01. A Data Scientist is working on optimizing a model during the training process by varying multiple parameters. The Data Scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values.
What should the Data Scientist do to improve the training process?
a) Increase the learning rate. Keep the batch size the same.
b) Reduce the batch size. Decrease the learning rate.
c) Keep the batch size the same. Decrease the learning rate.
d) Do not change the learning rate. Increase the batch size.
02. A Machine Learning Engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm.
The ML Engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%.
What should the ML Engineer do to minimize bias due to missing values?
a) Replace each missing value by the mean or median across non-missing values in same row.
b) Delete observations that contain missing values because these represent less than 5% of the data.
c) Replace each missing value by the mean or median across non-missing values in the same column.
d) For each feature, approximate the missing values using supervised learning based on other features.
03. An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each.
Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not.
Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?
a) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
b) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
c) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
d) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
04. A company is interested in building a fraud detection model. Currently, the Data Scientist does not have a sufficient amount of information due to the low number of fraud cases.
Which method is MOST likely to detect the GREATEST number of valid fraud cases?
a) Oversampling using bootstrapping
c) Oversampling using SMOTE
d) Class weight adjustment
05. A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features:
id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field.
For this use case, which is the most effective course of action to address test data samples with missing features?
a) Drop the test samples with missing full review text fields, and then run through the test set.
b) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
c) Use an algorithm that handles missing data better than decision trees.
d) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
06. A Data Scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result.
The models should be evaluated based on the following criteria:
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs
After creating each binary classification model, the Data Scientist generates the corresponding confusion matrix.
Which confusion matrix represents the model that satisfies the requirements?
a) TN = 91, FP = 9 FN = 22, TP = 78
b) TN = 99, FP = 1 FN = 21, TP = 79
c) TN = 96, FP = 4 FN = 10, TP = 90
d) TN = 98, FP = 2 FN = 18, TP = 82
07. A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
1. Please call the number below.
2. Please do not call us.
What are the dimensions of the tf–idf matrix?
a) (2, 16)
b) (2, 8)
c) (2, 10)
d) (8, 10)
08. A company is setting up a system to manage all of the datasets it stores in Amazon S3.
The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance.
Which solution will allow the company to achieve its goals?
a) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
b) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
c) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
d) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
09. A Machine Learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process.
What can a Machine Learning Specialist do to address this concern?
a) Use Amazon SageMaker Pipe mode.
b) Use Amazon Machine Learning to train the models.
c) Use Amazon Kinesis to stream the data to Amazon SageMaker.
d) Use AWS Glue to transform the CSV dataset to the JSON format.
10. A Data Scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model.
What action will definitively help the model detect more than 10% of fraud cases?
a) Using oversampling to balance the dataset
b) Using regularization to reduce overfitting
c) Decreasing the class probability threshold
d) Using undersampling to balance the dataset