The Snowflake DSA-C02 exam preparation guide is designed to provide candidates with necessary information about the SnowPro Advanced - Data Scientist exam. It includes exam summary, sample questions, practice test, objectives and ways to interpret the exam objectives to enable candidates to assess the types of questions-answers that may be asked during the Snowflake Certified SnowPro Advanced - Data Scientist exam.
It is recommended for all the candidates to refer the DSA-C02 objectives and sample questions provided in this preparation guide. The Snowflake SnowPro Advanced - Data Scientist certification is mainly targeted to the candidates who want to build their career in Advance domain and demonstrate their expertise. We suggest you to use practice exam listed in this cert guide to get used to with exam environment and identify the knowledge areas where you need more work prior to taking the actual Snowflake SnowPro Advanced - Data Scientist exam.
Snowflake DSA-C02 Exam Summary:
Snowflake SnowPro Advanced - Data Scientist Syllabus:
Section |
Objectives |
Weight |
Data Science Concepts |
- Define machine learning concepts for data science workloads.
-
Machine Learning
- Supervised learning
- Unsupervised learning
- Outline machine learning problem types.
-
Supervised Learning
1. Structured Data
- Linear regression
- Binary classification
- Multi-class classification
- Time-series forecasting
2. Unstructured Data
- Image classification
- Segmentation
-
Unsupervised Learning
- Clustering
- Association models
- Summarize the machine learning lifecycle.
-
Data collection
-
Data visualization and exploration
-
Feature engineering
-
Training models
-
Model deployment
-
Model monitoring and evaluation (e.g., model explainability, precision, recall, accuracy, confusion matrix)
-
Model versioning
- Define statistical concepts for data science.
-
Normal versus skewed distributions (e.g., mean, outliers)
-
Central limit theorem
-
Z and T tests
-
Bootstrapping
-
Confidence intervals
|
15-20% |
Data Pipelining |
- Enrich data by consuming data sharing sources.
-
Snowflake Marketplace
-
Direct Sharing
-
Shared database considerations
- Build a data science pipeline.
-
Automation of data transformation with streams and tasks
-
Python User-Defined Functions (UDFs)
-
Python User-Defined Table Functions (UDTFs)
-
Python stored procedures
-
Integration with machine learning platforms (e.g., connectors, ML partners, etc.)
|
15-20% |
Data Preparation and Feature Engineering |
- Prepare and clean data in Snowflake.
-
Use Snowpark for Python and SQL
- Aggregate
- Joins
- Identify critical data
- Remove duplicates
- Remove irrelevant fields
- Handle missing values
- Data type casting
- Sampling data
- Perform exploratory data analysis in Snowflake.
-
Snowpark and SQL
- Identify initial patterns (i.e., data profiling)
- Connect external machine learning platforms and/or notebooks (e.g., Jupyter)
-
Use Snowflake native statistical functions to analyze and calculate descriptive data statistics.
- Window Functions
- MIN/MAX/AVG/STDEV
- VARIANCE
- TOPn
- Approximation/High Performing function
-
Linear Regression
- Find the slope and intercept
- Verify the dependencies on dependent and independent variables
- Perform feature engineering on Snowflake data.
-
Preprocessing
- Scaling data
- Encoding
- Normalization
-
Data Transformations
- Data Frames (i.e, Pandas, Snowpark)
- Derived features (e.g., average spend)
-
Binarizing data
- Binning continuous data into intervals
- Label encoding
- One hot encoding
- Visualize and interpret the data to present a business case.
-
Statistical summaries
- Snowsight with SQL
- Streamlit
- Interpret open-source graph libraries
- Identify data outliers
-
Common types of visualization formats
- Bar charts
- Scatterplots
- Heat maps
|
30-35% |
Model Development |
- Connect data science tools directly to data in Snowflake.
-
Connecting Python to Snowflake
- Snowpark
- Python connector with Pandas support
- Spark connector
-
Snowflake Best Practices
- One platform, one copy of data, many workloads
- Enrich datasets using the Snowflake Marketplace
- External tables
- External functions
- Zero-copy cloning for training snapshots
- Data governance
- Train a data science model.
-
Hyperparameter tuning
-
Optimization metric selection (e.g., log loss, AUC, RMSE)
-
Partitioning
- Cross validation
- Train validation hold-out
-
Down/Up-sampling
-
Training with Python stored procedures
-
Training outside Snowflake through external functions
-
Training with Python User-Defined Table Functions (UDTFs)
- Validate a data science model.
-
ROC curve/confusion matrix
- Calculate the expected payout of the model
-
Regression problems
-
Residuals plot
- Interpret graphics with context
-
Model metrics
- Interpret a model.
-
Feature impact
-
Partial dependence plots
-
Confidence intervals
|
15-20% |
Model Deployment |
- Move a data science model into production.
-
Use an external hosted model
- External functions
- Pre-built models
-
Deploy a model in Snowflake
- Vectorized/Scalar Python User Defined Functions (UDFs)
- Pre-built models
- Storing predictions
- Stage commands
- Determine the effectiveness of a model and retrain if necessary.
-
Metrics for model evaluation
1. Data drift /Model decay
- Data distribution comparisons
-> Do the data making predictions look similar to the training data?
-> Do the same data points give the same predictions once a model is deployed?
-
Area under the curve
-
Accuracy, precision, recall
-
User defined functions (UDFs)
- Outline model lifecycle and validation tools.
-
Streams and tasks
-
Metadata tagging
-
Model versioning with partner tools
-
Automation of model retraining
|
15-20% |