01. In data modeling, an entity-relationship diagram (ERD) is primarily used to:
a) Transform raw data into structured data.
b) Illustrate relationships between entities.
c) Store data in a physical location.
d) Cleanse dirty data.
02. A company is designing a data lake on Amazon S3. To ensure high performance when accessing the data, which best practice should the company adopt in organizing its data in the S3 bucket?
a) Store all data files as a single large file and use AWS Lambda to parse required data segments.
b) Use a flat structure by avoiding the creation of any prefix or "folder" hierarchy.
c) Partition data based on commonly accessed attributes and use a consistent naming scheme for prefixes.
d) Enable S3 Transfer Acceleration to ensure data is quickly accessible from any location.
03. You have been tasked with migrating an on-premises MySQL database to Amazon Aurora PostgreSQL using AWS Database Migration Service (DMS). The stakeholder emphasizes that the source database must remain fully operational during the migration process.
Which of the following statements about DMS is accurate with respect to this scenario?
a) AWS DMS only supports full-load migrations, which would require downtime for the source database.
b) AWS DMS supports both full-load and continuous replication, allowing the source MySQL database to remain operational during migration.
c) When using DMS, the target Amazon Aurora PostgreSQL instance cannot be accessed or queried until the migration is complete.
d) AWS DMS requires the source MySQL database to be version 5.7 or higher for migrating to Amazon Aurora PostgreSQL.
e) AWS DMS can convert the MySQL database schema directly to PostgreSQL without any manual intervention.
04. Company DEF has a strict security policy that mandates that all data at rest in Amazon S3 must be encrypted. They want to ensure that the encryption keys are managed by AWS, but they also want the flexibility to change the encryption keys when required.
Which of the following encryption methods best meets Company DEF's requirements?
a) Server-Side Encryption with Customer-Provided Keys (SSE-C).
b) Server-Side Encryption with Amazon S3 Managed Keys (SSE-S3).
c) Server-Side Encryption with AWS Key Management Service (SSE-KMS).
d) Client-Side Encryption with a client-side master key.
05. In a data engineering pipeline, a company is using multiple applications and teams to access a shared Amazon S3 bucket. To streamline access and simplify permissions management for these different entities, which S3 feature should the company utilize?
a) Enable multiple IAM roles, each corresponding to an application or team, granting access to the S3 bucket.
b) Use S3 Access Points to create unique endpoints with tailored permissions for each application or team.
c) Activate S3 Transfer Acceleration for the bucket to ensure fast and differentiated access for each application or team.
d) Implement S3 Lifecycle policies for each application or team to manage their specific data access and retention.
06. When processing large datasets using distributed computing frameworks, uneven distribution of data can lead to processing delays. What is this phenomenon commonly known as?
a) Data skew
b) Data partitioning
c) Data shuffling
d) Data fragmentation
07. For evolving schema and high compatibility, which data format should be chosen for downstream analytics?
a) CSV
b) JSON
c) Parquet
d) Avro
08. What is the primary purpose of data lineage in data engineering?
a) To optimize query performance.
b) To transform data formats.
c) To trace the source and flow of data.
d) To create visualizations.
09. Which of the following best describes the type of data found in traditional relational databases?
a) Structured data
b) Unstructured data
c) Semi-structured data
d) Free-form data
10. Pivoting in SQL is mainly used to transform data from:
a) One row to one column
b) Only one column to one row
c) Multiple columns into multiple rows
d) Multiple rows into multiple columns