Amazon DEA-C01 Online Practice Questions

Home / Amazon / Amazon DEA-C01

Latest Amazon DEA-C01 Exam Practice Questions

The practice questions for Amazon DEA-C01 exam was last updated on 2026-02-24 .

Viewing page 1 out of 5 pages.

Viewing questions 1 out of 28 questions.

Question#1

A company uses AWS Glue Data Catalog to index data that is uploaded to an Amazon S3 bucket every day. The company uses a daily batch processes in an extract, transform, and load (ETL) pipeline to upload data from external sources into the S3 bucket.
The company runs a daily report on the S3 data. Some days, the company runs the report before all the daily data has been uploaded to the S3 bucket. A data engineer must be able to send a message that identifies any incomplete data to an existing Amazon Simple Notification Service (Amazon SNS) topic.
Which solution will meet this requirement with the LEAST operational overhead?

A. Create data quality checks for the source datasets that the daily reports use. Create a new AWS managed Apache Airflow cluster. Run the data quality checks by using Airflow tasks that run data quality queries on the columns data type and the presence of null values. Configure Airflow Directed Acyclic Graphs (DAGs) to send an email notification that informs the data engineer about the incomplete datasets to the SNS topic.
B. Create data quality checks on the source datasets that the daily reports use. Create a new Amazon EMR cluster. Use Apache Spark SQL to create Apache Spark jobs in the EMR cluster that run data quality queries on the columns data type and the presence of null values. Orchestrate the ETL pipeline by using an AWS Step Functions workflow. Configure the workflow to send an email notification that informs the data engineer about the incomplete datasets to the SNS topic.
C. Create data quality checks on the source datasets that the daily reports use. Create data quality actions by using AWS Glue workflows to confirm the completeness and consistency of the datasets. Configure the data quality actions to create an event in Amazon EventBridge if a dataset is incomplete. Configure EventBridge to send the event that informs the data engineer about the incomplete datasets to the Amazon SNS topic.
D. Create AWS Lambda functions that run data quality queries on the columns data type and the presence of null values. Orchestrate the ETL pipeline by using an AWS Step Functions workflow that runs the Lambda functions. Configure the Step Functions workflow to send an email notification that informs the data engineer about the incomplete datasets to the SNS topic.

Explanation:
AWS Glue workflows are designed to orchestrate the ETL pipeline, and you can create data quality checks to ensure the uploaded datasets are complete before running reports. If there is an issue with the data, AWS Glue workflows can trigger an Amazon EventBridge event that sends a message to an SNS topic.
AWS Glue Workflows:
AWS Glue workflows allow users to automate and monitor complex ETL processes. You can include data quality actions to check for null values, data types, and other consistency checks.
In the event of incomplete data, an EventBridge event can be generated to notify via SNS.
Reference: AWS Glue Workflows
Alternatives Considered:
A (Airflow cluster): Managed Airflow introduces more operational overhead and complexity compared to Glue workflows.
B (EMR cluster): Setting up an EMR cluster is also more complex compared to the Glue-centric solution.
D (Lambda functions): While Lambda functions can work, using Glue workflows offers a more integrated and lower operational overhead solution.
Reference: AWS Glue Workflow Documentation

Question#2

A company extracts approximately 1 TB of data every day from data sources such as SAP HANA,
Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undefined data schemas or data schemas that change.
A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation.
Which solution will meet these requirements with the LEAST operational overhead?

A. Use Amazon EMR to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
B. Use AWS Glue to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
C. Create a PvSpark proqram in AWS Lambda to extract, transform, and load the data into the S3 bucket.
D. Create a stored procedure in Amazon Redshift to detect the schema and to extract, transform, and load the data into a Redshift Spectrum table. Access the table from Amazon S3.

Explanation:
AWS Glue is a fully managed service that provides a serverless data integration platform. It can automatically discover and categorize data from various sources, including SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. It can also infer the schema of the data and store it in the AWS Glue Data Catalog, which is a central metadata repository. AWS Glue can then use the schema information to generate and run Apache Spark code to extract, transform, and load the data into an Amazon S3 bucket. AWS Glue can also monitor and optimize the performance and cost of the data pipeline, and handle any schema changes that may occur in the source data. AWS Glue can meet the SLA of loading the data into the S3 bucket within 15 minutes of data creation, as it can trigger the data pipeline based on events, schedules, or on-demand. AWS Glue has the least operational overhead among the options, as it does not require provisioning, configuring, or managing any servers or clusters. It also handles scaling, patching, and security automatically.
Reference: AWS Glue
[AWS Glue Data Catalog]
[AWS Glue Developer Guide]
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide

Question#3

A data engineer needs to optimize the performance of a data pipeline that handles retail orders. Data about the orders is ingested daily into an Amazon S3 bucket.
The data engineer runs queries once each week to extract metrics from the orders data based on the order date for multiple date ranges. The data engineer needs an optimization solution that ensures the query performance will not degrade when the volume of data increases.

A. Partition the data based on order date. Use Amazon Athena to query the data.
B. Partition the data based on order date. Use Amazon Redshift to query the data.
C. Partition the data based on load date. Use Amazon EMR to query the data.
D. Partition the data based on load date. Use Amazon Aurora to query the data.

Explanation:
For query workloads on S3 data that depend on date-based filters, partitioning by order date optimizes performance and cost because Athena reads only the relevant partitions.
Athena scales automatically and doesn’t degrade with increasing data size when partitions are managed efficiently.
“Partitioning data in Amazon S3 based on query predicates such as order date improves Athena query performance and reduces scanned data volume.”
C Ace the AWS Certified Data Engineer - Associate Certification - version 2 - apple.pdf This is the most cost-effective and scalable option for date-based queries.

Question#4

A data engineer must manage the ingestion of real-time streaming data into AWS. The data engineer wants to perform real-time analytics on the incoming streaming data by using time-based aggregations over a window of up to 30 minutes. The data engineer needs a solution that is highly fault tolerant.
Which solution will meet these requirements with the LEAST operational overhead?

A. Use an AWS Lambda function that includes both the business and the analytics logic to perform time-based aggregations over a window of up to 30 minutes for the data in Amazon Kinesis Data Streams.
B. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data that might occasionally contain duplicates by using multiple types of aggregations.
C. Use an AWS Lambda function that includes both the business and the analytics logic to perform aggregations for a tumbling window of up to 30 minutes, based on the event timestamp.
D. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data by using multiple types of aggregations to perform time-based analytics over a window of up to 30 minutes.

Explanation:
This solution meets the requirements of managing the ingestion of real-time streaming data into AWS and performing real-time analytics on the incoming streaming data with the least operational overhead. Amazon Managed Service for Apache Flink is a fully managed service that allows you to run Apache Flink applications without having to manage any infrastructure or clusters. Apache Flink is a framework for stateful stream processing that supports various types of aggregations, such as tumbling, sliding, and session windows, over streaming data. By using Amazon Managed Service for Apache Flink, you can easily connect to Amazon Kinesis Data Streams as the source and sink of your streaming data, and perform time-based analytics over a window of up to 30 minutes. This solution is also highly fault tolerant, as Amazon Managed Service for Apache Flink automatically scales, monitors, and restarts your Flink applications in case of failures.
Reference: Amazon Managed Service for Apache Flink
Apache Flink
Window Aggregations in Flink

Question#5

A technology company currently uses Amazon Kinesis Data Streams to collect log data in real time. The company wants to use Amazon Redshift for downstream real-time queries and to enrich the log data.
Which solution will ingest data into Amazon Redshift with the LEAST operational overhead?

A. Set up an Amazon Data Firehose delivery stream to send data to a Redshift provisioned cluster table.
B. Set up an Amazon Data Firehose delivery stream to send data to Amazon S3. Configure a Redshift provisioned cluster to load data every minute.
C. Configure Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to send data directly to a Redshift provisioned cluster table.
D. Use Amazon Redshift streaming ingestion from Kinesis Data Streams and to present data as a materialized view.

Explanation:
The most efficient and low-operational-overhead solution for ingesting data into Amazon Redshift from Amazon Kinesis Data Streams is to use Amazon Redshift streaming ingestion. This feature allows Redshift to directly ingest streaming data from Kinesis Data Streams and process it in real-time.
Amazon Redshift Streaming Ingestion:
Redshift supports native streaming ingestion from Kinesis Data Streams, allowing real-time data to be queried using materialized views.
This solution reduces operational complexity because you don't need intermediary services like Amazon Kinesis Data Firehose or S3 for batch loading.
Reference: Amazon Redshift Streaming Ingestion
Alternatives Considered:
A (Data Firehose to Redshift): This option is more suitable for batch processing but incurs additional operational overhead with the Firehose setup.
B (Firehose to S3): This involves an intermediate step, which adds complexity and delays the real-time requirement.
C (Managed Service for Apache Flink): This would work but introduces unnecessary complexity compared to Redshift’s native streaming ingestion.
Reference: Amazon Redshift Streaming Ingestion from Kinesis
Materialized Views in Redshift

Disclaimer

This page is for educational and exam preparation reference only. It is not affiliated with Amazon, Data Engineer Associate, or the official exam provider. Candidates should refer to official documentation and training for authoritative information.

Exam Code: Amazon DEA-C01Q & A: 231 Q&AsUpdated:  2026-02-24

  Get All Amazon DEA-C01 Q&As