Good DAS-C01 Exam Dumps Save Your Preparation Time

Category:

Comments:

0 Comments

Post Date:

September 29, 2023

Amazon DAS-C01 dumps questions are an essential tool for anyone looking to pursue the AWS Certified Data Analytics - Specialty certification. By using these DAS-C01 dumps questions as part of your exam preparation, you can gain a deeper understanding of the exam format, identify knowledge gaps, build confidence, get real-time feedback, and stay up-to-date with the latest trends in AWS Certified Data Analytics - Specialty. So, if you're serious about achieving DAS-C01 certification success, be sure to incorporate DAS-C01 exam dumps questions into your study plan. Test free Amazon DAS-C01 exam dumps below.

Page 1 of 7

1. A data engineer is using AWS Glue ETL jobs to process data at frequent intervals The processed data is then copied into Amazon S3 The ETL jobs run every 15 minutes. The AWS Glue Data Catalog partitions need to be updated automatically after the completion of each job

Which solution will meet these requirements MOST cost-effectively?

Use the AWS Glue Data Catalog to manage the data catalog Define an AWS Glue workflow for the ETL process Define a trigger within the workflow that can start the crawler when an ETL job run is complete

Use the AWS Glue Data Catalog to manage the data catalog Use AWS Glue Studio to manage ETL jobs. Use the AWS Glue Studio feature that supports updates to the AWS Glue Data Catalog during job runs.

Use an Apache Hive metastore to manage the data catalog Update the AWS Glue ETL code to include the enableUpdateCatalog and partitionKeys arguments.

Use the AWS Glue Data Catalog to manage the data catalog Update the AWS Glue ETL code to include the enableUpdateCatalog and partitionKeys arguments.

2. A company uses Amazon Redshift for its data warehousing needs. ETL jobs run every night to load data, apply business rules, and create aggregate tables for reporting. The company's data analysis, data science, and business intelligence teams use the data warehouse during regular business hours. The workload management is set to auto, and separate queues exist for each team with the priority set to NORMAL.

Recently, a sudden spike of read queries from the data analysis team has occurred at least twice daily, and queries wait in line for cluster resources. The company needs a solution that enables the data analysis team to avoid query queuing without impacting latency and the query times of other teams.

Which solution meets these requirements?

Increase the query priority to HIGHEST for the data analysis queue.

Configure the data analysis queue to enable concurrency scaling.

Create a query monitoring rule to add more cluster capacity for the data analysis queue when queries are waiting for resources.

Use workload management query queue hopping to route the query to the next matching queue.

3. A financial services company is building a data lake solution on Amazon S3. The company plans to use analytics offerings from AWS to meet user needs for one-time querying and business intelligence reports. A portion of the columns will contain personally identifiable information (Pll). Only authorized users should be able to see plaintext PII data.

What is the MOST operationally efficient solution that meets these requirements?

Define a bucket policy for each S3 bucket of the data lake to allow access to users who have authorization to see PII data. Catalog the data by using AWS Glue. Create two IAM roles. Attach a permissions policy with access to PII columns to one role. Attach a policy without these permissions to the other role.

Register the S3 locations with AWS Lake Formation. Create two IAM roles. Use Lake Formation data permissions to grant Select permissions to all of the columns for one role. Grant Select permissions to only columns that contain non-PII data for the other role.

Register the S3 locations with AWS Lake Formation. Create an AWS Glue job to create an E TL workflow that removes the Pll columns from the data and creates a separate copy of the data in another data lake S3 bucket. Register the new S3 locations with Lake Formation. Grant users the permissions to each data lake data based on whether the users are authorized to see PII data.

Register the S3 locations with AWS Lake Formation. Create two IAM roles. Attach a permissions policy with access to Pll columns to one role. Attach a policy without these permissions to the other role. For each downstream analytics service, use its native security functionality and the IAM roles to secure the Pll data.

4. A company has a process that writes two datasets in CSV format to an Amazon S3 bucket every 6 hours. The company needs to join the datasets, convert the data to Apache Parquet, and store the data within another bucket for users to query using Amazon Athena. The data also needs to be loaded to Amazon Redshift for advanced analytics. The company needs a solution that is resilient to the failure of any individual job component and can be restarted in case of an error.

Which solution meets these requirements with the LEAST amount of operational overhead?

Use AWS Step Functions to orchestrate an Amazon EMR cluster running Apache Spark. Use PySpark to generate data frames of the datasets in Amazon S3, transform the data, join the data, write the data back to Amazon S3, and load the data to Amazon Redshift.

Create an AWS Glue job using Python Shell that generates dynamic frames of the datasets in Amazon S3, transforms the data, joins the data, writes the data back to Amazon S3, and loads the data to Amazon Redshift. Use an AWS Glue workflow to orchestrate the AWS Glue job at the desired frequency.

Use AWS Step Functions to orchestrate the AWS Glue job. Create an AWS Glue job using Python Shell that creates dynamic frames of the datasets in Amazon S3, transforms the data, joins the data, writes the data back to Amazon S3, and loads the data to Amazon Redshift.

Create an AWS Glue job using PySpark that creates dynamic frames of the datasets in Amazon S3, transforms the data, joins the data, writes the data back to Amazon S3, and loads the data to Amazon Redshift. Use an AWS Glue workflow to orchestrate the AWS Glue job.

5. A gaming company is building a serverless data lake. The company is ingesting streaming data into Amazon Kinesis Data Streams and is writing the data to Amazon S3 through Amazon Kinesis Data Firehose. The company is using 10 MB as the S3 buffer size and is using 90 seconds as the buffer interval. The company runs an AWS Glue ET L job to merge and transform the data to a different format before writing the data back to Amazon S3.

Recently, the company has experienced substantial growth in its data volume. The AWS Glue ETL jobs are frequently showing an OutOfMemoryError error.

Which solutions will resolve this issue without incurring additional costs? (Select TWO.)

Place the small files into one S3 folder. Define one single table for the small S3 files in AWS Glue Data Catalog. Rerun the AWS Glue ET L jobs against this AWS Glue table.

Create an AWS Lambda function to merge small S3 files and invoke them periodically. Run the AWS Glue ETL jobs after successful completion of the Lambda function.

Run the S3DistCp utility in Amazon EMR to merge a large number of small S3 files before running the AWS Glue ETL jobs.

Use the groupFiIes setting in the AWS Glue ET L job to merge small S3 files and rerun AWS Glue E TL jobs.

Update the Kinesis Data Firehose S3 buffer size to 128 M

Update the buffer interval to 900 seconds.

6. A streaming application is reading data from Amazon Kinesis Data Streams and immediately writing the data to an Amazon S3 bucket every 10 seconds. The application is reading data from hundreds of shards. The batch interval cannot be changed due to a separate requirement. The data is being accessed by Amazon Athen

a. Users are seeing degradation in query performance as time progresses.

Which action can help improve query performance?

Merge the files in Amazon S3 to form larger files.

Increase the number of shards in Kinesis Data Streams.

Add more memory and CPU capacity to the streaming application.

Write the files to multiple S3 buckets.

7. A company that monitors weather conditions from remote construction sites is setting up a solution to collect temperature data from the following two weather stations.

Station A, which has 10 sensors

Station B, which has five sensors

These weather stations were placed by onsite subject-matter experts.

Each sensor has a unique ID. The data collected from each sensor will be collected using Amazon Kinesis Data Streams.

Based on the total incoming and outgoing data throughput, a single Amazon Kinesis data stream with two shards is created. Two partition keys are created based on the station names. During testing, there is a bottleneck on data coming from Station A, but not from Station B. Upon review, it is confirmed that the total stream throughput is still less than the allocated Kinesis Data Streams throughput.

How can this bottleneck be resolved without increasing the overall cost and complexity of the solution, while retaining the data collection quality requirements?

Increase the number of shards in Kinesis Data Streams to increase the level of parallelism.

Create a separate Kinesis data stream for Station A with two shards, and stream Station A sensor data to the new stream.

Modify the partition key to use the sensor ID instead of the station name.

Reduce the number of sensors in Station A from 10 to 5 sensors.

8. An online retail company uses Amazon Redshift to store historical sales transactions. The company is required to encrypt data at rest in the clusters to comply with the Payment Card Industry Data Security Standard (PCI DSS). A corporate governance policy mandates management of encryption keys using an on-premises hardware security module (HSM).

Which solution meets these requirements?

Create and manage encryption keys using AWS CloudHSM Classic. Launch an Amazon Redshift cluster in a VPC with the option to use CloudHSM Classic for key management.

Create a VPC and establish a VPN connection between the VPC and the on-premises network. Create an HSM connection and client certificate for the on-premises HS

Launch a cluster in the VPC with the option to use the on-premises HSM to store keys.

Create an HSM connection and client certificate for the on-premises HS

Enable HSM encryption on the existing unencrypted cluster by modifying the cluster. Connect to the VPC where the Amazon Redshift cluster resides from the on-premises network using a VP

Create a replica of the on-premises HSM in AWS CloudHS

Launch a cluster in a VPC with the option to use CloudHSM to store keys.

9. A company wants to improve user satisfaction for its smart home system by adding more features to its recommendation engine. Each sensor asynchronously pushes its nested JSON data into Amazon Kinesis Data Streams using the Kinesis Producer Library (KPL) in Java. Statistics from a set of failed sensors showed that, when a sensor is malfunctioning, its recorded data is not always sent to the cloud.

The company needs a solution that offers near-real-time analytics on the data from the most updated sensors.

Which solution enables the company to meet these requirements?

Set the RecordMaxBufferedTime property of the KPL to "−1" to disable the buffering on the sensor side. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Push the enriched data to a fleet of Kinesis data streams and enable the data transformation feature to flatten the JSON file. Instantiate a dense storage Amazon Redshift cluster and use it as the destination for the Kinesis Data Firehose delivery stream.

Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Direct the output of KDA application to a Kinesis Data Firehose delivery stream, enable the data transformation feature to flatten the JSON file, and set the Kinesis Data Firehose destination to an Amazon Elasticsearch Service cluster.

Set the RecordMaxBufferedTime property of the KPL to "0" to disable the buffering on the sensor side. Connect for each stream a dedicated Kinesis Data Firehose delivery stream and enable the data transformation feature to flatten the JSON file before sending it to an Amazon S3 bucket. Load the S3 data into an Amazon Redshift cluster.

Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use AWS Glue to fetch and process data from the stream using the Kinesis Client Library (KCL). Instantiate an Amazon Elasticsearch Service cluster and use AWS Lambda to directly push data into it.

10. A manufacturing company has many loT devices in different facilities across the world The company is using Amazon Kinesis Data Streams to collect the data from the devices

The company's operations team has started to observe many WnteThroughputExceeded exceptions The operations team determines that the reason is the number of records that are being written to certain shards The data contains device ID capture date measurement type, measurement value and facility ID The facility ID is used as the partition key

Which action will resolve this issue?

Change the partition key from facility ID to a randomly generated key

Increase the number of shards

Archive the data on the producers' side

Change the partition key from facility ID to capture date

Page 2 of 7

11. A company uses Amazon EC2 instances to receive files from external vendors throughout each day. At the end of each day, the EC2 instances combine the files into a single file, perform gzip compression, and upload the single file to an Amazon S3 bucket. The total size of all the files is approximately 100 GB each day.

When the files are uploaded to Amazon S3, an AWS Batch job runs a COPY command to load the files into an Amazon Redshift cluster.

Which solution will MOST accelerate the COPY process?

Upload the individual files to Amazon S3. Run the COPY command as soon as the files become available.

Split the files so that the number of files is equal to a multiple of the number of slices in the Redshift cluster. Compress and upload the files to Amazon S3. Run the COPY command on the files.

Split the files so that each file uses 50% of the free storage on each compute node in the Redshift cluster. Compress and upload the files to Amazon S3. Run the COPY command on the files.

pply sharding by breaking up the files so that the DISTKEY columns with the same values go to the same file. Compress and upload the sharded files to Amazon S3. Run the COPY command on the files.

12. A company wants to use a data lake that is hosted on Amazon S3 to provide analytics services for historical data. The data lake consists of 800 tables but is expected to grow to thousands of tables. More than 50 departments use the tables, and each department has hundreds of users. Different departments need access to specific tables and columns.

Which solution will meet these requirements with the LEAST operational overhead?

Create an 1AM role for each department. Use AWS Lake Formation based access control to grant each 1AM role access to specific tables and columns. Use Amazon Athena to analyze the data.

Create an Amazon Redshift cluster for each department. Use AWS Glue to ingest into the Redshift cluster only the tables and columns that are relevant to that department. Create Redshift database users. Grant the users access to the relevant department's Redshift cluster. Use Amazon Redshift to analyze the data.

Create an 1AM role for each department. Use AWS Lake Formation tag-based access control to grant each 1AM role access to only the relevant resources. Create LF-tags that are attached to tables and columns. Use Amazon Athena to analyze the data.

Create an Amazon EMR cluster for each department. Configure an 1AM service role for each EMR cluster to access

relevant S3 files. For each department's users, create an 1AM role that provides access to the relevant EMR cluster. Use Amazon EMR to analyze the data.

13. A network administrator needs to create a dashboard to visualize continuous network patterns over time in a company's AWS account. Currently, the company has VPC Flow Logs enabled and is publishing this data to Amazon CloudWatch Logs. To troubleshoot networking issues quickly, the dashboard needs to display the new data in near-real time.

Which solution meets these requirements?

Create a CloudWatch Logs subscription to stream CloudWatch Logs data to an AWS Lambda function that writes the data to an Amazon S3 bucket. Create an Amazon QuickSight dashboard to visualize the data.

Create an export task from CloudWatch Logs to an Amazon S3 bucket. Create an Amazon QuickSight dashboard to visualize the data.

Create a CloudWatch Logs subscription that uses an AWS Lambda function to stream the CloudWatch Logs data directly into an Amazon OpenSearch Service cluster. Use OpenSearch Dashboards to create the dashboard.

Create a CloudWatch Logs subscription to stream CloudWatch Logs data to an AWS Lambda function that writes to an Amazon Kinesis data stream to deliver the data into an Amazon OpenSearch Service cluster. Use OpenSearch Dashboards to create the dashboard.

14. A company collects data from parking garages. Analysts have requested the ability to run reports in near real time about the number of vehicles in each garage.

The company wants to build an ingestion pipeline that loads the data into an Amazon Redshift cluster. The solution must alert operations personnel when the number of vehicles in a particular garage exceeds a specific threshold. The alerting query will use garage threshold values as a static reference. The threshold values are stored in Amazon S3.

What is the MOST operationally efficient solution that meets these requirements?

Use an Amazon Kinesis Data Firehose delivery stream to collect the data and to deliver the data to Amazon Redshift. Create an Amazon Kinesis Data Analytics application that uses the same delivery stream as an input source. Create a reference data source in Kinesis Data Analytics to temporarily store the threshold values from Amazon S3 and to compare the number of vehicles in a particular garage to the corresponding threshold value. Configure an AWS Lambda function to publish an Amazon Simple Notification Service (Amazon SNS) notification if the number of vehicles exceeds the threshold.

Use an Amazon Kinesis data stream to collect the data. Use an Amazon Kinesis Data Firehose delivery stream to deliver the data to Amazon Redshift. Create another Kinesis data stream to temporarily store the threshold values from Amazon S3. Send the delivery stream and the second data stream to Amazon Kinesis Data Analytics to compare the number of vehicles in a particular garage to the corresponding threshold value. Configure an AWS Lambda function to publish an

Amazon Simple Notification Service (Amazon SNS) notification if the number of vehicles exceeds the threshold.

Use an Amazon Kinesis Data Firehose delivery stream to collect the data and to deliver the data to Amazon Redshift. Automatically initiate an AWS Lambda function that queries the data in Amazon Redshift. Configure the Lambda function to compare the number of vehicles in a particular garage to the corresponding threshold value from Amazon S3. Configure the Lambda function to also publish an Amazon Simple Notification Service (Amazon SNS) notification if the number of vehicles exceeds the threshold.

Use an Amazon Kinesis Data Firehose delivery stream to collect the data and to deliver the data to Amazon Redshift. Create an Amazon Kinesis Data Analytics application that uses the same delivery stream as an input source. Use Kinesis Data Analytics to compare the number of vehicles in a particular garage to the corresponding threshold value that is stored in a table as an in-application stream. Configure an AWS Lambda function as an output for the application to publish an Amazon Simple Queue Service (Amazon SQS) notification if the number of vehicles exceeds the threshold.

15. An IoT company wants to release a new device that will collect data to track sleep overnight on an intelligent mattress. Sensors will send data that will be uploaded to an Amazon S3 bucket. About 2 MB of data is generated each night for each bed. Data must be processed and summarized for each user, and the results need to be available as soon as possible. Part of the process consists of time windowing and other functions. Based on tests with a Python script, every run will require about 1 GB of memory and will complete within a couple of minutes.

Which solution will run the script in the MOST cost-effective way?

AWS Lambda with a Python script

AWS Glue with a Scala job

Amazon EMR with an Apache Spark script

AWS Glue with a PySpark job

16. A retail company has 15 stores across 6 cities in the United States. Once a month, the sales team requests a visualization in Amazon QuickSight that provides the ability to easily identify revenue trends across cities and stores. The visualization also helps identify outliers that need to be examined with further analysis.

Which visual type in QuickSight meets the sales team's requirements?

Geospatial chart

Line chart

Heat map

Tree map

17. A global company has different sub-organizations, and each sub-organization sells its products and services in various countries. The company's senior leadership wants to quickly identify which sub-organization is the strongest performer in each country. All sales data is stored in Amazon S3 in Parquet format.

Which approach can provide the visuals that senior leadership requested with the least amount of effort?

Use Amazon QuickSight with Amazon Athena as the data source. Use heat maps as the visual type.

Use Amazon QuickSight with Amazon S3 as the data source. Use heat maps as the visual type.

Use Amazon QuickSight with Amazon Athena as the data source. Use pivot tables as the visual type.

Use Amazon QuickSight with Amazon S3 as the data source. Use pivot tables as the visual type.

18. A company wants to ingest clickstream data from its website into an Amazon S3 bucket. The streaming data is in JSON format. The data in the S3 bucket must be partitioned by product_id.

Which solution will meet these requirements MOST cost-effectively?

Create an Amazon Kinesis Data Firehose delivery stream to ingest the streaming data into the S3 bucket. Enable dynamic partitioning. Specify the data field of productjd as one partitioning key.

Create an AWS Glue streaming job to partition the data by productjd before delivering the data to the S3 bucket. Create an Amazon Kinesis Data Firehose delivery stream. Specify the AWS Glue job as the destination of the delivery stream.

Create an Amazon Kinesis Data Firehose delivery stream to ingest the streaming data into the S3 bucket. Create an AWS Glue ETL job to read the data stream in the S3 bucket, partition the data by productjd, and write the data into another S3 bucket.

Create an Amazon Kinesis Data Firehose delivery stream to ingest the streaming data into the S3 bucket. Create an Amazon EMR cluster that includes a job to read the data stream in the S3 bucket, partition the data by productjd, and write the data into another S3 bucket.

19. An airline has been collecting metrics on flight activities for analytics. A recently completed proof of concept demonstrates how the company provides insights to data analysts to improve on-time departures. The proof of concept used objects in Amazon S3, which contained the metrics in .csv format, and used Amazon Athena for querying the data. As the amount of data increases, the data analyst wants to optimize the storage solution to improve query performance.

Which options should the data analyst use to improve performance as the data lake grows? (Choose three.)

Add a randomized string to the beginning of the keys in S3 to get more throughput across partitions.

Use an S3 bucket in the same account as Athena.

Compress the objects to reduce the data transfer I/

Use an S3 bucket in the same Region as Athena.

Preprocess the .csv data to JSON to reduce I/O by fetching only the document keys needed by the query.

Preprocess the .csv data to Apache Parquet to reduce I/O by fetching only the data blocks needed for predicates.

20. A company wants to use an automatic machine learning (ML) Random Cut Forest (RCF) algorithm to visualize complex real-world scenarios, such as detecting seasonality and trends, excluding outers, and imputing missing values.

The team working on this project is non-technical and is looking for an out-of-the-box solution that will require the LEAST amount of management overhead.

Which solution will meet these requirements?

Use an AWS Glue ML transform to create a forecast and then use Amazon QuickSight to visualize the data.

Use Amazon QuickSight to visualize the data and then use ML-powered forecasting to forecast the key business metrics.

Use a pre-build ML AMI from the AWS Marketplace to create forecasts and then use Amazon QuickSight to visualize the data.

Use calculated fields to create a new forecast and then use Amazon QuickSight to visualize the data.

Page 3 of 7

21. A company owns facilities with IoT devices installed across the world. The company is using Amazon Kinesis Data Streams to stream data from the devices to Amazon S3. The company's operations team wants to get insights from the IoT data to monitor data quality at ingestion. The insights need to be derived in near-real time, and the output must be logged to Amazon DynamoDB for further analysis.

Which solution meets these requirements?

Connect Amazon Kinesis Data Analytics to analyze the stream data. Save the output to DynamoDB by using the default output from Kinesis Data Analytics.

Connect Amazon Kinesis Data Analytics to analyze the stream data. Save the output to DynamoDB by using an AWS Lambda function.

Connect Amazon Kinesis Data Firehose to analyze the stream data by using an AWS Lambda function. Save the output to DynamoDB by using the default output from Kinesis Data Firehose.

Connect Amazon Kinesis Data Firehose to analyze the stream data by using an AWS Lambda function. Save the data to Amazon S3. Then run an AWS Glue job on schedule to ingest the data into DynamoD

22. A smart home automation company must efficiently ingest and process messages from various connected devices and sensors. The majority of these messages are comprised of a large number of small files. These messages are ingested using Amazon Kinesis Data Streams and sent to Amazon S3 using a Kinesis data stream consumer application. The Amazon S3 message data is then passed through a processing pipeline built on Amazon EMR running scheduled PySpark jobs.

The data platform team manages data processing and is concerned about the efficiency and cost of downstream data processing. They want to continue to use PySpark.

Which solution improves the efficiency of the data processing jobs and is well architected?

Send the sensor and devices data directly to a Kinesis Data Firehose delivery stream to send the data to Amazon S3 with Apache Parquet record format conversion enabled. Use Amazon EMR running PySpark to process the data in Amazon S3.

Set up an AWS Lambda function with a Python runtime environment. Process individual Kinesis data stream messages from the connected devices and sensors using Lambda.

Launch an Amazon Redshift cluster. Copy the collected data from Amazon S3 to Amazon Redshift and move the data processing jobs from Amazon EMR to Amazon Redshift.

Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue.

23. A large financial company is running its ETL process. Part of this process is to move data from Amazon S3 into an Amazon Redshift cluster. The company wants to use the most cost-efficient method to load the dataset into Amazon Redshift.

Which combination of steps would meet these requirements? (Choose two.)

Use the COPY command with the manifest file to load data into Amazon Redshift.

Use S3DistCp to load files into Amazon Redshift.

Use temporary staging tables during the loading process.

Use the UNLOAD command to upload data into Amazon Redshift.

Use Amazon Redshift Spectrum to query files from Amazon S3.

24. A hospital is building a research data lake to ingest data from electronic health records (EHR) systems from multiple hospitals and clinics. The EHR systems are independent of each other and do not have a common patient identifier. The data engineering team is not experienced in machine learning (ML) and has been asked to generate a unique patient identifier for the ingested records.

Which solution will accomplish this task?

An AWS Glue ETL job with the FindMatches transform

Amazon Kendra

Amazon SageMaker Ground Truth

An AWS Glue ETL job with the ResolveChoice transform

25. A healthcare company uses AWS data and analytics tools to collect, ingest, and store electronic health record (EHR) data about its patients. The raw EHR data is stored in Amazon S3 in JSON format partitioned by hour, day, and year and is updated every hour. The company wants to maintain the data catalog and metadata in an AWS Glue Data Catalog to be able to access the data using Amazon Athena or Amazon Redshift Spectrum for analytics.

When defining tables in the Data Catalog, the company has the following requirements:

Choose the catalog table name and do not rely on the catalog table naming algorithm. Keep the table updated with new partitions loaded in the respective S3 bucket prefixes.

Which solution meets these requirements with minimal effort?

Run an AWS Glue crawler that connects to one or more data stores, determines the data structures, and writes tables in the Data Catalog.

Use the AWS Glue console to manually create a table in the Data Catalog and schedule an AWS Lambda function to update the table partitions hourly.

Use the AWS Glue API CreateTable operation to create a table in the Data Catalog. Create an AWS Glue crawler and specify the table as the source.

Create an Apache Hive catalog in Amazon EMR with the table schema definition in Amazon S3, and update the table partition with a scheduled job. Migrate the Hive catalog to the Data Catalog.

26. A medical company has a system with sensor devices that read metrics and send them in real time to an Amazon Kinesis data stream. The Kinesis data stream has multiple shards. The company needs to calculate the average value of a numeric metric every second and set an alarm for whenever the value is above one threshold or below another threshold. The alarm must be sent to Amazon Simple Notification Service (Amazon SNS) in less than 30 seconds.

Which architecture meets these requirements?

Use an Amazon Kinesis Data Firehose delivery stream to read the data from the Kinesis data stream with an AWS Lambda transformation function that calculates the average per second and sends the alarm to Amazon SN

Use an AWS Lambda function to read from the Kinesis data stream to calculate the average per second and sent the alarm to Amazon SN

Use an Amazon Kinesis Data Firehose deliver stream to read the data from the Kinesis data stream and store it on Amazon S3. Have Amazon S3 trigger an AWS Lambda function that calculates the average per second and sends the alarm to Amazon SN

Use an Amazon Kinesis Data Analytics application to read from the Kinesis data stream and calculate the average per second. Send the results to an AWS Lambda function that sends the alarm to Amazon SN

27. A company uses Amazon Redshift as its data warehouse. The Redshift cluster is not encrypted. A data analytics specialist needs to use hardware security module (HSM) managed encryption keys to encrypt the data that is stored in the Redshift cluster.

Which combination of steps will meet these requirements? (Select THREE.)

Stop all write operations on the source cluster. Unload data from the source cluster.

Copy the data to a new target cluster that is encrypted with AWS Key Management Service (AWS KMS).

Modify the source cluster by activating AWS CloudHSM encryption. Configure Amazon Redshift to automatically migrate data to a new encrypted cluster.

Modify the source cluster by activating encryption from an external HS

Configure Amazon Redshift to automatically migrate data to a new encrypted cluster.

Copy the data to a new target cluster that is encrypted with an HSM from AWS CloudHS

Rename the source cluster and the target cluster after the migration so that the target cluster is using the original endpoint.

28. A company wants to optimize the cost of its data and analytics platform. The company is ingesting a number of .csv and JSON files in Amazon S3 from various data sources. Incoming data is expected to be 50 GB each day. The company is using Amazon Athena to query the raw data in Amazon S3 directly. Most queries aggregate data from the past 12 months, and data that is older than 5 years is infrequently queried. The typical query scans about 500 MB of data and is expected to return results in less than 1 minute. The raw data must be retained indefinitely for compliance requirements.

Which solution meets the company’s requirements?

Use an AWS Glue ETL job to compress, partition, and convert the data into a columnar data format. Use Athena to query the processed dataset. Configure a lifecycle policy to move the processed data into the Amazon S3 Standard-Infrequent Access (S3 Standard-IA) storage class 5 years after object creation. Configure a second lifecycle policy to move the raw data into Amazon S3 Glacier for long-term archival 7 days after object creation.

Use an AWS Glue ETL job to partition and convert the data into a row-based data format. Use Athena to query the processed dataset. Configure a lifecycle policy to move the data into the Amazon S3 Standard- Infrequent Access (S3 Standard-IA) storage class 5 years after object creation. Configure a second lifecycle policy to move the raw data into Amazon S3 Glacier for long-term archival 7 days after object creation.

Use an AWS Glue ETL job to compress, partition, and convert the data into a columnar data format. Use Athena to query the processed dataset. Configure a lifecycle policy to move the processed data into the Amazon S3 Standard-Infrequent Access (S3 Standard-IA) storage class 5 years after the object was last accessed. Configure a second lifecycle policy to move the raw data into Amazon S3 Glacier for long-term archival 7 days after the last date the object was accessed.

Use an AWS Glue ETL job to partition and convert the data into a row-based data format. Use Athena to query the processed dataset. Configure a lifecycle policy to move the data into the Amazon S3 Standard- Infrequent Access (S3 Standard-IA) storage class 5 years after the object was last accessed. Configure a second lifecycle policy to move the raw data into Amazon S3 Glacier for long-term archival 7 days after the last date the object was accessed.

29. A company using Amazon QuickSight Enterprise edition has thousands of dashboards analyses and datasets. The company struggles to manage and assign permissions for granting users access to various items within QuickSight. The company wants to make it easier to implement sharing and permissions management.

Which solution should the company implement to simplify permissions management?

Use QuickSight folders to organize dashboards, analyses, and datasets Assign individual users permissions to these folders

Use QuickSight folders to organize dashboards analyses, and datasets Assign group permissions by using these folders.

Use AWS 1AM resource-based policies to assign group permissions to QuickSight items

Use QuickSight user management APIs to provision group permissions based on dashboard naming conventions

30. A company is streaming its high-volume billing data (100 MBps) to Amazon Kinesis Data Streams. A data analyst partitioned the data on account_id to ensure that all records belonging to an account go to the same Kinesis shard and order is maintained. While building a custom consumer using the Kinesis Java SDK, the data analyst notices that, sometimes, the messages arrive out of order for account_id. Upon further investigation, the data analyst discovers the messages that are out of order seem to be arriving from different shards for the same account_id and are seen when a stream resize runs.

What is an explanation for this behavior and what is the solution?

There are multiple shards in a stream and order needs to be maintained in the shard. The data analyst needs to make sure there is only a single shard in the stream and no stream resize runs.

The hash key generation process for the records is not working correctly. The data analyst should generate an explicit hash key on the producer side so the records are directed to the appropriate shard accurately.

The records are not being received by Kinesis Data Streams in order. The producer should use the PutRecords API call instead of the PutRecord API call with the SequenceNumberForOrdering parameter.

The consumer is not processing the parent shard completely before processing the child shards after a stream resize. The data analyst should process the parent shard completely first before processing the child shards.

Page 4 of 7

31. A company uses Amazon Redshift for its data warehouse. The company is running an ET L process that receives data in data parts from five third-party providers. The data parts contain independent records that are related to one specific job. The company receives the data parts at various times

throughout each day.

A data analytics specialist must implement a solution that loads the data into Amazon Redshift only after the company receives all five data parts.

Which solution will meet these requirements?

Create an Amazon S3 bucket to receive the data. Use S3 multipart upload to collect the data from the different sources and to form a single object before loading the data into Amazon Redshift.

Use an AWS Lambda function that is scheduled by cron to load the data into a temporary table in Amazon Redshift. Use Amazon Redshift database triggers to consolidate the final data when all five data parts are ready.

Create an Amazon S3 bucket to receive the data. Create an AWS Lambda function that is invoked by S3 upload events. Configure the function to validate that all five data parts are gathered before the function loads the data into Amazon Redshift.

Create an Amazon Kinesis Data Firehose delivery stream. Program a Python condition that will invoke a buffer flush when all five data parts are received.

32. An online food delivery company wants to optimize its storage costs. The company has been collecting operational data for the last 10 years in a data lake that was built on Amazon S3 by using a Standard storage class. The company does not keep data that is older than 7 years. The data analytics team frequently uses data from the past 6 months for reporting and runs queries on data from the last 2 years about once a month. Data that is more than 2 years old is rarely accessed and is only used for audit purposes.

Which combination of solutions will optimize the company's storage costs? (Select TWO.)

Create an S3 Lifecycle configuration rule to transition data that is older than 6 months to the S3 Standard-Infrequent Access (S3 Standard-IA) storage class.

Create another S3 Lifecycle configuration rule to transition data that is older than 2 years to the S3 Glacier Deep Archive storage class. Create an S3 Lifecycle configuration rule to transition data that is older than 6 months to the S3 One Zone-Infrequent Access (S3 One Zone-IA) storage class.

Create another S3 Lifecycle configuration rule to transition data that is older than 2 years to the S3 Glacier Flexible Retrieval storage class.

Use the S3 Intelligent-Tiering storage class to store data instead of the S3 Standard storage class.

reate an S3 Lifecycle expiration rule to delete data that is older than 7 years.

Create an S3 Lifecycle configuration rule to transition data that is older than 7 years to the S3 Glacier Deep Archive storage class.

33. A company launched a service that produces millions of messages every day and uses Amazon Kinesis Data Streams as the streaming service.

The company uses the Kinesis SDK to write data to Kinesis Data Streams. A few months after launch, a data analyst found that write performance is significantly reduced. The data analyst investigated the metrics and determined that Kinesis is throttling the write requests. The data analyst wants to address this issue without significant changes to the architecture.

Which actions should the data analyst take to resolve this issue? (Choose two.)

Increase the Kinesis Data Streams retention period to reduce throttling.

Replace the Kinesis API-based data ingestion mechanism with Kinesis Agent.

Increase the number of shards in the stream using the UpdateShardCount AP

Choose partition keys in a way that results in a uniform record distribution across shards.

Customize the application code to include retry logic to improve performance.

34. A company is reading data from various customer databases that run on Amazon RDS. The databases contain many inconsistent fields For example, a customer record field that is place_id in one database is location_id in another database. The company wants to link customer records across different databases, even when many customer record fields do not match exactly.

Which solution will meet these requirements with the LEAST operational overhead?

Create an Amazon EMR cluster to process and analyze data in the databases Connect to the Apache Zeppelin notebook, and use the FindMatches transform to find duplicate records in the data.

Create an AWS Glue crawler to crawl the databases. Use the FindMatches transform to find duplicate records in the data Evaluate and tune the transform by evaluating performance and results of finding matches

Create an AWS Glue crawler to crawl the data in the databases Use Amazon SageMaker to construct Apache Spark ML pipelines to find duplicate records in the data

Create an Amazon EMR cluster to process and analyze data in the databases. Connect to the Apache Zeppelin notebook, and use Apache Spark ML to find duplicate records in the data. Evaluate and tune the model by evaluating performance and results of finding duplicates

35. A company has 1 million scanned documents stored as image files in Amazon S3. The documents contain typewritten application forms with information including the applicant first name, applicant last name, application date, application type, and application text. The company has developed a machine learning algorithm to extract the metadata values from the scanned documents. The company wants to allow internal data analysts to analyze and find applications using the applicant name, application date, or application text. The original images should also be downloadable. Cost control is secondary to query performance.

Which solution organizes the images and metadata to drive insights while meeting the requirements?

For each image, use object tags to add the metadata. Use Amazon S3 Select to retrieve the files based on the applicant name and application date.

Index the metadata and the Amazon S3 location of the image file in Amazon Elasticsearch Service. Allow the data analysts to use Kibana to submit queries to the Elasticsearch cluster.

Store the metadata and the Amazon S3 location of the image file in an Amazon Redshift table. Allow the data analysts to run ad-hoc queries on the table.

Store the metadata and the Amazon S3 location of the image files in an Apache Parquet file in Amazon S3, and define a table in the AWS Glue Data Catalog. Allow data analysts to use Amazon Athena to submit custom queries.

36. An online retailer is rebuilding its inventory management system and inventory reordering system to automatically reorder products by using Amazon Kinesis Data Streams. The inventory management system uses the Kinesis Producer Library (KPL) to publish data to a stream. The inventory reordering system uses the Kinesis Client Library (KCL) to consume data from the stream. The stream has been configured to scale as needed. Just before production deployment, the retailer discovers that the inventory reordering system is receiving duplicated data.

Which factors could be causing the duplicated data? (Choose two.)

The producer has a network-related timeout.

The stream’s value for the IteratorAgeMilliseconds metric is too high.

There was a change in the number of shards, record processors, or both.

The AggregationEnabled configuration property was set to true.

The max_records configuration property was set to a number that is too high.

37. A manufacturing company is storing data from its operational systems in Amazon S3. The company's business analysts need to perform one-time queries of the data in Amazon S3 with Amazon Athen a. The company needs to access the Athena service from the on-premises network by using a JDBC connection. The company has created a VPC. Security policies mandate that requests to AWS services cannot traverse the internet.

Which combination of steps should a data analytics specialist take to meet these requirements? (Select TWO.)

Establish an AWS Direct Connect connection between the on-premises network and the VP

Configure the JDBC connection to connect to Athena through Amazon API Gateway.

Configure the JDBC connection to use a gateway VPC endpoint for Amazon S3.

Configure the JDBC connection to use an interface VPC endpoint for Athena.

Deploy Athena within a private subnet.

38. A hospital uses an electronic health records (EHR) system to collect two types of data

• Patient information, which includes a patient's name and address

• Diagnostic tests conducted and the results of these tests

Patient information is expected to change periodically Existing diagnostic test data never changes and only new records are added

The hospital runs an Amazon Redshift cluster with four dc2.large nodes and wants to automate the ingestion of the patient information and diagnostic test data into respective Amazon Redshift tables for analysis The EHR system exports data as CSV files to an Amazon S3 bucket on a daily basis Two sets of CSV files are generated One set of files is for patient information with updates, deletes, and inserts The other set of files is for new diagnostic test data only

What is the MOST cost-effective solution to meet these requirements?

Use Amazon EMR with Apache Hudi. Run daily ETL jobs using Apache Spark and the Amazon Redshift JDBC driver

Use an AWS Glue crawler to catalog the data in Amazon S3 Use Amazon Redshift Spectrum to perform scheduled queries of the data in Amazon S3 and ingest the data into the patient information table and the diagnostic tests table.

Use an AWS Lambda function to run a COPY command that appends new diagnostic test data to the diagnostic tests table Run another COPY command to load the patient information data into the staging tables Use a stored procedure to handle create update, and delete operations for the patient information table

Use AWS Database Migration Service (AWS DMS) to collect and process change data capture (CDC) records Use the COPY command to load patient information data into the staging tables. Use a stored procedure to handle create, update and delete operations for the patient information table

39. A social media company is using business intelligence tools to analyze data for forecasting. The company is using Apache Kafka to ingest data. The company wants to build dynamic dashboards that include machine learning (ML) insights to forecast key business trends.

The dashboards must show recent batched data that is not more than 75 minutes old. Various teams at the company want to view the dashboards by using Amazon QuickSight with ML insights.

Which solution will meet these requirements?

Replace Kafka with Amazon Managed Streaming for Apache Kafka (Amazon MSK). Use AWS Data Exchange to store the data in Amazon S3. Use SPICE in QuickSight Enterprise edition to refresh the data from Amazon S3 each hour. Use QuickSight to create a dynamic dashboard that includes forecasting and ML insights.

Replace Kafka with an Amazon Kinesis data stream. Use AWS Data Exchange to store the data in Amazon S3. Use SPICE in QuickSight Standard edition to refresh the data from Amazon S3 each hour. Use QuickSight to create a dynamic dashboard that includes forecasting and ML insights.

Configure the Kafka-Kinesis-Connector to publish the data to an Amazon Kinesis Data Firehose delivery stream. Configure the delivery stream to store the data in Amazon S3 with a max buffer size of 60 seconds. Use SPICE in QuickSight Enterprise edition to refresh the data from Amazon S3 each hour. Use QuickSight to create a dynamic dashboard that includes forecasting and ML insights.

Configure the Kafka-Kinesis-Connector to publish the data to an Amazon Kinesis Data Firehose delivery stream. Configure the delivery stream to store the data in Amazon S3 with a max buffer size of 60 seconds. Refresh the data in QuickSight Standard edition SPICE from Amazon S3 by using a scheduled AWS Lambda function. Configure the Lambda function to run every 75 minutes and to invoke the QuickSight API to create a dynamic dashboard that includes forecasting and ML insights.

40. A marketing company is using Amazon EMR clusters for its workloads. The company manually installs third- party libraries on the clusters by logging in to the master nodes. A data analyst needs to create an automated solution to replace the manual process.

Which options can fulfill these requirements? (Choose two.)

Place the required installation scripts in Amazon S3 and execute them using custom bootstrap actions.

Place the required installation scripts in Amazon S3 and execute them through Apache Spark in Amazon EM

Install the required third-party libraries in the existing EMR master node. Create an AMI out of that master node and use that custom AMI to re-create the EMR cluster.

Use an Amazon DynamoDB table to store the list of required applications. Trigger an AWS Lambda function with DynamoDB Streams to install the software.

Launch an Amazon EC2 instance with Amazon Linux and install the required third-party libraries on the instance. Create an AMI and use that AMI to create the EMR cluster.

Page 5 of 7

41. A company has an application that ingests streaming data. The company needs to analyze this stream over a 5-minute timeframe to evaluate the stream for anomalies with Random Cut Forest (RCF) and summarize the current count of status codes. The source and summarized data should be persisted for future use.

Which approach would enable the desired outcome while keeping data persistence costs low?

Ingest the data stream with Amazon Kinesis Data Streams. Have an AWS Lambda consumer evaluate the stream, collect the number status codes, and evaluate the data against a previously trained RCF model. Persist the source and results as a time series to Amazon DynamoD

Ingest the data stream with Amazon Kinesis Data Streams. Have a Kinesis Data Analytics application evaluate the stream over a 5-minute window using the RCF function and summarize the count of status codes. Persist the source and results to Amazon S3 through output delivery to Kinesis Data Firehouse.

Ingest the data stream with Amazon Kinesis Data Firehose with a delivery frequency of 1 minute or 1 MB in Amazon S3. Ensure Amazon S3 triggers an event to invoke an AWS Lambda consumer that evaluates the batch data, collects the number status codes, and evaluates the data against a previously trained RCF model. Persist the source and results as a time series to Amazon DynamoD

Ingest the data stream with Amazon Kinesis Data Firehose with a delivery frequency of 5 minutes or 1 MB into Amazon S3. Have a Kinesis Data Analytics application evaluate the stream over a 1-minute window using the RCF function and summarize the count of status codes. Persist the results to Amazon S3 through a Kinesis Data Analytics output to an AWS Lambda integration.

42. A company has a fitness tracker application that generates data from subscribers. The company needs real-time reporting on this data. The data is sent immediately, and the processing latency must be less than 1 second. The company wants to perform anomaly detection on the data as the data is collected. The company also requires a solution that minimizes operational overhead.

Which solution meets these requirements?

Amazon EMR cluster with Apache Spark streaming, Spark SQL, and Spark's machine learning library (MLIib)

Amazon Kinesis Data Firehose with Amazon S3 and Amazon Athena

Amazon Kinesis Data Firehose with Amazon QuickSight

Amazon Kinesis Data Streams with Amazon Kinesis Data Analytics

43. A company wants to research user turnover by analyzing the past 3 months of user activities. With millions of users, 1.5 TB of uncompressed data is generated each day. A 30-node Amazon Redshift cluster with 2.56 TB of solid state drive (SSD) storage for each node is required to meet the query performance goals.

The company wants to run an additional analysis on a year’s worth of historical data to examine trends indicating which features are most popular. This analysis will be done once a week.

What is the MOST cost-effective solution?

Increase the size of the Amazon Redshift cluster to 120 nodes so it has enough storage capacity to hold 1 year of data. Then use Amazon Redshift for the additional analysis.

Keep the data from the last 90 days in Amazon Redshift. Move data older than 90 days to Amazon S3 and store it in Apache Parquet format partitioned by date. Then use Amazon Redshift Spectrum for the additional analysis.

Keep the data from the last 90 days in Amazon Redshift. Move data older than 90 days to Amazon S3 and store it in Apache Parquet format partitioned by date. Then provision a persistent Amazon EMR cluster and use Apache Presto for the additional analysis.

Resize the cluster node type to the dense storage node type (DS2) for an additional 16 TB storage capacity on each individual node in the Amazon Redshift cluster. Then use Amazon Redshift for the additional analysis.

44. A retail company stores order invoices in an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster Indices on the cluster are created monthly Once a new month begins, no new writes are made to any of the indices from the previous months The company has been expanding the storage on the Amazon OpenSearch Service {Amazon Elasticsearch Service) cluster to avoid running out of space, but the company wants to reduce costs Most searches on the cluster are on the most recent 3 months of data while the audit team requires infrequent access to older data to generate periodic reports The most recent 3 months of data must be quickly available for queries, but the audit team can tolerate slower queries if the solution saves on cluster costs

Which of the following is the MOST operationally efficient solution to meet these requirements?

Archive indices that are older than 3 months by using Index State Management (ISM) to create a policy to store the indices in Amazon S3 Glacier When the audit team requires the archived data restore the archived indices back to the Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster

Archive indices that are older than 3 months by taking manual snapshots and storing the snapshots in Amazon S3 When the audit team requires the archived data, restore the archived indices back to the Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster

Archive indices that are older than 3 months by using Index State Management (ISM) to create a policy to migrate the indices to Amazon OpenSearch Service (Amazon Elasticsearch Service) UltraWarm storage When the audit team requires the older data: migrate the indices in UltraWarm storage back to hot storage

45. A university intends to use Amazon Kinesis Data Firehose to collect JSON-formatted batches of water quality readings in Amazon S3. The readings are from 50 sensors scattered across a local lake. Students will query the stored data using Amazon Athena to observe changes in a captured metric over time, such as water temperature or acidity. Interest has grown in the study, prompting the university to reconsider how data will be stored.

Which data format and partitioning choices will MOST significantly reduce costs? (Choose two.)

Store the data in Apache Avro format using Snappy compression.

Partition the data by year, month, and day.

Store the data in Apache ORC format using no compression.

Store the data in Apache Parquet format using Snappy compression.

Partition the data by sensor, year, month, and day.

46. A large company has several independent business units. Each business unit is responsible for its own data, but needs to share data with other units for collaboration.

Each unit stores data in an Amazon S3 data lake created with AWS Lake Formation. To create dashboard reports, the marketing team wants to join its data stored in an Amazon Redshift cluster with the sales team customer table stored in the data lake. The sales team has a large number of tables and schemas, but the marketing team should only have access to the customer table. The solution must be secure and scalable.

Which set of actions meets these requirements?

The sales team shares the AWS Glue Data Catalog customer table with the marketing team in read-only mode using the named resource method. The marketing team accepts the datashare using AWS Resource Access Manager (AWS RAM) and creates a resource link to the shared customer table. The marketing team joins its data with the customer table using Amazon Redshift Spectrum.

The marketing team creates an S3 cross-account replication between the sales team's S3 bucket as the source and the marketing team's S3 bucket as the destination. The marketing team runs an AWS Glue crawler on the replicated data in its AWS account to create an AWS Glue Data Catalog customer table.

The marketing team joins its data with the customer table using Amazon Redshift Spectrum. The marketing team creates an AWS Lambda function in the sales team's account to replicate data between the sale team's S3 bucket as the source and the marketing team's S3 bucket as the destination. The marketing team runs an AWS Glue crawler on the replicated data in its AWS account to create an AWS Glue Data Catalog customer table. The marketing team joins its data with the customer table using Amazon Redshift Spectrum

The sales team shares the AWS Glue Data Catalog customer table with the marketing team in read-only mode using the Lake Formation tag-based access control (LF-TBAC) method. The sales team updates the AWS Glue Data Catalog resource policy to add relevant permissions for the marketing team. The marketing team creates a resource link to the shared customer table. The marketing team joins its data with the customer table using Amazon Redshift Spectrum.

47. A marketing company has an application that stores event data in an Amazon RDS database. The company is replicating this data to Amazon Redshift for reporting and business intelligence (BI) purposes. New event data is continuously generated and ingested into the RDS database throughout the day and captured by a change data capture (CDC) replication task in AWS Database Migration Service (AWS DMS). The company requires that the new data be replicated to Amazon Redshift in near-real time.

Which solution meets these requirements?

Use Amazon Kinesis Data Streams as the destination of the CDC replication task in AWS DM

Use an AWS Glue streaming job to read changed records from Kinesis Data Streams and perform an upsert into the Redshift cluster.

Use Amazon S3 as the destination of the CDC replication task in AWS DM

Use the COPY command to load data into the Redshift cluster.

Use Amazon DynamoDB as the destination of the CDC replication task in AWS DM

Use the COPY command to load data into the Redshift cluster.

Use Amazon Kinesis Data Firehose as the destination of the CDC replication task in AWS DM

Use an AWS Glue streaming job to read changed records from Kinesis Data Firehose and perform an upsert into the Redshift cluster.

48. A company is migrating from an on-premises Apache Hadoop cluster to an Amazon EMR cluster. The cluster runs only during business hours. Due to a company requirement to avoid intraday cluster failures, the EMR cluster must be highly available. When the cluster is terminated at the end of each business day, the data must persist.

Which configurations would enable the EMR cluster to meet these requirements? (Choose three.)

EMR File System (EMRFS) for storage

Hadoop Distributed File System (HDFS) for storage

AWS Glue Data Catalog as the metastore for Apache Hive

MySQL database on the master node as the metastore for Apache Hive

Multiple master nodes in a single Availability Zone

Multiple master nodes in multiple Availability Zones

49. A company is building a data lake and needs to ingest data from a relational database that has time-series data. The company wants to use managed services to accomplish this. The process needs to be scheduled daily and bring incremental data only from the source into Amazon S3.

What is the MOST cost-effective approach to meet these requirements?

Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.

Use AWS Glue to connect to the data source using JDBC Drivers. Store the last updated key in an Amazon DynamoDB table and ingest the data using the updated key as a filter.

Use AWS Glue to connect to the data source using JDBC Drivers and ingest the entire dataset. Use appropriate Apache Spark libraries to compare the dataset, and find the delta.

Use AWS Glue to connect to the data source using JDBC Drivers and ingest the full data. Use AWS DataSync to ensure the delta only is written into Amazon S3.

50. A company’s marketing team has asked for help in identifying a high performing long-term storage service for their data based on the following requirements:

The data size is approximately 32 TB uncompressed.

There is a low volume of single-row inserts each day.

There is a high volume of aggregation queries each day.

Multiple complex joins are performed.

The queries typically involve a small subset of the columns in a table.

Which storage service will provide the MOST performant solution?

Amazon Aurora MySQL

Amazon Redshift

Amazon Neptune

Amazon Elasticsearch

Page 6 of 7

51. An Amazon Redshift database contains sensitive user data. Logging is necessary to meet compliance requirements. The logs must contain database authentication attempts, connections, and disconnections. The logs must also contain each query run against the database and record which database user ran each query.

Which steps will create the required logs?

Enable Amazon Redshift Enhanced VPC Routing. Enable VPC Flow Logs to monitor traffic.

Allow access to the Amazon Redshift database using AWS IAM only. Log access using AWS CloudTrail.

Enable audit logging for Amazon Redshift using the AWS Management Console or the AWS CL

Enable and download audit reports from AWS Artifact.

52. A large ecommerce company uses Amazon DynamoDB with provisioned read capacity and auto scaled write capacity to store its product catalog. The company uses Apache HiveQL statements on an Amazon EMR cluster to query the DynamoDB table. After the company announced a sale on all of its products, wait times for each query have increased. The data analyst has determined that the longer wait times are being caused by throttling when querying the table.

Which solution will solve this issue?

Increase the size of the EMR nodes that are provisioned.

Increase the number of EMR nodes that are in the cluster.

Increase the DynamoDB table's provisioned write throughput.

Increase the DynamoDB table's provisioned read throughput.

53. A data analytics specialist is building an automated ETL ingestion pipeline using AWS Glue to ingest compressed files that have been uploaded to an Amazon S3 bucket. The ingestion pipeline should support incremental data processing.

Which AWS Glue feature should the data analytics specialist use to meet this requirement?

Workflows

Triggers

Job bookmarks

Classifiers

54. A large ride-sharing company has thousands of drivers globally serving millions of unique customers every day. The company has decided to migrate an existing data mart to Amazon Redshift. The existing schema includes the following tables.

A trips fact table for information on completed rides. A drivers dimension table for driver profiles.

A customers fact table holding customer profile information.

The company analyzes trip details by date and destination to examine profitability by region. The drivers data rarely changes. The customers data frequently changes.

What table design provides optimal query performance?

Use DISTSTYLE KEY (destination) for the trips table and sort by date. Use DISTSTYLE ALL for the drivers and customers tables.

Use DISTSTYLE EVEN for the trips table and sort by date. Use DISTSTYLE ALL for the drivers table. Use DISTSTYLE EVEN for the customers table.

Use DISTSTYLE KEY (destination) for the trips table and sort by date. Use DISTSTYLE ALL for the drivers table. Use DISTSTYLE EVEN for the customers table.

Use DISTSTYLE EVEN for the drivers table and sort by date. Use DISTSTYLE ALL for both fact tables.

55. A company is creating a data lake by using AWS Lake Formation. The data that will be stored in the data lake contains sensitive customer information and must be encrypted at rest using an AWS Key Management Service (AWS KMS) customer managed key to meet regulatory requirements.

How can the company store the data in the data lake to meet these requirements?

Store the data in an encrypted Amazon Elastic Block Store (Amazon EBS) volume. Register the Amazon EBS volume with Lake Formation.

Store the data in an Amazon S3 bucket by using server-side encryption with AWS KMS (SSE-KMS). Register the S3 location with Lake Formation.

Encrypt the data on the client side and store the encrypted data in an Amazon S3 bucket. Register the S3 location with Lake Formation.

Store the data in an Amazon S3 Glacier Flexible Retrieval vault bucket. Register the S3 Glacier Flexible Retrieval vault with Lake Formation.

56. A data analyst is using AWS Glue to organize, cleanse, validate, and format a 200 GB dataset. The data analyst triggered the job to run with the Standard worker type. After 3 hours, the AWS Glue job status is still RUNNING. Logs from the job run show no error codes. The data analyst wants to improve the job execution time without overprovisioning.

Which actions should the data analyst take?

Enable job bookmarks in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the executor-cores job parameter.

Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the maximum capacity job parameter.

Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the spark.yarn.executor.memoryOverhead job parameter.

Enable job bookmarks in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the num-executors job parameter.

57. A manufacturing company has been collecting IoT sensor data from devices on its factory floor for a year and is storing the data in Amazon Redshift for daily analysis. A data analyst has determined that, at an expected ingestion rate of about 2 TB per day, the cluster will be undersized in less than 4 months. A long-term solution is needed. The data analyst has indicated that most queries only reference the most recent 13 months of data, yet there are also quarterly reports that need to query all the data generated from the past 7 years. The chief technology officer (CTO) is concerned about the costs, administrative effort, and performance of a long-term solution.

Which solution should the data analyst use to meet these requirements?

Create a daily job in AWS Glue to UNLOAD records older than 13 months to Amazon S3 and delete those records from Amazon Redshift. Create an external table in Amazon Redshift to point to the S3 location. Use Amazon Redshift Spectrum to join to data that is older than 13 months.

Take a snapshot of the Amazon Redshift cluster. Restore the cluster to a new cluster using dense storage nodes with additional storage capacity.

Execute a CREATE TABLE AS SELECT (CTAS) statement to move records that are older than 13 months to quarterly partitioned data in Amazon Redshift Spectrum backed by Amazon S3.

Unload all the tables in Amazon Redshift to an Amazon S3 bucket using S3 Intelligent-Tiering. Use AWS Glue to crawl the S3 bucket location to create external tables in an AWS Glue Data Catalog. Create an Amazon EMR cluster using Auto Scaling for any daily analytics needs, and use Amazon Athena for the quarterly reports, with both using the same AWS Glue Data Catalog.

58. A company needs to store objects containing log data in JSON format. The objects are generated by eight applications running in AWS. Six of the applications generate a total of 500 KiB of data per second, and two of the applications can generate up to 2 MiB of data per second.

A data engineer wants to implement a scalable solution to capture and store usage data in an Amazon S3 bucket. The usage data objects need to be reformatted, converted to .csv format, and then compressed before they are stored in Amazon S3. The company requires the solution to include the least custom code possible and has authorized the data engineer to request a service quota increase if needed.

Which solution meets these requirements?

Configure an Amazon Kinesis Data Firehose delivery stream for each application. Write AWS Lambda functions to read log data objects from the stream for each application. Have the function perform reformatting and .csv conversion. Enable compression on all the delivery streams.

Configure an Amazon Kinesis data stream with one shard per application. Write an AWS Lambda function to read usage data objects from the shards. Have the function perform .csv conversion, reformatting, and compression of the data. Have the function store the output in Amazon S3.

Configure an Amazon Kinesis data stream for each application. Write an AWS Lambda function to read usage data objects from the stream for each application. Have the function perform .csv conversion, reformatting, and compression of the data. Have the function store the output in Amazon S3.

Store usage data objects in an Amazon DynamoDB table. Configure a DynamoDB stream to copy the objects to an S3 bucket. Configure an AWS Lambda function to be triggered when objects are written to the S3 bucket. Have the function convert the objects into .csv format.

59. A company uses an Amazon EMR cluster with 50 nodes to process operational data and make the data available for data analysts These jobs run nightly use Apache Hive with the Apache Jez framework as a processing model and write results to Hadoop Distributed File System (HDFS) In the last few weeks, jobs are failing and are producing the following error message "File could only be replicated to 0 nodes instead of 1"

A data analytics specialist checks the DataNode logs the NameNode logs and network connectivity for potential issues that could have prevented HDFS from replicating data The data analytics specialist rules out these factors as causes for the issue

Which solution will prevent the jobs from failing'?

Monitor the HDFSUtilization metric. If the value crosses a user-defined threshold add task nodes to the EMR cluster

Monitor the HDFSUtilization metri.c If the value crosses a user-defined threshold add core nodes to the EMR cluster

Monitor the MemoryAllocatedMB metric. If the value crosses a user-defined threshold, add task nodes to the EMR cluster

Monitor the MemoryAllocatedMB metric. If the value crosses a user-defined threshold, add core nodes to the EMR cluster.

60. A company has multiple data workflows to ingest data from its operational databases into its data lake on Amazon S3. The workflows use AWS Glue and Amazon EMR for data processing and ETL. The company wants to enhance its architecture to provide automated orchestration and minimize manual intervention

Which solution should the company use to manage the data workflows to meet these requirements?

AWS Glue workflows

AWS Step Functions

AWS Lambda

AWS Batch

Page 7 of 7

61. A company has an application that ingests streaming data. The company needs to analyze this stream over a 5-minute timeframe to evaluate the stream for anomalies with Random Cut Forest (RCF) and summarize the current count of status codes. The source and summarized data should be persisted for future use.

Which approach would enable the desired outcome while keeping data persistence costs low?

Ingest the data stream with Amazon Kinesis Data Firehose with a delivery frequency of I minute or I MB in Amazon S3. Ensure Amazon S3 triggers an event to invoke an AWS Lambda consumer that evaluates the batch data, collects the number status codes, and evaluates the data against a previously trained RCF model. Persist the source and results as a time series to Amazon DynamoD

Ingest the data stream with Amazon Kinesis Data Firehose with a delivery frequency of 5 minutes or I MB into Amazon S3. Have a Kinesis Data Analytics application evaluate the stream over a I-minute window using the RCF function and summarize the count of status codes. Persist the results to Amazon S3 through a Kinesis Data Analytics output to an AWS Lambda integration.

62. A company has several Amazon EC2 instances sitting behind an Application Load Balancer (ALB) The company wants its IT Infrastructure team to analyze the IP addresses coming into the company's ALB The ALB is configured to store access logs in Amazon S3 The access logs create about 1 TB of data each day, and access to the data will be infrequent. The company needs a solution that is scalable, cost-effective and has minimal maintenance requirements

Which solution meets these requirements?

Copy the data into Amazon Redshift and query the data

Use Amazon EMR and Apache Hive to query the S3 data

Use Amazon Athena to query the S3 data

Use Amazon Redshift Spectrum to query the S3 data

63. A company developed a new elections reporting website that uses Amazon Kinesis Data Firehose to deliver full logs from AWS WAF to an Amazon S3 bucket. The company is now seeking a low-cost option to perform this infrequent data analysis with visualizations of logs in a way that requires minimal development effort.

Which solution meets these requirements?

Use an AWS Glue crawler to create and update a table in the Glue data catalog from the logs. Use Athena to perform ad-hoc analyses and use Amazon QuickSight to develop data visualizations.

Create a second Kinesis Data Firehose delivery stream to deliver the log files to Amazon Elasticsearch Service (Amazon ES). Use Amazon ES to perform text-based searches of the logs for ad-hoc analyses and use Kibana for data visualizations.

Create an AWS Lambda function to convert the logs into .csv format. Then add the function to the Kinesis Data Firehose transformation configuration. Use Amazon Redshift to perform ad-hoc analyses of the logs using SQL queries and use Amazon QuickSight to develop data visualizations.

Create an Amazon EMR cluster and use Amazon S3 as the data source. Create an Apache Spark job to perform ad-hoc analyses and use Amazon QuickSight to develop data visualizations.

TAGS:

DAS-C01, DAS-C01 exam dumps

Notify of

Label

Name*

Email*

Website

Label

Name*

Email*

Website

0 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Get DAS-C01 Dumps Full Version

Q&As: 207
Versions: PDF and Software Download Now

About Dumpsinfo

Dumpsinfo is a good platform providing the latest exam information and dumps questions for all IT certification exams. You can study all the latest exam dumps questions online.

[email protected]

Mon - Sat 9:00am - 6:00pm

Good DAS-C01 Exam Dumps Save Your Preparation Time

Related

Posts

Amazon DEA-C01 Online Dumps Boost Your Career

Improve Your Knowledge with MLA-C01 Exam Dumps

Amazon AIF-C01 Exam Questions Simulate Actual AIF-C01 Exam

Prepare SOA-C02 Exam with Using SOA-C02 Dump Questions

About Us

EMAIL

Services

Opening Hours