Professional Data Engineer Online Practice Questions

Home / Google / Professional Data Engineer

Latest Professional Data Engineer Exam Practice Questions

The practice questions for Professional Data Engineer exam was last updated on 2025-12-14 .

Viewing page 1 out of 25 pages.

Viewing questions 1 out of 127 questions.

Question#1

You are developing a new deep teaming model that predicts a customer's likelihood to buy on your ecommerce site. Alter running an evaluation of the model against both the original training data and new test data, you find that your model is overfitting the data. You want to improve the accuracy of the model when predicting new data.
What should you do?

A. Increase the size of the training dataset, and increase the number of input features.
B. Increase the size of the training dataset, and decrease the number of input features.
C. Reduce the size of the training dataset, and increase the number of input features.
D. Reduce the size of the training dataset, and decrease the number of input features.

Explanation:
https://machinelearningmastery.com/impact-of-dataset-size-on-deep-learning-model-skill-and-performance-estimates/

Question#2

You are building a teal-lime prediction engine that streams files, which may contain Pll (personal identifiable information) data, into Cloud Storage and eventually into BigQuery You want to ensure that the sensitive data is masked but still maintains referential Integrity, because names and emails are often used as join keys.
How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the Pll data is not accessible by unauthorized individuals?

A. Create a pseudonym by replacing the Pll data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.
B. Redact all Pll data, and store a version of the unredacted data in a locked-down bucket
C. Scan every table in BigQuery, and mask the data it finds that has Pll
D. Create a pseudonym by replacing Pll data with a cryptographic format-preserving token

Question#3

Your team runs a complex analytical query daily that processes terabytes of data. Recently, after running for 20 minutes, the query fails with a "Resources exceeded” error. You need to resolve this issue.
What should you do?

A. Increase your project's BigQuery API request quota.
B. Increase the maximum table size limit.
C. Analyze the SQL syntax for errors.
D. Move from BigQuery on-demand to slot reservations.

Explanation:
Comprehensive and Detailed The error message "Resources exceeded" in BigQuery indicates that the query's execution plan is too complex or requires more computational resources (slots) than are available to it in the on-demand, fair-share pool.
Option D is the correct answer. BigQuery's on-demand pricing model uses a massive, shared pool of processing units called slots. While this pool is large, a single query cannot monopolize it, and there are limits to prevent runaway jobs. For consistently complex, high-resource queries, the solution is to switch to capacity-based pricing by purchasing slot reservations (e.g., using BigQuery editions). This provides your project with a dedicated, guaranteed amount of processing capacity, ensuring your complex queries have the resources they need to complete successfully.
Option A is incorrect because API request quotas relate to the number of API calls (e.g., how many jobs you can submit per minute), not the computational resources allocated to a single running query.
Option B is incorrect because table size limits are not related to query execution resources.
Option C is incorrect because while a syntax error would cause a query to fail, it would do so immediately with a syntax error message, not after 20 minutes with a "Resources exceeded" error. While optimizing the query is a good practice, the most direct way to solve a resource limit issue is to provide more resources.
Reference (Google Cloud Documentation Concepts): The Google Cloud documentation on "BigQuery pricing" explains the two main models: on-demand pricing and capacity-based pricing (editions). The "Resources exceeded" error is a known limitation of the on-demand model for extremely demanding queries. The documentation on "Introduction to slots" and "Reservations" explicitly presents purchasing dedicated slots as the solution for gaining more predictable and higher query performance for demanding workloads.

Question#4

You’re training a model to predict housing prices based on an available dataset with real estate properties. Your plan is to train a fully connected neural net, and you’ve discovered that the dataset contains latitude and longtitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you’d like to engineer a feature that incorporates this physical dependency.
What should you do?

A. Provide latitude and longtitude as input vectors to your neural net.
B. Create a numeric column from a feature cross of latitude and longtitude.
C. Create a feature cross of latitude and longtitude, bucketize at the minute level and use L1 regularization during optimization.
D. Create a feature cross of latitude and longtitude, bucketize it at the minute level and use L2 regularization during optimization.

Explanation:
Reference: https://cloud.google.com/bigquery/docs/gis-dataa
To engineer a feature that incorporates the physical dependency of location on housing prices for a neural network, creating a numeric column from a feature cross of latitude and longitude is the most effective approach. Here’s why option B is the best choice:
Feature Crosses:
Feature crosses combine multiple features into a single feature that captures the interaction between them. For location data, a feature cross of latitude and longitude can capture spatial dependencies that affect housing prices.
This approach allows the neural network to learn complex patterns related to geographic location more effectively than using raw latitude and longitude values.
Numerical Representation:
Converting the feature cross into a numeric column simplifies the input for the neural network and can improve the model's ability to learn from the data.
This method ensures that the model can leverage the combined information from both latitude and longitude in a meaningful way.
Model Training:
Using a numeric column for the feature cross helps in regularizing the model and prevents overfitting, which is crucial for achieving good generalization on unseen data.

Question#5

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples.
Which two characteristic support this method? (Choose two.)

A. There are very few occurrences of mutations relative to normal samples.
B. There are roughly equal occurrences of both normal and mutated samples in the database.
C. You expect future mutations to have different features from the mutated samples in the database.
D. You expect future mutations to have similar features to the mutated samples in the database.
E. You already have labels for which samples are mutated and which are normal in the database.

Explanation:
Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. https://en.wikipedia.org/wiki/Anomaly_detection

Exam Code: Professional Data EngineerQ & A: 384 Q&AsUpdated:  2025-12-14

 Get All Professional Data Engineer Q&As