Databricks Certified Associate-Developer for Apache Spark 3.5 Online Practice Questions

Home / Databricks / Databricks Certified Associate-Developer for Apache Spark 3.5

Latest Databricks Certified Associate-Developer for Apache Spark 3.5 Exam Practice Questions

The practice questions for Databricks Certified Associate-Developer for Apache Spark 3.5 exam was last updated on 2025-10-31 .

Viewing page 1 out of 9 pages.

Viewing questions 1 out of 48 questions.

Question#1

An engineer has two DataFrames ― df1 (small) and df2 (large).
To optimize the join, the engineer uses a broadcast join:
from pyspark.sql.functions import broadcast
df_result = df2.join(broadcast(df1), on="id", how="inner")
What is the purpose of using broadcast() in this scenario?

A. It increases the partition size for df1 and df2.
B. It ensures that the join happens only when the id values are identical.
C. It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.
D. It filters the id values before performing the join.

Explanation:
A broadcast join is a type of join where the smaller DataFrame is replicated (broadcast) to all worker nodes in the cluster. This avoids shuffling the large DataFrame across the network.
Benefits:
Eliminates shuffle for the smaller dataset.
Greatly improves performance when one side of the join is small enough to fit in memory.
Correct usage example:
df_result = df2.join(broadcast(df1), "id")
This is a map-side join, where each executor joins its local partition of the large dataset with the broadcasted copy of the small one.
Why the other options are incorrect:
A: Broadcasting does not change partition sizes.
B: Joins always match on key equality; this is not specific to broadcast joins.
D: Broadcasting does not filter; it distributes data for faster joins.
Reference: Databricks Exam Guide (June 2025): Section “Developing Apache Spark DataFrame/DataSet API Applications” ― broadcast joins and partitioning strategies.
PySpark SQL Functions ― broadcast() method documentation.

Question#2

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.
Which action should the engineer take to resolve this issue?

A. Optimize the data processing logic by repartitioning the DataFrame.
B. Modify the Spark configuration to disable garbage collection
C. Increase the memory allocated to the Spark Driver.
D. Cache large DataFrames to persist them in memory.

Explanation:
The message "GC overhead limit exceeded" typically indicates that the JVM is spending too much time in garbage collection with little memory recovery. This suggests that the driver or executor is under-provisioned in memory.
The most effective remedy is to increase the driver memory using:
--driver-memory 4g
This is confirmed in Spark's official troubleshooting documentation:
“If you see a lot of GC overhead limit exceeded errors in the driver logs, it’s a sign that the driver is running out of memory.”
― Spark Tuning Guide Why others are incorrect:
A may help but does not directly address the driver memory shortage. B is not a valid action; GC cannot be disabled.
D increases memory usage, worsening the problem.

Question#3

A data scientist has been investigating user profile data to build features for their model. After some exploratory data analysis, the data scientist identified that some records in the user profiles contain NULL values in too many fields to be useful.
The schema of the user profile table looks like this:
user_id STRING,
username STRING,
date_of_birth DATE,
country STRING,
created_at TIMESTAMP
The data scientist decided that if any record contains a NULL value in any field, they want to remove that record from the output before further processing.
Which block of Spark code can be used to achieve these requirements?

A. filtered_users = raw_users.na.drop("any")
B. filtered_users = raw_users.na.drop("all")
C. filtered_users = raw_users.dropna(how="any")
D. filtered_users = raw_users.dropna(how="all")

Explanation:
In Spark’s DataFrame API, the dropna() (or equivalently, DataFrameNaFunctions.drop()) method removes rows containing null values.
Behavior:
how="any" → drops rows where any column has a null value.
how="all" → drops rows where all columns are null.
Since the data scientist wants to drop records with any null field, the correct parameter is how="any".
Correct syntax:
filtered_users = raw_users.dropna(how="any")
This will remove all records that have at least one null value in any column.
Why the other options are incorrect:
A: Uses na.drop("any") but missing parentheses context (works only as raw_users.na.drop("any"), which is equivalent to option C).
B/D: how="all" only removes rows where all values are null ― too strict for this use case.
Reference: PySpark DataFrame API ― DataFrameNaFunctions.drop() and DataFrame.dropna().
Databricks Exam Guide (June 2025): Section “Developing Apache Spark DataFrame/DataSet API Applications” ― covers handling missing data and DataFrame cleaning operations.

Question#4

Which command overwrites an existing JSON file when writing a DataFrame?

A. df.write.mode("overwrite").json("path/to/file")
B. df.write.overwrite.json("path/to/file")
C. df.write.json("path/to/file", overwrite=True)
D. df.write.format("json").save("path/to/file", mode="overwrite")

Explanation:
The correct way to overwrite an existing file using the DataFrameWriter is:
df.write.mode("overwrite").json("path/to/file")
Option D is also technically valid, but Option A is the most concise and idiomatic PySpark syntax.
Reference: PySpark DataFrameWriter API

Question#5

A developer wants to test Spark Connect with an existing Spark application.
What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

A. Execute their pyspark shell with the option --remote "https://localhost"
B. Execute their pyspark shell with the option --remote "sc://localhost"
C. Set the environment variable SPARK_REMOTE="sc://localhost" before starting the pyspark shell
D. Add .remote("sc://localhost") to their SparkSession.builder calls in their Spark code
E. Ensure the Spark property spark.connect.grpc.binding.port is set to 15002 in the application code

Explanation:
Spark Connect enables decoupling of the client and Spark driver processes, allowing remote access.
Spark supports configuring the remote Spark Connect server in multiple ways:
From Databricks and Spark documentation:
Option B (--remote "sc://localhost") is a valid command-line argument for the pyspark shell to connect using Spark Connect.
Option C (setting SPARK_REMOTE environment variable) is also a supported method to configure the remote endpoint.
Option A is incorrect because Spark Connect uses the sc:// protocol, not https://.
Option D requires modifying the code, which the question explicitly avoids.
Option E configures the port on the server side but doesn’t start a client connection.
Final Answers: B and C

Exam Code: Databricks Certified Associate-Developer for Apache Spark 3.5Q & A: 135 Q&AsUpdated:  2025-10-31

 Get All Databricks Certified Associate-Developer for Apache Spark 3.5 Q&As