Overview of GCP Dataproc Serverless Spark

Sandeep Kongathi
2 min readJan 22, 2022

Putting some thought after using GCP Dataproc Serverless Spark Offering , I was invited to try their Preview in Nov 2021(As of Jan 20, 2022, its GA), I wrote a small program to test how the whole thing works. Here are my findings

  • It supports Spark 3.2 and above (With Java 11), initially only scala with compiled jar was supported, but now Python, R, SQL Modes are supported
  • Here is an example Batch Program I ran
gcloud beta dataproc batches submit — project plmgo-316515 — region us-central1 spark — batch batch-83f4 — class MainApp — jars gs://serverless-32415/serverless-spark-maven-1.0-SNAPSHOT.jar — subnet default

Really linked the UI/CLI as its simple and not complex

Here is the link to the Project in GitHub — https://github.com/sandeep540/serverless-spark-maven

Please feel free to suggest any improvement/changes

It’s a very basic 101 program of spark, not too heavy as well, here is how it executed

It ran for about 100 sec (Which is expected as startup time is 60 sec), started with 2 executors , If the program has some serious heavy lifting, I am sure it will scale to much more number of executors

I really like the Serverless Option in Spark (Even AWS EMR, Databricks has some kind of Serverless Option), it removed the headache of provisioning, fine tuning, with the Metrics Collected, I am sure many of the simple jobs can be shifted to Serverless Option

Complex/Heavy Jobs still need careful planning and fine-tuning, but there are many projects which can happily take this PaaS Offering

Personally I like this offering very much and with the price offering which GCP is providing, This will be something every Cloud Operator will provide in future

I will try to run some heavy job with Large data and post my Finding later

Thanks.

--

--

Sandeep Kongathi

Developer, Data Engineer, Cloud-Native, K8S, Docker, GCP, Spark, Flink