Overview of GCP Dataproc Serverless Spark
Putting some thought after using GCP Dataproc Serverless Spark Offering , I was invited to try their Preview in Nov 2021(As of Jan 20, 2022, its GA), I wrote a small program to test how the whole thing works. Here are my findings
- It supports Spark 3.2 and above (With Java 11), initially only scala with compiled jar was supported, but now Python, R, SQL Modes are supported
- Here is an example Batch Program I ran
gcloud beta dataproc batches submit — project plmgo-316515 — region us-central1 spark — batch batch-83f4 — class MainApp — jars gs://serverless-32415/serverless-spark-maven-1.0-SNAPSHOT.jar — subnet default
Really linked the UI/CLI as its simple and not complex
Here is the link to the Project in GitHub — https://github.com/sandeep540/serverless-spark-maven
Please feel free to suggest any improvement/changes
It’s a very basic 101 program of spark, not too heavy as well, here is how it executed
It ran for about 100 sec (Which is expected as startup time is 60 sec), started with 2 executors , If the program has some serious heavy lifting, I am sure it will scale to much more number of executors
I really like the Serverless Option in Spark (Even AWS EMR, Databricks has some kind of Serverless Option), it removed the headache of provisioning, fine tuning, with the Metrics Collected, I am sure many of the simple jobs can be shifted to Serverless Option
Complex/Heavy Jobs still need careful planning and fine-tuning, but there are many projects which can happily take this PaaS Offering
Personally I like this offering very much and with the price offering which GCP is providing, This will be something every Cloud Operator will provide in future
I will try to run some heavy job with Large data and post my Finding later
Thanks.