ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Follow publication

Photo by Josh Appel on Unsplash

Wrong Spark configuration that cost us $3k/month

Itay Bittan
ITNEXT
Published in
4 min readJun 17, 2022

--

Spark version: 3.2.0

Here at Dynamic Yield, we run thousands of Spark applications every day. Over the last two years, we worked hard to upgrade our infrastructure to run those jobs on Kubernetes and we continuously improve our system.

One of the innocent configurations that we added a few months ago was:

spark.eventLog.rolling.enabled: true
spark.eventLog.rolling.maxFileSize: 16m

That’s it. Two lines, more than $3,000 per month:

AWS Cost Explorer before and after the rolling flag (image by author)

Assuming you’re using the Spark history server to monitor your Spark applications, make sure that it's not going to cost you more than it should.

Spark History Configuration

Spark history server is a great useful tool for monitoring your applications. By adding a few configuration lines to thespark-submit command, application logs saved in your preferable storage (AWS S3 in our case) and being available for exploration later on. The most basic configurations are:

spark.eventLog.enabled: true
spark.eventLog.dir: s3a://a-team-bucket/history-logs

Those will make sure that the application will be available for future exploration.

While reading the documentation, Applying compaction on rolling event log files looks promising, especially after suffering from non-responsive UI while monitoring extremely slow applications in the past. Nothing implies the price of this new feature.

Implementation Details

The history server's refresh interval is 10 seconds by default and it can be tuned via spark.history.fs.update.interval . Now, let’s say we store the last 50k applications in an S3 bucket and each application is represented by one file. In that case, there are 50 ListBucket operations behind the scene (each returning up to 1,000 objects) every 10 seconds:

.
├── spark-00079d419d924b4d900a0a27cd6a9ae0
├── spark-000af08856194e6e82046bc65237bc78
├── ...
├── ...
└── spark-z01051c66d93409583b17001c34fc21c

On the other hand, using the rolling logs will create a folder per application which looks like this:

.
├── eventlog_v2_spark-00079d419d924b4d900a0a27cd6a9ae0
| ├── appstatus_spark-00079d419d924b4d900a0a27cd6a9ae0
│ └── events_1_spark-00079d419d924b4d900a0a27cd6a9ae0
├── eventlog_v2_spark-000af08856194e6e82046bc65237bc78
| ├── appstatus_spark-000af08856194e6e82046bc65237bc78
│ └── events_1_spark-000af08856194e6e82046bc65237bc78
├── ...
├── ...
└── eventlog_v2_spark-z01051c66d93409583b17001c34fc21c
├── appstatus_spark-z01051c66d93409583b17001c34fc21c
└── events_1_spark-z01051c66d93409583b17001c34fc21c

This time history server performs the same 50 ListBucket operations to get all folders and then another 50,000 ListBucket operations for getting each folder's content! In other words, 50,050 ListBucket operations every 10s.

After debugging the source code I found that for file-system abstraction, where logs can be stored both in a local file system or in hdfs like S3 — tree walking is being used (list all folders and then each folder’s files) although deep tree scan is available.

Conclusions

  • Tune the update interval that fits you. If you’re monitoring your applications once a day, 1 minute, 1 hour, or a daily update might be enough for you.
  • Don’t hurry up to add configurations when you don’t have problems. Rotate might be useful for monitoring large/heavy applications. If you don’t feel pain while using your history server, you probably don’t need that.
  • You can mix between those. The same history server path supports a mix of folders (rotate logs) and files. If you go with the rotate logs configurations, add them only to your biggest applications.
  • Don’t forget to monitor the price you pay for storage. While it is perceived as cheap compared to compute resources, you don’t want to waste your budget on redundant operations.
Spark History server (image by author)

Appendix

If you’re interested in running the Spark history server locally and seeing what’s going on, you may find this guide useful.

Running spark history server in IntelliJ

  1. Clone spark repository from Github.
  2. Open IntelliJ.
  3. Search for org.apache.spark.deploy.history.HistoryServerand run it (it will probably fail).
  4. Edit Run Configurations:

choose module. I chose thespark-core_2.12 module and we will extend it in a second.

Tip: If you try to run from the script ( /bin/start-history-server.sh) you will see the run command is printed in the log first line.

5. Go to Project Structure → Project Settings → Modules and select spark-core_2.12. Click on the Dependencies tab and add all jars from assembly/target/scala-2.12/jars. If you’re testing AWS / Hadoop jars add them as well.

6. In the conf folder, add spark-defaults.conf file and add your custom configuration, for example:

spark.history.fs.logDirectory   s3a://logs-bucket/history
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.endpoint http://localhost:9000
spark.hadoop.fs.s3a.connection.ssl.enabled false
spark.hadoop.fs.s3a.path.style.access true
spark.history.fs.update.interval 30s

If you’re using minio (with docker) to simulate S3 you can add the credentials in the Run Configuration (step 4) via environment variables.

7. If you want to see debug logs, copy the log4j2.properties.template file from theconf directory to log4j2.properties and change the rootLogger.level severity.

--

--

Published in ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Written by Itay Bittan

Dad | Husband | Principal Software Engineer at Dynamic Yield | Tech Geek | https://www.linkedin.com/in/itaybittan

Responses (1)

Write a response