Wrong Spark configuration that cost us $3k/month
Spark version: 3.2.0
Here at Dynamic Yield, we run thousands of Spark applications every day. Over the last two years, we worked hard to upgrade our infrastructure to run those jobs on Kubernetes and we continuously improve our system.
One of the innocent configurations that we added a few months ago was:
spark.eventLog.rolling.enabled: true
spark.eventLog.rolling.maxFileSize: 16m
That’s it. Two lines, more than $3,000 per month:

Assuming you’re using the Spark history server to monitor your Spark applications, make sure that it's not going to cost you more than it should.
Spark History Configuration
Spark history server is a great useful tool for monitoring your applications. By adding a few configuration lines to thespark-submit
command, application logs saved in your preferable storage (AWS S3 in our case) and being available for exploration later on. The most basic configurations are:
spark.eventLog.enabled: true
spark.eventLog.dir: s3a://a-team-bucket/history-logs
Those will make sure that the application will be available for future exploration.
While reading the documentation, Applying compaction on rolling event log files looks promising, especially after suffering from non-responsive UI while monitoring extremely slow applications in the past. Nothing implies the price of this new feature.
Implementation Details
The history server's refresh interval is 10 seconds by default and it can be tuned via spark.history.fs.update.interval
. Now, let’s say we store the last 50k applications in an S3 bucket and each application is represented by one file. In that case, there are 50 ListBucket operations behind the scene (each returning up to 1,000 objects) every 10 seconds:
.
├── spark-00079d419d924b4d900a0a27cd6a9ae0
├── spark-000af08856194e6e82046bc65237bc78
├── ...
├── ...
└── spark-z01051c66d93409583b17001c34fc21c
On the other hand, using the rolling logs will create a folder per application which looks like this:
.
├── eventlog_v2_spark-00079d419d924b4d900a0a27cd6a9ae0
| ├── appstatus_spark-00079d419d924b4d900a0a27cd6a9ae0
│ └── events_1_spark-00079d419d924b4d900a0a27cd6a9ae0
├── eventlog_v2_spark-000af08856194e6e82046bc65237bc78
| ├── appstatus_spark-000af08856194e6e82046bc65237bc78
│ └── events_1_spark-000af08856194e6e82046bc65237bc78
├── ...
├── ...
└── eventlog_v2_spark-z01051c66d93409583b17001c34fc21c
├── appstatus_spark-z01051c66d93409583b17001c34fc21c
└── events_1_spark-z01051c66d93409583b17001c34fc21c
This time history server performs the same 50 ListBucket operations to get all folders and then another 50,000 ListBucket operations for getting each folder's content! In other words, 50,050 ListBucket operations every 10s.
After debugging the source code I found that for file-system abstraction, where logs can be stored both in a local file system or in hdfs like S3 — tree walking is being used (list all folders and then each folder’s files) although deep tree scan is available.
Conclusions
- Tune the update interval that fits you. If you’re monitoring your applications once a day, 1 minute, 1 hour, or a daily update might be enough for you.
- Don’t hurry up to add configurations when you don’t have problems. Rotate might be useful for monitoring large/heavy applications. If you don’t feel pain while using your history server, you probably don’t need that.
- You can mix between those. The same history server path supports a mix of folders (rotate logs) and files. If you go with the rotate logs configurations, add them only to your biggest applications.
- Don’t forget to monitor the price you pay for storage. While it is perceived as cheap compared to compute resources, you don’t want to waste your budget on redundant operations.

Appendix
If you’re interested in running the Spark history server locally and seeing what’s going on, you may find this guide useful.
Running spark history server in IntelliJ
- Clone spark repository from Github.
- Open IntelliJ.
- Search for
org.apache.spark.deploy.history.HistoryServer
and run it (it will probably fail). - Edit Run Configurations:

choose module. I chose thespark-core_2.12
module and we will extend it in a second.
Tip: If you try to run from the script ( /bin/start-history-server.sh
) you will see the run command is printed in the log first line.
5. Go to Project Structure → Project Settings → Modules and select spark-core_2.12. Click on the Dependencies tab and add all jars from assembly/target/scala-2.12/jars. If you’re testing AWS / Hadoop jars add them as well.
6. In the conf
folder, add spark-defaults.conf
file and add your custom configuration, for example:
spark.history.fs.logDirectory s3a://logs-bucket/history
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.endpoint http://localhost:9000
spark.hadoop.fs.s3a.connection.ssl.enabled false
spark.hadoop.fs.s3a.path.style.access true
spark.history.fs.update.interval 30s
If you’re using minio (with docker) to simulate S3 you can add the credentials in the Run Configuration (step 4) via environment variables.
7. If you want to see debug logs, copy the log4j2.properties.template
file from theconf
directory to log4j2.properties
and change the rootLogger.level
severity.