Even more Useful Spark Tips

Even more tips and things I've come across while using pyspark. Enjoy!

Avoid spark accidentally optimizing coalesce

Sometimes with the following code, spark will push the coalesce way up in the optimized plan, and therefore run your whole stage in a very inefficient way. It's best to avoid using on large transformations.

# < transformations above >

df.coalesce(1).write.parquet("path/to/save")

You can avoid this by triggering an action before the write

# < transformations above >

# Computation is triggered
df.cache.count()
# Done, now coalesce
df.coalesce(1).write.parquet("path/to/save")

You can also call repartition(1) to add a shuffle, but avoid non parallel computation.

Sorting large writes

When writing out a large amount of date that gets read a lot, consider sorting your data on columns with common values when using parquet. When spark compresses parquet on write, having rows that share values will drastically help with compression. As a real world example, I've been able to shrink a 50GB write to around 2GB.

Look into sortWithinPartitions to get a boost without an expensive call.

As always, play around with what columns an how many.

Reading in highly partitioned data

While writing out by partition for faster reads due to predicate push down is advised, sometimes reading back in that data can be very slow with filters vs directly withing the path string.

Depending on the total number of files under that partition, InMemoryFileIndex potentially will take a long time to build a directory listing before and reads are even done. See more here

You can potentially get around this by using a datasource table, see the mentioned spark summit talk, or using basepath to drill down the the partition level you want