Apache Spark Do’s and Don’ts

Apache Spark Do’s and Don’ts

One of the main tools for every data engineer tools box would be apache-spark, knowing how to use it effectively and efficiently will pay him dividends in long run. There are many different tools in data engineering space each one has its own specific use case and no one tool will help to get the required solution or outcome, it is always the combination of tools to use which makes him/her an effective architect or a developer in designing data pipelines/data transformation activities. Many of them are judged by how well and correct they solve this or that problem, but there are tools that you just like, you want to use them. They are properly designed and fit well in your hand, you do not need to dig into the documentation and understand how to do this or that simple action.

I will describe the optimization methods and tips that help me solve certain technical problems and achieve high efficiency using Apache Spark

Never collect data on driver

When your DataFrame is  large enough to spill out of the driver memory then obviously  all its elements will not fit into the driver machine memory, don’t do the following:

df.collect()
Spark Driver Struggling to Collect the Data to Driver Memory

Collect action will try to move all data in RDD/Data Frame to the machine with the driver, where it may run out of memory and crash.

Instead, you can make sure that the number of items returned is sampled by calling take or takeSample, or perhaps by filtering your RDD/DataFrame.

Similarly, be careful with other actions if you are not sure that your dataset is small enough to fit into the driver memory:

  • countByKey
  • countByValue
  • collectAsMap

Broadcasting Dimension Tables / Variables

From the official Spark docs:

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

The use of broadcast variables available in SparkContext can significantly reduce the size of each serialized task, as well as the cost of running the task on the cluster. If your tasks use a large object from the driver program (e.g. a static search table, a large list), consider turning it into a broadcast variable.

If we don’t do broadcasting, the same variable will be sent to the executor separately for each partition. Broadcast variables allow the programmer to cache a read-only variable, In a deserialized form on each machine, instead of sending a copy of the variable with tasks.

As per this source code, you can have an upper limit of up to 8 GB for the Broadcast variable.

Note:- Spark has a ContextCleaner, which is run at periodic intervals to remove broadcast variables if they are not used.

Using the best suitable file format for processing

To increase productivity, be wise in choosing file formats. Depending on the specific application or individual functionality of your Spark jobs, the formats may vary.

Many formats have their own specifics, e.g. Avro has easy serialization/deserialization, which allows for efficient integration of ingestion processes. Meanwhile, Parquet allows you to work effectively when selecting specific columns and can be effective for storing intermediate files. But the parquet files are immutable, modifications require overwriting the whole data set, however, Avro files can easily cope with frequent schema changes.

When reading CSV and JSON files, you get better performance by specifying the schema, instead of using the inference mechanism – specifying the schema reduces errors and is recommended for production code.

Using the right compression for files

In general, files which we deal with for downstream processing can categorize in to Splittable or Non-Splittable

Everything that is old is new again, Spark at it’s core employs Massively Parallel Processing framework

A Resilient Distributed Dataset (RDD) in Spark is, of course, essentially the same thing as a distributed table in an MPP database: a distributed collection of key-value pairs partitioned by a hash on the key and where the value can be more-or-less arbitrary objects. Each of the common RDD operations, including map(), filter(), union(), intersection(), cartesian(), distinct(), aggregate(), fold(), count(), take(), collect(), etc, has a direct corresponding equivalent operation in an MPP database. Even advanced Spark programming constructs like accumulators and broadcast variables have natural counterparts in MPP database operators that deal with the movement of data.

So we can conclude that

Spark = MPP Database + Query Optimisation + Transaction Support(DELTA)

When we are dealing with Big data forest’s fastest animal, we should feed that animal with easily chewable small chunks of food(data chunks) so it can ingest/digest faster without any side effects.

Spark = Cheetah

In this context, it would be effective to feed Spark with Splittable files rather than Non-Splittable files. “splittable files” means that they can be processed in parallel in a distributed manner rather than a single machine (non-splittable).

Splittable

  • Snappy
  • bZip
  • lzo etc

Non-Splittable

  • gzip
  • zip
  • lz4

Do not use large source files in zip/gzip format, they are not splittable. It is not possible to read such files in parallel with Spark. Firstly, Spark needs to download the whole file on one executor, unpack it on just one core, and then redistribute the partitions to the cluster nodes. As you can imagine, this becomes a huge bottleneck in your distributed processing. If the files are stored on HDFS, you should unpack them before downloading them to Spark.

Bzip2 files have a similar problem. Even though they are splittable, they are so compressed that you get very few partitions and therefore they can be poorly distributed. Bzip2 is used if there are no limits on compression time and CPU load, for example for one-time packaging of large amounts of data.

After all, we see that uncompressed files are clearly outperforming compressed files. This is because uncompressed files are I/O bound, and compressed files are CPU bound, but I/O is good enough here. Apache Spark supports this quite well, but other libraries and data warehouses may not.

Conclusion

As you can infer from the discussion above it is the responsibility of the Architect/Developer to design and feed the system in a way that it would be easy for the system to process the data effectively and efficiently in order to achieve this, one has to know how to use the system and its components efficiently and effectively.

1

Leave a Comment

Your email address will not be published. Required fields are marked *