It’s a very common problem across whole data industry, people are struggling to reduce the time complexity, they bought big machines, they moved to spark but yet, there are few problems which can’t be solved by just using something, it requires a deeper insight to literally feel how the framework works and why it’s very slow even with considerable amount of resources.
Here we’ll be discussing how spark treats a join and figure out how to join.
Before we go into much detail, let’s discuss about varies strategies which spark uses to join.
There are 3 types of strategies:
- Sort-merge Join: The keys from both the sides are being sorted and then merged and then data is shuffled to bring down the same set of keys on same machine so that joining can be done.
- Broadcast-hash Join: The smallest side is being broadcasted to each node wherever the largest side of data partition resides, then the joining takes place, this eliminates the shuffling process for bigger dataset.
- Shuffle-hash Join: This is similar to the join in map reduce, it doesn’t sort the data, it simply hash the keys of both datasets and shuffle the data to the nodes by (hash mod no. of executors).
When both side datasets are too big, then it’s better to use Sort-merge join or Shuffle-hash join depending on the data locality.
After spark 2.3 release, spark made Sort merge join as the default strategy which can be disabled by setting ‘spark.sql.join.preferSortMergeJoin’ = false.
Best suitable when one of the dataset is small [about 1 GB]. There are 2 way of achieving this:
- By setting spark.sql.autoBroadcastJoinThreshold, this parameter takes value in bytes and default is 10 MB, spark check the size of dataset and it found it to be lesser than the threshold value then it will broadcast this dataset automatically to all the nodes wherever big dataset partitions reside.
** Sometimes it may happen that spark may not be able to figure out the size of data correctly, which may end up in sort-merge join, so to get rid if this, one can persist the dataset on disk and memory, by doing this spark will know the actual size of data and will choose the strategy accordingly.
- Forcing spark to broadcast, spark has also provided a functionality to broadcast a dataset forcefully, this approach is advisable only when we are sure about the data sizing.