Playing with joins in spark

It’s a very common problem across whole data industry, people are struggling to reduce the time complexity, they bought big machines, they moved to spark but yet, there are few problems which can’t be solved by just using something, it requires a deeper insight to literally feel how the framework works and why it’s very slow even with considerable amount of resources.

Here we’ll be discussing how spark treats a join and figure out how to join.

Continue reading “Playing with joins in spark”

Understanding the flow of classpath to run spark job on yarn

To run a job on cluster, it’s very necessary to provide correct set of jars but it’s always challenging on clustered environment where we have to deal with lots of moving components with different set of requirements and hence it’s very important to understand the importance of classpath settings otherwise a one can land up to a never ending problem ClassNotFoundException.

Continue reading “Understanding the flow of classpath to run spark job on yarn”

Apache Spark SQL

The previous systems which were developed for Big Data applications, such as MapReduce, offered a strong, but low-level procedural programming interface. By carrying up the development of the new systems for a better user experience, multiple techniques have been introduced to the relational interface, such as Pig, Hive, and Shark.

Continue reading “Apache Spark SQL”

I started working with big data technologies in July 2014, I was having hands-on experience on map-reduce code but in late 2014, I got introduced to another computing engine i.e Apache Spark, and that’s how I started with Scala since Spark itself is written in Scala. I did start with a fun data science project trying to recommend item on the basis of their attributes. This further how it gets turned into a great way of understanding the core concept of Spark and it’s programming.


Create a website or blog at

Up ↑