Playing with joins in spark

It’s a very common problem across whole data industry, people are struggling to reduce the time complexity, they bought big machines, they moved to spark but yet, there are few problems which can’t be solved by just using something, it requires a deeper insight to literally feel how the framework works and why it’s very slow even with considerable amount of resources.

Here we’ll be discussing how spark treats a join and figure out how to join.

Continue reading “Playing with joins in spark”

Understanding the flow of classpath to run spark job on yarn

To run a job on cluster, it’s very necessary to provide correct set of jars but it’s always challenging on clustered environment where we have to deal with lots of moving components with different set of requirements and hence it’s very important to understand the importance of classpath settings otherwise a one can land up to a never ending problem ClassNotFoundException.

Continue reading “Understanding the flow of classpath to run spark job on yarn”

Setting Up Hadoop Credential Provider API

Today, security is the main concern to everyone and when you product need to be deployed on premises there are few things which need to be provided to our application, a very basic example is database password, today industries are not ready to put them in a configuration file in cleartext format, everyone is looking for encryption. Which is now commonly known as Vault.

Here I’ve prepared a working vault using hadoop credential provider api.

Continue reading “Setting Up Hadoop Credential Provider API”

Setting up Virtual Environment for Pyspark or any other clustered env

On clustered environment, we face lot of issues with the python version available on the nodes, if we are shipping our product in that case we had to perform lot of sanity test pre-deployment to make sure our application will run as per our expectation but we can’t cover all scenarios and hence there is high chance of hitting issue.

So we thought of a better way and come up with an idea of shipping our own python version with everything preinstalled in that package, everyone might have been familiar with Virtual Environment or Anaconda but believe me after reading this you would get something new to learn.

Continue reading “Setting up Virtual Environment for Pyspark or any other clustered env”

Apache Spark SQL

The previous systems which were developed for Big Data applications, such as MapReduce, offered a strong, but low-level procedural programming interface. By carrying up the development of the new systems for a better user experience, multiple techniques have been introduced to the relational interface, such as Pig, Hive, and Shark.

Continue reading “Apache Spark SQL”

I started working with big data technologies in July 2014, I was having hands-on experience on map-reduce code but in late 2014, I got introduced to another computing engine i.e Apache Spark, and that’s how I started with Scala since Spark itself is written in Scala. I did start with a fun data science project trying to recommend item on the basis of their attributes. This further how it gets turned into a great way of understanding the core concept of Spark and it’s programming.


Create a website or blog at

Up ↑