Playing with joins in spark

It’s a very common problem across whole data industry, people are struggling to reduce the time complexity, they bought big machines, they moved to spark but yet, there are few problems which can’t be solved by just using something, it requires a deeper insight to literally feel how the framework works and why it’s very slow even with considerable amount of resources.

Here we’ll be discussing how spark treats a join and figure out how to join.

Continue reading “Playing with joins in spark”

Understanding the flow of classpath to run spark job on yarn

To run a job on cluster, it’s very necessary to provide correct set of jars but it’s always challenging on clustered environment where we have to deal with lots of moving components with different set of requirements and hence it’s very important to understand the importance of classpath settings otherwise a one can land up to a never ending problem ClassNotFoundException.

Continue reading “Understanding the flow of classpath to run spark job on yarn”

Setting up Virtual Environment for Pyspark or any other clustered env

On clustered environment, we face lot of issues with the python version available on the nodes, if we are shipping our product in that case we had to perform lot of sanity test pre-deployment to make sure our application will run as per our expectation but we can’t cover all scenarios and hence there is high chance of hitting issue.

So we thought of a better way and come up with an idea of shipping our own python version with everything preinstalled in that package, everyone might have been familiar with Virtual Environment or Anaconda but believe me after reading this you would get something new to learn.

Continue reading “Setting up Virtual Environment for Pyspark or any other clustered env”

Create a website or blog at

Up ↑