Don’t be afraid of helping others

How you feel, when you give something to a person who doesn’t have it, it not just brings happiness to him but also to your heart. But the major challenge you face is whom to help. How would you make sure, that person is not lying, that organisation is not lying and hence at last you restrict yourself and limit your boundaries.

So, I decided to write a post to let you know my perception, how I see things and help people. There are lot of people who has a very wrong perception in their mind about:

Continue reading “Don’t be afraid of helping others”

Playing with joins in spark

It’s a very common problem across whole data industry, people are struggling to reduce the time complexity, they bought big machines, they moved to spark but yet, there are few problems which can’t be solved by just using something, it requires a deeper insight to literally feel how the framework works and why it’s very slow even with considerable amount of resources.

Here we’ll be discussing how spark treats a join and figure out how to join.

Continue reading “Playing with joins in spark”


If you are not failing then you are not moving, you are in a box of situations / goals created by [others / your society] and you are trying to accomplish it (getting a job in big company, earning money higher than anyone you know, getting the biggest house, getting a car) these comparisons are creating an illusion around you [A BOX] that this is your goal. But do you really think is that what makes you happy? Even when you fail in accomplishing them you try to think positive to live with them because you don’t trust yourself. If you really want to know real you, you have to feel that box / cage [created by others] and you have to step out and think what you exactly want, go into your childhood and try to remember what you felt, when you saw an older woman begging for a coin, what you decided to be or how you decided to help her, or anything which strikes you to be something, bring that real you in yourself, dream it now and but do set a goal because …  




Understanding the flow of classpath to run spark job on yarn

To run a job on cluster, it’s very necessary to provide correct set of jars but it’s always challenging on clustered environment where we have to deal with lots of moving components with different set of requirements and hence it’s very important to understand the importance of classpath settings otherwise a one can land up to a never ending problem ClassNotFoundException.

Continue reading “Understanding the flow of classpath to run spark job on yarn”

Setting Up Hadoop Credential Provider API

Today, security is the main concern to everyone and when you product need to be deployed on premises there are few things which need to be provided to our application, a very basic example is database password, today industries are not ready to put them in a configuration file in cleartext format, everyone is looking for encryption. Which is now commonly known as Vault.

Here I’ve prepared a working vault using hadoop credential provider api.

Continue reading “Setting Up Hadoop Credential Provider API”

Setting up Virtual Environment for Pyspark or any other clustered env

On clustered environment, we face lot of issues with the python version available on the nodes, if we are shipping our product in that case we had to perform lot of sanity test pre-deployment to make sure our application will run as per our expectation but we can’t cover all scenarios and hence there is high chance of hitting issue.

So we thought of a better way and come up with an idea of shipping our own python version with everything preinstalled in that package, everyone might have been familiar with Virtual Environment or Anaconda but believe me after reading this you would get something new to learn.

Continue reading “Setting up Virtual Environment for Pyspark or any other clustered env”

Apache Spark SQL

The previous systems which were developed for Big Data applications, such as MapReduce, offered a strong, but low-level procedural programming interface. By carrying up the development of the new systems for a better user experience, multiple techniques have been introduced to the relational interface, such as Pig, Hive, and Shark.

Continue reading “Apache Spark SQL”

I started working with big data technologies in July 2014, I was having hands-on experience on map-reduce code but in late 2014, I got introduced to another computing engine i.e Apache Spark, and that’s how I started with Scala since Spark itself is written in Scala. I did start with a fun data science project trying to recommend item on the basis of their attributes. This further how it gets turned into a great way of understanding the core concept of Spark and it’s programming.


Create a website or blog at

Up ↑