Understanding the flow of classpath to run spark job on yarn

To run a job on cluster, it’s very necessary to provide correct set of jars but it’s always challenging on clustered environment where we have to deal with lots of moving components with different set of requirements and hence it’s very important to understand the importance of classpath settings otherwise a one can land up to a never ending problem ClassNotFoundException.


<CPS> 
Yarn would resolve this key work with ‘:’

<PWD> Yarn would resolve this key with the working directory.

YARN-ENV-ENTRIES

CLASSPATH: Jars specified in this text-box would be used loaded on Yarn-Launcher class path and will be used to launch a Spark Job. This place is a way to isolate the dependencies of spark from yarn,

                       Example:

CLASSPATH
/opt/cloudera/parcels/CDH/lib/hive/*<CPS>/opt/cloudera/parcels/CDH/lib/hive/lib/*<CPS>/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/*<CPS>/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*<CPS>/opt/cloudera/parcels/CDH/lib/hadoop-hdfs/*<CPS>/opt/cloudera/parcels/CDH/lib/hadoop-hdfs/lib/*<CPS>/opt/cloudera/parcels/CDH/lib/hadoop-yarn/*<CPS>/opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/*<CPS>/opt/cloudera/parcels/CDH/lib/hadoop/*<CPS>/opt/cloudera/parcels/CDH/lib/hadoop/lib/*<CPS>/opt/cloudera/parcels/CDH/lib/spark/jars/*

We usually need hadoop jars, scala jars and spark jars to submit a spark job.

YARN-HADOOP

yarn.mode.hadoop.classpath: This will point to the hadoop jars but it’s not required to set this if we have already set the above property.

SPARK
spark.driver.extraClassPath: We can use this property to provide jars to driver node, the hadoop jars and all the other jars which is already localised(already present on all the nodes). These jars would also be available from the start of the spark job.

spark.executor.extraClassPath: We can use this property to provide jars to all executor nodes, the hadoop jars and all the other jars  which is already localised(already present on all the nodes).

spark.yarn.archive: We can provide a compressed file containing jars from hdfs itself to be present on spark job class path and which is not available locally, These jars would not be available till spark context
get created so putting hadoop jars or any other jar in yarn-archive will not work in the launch of spark job and it will fail with class not found exception.

spark.yarn.stagingDir: This is used by spark to put all the jars and other files to be localised, make sure if driver fails with jar not found then please do either of this:
———
1. Set this property with some hdfs path because it may possible it would be writing to a local node because of some default file system issue.
Example: hdfs://ip-11-0-0-189.ec2.internal:8020/user/
———
2. Put correct set configuration files (core-site.xml, hdfs-site.xml and other hadoop configuration files) on the classpath. If we don’t put these files on classpath while submitting the spark then spark would not know that it’s the same cluster where you are running the spark job and hence it will take long time to submit spark job.

spark.files: Any external file can be passed to spark job using this property, in our case we want hive-site.xml on the classpath to connect to metastore. Example: /etc/hive/conf/hive-site.xml

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: