To run a job on cluster, it’s very necessary to provide correct set of jars but it’s always challenging on clustered environment where we have to deal with lots of moving components with different set of requirements and hence it’s very important to understand the importance of classpath settings otherwise a one can land up to a never ending problem ClassNotFoundException.
<CPS> Yarn would resolve this key work with ‘:’
<PWD> Yarn would resolve this key with the working directory.
CLASSPATH: Jars specified in this text-box would be used loaded on Yarn-Launcher class path and will be used to launch a Spark Job. This place is a way to isolate the dependencies of spark from yarn,
We usually need hadoop jars, scala jars and spark jars to submit a spark job.
yarn.mode.hadoop.classpath: This will point to the hadoop jars but it’s not required to set this if we have already set the above property.
spark.driver.extraClassPath: We can use this property to provide jars to driver node, the hadoop jars and all the other jars which is already localised(already present on all the nodes). These jars would also be available from the start of the spark job.
spark.executor.extraClassPath: We can use this property to provide jars to all executor nodes, the hadoop jars and all the other jars which is already localised(already present on all the nodes).
spark.yarn.archive: We can provide a compressed file containing jars from hdfs itself to be present on spark job class path and which is not available locally, These jars would not be available till spark context
spark.yarn.stagingDir: This is used by spark to put all the jars and other files to be localised, make sure if driver fails with jar not found then please do either of this:
2. Put correct set configuration files (core-site.xml, hdfs-site.xml and other hadoop configuration files) on the classpath. If we don’t put these files on classpath while submitting the spark then spark would not know that it’s the same cluster where you are running the spark job and hence it will take long time to submit spark job.
spark.files: Any external file can be passed to spark job using this property, in our case we want hive-site.xml on the classpath to connect to metastore. Example: