Running Spark Application
Driver Configuration
As mentioned in the Configuration section, the Driver class picks up parameters from driver.ini configuration file. For batch geocoding you need to change parameters of the [batch-geocode] sections.
- The format of input files should be
tsv(tab separated values), which must haveaddressandjoin_keycolumns (headers). - This application outputs data directly to hive. You need to change the
hive.output.tablefor each run or else the spark application will fail withAnalysisException: Table already exists
Submitting Spark Application
spark2-submit --master yarn \
--deploy-mode cluster \
--class "org.nfpa.spatial.Driver" \
--num-executors 80 \
--executor-memory 18G \
--executor-cores 2 \
--files=driver.ini \
--conf "spark.storage.memoryFraction=1" \
--conf "spark.yarn.executor.memoryOverhead=2048" \
--conf spark.executor.extraLibraryPath=/path/to/jniLibs \
--conf spark.driver.extraLibraryPath=/path/to/jniLibs \
path/to/location-tools-1.0-SNAPSHOT.jar --batch-geocode driver.ini
You should run spark2-submit in either headless mode or in tmux session, since batch jobs may take several hours to execute.
spark2-submit Arguments
--master : URL where Application master for this job will be instantiated. If you are running Spark on YARN, then choose
yarn. Default islocal--deploy-mode : Choose
clusterif you want to spin it on one of the node machines on the cluster. Otherwise, chooseclientfor local.--class : Application driver class name
--num-executors : Number of executors to spawn
--executor-memory : Memory to be allocated to each executor
--executor-cores : Number of cpu cores allocated to each executor
--conf : Arbitary Spark Configuration as
Key=Valueformat. For values that contain spaces wrap “key=value” in quotespath/to/location-tools-1.0-SNAPSHOT.jar: Location of the JAR file generated after building the project--batch-geocode : Direct the
Driverclass to run batch geocoding applicationdriver.ini : Path to
driver.iniconfiguration file
Tuning Spark Paramters
Please refer to this stackoverflow thread to estimate spark paramters