Lucene Index from TIGER Census Data
Download TIGER data
We use TIGER2018 data pulished by United States Census Bureau. TIGER data is in shape file format. We first download the zip files from census website.
You can specify which states to download in the driver.ini
file by changing the states
key in [download]
section. You can specify a comma separated list of US state codes.
Make sure you have about ~30G for complete US build
java -cp /usr/src/location-tools-1.0-SNAPSHOT.jar org.nfpa.spatial.Driver --download driver.ini
Preprocess the data for indexing
For lucene to consume the downloaded data, we first convert the downloaded shape files to csv and combine information from multiple file types like FACES
, EDGES
, STATE
, COUNTY
. For this, we use GeoSpark.
java -cp /usr/src/location-tools-1.0-SNAPSHOT.jar org.nfpa.spatial.Driver --process driver.ini
Build Lucene Indexes
We can now index all the csv files into lucene.
java -cp /usr/src/location-tools-1.0-SNAPSHOT.jar org.nfpa.spatial.Driver --index driver.ini
Make sure you
driver.ini
config has correct values
You should now have files in the lucene.index.dir
directory.
It's always a good idea to check the index with Lucene Luke which you can find in lucene binary releases (Lucene >= 8.1)