Bulk Loading Shapefiles Into Postgres/Postgis

Recently I’ve been doing a fair bit of work with geospatial data, mostly on the data preparation side. While there are common data formats, I have found that because so much of this data are sourced from government agencies, the data are often in many files that can be concatenated.

In this example, I will show how to take a few dozen county-level shapefiles of parcel data from Utah and load it into a single table in Postgres/Postgis.

Step 1: Downloading Shapefiles

The following shell commands come from an in-progress collaboration with a friend, where we are going to analyze daily air quality in Utah over the past several years. Utah is open-data-friendly, providing shapefiles for every parcel of land in Utah.

While it may have been possible to use wget or curl to download every shapefile, they are stored within Google Drive with a bunch of hashed URLs, so I just clicked on each file instead of trying to be clever. So if you want to follow along with this blog post exactly, you’ll need to download the 25 zip files of Utah shapefiles:

$ ls -lrt utah_lir_shapefiles/
total 408688
-rw-rw-r-- 1 rzwitch rzwitch    954984 Jun  1 13:10 Parcels_Beaver_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   7183466 Jun  1 13:10 Parcels_BoxElder_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   9152777 Jun  1 13:10 Parcels_Cache_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   3279384 Jun  1 13:10 Parcels_Carbon_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch    356058 Jun  1 13:10 Parcels_Daggett_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch  18908413 Jun  1 13:10 Parcels_Davis_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   3900415 Jun  1 13:10 Parcels_Duchesne_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   2689950 Jun  1 13:10 Parcels_Garfield_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   2156109 Jun  1 13:10 Parcels_Grand_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   8107608 Jun  1 13:10 Parcels_Iron_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   1975537 Jun  1 13:10 Parcels_Juab_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   3273485 Jun  1 13:10 Parcels_Kane_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   2741403 Jun  1 13:10 Parcels_Millard_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   1110627 Jun  1 13:10 Parcels_Morgan_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   2970626 Jun  1 13:10 Parcels_Rich_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 200183664 Jun  1 13:11 Parcels_SaltLake_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   1397522 Jun  1 13:11 Parcels_SanJuan_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   1576757 Jun  1 13:11 Parcels_Sanpete_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   7911261 Jun  1 13:11 Parcels_Summit_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   4480456 Jun  1 13:11 Parcels_Tooele_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch  69690149 Jun  1 13:11 Parcels_Utah_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch   5025674 Jun  1 13:11 Parcels_Wasatch_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch  35896908 Jun  1 13:11 Parcels_Washington_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch    298313 Jun  1 13:11 Parcels_Wayne_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch  23225130 Jun  1 13:11 Parcels_Weber_LIR.zip

Step 2: Bulk Unzip

With all of these files in the same directory at the same level (i.e. no subfolders), it’s pretty easy to bulk unzip the files, with one caveat: to move the contents of the unzipped files into a new directory, you need to use the -d flag:

mkdir utah_lir_shapefiles_unzipped && unzip utah_lir_shapefiles/\*.zip -d utah_lir_shapefiles_unzipped

The reason I created a new directory (mkdir) and then unzipped the files into a new directory is that when doing analysis, I always like to keep the source data separate, so that I always have the option of starting completely over. It also can make regular expression globs easier :)

Step 3: Creating Postgis Table Definition

After all of the county zip files are unzipped, you get 25 sub-directories structured like the following:

ls -ltr
total 10916
-rw-rw-rw- 1 rzwitch rzwitch   67868 Sep  3  2017 Parcels_Beaver_LIR.shx
-rw-rw-rw- 1 rzwitch rzwitch   28280 Sep  3  2017 Parcels_Beaver_LIR.shp.xml
-rw-rw-rw- 1 rzwitch rzwitch 1503304 Sep  3  2017 Parcels_Beaver_LIR.shp
-rw-rw-rw- 1 rzwitch rzwitch    3036 Sep  3  2017 Parcels_Beaver_LIR.sbx
-rw-rw-rw- 1 rzwitch rzwitch   83052 Sep  3  2017 Parcels_Beaver_LIR.sbn
-rw-rw-rw- 1 rzwitch rzwitch     425 Sep  3  2017 Parcels_Beaver_LIR.prj
-rw-rw-rw- 1 rzwitch rzwitch 9471508 Sep  3  2017 Parcels_Beaver_LIR.dbf
-rw-rw-rw- 1 rzwitch rzwitch       5 Sep  3  2017 Parcels_Beaver_LIR.cpg

The .shp files from the 25 counties all have the same format, which is very convenient. In this step, we can use the shp2pgsql utility that comes with Postgis to read a shapefile, determine the proper schema, then create the table in the database:

shp2pgsql -I -s 26912 -p utah_lir_shapefiles_unzipped/Parcels_Beaver_LIR/Parcels_Beaver_LIR.shp \
utahlirparcels  | psql -h localhost -U <username> <database>;

The key flag here is -p, which means ‘prepare mode’; the shapefile will get read, a table created, but no data loaded. By not loading the data in this step, it makes looping over the files easier later, as no special logic is required to keep the Parcels_Beaver_LIR.shp from being duplicated in Postgis (because it was never loaded in the first place).

Step 4: Bulk Loading Shapefiles into Postgis

The last steps of the loading process are to 1) get all of the shapefile locations and 2) feed them to shp2pgsql:

for i in $(find utah_lir_shapefiles_unzipped/ -type f -name '*.shp'); do
  shp2pgsql -I -s 26912 -a $i utahlirparcels  | psql -h localhost -U <username> <database>;
done;

To get all of the shapefile locations, I use find with flags -type f (files type) and name to search for the pattern within the directory. This command goes through the entire set of subdirectories and gets all the .shp files. From there, I iterate over the list of files using for i in..., then pass the value of $i into a similar shp2pgsql as above. However, rather than using flag -p for ‘prepare’, we are now going to use flag -a for ‘append’. This will perform an INSERT INTO utahlirparcels() statement for Postgres, loading in the actual data from the 25 shapefiles.

Spend Time Now To Save Time Later

Like so much of shell scripting, figuring out these commands took longer than I would’ve expected. Certainly, they took longer to figure out than it would’ve taken to copy-paste a shp2pgsql 25 times! But by taking the time upfront to figure out a generic method of looping over shapefiles, the next time (and every time after that) I find myself needing to do this, this code will be available to load multiple shapefiles into Postgis.


Using RSiteCatalyst With Microsoft PowerBI Desktop

With pretty regular frequency I get emails asking if RSiteCatalyst can be used with Microsoft Power BI. While admittedly I’m not a frequent user of the Windows operating system (nor dashboarding tools like Tableau or Power BI), I am pleased to report that it is fact possible to call the Adobe Analytics API with Power BI via RSiteCatalyst!

Step 1: Call Adobe Analytics API Using Get Data Menu

The majority of getting RSiteCatalyst to work within Power BI desktop is getting the R script correct. From the Get Data menu, choose the More... menu option to bring up all of the data import tools that Power BI defines:

rsitecatalyst powerbi get data

Once you choose ‘R Script’, an input box will open where you can place your RSiteCatalyst function calls:

rsitecatalyst powerbi rscript

After hitting ‘OK’, Power BI will evaluate your R code, determining which statements return a data.frame (which is the only allowable data structure imported into Power BI). You can choose which data.frame(s) you want to import from the ‘Navigator’ window:

rsitecatalyst powerbi navigator

Once you hit ‘OK’, Power BI imports the data and you can use your Adobe Analytics data just as you would in R with RSiteCatalyst (or, like any other data source like CSV or database…)

Limitations

While it’s possible to call RSiteCatalyst through Power BI, there are some limitations to keep in mind.

First, RSiteCatalyst will only work with Microsoft Power BI Desktop, which is installed locally on your machine. The Power BI Service, which is more of a shared dashboard/data store environment, does not allow external API calls as part of its security model. So while you can analyze your data locally, you cannot share dashboards to the Power BI Service.

The second limitation I’ve noticed is that Power BI doesn’t read from from a .Renviron file (at least, not from the default Windows location that R GUI reads). So you will need to place your credentials directly in the R script, which is never really ideal (though, may not be a big deal all things considered).

Finally, the R script runs synchronously, so when placing multiple calls in the same R script you will need to wait for all of the data.frame results before you can use them within Power BI. This is the same default behavior within R, sans using Promises or parallelism of some sort, but it’s still important to keep in mind.

Dashboards, Dashboards, Dashboards!

With a few minutes work, I was able to create this rudimentary dashboard (R code):

rsitecatalyst powerbi dashboard

Someone with more interesting/higher volume data could surely do better. But the most important thing in my opinion is that Microsoft has built an awesome integration with R and that creating dashboards in Power BI is waaaaaay easier than the last time I tried to create a dashboard using Excel and the Adobe Report Builder plugin.

Happy dashboarding!


Getting Started With OmniSci, Part 2: Electricity Dataset

Edit 10/1/2018: When I wrote this blog post, the company and product were named MapD. I’ve changed the title to reflect the new company name, but left the MapD references below to hopefully avoid confusion

In my previous MapD post, I loaded electricity data into MapD Community Edition, intentionally ignoring the what of the data to keep that post from being too overwhelming. Now let’s take a step back and explain the dataset, show how to format the data using Python that was loaded into MapD, then use the MapD Immerse UI to build a simple dashboard.

PJM Metered Load Data

I started off my career at PJM doing long-term electricity demand forecasting, to help power engineers do transmission line studies for reliability and to support expansion of the electrical grid in the U.S. Because PJM is a quasi-government agency, they provide over 25 years of hourly electricity usage for the Eastern and Central U.S., both in aggregate and by local power region (roughly, the local power company territories).

However, just because the data is available doesn’t mean it’s convenient, and unfortunately, the data are stored as Excel spreadsheets. This is easily remedied using pandas (v0.22.0, python3.6):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import os
import pandas as pd

#change to directory with files for convenience
os.chdir("~/electricity_data")

#first sheet in workbook contains all info for years 1993-1999
df1993_1999 = [pd.read_excel(str(x) + "-hourly-loads.xls", usecols = "A:Z") for x in range(1993,1999)]

#melt, append df1993-df1999 together
df_melted = pd.DataFrame()
for x in df1993_1999:
    x.columns = df1993_1999[1].columns.tolist()
    x_melt = pd.melt(x, id_vars=['ACTUAL_DATE', 'ZONE_NAME'], var_name = "HOUR_ENDING", value_name = "MW")
    df_melted = df_melted.append(x_melt)

#multiple sheets to concatenate
#too much variation for a one-liner
d2000 = pd.read_excel("2000-hourly-loads.xls", sheet_name = [x for x in range(2,17)], usecols = "A:Z")
d2001 = pd.read_excel("2001-hourly-loads.xls", sheet_name = None, usecols = "A:Z")
d2002 = pd.read_excel("2002-hourly-loads.xls", sheet_name = [x for x in range(1,18)], usecols = "A:Z")
d2003 = pd.read_excel("2003-hourly-loads.xls", sheet_name = [x for x in range(1,19)], usecols = "A:Z")
d2004 = pd.read_excel("2004-hourly-loads.xls", sheet_name = [x for x in range(2,24)], usecols = "A:Z")
d2005 = pd.read_excel("2005-hourly-loads.xls", sheet_name = [x for x in range(2,27)], usecols = "A:Z")
d2006 = pd.read_excel("2006-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2007 = pd.read_excel("2007-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2008 = pd.read_excel("2008-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2009 = pd.read_excel("2009-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2010 = pd.read_excel("2010-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2011 = pd.read_excel("2011-hourly-loads.xls", sheet_name = [x for x in range(3,32)], usecols = "A:Z")
d2012 = pd.read_excel("2012-hourly-loads.xls", sheet_name = [x for x in range(3,33)], usecols = "A:Z")
d2013 = pd.read_excel("2013-hourly-loads.xls", sheet_name = [x for x in range(3,34)], usecols = "A:Z")
d2014 = pd.read_excel("2014-hourly-loads.xls", sheet_name = [x for x in range(3,34)], usecols = "A:Z")
d2015 = pd.read_excel("2015-hourly-loads.xls", sheet_name = [x for x in range(3,40)], usecols = "B:AA")
d2016 = pd.read_excel("2016-hourly-loads.xls", sheet_name = [x for x in range(3,40)], usecols = "B:AA")
d2017 = pd.read_excel("2017-hourly-loads.xls", sheet_name = [x for x in range(3,42)], usecols = "B:AA")
d2018 = pd.read_excel("2018-hourly-loads.xls", sheet_name = [x for x in range(3,40)], usecols = "B:AA")

#loop over dataframes, read in matrix-formatted data, melt to normalized form
for ord in [d2000, d2001, d2002, d2003, d2004, d2005, d2006, d2007, d2008, d2009, d2010,
            d2011, d2012, d2013, d2014, d2015, d2016, d2017, d2018]:

    for key in ord:
        temp = ord[key]
        temp.columns = df1993_1999[1].columns.tolist() #standardize column names
        temp["ACTUAL_DATE"] = pd.to_datetime(temp["ACTUAL_DATE"]) #force datetime, excel reader wonky
        df_melted = df_melted.append(pd.melt(temp, id_vars=['ACTUAL_DATE', 'ZONE_NAME'], var_name = "HOUR_ENDING", value_name = "MW"))

#(4941384, 4)
#130MB as CSV
#remove any dates that are null, artifacts from excel reader
df_melted[pd.notnull(df_melted["ACTUAL_DATE"])].to_csv("hourly_loads.csv", index=False)    

The code is a bit verbose, if only because I didn’t want to spend time to figure out how to programmatically determine how many tabs each workbook has. But the concept is the same each time: read an Excel file, get the data into a dataframe, then convert the data to long form. So instead of having 26 columns (Date, Zone, Hr1-Hr24), we have 4 columns, which is quite frequently a more convenient way to access the data (especially when using SQL).

The final statement writes out a CSV of approximately 4MM rows, the same dataset that was loaded using mapdql in the first post.

Top 10 Usage Days By Season

One of the metrics I used to monitor as part of my job was the top 5/top 10 peak electricity use days per Summer (high A/C usage) and Winter (electric space heating) seasons. Back in those days, I used to use SAS against an enterprise database and the results would come back eventually

Obviously, it’s not a fair comparison to compare today’s GPUs vs. late ’90s enterprise databases in terms of performance, but back then it did take a non-trivial amount of effort to run this query to keep the report updated. With MapD, I can do the same report in ~100ms:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
--MapD doesn't currently support window functions, so need to precalculate maximum by day
with qry as (select
actual_date,
zone_name,
max(MW) as daily_max_usage
from hourly_loads
where zone_name = 'MIDATL' and actual_date between '2017-06-01' and '2017-09-30'
group by 1,2)
select
hl.actual_date,
hl.zone_name,
hl.hour_ending,
hl.MW
from hourly_loads as hl
inner join qry on qry.actual_date = hl.actual_date and qry.daily_max_usage = hl.mw
order by daily_max_usage desc
limit 10;

top 10 electric usage

The thing about returning an answer in 100ms or so is that its fast enough where calling these results from a webpage/dashboard would be very responsive; that’s where MapD Immerse comes in.

Building A Dashboard Using MapD Immerse

Rather than copy/pasting the query in and running it, it’s pretty easy to build an automated report using the Immerse dashboard builder. I’m limited to a single data source because I’m using MapD Community Edition, but in just a few minutes I was able to create the following dashboard:

mapd immerse dashboard

I took the query from above and built a view to encapsulate the query, so I didn’t have to worry about the with statement or joins, I could just use the view as if the results were pre-calculated. From there, adding in a results table and two bar charts was fairly quick (in the same drag-and-drop style of Tableau or other BI/reporting tools).

While this dashboard is pretty rudimentary in its design, were this data source set up as real-time using Apache Kafka or similar, this chart would always be up-to-date for use on a TV screen or as a browser bookmark without any additional data or web engineering.

Obviously, many dashboarding tools exist, but its important to note that no pre-aggregation or column indexing or other standard database performance tricks are being employed (outside of specialized hardware and fast GPU RAM caching). Even with 10 dashboard tiles updating serially 100ms at a time, you are still in the 1-2s page load time, on par with the fastest-loading dynamic webpages on the internet.

Programmatic analytics using pymapd

While dashboarding can be very effective for keeping senior management up-to-date, the real value of data is unlocked with more in-depth analytics and segmentation. In my next blog post, I’ll cover how to access MapD using pymapd in Python, doing more advanced visualizations and maybe even some machine learning…


  • RSiteCatalyst Version 1.4.16 Release Notes
  • Using RSiteCatalyst With Microsoft PowerBI Desktop
  • RSiteCatalyst Version 1.4.14 Release Notes
  • RSiteCatalyst Version 1.4.13 Release Notes
  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • ODSC webinar: End-to-End Data Science Without Leaving the GPU
  • PyData NYC 2018: End-to-End Data Science Without Leaving the GPU
  • Data Science Without Leaving the GPU
  • Getting Started With OmniSci, Part 2: Electricity Dataset
  • Getting Started With OmniSci, Part 1: Docker Install and Loading Data
  • Parallelizing Distance Calculations Using A GPU With CUDAnative.jl
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS