Similar to my last post about needing to merge shapefiles using Postgis, I recently downloaded a bunch of energy data from the federal government. 13,370 files to be exact. While the data size itself isn’t that large (~8GB, compressed), an open-source tool I was looking to evaluate only supports gzip compression instead of the zip compressed files I actually had.
While I could’ve used this opportunity to merge the files together into one and do all the data cleaning, I became obsessed with figuring out how to just switch the compression scheme. Here’s the one-liner that emerged:
As a one-liner, it’s not the hardest to comprehend what’s going on, but it’s also not the most intuitive. The key idea here is that once we find all of the zip files, we can unzip/gzip the files in parallel. Note that this works because each process is independent from the other; a single file itself is being unzipped and then gzipped, we’re not unzipping and gzipping a single file in parallel. Just that multiple single-threaded processes are being kicked off at once instead of leaving the other cores in the CPU idle.
Once the unzip-to-gzip process has occurred, then I delete the original zip file. So for the most part, this process can be considered to take constant disk space (if you ignore that 4-8 files are being processed at one time).
Like many one-liners, this took longer to figure out than was actually worth the time savings. But such is life, and now it’s available in the wild for the next person who wonders how to do something like this!
Recently I’ve been doing a fair bit of work with geospatial data, mostly on the data preparation side. While there are common data formats, I have found that because so much of this data are sourced from government agencies, the data are often in many files that can be concatenated.
In this example, I will show how to take a few dozen county-level shapefiles of parcel data from Utah and load it into a single table in Postgres/Postgis.
While it may have been possible to use wget or curl to download every shapefile, they are stored within Google Drive with a bunch of hashed URLs, so I just clicked on each file instead of trying to be clever. So if you want to follow along with this blog post exactly, you’ll need to download the 25 zip files of Utah shapefiles:
Step 2: Bulk Unzip
With all of these files in the same directory at the same level (i.e. no subfolders), it’s pretty easy to bulk unzip the files, with one caveat: to move the contents of the unzipped files into a new directory, you need to use the -d flag:
The reason I created a new directory (mkdir) and then unzipped the files into a new directory is that when doing analysis, I always like to keep the source data separate, so that I always have the option of starting completely over. It also can make regular expression globs easier :)
Step 3: Creating Postgis Table Definition
After all of the county zip files are unzipped, you get 25 sub-directories structured like the following:
The .shp files from the 25 counties all have the same format, which is very convenient. In this step, we can use the shp2pgsql utility that comes with Postgis to read a shapefile, determine the proper schema, then create the table in the database:
The key flag here is -p, which means ‘prepare mode’; the shapefile will get read, a table created, but no data loaded. By not loading the data in this step, it makes looping over the files easier later, as no special logic is required to keep the Parcels_Beaver_LIR.shp from being duplicated in Postgis (because it was never loaded in the first place).
Step 4: Bulk Loading Shapefiles into Postgis
The last steps of the loading process are to 1) get all of the shapefile locations and 2) feed them to shp2pgsql:
To get all of the shapefile locations, I use find with flags -type f (files type) and name to search for the pattern within the directory. This command goes through the entire set of subdirectories and gets all the .shp files. From there, I iterate over the list of files using for i in..., then pass the value of $i into a similar shp2pgsql as above. However, rather than using flag -p for ‘prepare’, we are now going to use flag -a for ‘append’. This will perform an INSERT INTO utahlirparcels() statement for Postgres, loading in the actual data from the 25 shapefiles.
Spend Time Now To Save Time Later
Like so much of shell scripting, figuring out these commands took longer than I would’ve expected. Certainly, they took longer to figure out than it would’ve taken to copy-paste a shp2pgsql 25 times! But by taking the time upfront to figure out a generic method of looping over shapefiles, the next time (and every time after that) I find myself needing to do this, this code will be available to load multiple shapefiles into Postgis.
With pretty regular frequency I get emails asking if RSiteCatalyst can be used with Microsoft Power BI. While admittedly I’m not a frequent user of the Windows operating system (nor dashboarding tools like Tableau or Power BI), I am pleased to report that it is fact possible to call the Adobe Analytics API with Power BI via RSiteCatalyst!
Step 1: Call Adobe Analytics API Using Get Data Menu
The majority of getting RSiteCatalyst to work within Power BI desktop is getting the R script correct. From the Get Data menu, choose the More... menu option to bring up all of the data import tools that Power BI defines:
Once you choose ‘R Script’, an input box will open where you can place your RSiteCatalyst function calls:
After hitting ‘OK’, Power BI will evaluate your R code, determining which statements return a data.frame (which is the only allowable data structure imported into Power BI). You can choose which data.frame(s) you want to import from the ‘Navigator’ window:
Once you hit ‘OK’, Power BI imports the data and you can use your Adobe Analytics data just as you would in R with RSiteCatalyst (or, like any other data source like CSV or database…)
While it’s possible to call RSiteCatalyst through Power BI, there are some limitations to keep in mind.
First, RSiteCatalyst will only work with Microsoft Power BI Desktop, which is installed locally on your machine. The Power BI Service, which is more of a shared dashboard/data store environment, does not allow external API calls as part of its security model. So while you can analyze your data locally, you cannot share dashboards to the Power BI Service.
The second limitation I’ve noticed is that Power BI doesn’t read from from a .Renviron file (at least, not from the default Windows location that R GUI reads). So you will need to place your credentials directly in the R script, which is never really ideal (though, may not be a big deal all things considered).
Finally, the R script runs synchronously, so when placing multiple calls in the same R script you will need to wait for all of the data.frame results before you can use them within Power BI. This is the same default behavior within R, sans using Promises or parallelism of some sort, but it’s still important to keep in mind.
Dashboards, Dashboards, Dashboards!
With a few minutes work, I was able to create this rudimentary dashboard (R code):
Someone with more interesting/higher volume data could surely do better. But the most important thing in my opinion is that Microsoft has built an awesome integration with R and that creating dashboards in Power BI is waaaaaay easier than the last time I tried to create a dashboard using Excel and the Adobe Report Builder plugin.