Getting Started With OmniSci, Part 2: Electricity Dataset

Edit 10/1/2018: When I wrote this blog post, the company and product were named MapD. I’ve changed the title to reflect the new company name, but left the MapD references below to hopefully avoid confusion

In my previous MapD post, I loaded electricity data into MapD Community Edition, intentionally ignoring the what of the data to keep that post from being too overwhelming. Now let’s take a step back and explain the dataset, show how to format the data using Python that was loaded into MapD, then use the MapD Immerse UI to build a simple dashboard.

PJM Metered Load Data

I started off my career at PJM doing long-term electricity demand forecasting, to help power engineers do transmission line studies for reliability and to support expansion of the electrical grid in the U.S. Because PJM is a quasi-government agency, they provide over 25 years of hourly electricity usage for the Eastern and Central U.S., both in aggregate and by local power region (roughly, the local power company territories).

However, just because the data is available doesn’t mean it’s convenient, and unfortunately, the data are stored as Excel spreadsheets. This is easily remedied using pandas (v0.22.0, python3.6):

import os
import pandas as pd

#change to directory with files for convenience

#first sheet in workbook contains all info for years 1993-1999
df1993_1999 = [pd.read_excel(str(x) + "-hourly-loads.xls", usecols = "A:Z") for x in range(1993,1999)]

#melt, append df1993-df1999 together
df_melted = pd.DataFrame()
for x in df1993_1999:
    x.columns = df1993_1999[1].columns.tolist()
    x_melt = pd.melt(x, id_vars=['ACTUAL_DATE', 'ZONE_NAME'], var_name = "HOUR_ENDING", value_name = "MW")
    df_melted = df_melted.append(x_melt)

#multiple sheets to concatenate
#too much variation for a one-liner
d2000 = pd.read_excel("2000-hourly-loads.xls", sheet_name = [x for x in range(2,17)], usecols = "A:Z")
d2001 = pd.read_excel("2001-hourly-loads.xls", sheet_name = None, usecols = "A:Z")
d2002 = pd.read_excel("2002-hourly-loads.xls", sheet_name = [x for x in range(1,18)], usecols = "A:Z")
d2003 = pd.read_excel("2003-hourly-loads.xls", sheet_name = [x for x in range(1,19)], usecols = "A:Z")
d2004 = pd.read_excel("2004-hourly-loads.xls", sheet_name = [x for x in range(2,24)], usecols = "A:Z")
d2005 = pd.read_excel("2005-hourly-loads.xls", sheet_name = [x for x in range(2,27)], usecols = "A:Z")
d2006 = pd.read_excel("2006-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2007 = pd.read_excel("2007-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2008 = pd.read_excel("2008-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2009 = pd.read_excel("2009-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2010 = pd.read_excel("2010-hourly-loads.xls", sheet_name = [x for x in range(3,29)], usecols = "A:Z")
d2011 = pd.read_excel("2011-hourly-loads.xls", sheet_name = [x for x in range(3,32)], usecols = "A:Z")
d2012 = pd.read_excel("2012-hourly-loads.xls", sheet_name = [x for x in range(3,33)], usecols = "A:Z")
d2013 = pd.read_excel("2013-hourly-loads.xls", sheet_name = [x for x in range(3,34)], usecols = "A:Z")
d2014 = pd.read_excel("2014-hourly-loads.xls", sheet_name = [x for x in range(3,34)], usecols = "A:Z")
d2015 = pd.read_excel("2015-hourly-loads.xls", sheet_name = [x for x in range(3,40)], usecols = "B:AA")
d2016 = pd.read_excel("2016-hourly-loads.xls", sheet_name = [x for x in range(3,40)], usecols = "B:AA")
d2017 = pd.read_excel("2017-hourly-loads.xls", sheet_name = [x for x in range(3,42)], usecols = "B:AA")
d2018 = pd.read_excel("2018-hourly-loads.xls", sheet_name = [x for x in range(3,40)], usecols = "B:AA")

#loop over dataframes, read in matrix-formatted data, melt to normalized form
for ord in [d2000, d2001, d2002, d2003, d2004, d2005, d2006, d2007, d2008, d2009, d2010,
            d2011, d2012, d2013, d2014, d2015, d2016, d2017, d2018]:

    for key in ord:
        temp = ord[key]
        temp.columns = df1993_1999[1].columns.tolist() #standardize column names
        temp["ACTUAL_DATE"] = pd.to_datetime(temp["ACTUAL_DATE"]) #force datetime, excel reader wonky
        df_melted = df_melted.append(pd.melt(temp, id_vars=['ACTUAL_DATE', 'ZONE_NAME'], var_name = "HOUR_ENDING", value_name = "MW"))

#(4941384, 4)
#130MB as CSV
#remove any dates that are null, artifacts from excel reader
df_melted[pd.notnull(df_melted["ACTUAL_DATE"])].to_csv("hourly_loads.csv", index=False)    

The code is a bit verbose, if only because I didn’t want to spend time to figure out how to programmatically determine how many tabs each workbook has. But the concept is the same each time: read an Excel file, get the data into a dataframe, then convert the data to long form. So instead of having 26 columns (Date, Zone, Hr1-Hr24), we have 4 columns, which is quite frequently a more convenient way to access the data (especially when using SQL).

The final statement writes out a CSV of approximately 4MM rows, the same dataset that was loaded using mapdql in the first post.

Top 10 Usage Days By Season

One of the metrics I used to monitor as part of my job was the top 5/top 10 peak electricity use days per Summer (high A/C usage) and Winter (electric space heating) seasons. Back in those days, I used to use SAS against an enterprise database and the results would come back eventually

Obviously, it’s not a fair comparison to compare today’s GPUs vs. late ’90s enterprise databases in terms of performance, but back then it did take a non-trivial amount of effort to run this query to keep the report updated. With MapD, I can do the same report in ~100ms:

--MapD doesn't currently support window functions, so need to precalculate maximum by day
with qry as (select
max(MW) as daily_max_usage
from hourly_loads
where zone_name = 'MIDATL' and actual_date between '2017-06-01' and '2017-09-30'
group by 1,2)
from hourly_loads as hl
inner join qry on qry.actual_date = hl.actual_date and qry.daily_max_usage =
order by daily_max_usage desc
limit 10;

top 10 electric usage

The thing about returning an answer in 100ms or so is that its fast enough where calling these results from a webpage/dashboard would be very responsive; that’s where MapD Immerse comes in.

Building A Dashboard Using MapD Immerse

Rather than copy/pasting the query in and running it, it’s pretty easy to build an automated report using the Immerse dashboard builder. I’m limited to a single data source because I’m using MapD Community Edition, but in just a few minutes I was able to create the following dashboard:

mapd immerse dashboard

I took the query from above and built a view to encapsulate the query, so I didn’t have to worry about the with statement or joins, I could just use the view as if the results were pre-calculated. From there, adding in a results table and two bar charts was fairly quick (in the same drag-and-drop style of Tableau or other BI/reporting tools).

While this dashboard is pretty rudimentary in its design, were this data source set up as real-time using Apache Kafka or similar, this chart would always be up-to-date for use on a TV screen or as a browser bookmark without any additional data or web engineering.

Obviously, many dashboarding tools exist, but its important to note that no pre-aggregation or column indexing or other standard database performance tricks are being employed (outside of specialized hardware and fast GPU RAM caching). Even with 10 dashboard tiles updating serially 100ms at a time, you are still in the 1-2s page load time, on par with the fastest-loading dynamic webpages on the internet.

Programmatic analytics using pymapd

While dashboarding can be very effective for keeping senior management up-to-date, the real value of data is unlocked with more in-depth analytics and segmentation. In my next blog post, I’ll cover how to access MapD using pymapd in Python, doing more advanced visualizations and maybe even some machine learning…

RSiteCatalyst Version 1.4.14 Release Notes

Like the last several updates, this blog post will be fairly short, given only a single bug fix was added.

Thanks again to GitHub user leocwlau who reported that the GetReportSuiteGroups function added an additional field AND provided the solution. No other bug fixes were made, nor was any additional functionality added.

Version 1.4.14 of RSiteCatalyst was submitted to CRAN today and should be available for download in the coming days.

Community Contributions

As I’ve mentioned in many a blog post before this one, I encourage all users of the software to continue reporting bugs via GitHub issues, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community.

Note: Please don’t email directly via the email in the RSiteCatalyst package, it will not be returned. Having a valid email contact in the package is a requirement to have a package listed on CRAN so they can contact the package author, it is not meant to imply I can/will provide endless, personalized support for free.

Getting Started With OmniSci, Part 1: Docker Install and Loading Data

It’s been nearly five years since I wrote about Getting Started with Hadoop for big data. In those years, there have been incremental improvements in columnar file formats and dramatic computation speed improvements with Apache Spark, but I still wouldn’t call the Hadoop ecosystem convenient for actual data analysis. During this same time period, thanks to NVIDIA and their CUDA library for general-purpose calculations on GPUs, graphics cards went from enabling visuals on a computer to enabling massively-parallel calculations as well.

Building upon CUDA is MapD, an analytics platform that allows for super-fast SQL queries and interactive visualizations. In this blog post, I’ll show how to use Docker to install MapD Community Edition and load hourly electricity demand data to analyze.

Installing MapD CE using Docker/nvidia-docker

While CUDA makes it possible to do calculations on GPUs, I wouldn’t go as far as to say it is easy, including just getting everything installed! Luckily, there is Docker and nvidia-docker, which provide all-in-one containers with all necessary drivers and libraries installed to build upon. MapD provides instructions for installing MapD CE using nvidia-docker, with the main installation command as follows:

##nvidia-docker version 2
docker run --runtime=nvidia \
-v $HOME/mapd-docker-storage:/mapd-storage \
-p 9090-9092:9090-9092 \

When you kickoff this command (I’m using a ssh terminal into a remote Ubuntu desktop), Docker will download all the required images from the mapd/mapd-ce-cudarepository and start a background process for the MapD database and the Immerse visualization interface/web server:

docker images

Once all of the images are downloaded, you can find the container that was created using docker container ls, then run docker exec -it <container id> bash to start the container and drop you into a Bash shell (on the container). From this point, MapD Community Edition will be running!

docker ls

Loading Data Using the Immerse Interface

Once the Bash shell opens in the terminal, you can now interact with MapD via the Docker container. However, for beginning exploration, it’s much simpler to use the Immerse Web Interface at localhost:9092:

mapd immerse

Uploading data via the Data Manager interface is reasonably performant for smaller files; a test file with four columns and million or so rows loaded in a few seconds (dependent on your upload speed, obviously):

mapd data manager

Edit the column names and types if you want (the CSV reader gets it right for me most of the time). Then, once the ‘Save Table’ button is clicked, MapD will import the CSV data into a columnar binary format, so that the GPU can operate directly on the data rather than reading from the CSV each query.

Loading Data Using the Command Line

While browser GUIs are great for some things, I’m still very much a command-line guy, at least for things like loading data. MapD provides the mapdql interface to load data and query, very much like psql for Postgres and other databases. To load my 4.9 million * 4 column dataset, I used the following commands:

$ docker container ls
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                              NAMES
51be4b888448        mapd/mapd-ce-cuda   "/bin/sh -c '/mapd/s…"   44 hours ago        Up 44 hours>9090-9092/tcp   nifty_heisenberg

$ docker exec -it 51be4b888448 bash
root@1f64b2dcc316:/mapd# bin/mapdql
Password: <default is "HyperInteractive">
User mapd connected to database mapd

mapdql> create table hourly_loads(
..> MW FLOAT);

mapdql> copy hourly_loads from 'hourly_loads.csv';
Loaded: 4898472 recs, Rejected: 0 recs in 0.923000 secs

The DDL for MapD seems pretty much the same as every other database language. First you define a table’s columns and their types, then you can use the copy command to load data from a CSV. The statement that prints upon success begins to give an indication of the speed MapD provides, loading nearly 5 million records in less than a second.

Simplistic Query Performance

Up this point, I’ve intentionally not described the data I uploaded into MapD; in my next post, I’ll cover the dataset I’m using and how I converted the data from Excel spreadsheets into a CSV. But before ending this post, I wanted to show a brief summary of the performance of MapD:

mapd query speed

The first query shows a simple record count by the hour_ending dimension in my table, something you might run if you weren’t too familiar with the table. You’ll notice that running this group by across the 5 million row dataset took 5143ms, which isn’t so fast. What’s going on?

Because this is the first query from a cold start, MapD needs to load data into GPU RAM. So while the first query takes a few seconds, the second query displays a warmed-up level of performance: 212ms to scan 5 million rows, filter by a few values of the zone_name column, then grouping by hour_ending. For reference, a human blink takes 100-400 ms, so this second query quite literally finished in the blink of an eye…

Dashboards, Streaming Data and more…

This first blog post just scratched the surface on what is possible using just the Community Edition of MapD. In future blog posts, I will provide the code to create the dataset, do some basic descriptive statistics, and even do some analysis and dashboarding of historical electricity demand.

Update, 2/1/2018 4:49 p.m.

Per Todd Mostak from MapD, the second query would likely even run faster than 212ms, had I run it again:

  • Using RSiteCatalyst With Microsoft PowerBI Desktop
  • RSiteCatalyst Version 1.4.14 Release Notes
  • RSiteCatalyst Version 1.4.13 Release Notes
  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • ODSC webinar: End-to-End Data Science Without Leaving the GPU
  • PyData NYC 2018: End-to-End Data Science Without Leaving the GPU
  • Data Science Without Leaving the GPU
  • Getting Started With OmniSci, Part 2: Electricity Dataset
  • Getting Started With OmniSci, Part 1: Docker Install and Loading Data
  • Parallelizing Distance Calculations Using A GPU With CUDAnative.jl
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS