A Million Text Files And A Single Laptop

GNU Parallel Cat Unix

More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices).

The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill.

Surprisingly, with judicious use of GNU Parallel, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.

Data Generation

For this blog post, I used a combination of R and Python to generate the data: the “Groceries” dataset from the arules package for sampling transactions (with replacement), and the Python Faker (fake-factory) package to generate fake customer profiles and for creating the 1MM+ text files:

#R Code
write(Groceries, "groceries.txt", sep = ",")

#Python Code
import random, csv
from faker import Faker
fake = Faker()
from pandas import DataFrame
import pandas as pd

# Create customer file of 1,234,567 customers with fake data
# Use dataframe index as a way to generate unique customer id
customers = [fake.simple_profile() for x in range(0,1234567)]
customer_df = pd.DataFrame(customers)
customer_df["cust_id"] = customer_df.index

#Read in transactions file from arules package
with open("grocerydata.txt") as f:
    transactions = f.readlines()

#Remove new line character
transactions = [x[0:-1] for x in transactions]

#Generate transactions by cust_id

#file format:

#for each customer...
for i in range(0,1234567):
    #...create a file...
    with open('/transactions/custfile_%s' % i, 'w') as csvfile:
        trans = csv.writer(csvfile, delimiter=' ', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        #...that contains all of the transactions they've ever made
        for j in range(1, random.randint(1,365)):
            trans.writerow([i, fake.zipcode(), fake.date_time_this_decade(before_now=True, after_now=False), transactions[random.randint(0,len(transactions) - 1)]])

Problem 1: Concatenating (cat * >> out.txt ?!)

The cat utility in Unix-y systems is familiar to most anyone who has ever opened up a Terminal window. Take some or all of the files in a folder, concatenate them together….one big file. But something funny happens once you get enough files…

$ cat * >> out.txt

-bash: /bin/cat: Argument list too long

That’s a fun thought…too many files for the computer to keep track of. As it turns out, many Unix tools will only accept about 10,000 arguments; the use of the asterisk in the cat command gets expanded before running, so the above statement passes 1,234,567 arguments to cat and you get an error message.

One (naive) solution would be to loop over every file (a completely serial operation):

for f in *; do cat "$f" >> ../transactions_cat/transactions.csv; done

Roughly 10,093 seconds later, you’ll have your concatenated file. Three hours is quite a coffee break…

Solution 1: GNU Parallel & Concatenation

Above, I mentioned that looping over each file gets you past the error condition of too many arguments, but it is a serial operation. If you look at your computer usage during that operation, you’ll likely see that only a fraction of a core of your computer’s CPU is being utilized. We can greatly improve that through the use of GNU Parallel:

ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"

The $f argument in the code is to highlight that you can choose the level of parallelism; however, you will not get infinitely linear scaling, as shown below (graph code, Julia):

Given that the graph represents a single run at each level of parallelism, it’s a bit difficult to say exactly where the parallelism gets maxed out, but at roughly 10 concurrent jobs, there’s no additional benefit. It’s also interesting to point out what the -m argument represents; by specifying m, you allow multiple arguments (i.e. multiple text files) to be passed as inputs into parallel. This alone leads to an 8x speedup over the naive loop solution.

Problem 2: Data > RAM

Now that we have a single file, we’ve removed the “one million files” cognitive dissonance, but now we have a second problem: at 19.93GB, the amount of data exceeds the RAM in my laptop (2014 MBP, 16GB of RAM). So in order to do analysis, either a bigger machine is needed or processing has to be done in a streaming or “chunked” manner (such as using the “chunksize” keyword in pandas).

But continuing on with our use of GNU Parallel, suppose we wanted to answer the following types of questions about our transactions data:

  1. How many unique products were sold?
  2. How many transactions were there per day?
  3. How many total items were sold per store, per month?

If it’s not clear from the list above, in all three questions there is an “embarrassingly parallel” portion of the computation. Let’s take a look at how to answer all three of these questions in a time- and RAM-efficient manner:

Q1: Unique Products

Given the format of the data file (transactions in a single column array), this question is the hardest to parallelize, but using a neat trick with the [tr](http://www.linfo.org/tr.html) (transliterate) utility, we can map our data to one product per row as we stream over the file:

# Serial method (i.e. no parallelism)
# This is a simple implementation of map & reduce; tr statements represent one map, sort -u statements one reducer

# cut -d ' ' -f 5- transactions.csv | \     - Using cut, take everything from the 5th column and over from the transactions.csv file
# tr -d \" | \                              - Using tr, trim off double-quotes. This leaves us with a comma-delimited string of products representing a transaction
# sort -u | \                               - Using sort, put similar items together, but only output the unique values
# wc -l                                     - Count number of unique lines, which after de-duping, represents number of unique products

$ time cut -d ' ' -f 5- transactions.csv | tr -d \" | tr ',' '\n' | sort -u | wc -l

real	292m7.116s

# Parallelized version, default chunk size of 1MB. This will use 100% of all CPUs (real and virtual)
# Also map & reduce; tr statements a single map, sort -u statements multiple reducers (8 by default)

$ time cut -d ' ' -f 5- transactions.csv | tr -d \" | tr ',' '\n' | parallel --pipe --block 1M sort -u | sort -u | wc -l

# block size performance - Making block size smaller might improve performance
# Number of jobs can also be manipulated (not evaluated)
# --500K:               73m57.232s
# --Default 1M:         75m55.268s (3.84x faster than serial)
# --2M:                 79m30.950s
# --3M:                 80m43.311s

The trick here is that we swap the comma-delimited transactions with the newline character; the effect of this is taking a single transaction row and returning multiple rows, one for each product. Then we pass that down the line, eventually using sort -u to de-dup the list and wc -l to count the number of unique lines (i.e. products).

In a serial fashion, it takes quite some time to calculate the number of unique products. Incorporating GNU Parallel, just using the defaults, gives nearly a 4x speedup!

Q2. Transactions By Day

If the file format could be considered undesirable in question 1, for question 2 the format is perfect. Since each row represents a transaction, all we need to do is perform the equivalent of a SQL Group By on the date and sum the rows:

# Data is at transaction level, so just need to do equivalent of 'group by' operation
# Using cut again, we choose field 3, which is the date part of the timestamp
# sort | uniq -c is a common pattern for doing a 'group by' count operation
# Final tr step is to trim the leading quotation mark from date string

time cut -d ' ' -f 3 transactions.csv | sort | uniq -c | tr -d \"

real	76m51.223s

# Parallelized version
# Quoting can be annoying when using parallel, so writing a Bash function is often much easier than dealing with escaping quotes
# To do 'group by' operation using awk, need to use an associative array
# Because we are doing parallel operations, need to pass awk output to awk again to return final counts

awksub () { awk '{a[$3]+=1;}END{for(i in a)print i" "a[i];}';}
export -f awksub
time parallel --pipe awksub < transactions.csv | awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' | tr -d \" | sort

real	8m22.674s (9.05x faster than serial)

Using GNU Parallel starts to become complicated here, but you do get a 9x speed-up by calculating rows by date in chunks, then “reducing” again by calculating total rows by date (a trick I picked up at this blog post.

Q3. Total items Per store, Per month

For this example, it could be that my command-line fu is weak, but the serial method actually turns out to be the fastest. Of course, at a 14 minute run time, the real-time benefits to parallelization aren’t that great.

It may be possible that one of you out there knows how to do this correctly, but an interesting thing to note is that the serial version already uses 40-50% of the available CPU available. So parallelization might yield a 2x speedup, but seven minutes extra per run isn’t worth spending hours trying to the optimal settings.

But, I’ve got MULTIPLE files…

The three examples above showed that it’s possible to process datasets larger than RAM in a realistic amount of time using GNU Parallel. However, the examples also showed that working with Unix utilities can become complicated rather quickly. Shell scripts can help move beyond the “one-liner” syndrome, when the pipeline gets so long you lose track of the logic, but eventually problems are more easily solved using other tools.

The data that I generated at the beginning of this post represented two concepts: transactions and customers. Once you get to the point where you want to do joins, summarize by multiple columns, estimate models, etc., loading data into a database or an analytics environment like R or Python makes sense. But hopefully this post has shown that a laptop is capable of analyzing WAY more data than most people believe, using many tools written decades ago.

21st Century C: Error 64 on OSX When Using Make

To end 2015, I decided to finally learn C, instead of making it a 2016 resolution! I had previously done the ‘Learn C The Hard Way’ tutorials, taken about a year off, and thus forgotten everything.

Rather than re-do the same material, I decided to get ’21st Century C’ from O’Reilly and work through that. Unfortunately, there is an error/misprint in the very beginning chapters that makes doing the exercises near impossible on OSX. This error manifests itself as c99: invalid argument 'all' to -W Error 64. If you encounter this error on OSX (I’m using OSX 10.11.2 El Capitan as of writing this post), here are three methods for fixing the issue.

Error 64!

When the discussion of using Makefiles begins on page 15, there is a discussion of the “smallest practicable makefile”, which is just six lines long:

CFLAGS= -g -Wall -O3

$(P): $(OBJECTS)

Unfortunately, this doesn’t quite work on OSX. Page 11 in the book sort-of references that a fix is needed, but the directions aren’t so clear…

Error 64, solution 1: Book Fix, updated

To use the book fix, you are supposed to:

  1. Create a file named c99
  2. Put the lines gcc -std=c99 $\* OR clang $\* in the c99 file
  3. Add the file to your PATH in Terminal (such as export PATH="/Users/computeruser:$PATH" if the c99 file were located in /Users/computeruser directory)
  4. Run chmod +x c99 on the file to make it executable

Once you add this work-around to your PATH, then open a fresh Terminal window (or run source .bash_profile to refresh the Bash settings), you should be able to use Make to compile your C code.

But to be honest, this seems like a really weird “fix” to me, as it overrides the C compiler settings for any program run via Terminal. I prefer one of two alternate solutions.

Error 64, solution 2: Makefile Change

As I was researching this, a helpful Twitter user noted:

When you switch the ‘c99’ reference to just ‘cc’ in the Makefile, everything works fine. Here’s the subtlety different, corrected Makefile:

CFLAGS= -g -Wall -O3

$(P): $(OBJECTS)

Error 64, solution 3: Switch to Clang

The final solution I came across is rather than using the GCC compiler, you can use an alternate compiler called Clang, which is also generally available on OSX (especially with XCode installed). Like solution 2 above, the Makefile is just subtlety different:

CFLAGS= -g -Wall -O3

$(P): $(OBJECTS)

Whether to use GCC or Clang as your compiler is really beyond the scope of this blog post; as 21st Century C discusses, it really shouldn’t matter (especially when you are just learning the mechanics of the language).

Error 64, Be Gone!

There’s not really much more to say at this point; this blog post is mainly documentation for anyone who comes across this error in the future. I’ve continued on through the book using Clang, but suffice to say, it’s not the compiler that writes poor-quality, non-compiling code, it’s the user. Ah, the fun of learning 🙂

Four Tactics For Well Thought Out Business Requirements

One of the most common issues in business (especially large corporations) is trying to nail down the requirements for a given analysis request. The “business people” on the front-lines are talking to their higher-ups about what they think are important questions for the business to solve, but by the time the question gets to the analyst or developer, it sounds something like:

It would be interesting to model using SAS how our customers shop for our merchandise by channel and what overlaps there are between demographics, geography, product type and tenure. But we also have to timebox this, we can’t be boiling-the-ocean just looking for needles-in-a-haystack.

Say WHAT? Mr. Business Person, I cannot help you if you do not run that mess through Unsuck-It first.

In all seriousness, I’ve found there are a few great ways for an analyst to refine a “question” like the one above into an actionable plan of attack. So the next time you get a jargon-filled, completely generic analysis request such as the one above, try these four tactics.

1. All Requests Should Be Phrased In The Form Of A Question

The first thing to notice about the mock interaction above is that there are no question marks; it’s not a question! For an analyst or developer to work effectively, questions need to be presented, not bland statements. For example, a refinement series of questions from the analyst might include:

  • You need a model? What type of model? Do you mean a predictive model, a decision tree for understanding, a PivotTable for you to poke at, a one-page PowerPoint slide to give your boss?
  • You specified four attributes (demographics, geography, product type and tenure). Do you have a hypothesis around these attributes (or are you just brain-blabbing)?
  • What is meant by “shop”? Do you mean how do customers browse our goods online and in stores, the purchase cycle, what goods are frequently purchased together or something else?

Note that in all three of the refinement questions above, you are taking a generic idea and really drilling into what is needed. It is the analyst that is the expert in the techniques for analyzing data, so the analyst should be helping the business person to take a raw analysis request and make it into answerable questions.

2. Separate The Tools From The Question

The second thing to notice in the mock interaction above is the statement “using SAS”. I didn’t write that to pick on SAS, but rather, this exact statement was said to me early in my career. I had a boss who would try and guess which tool was appropriate for the question he was asking. I presume that he was trying to gauge how hard he thought the problem was, or try to signal to me how hard he thought the problem was. In the end, a plain SQL query with the results copied into an Excel table was all that was necessary.

As the analyst, confirm whether the tool is actually part of the deliverable. Meaning, if you need to deliver a Tableau workbook, ok, specifying “use Tableau” is an important part of the business question. But if the requirement is “production-quality visualizations”, Tableau may or may not be the right tool or might just be one part of a larger workflow.

3. Every Question Is Interesting To Someone. Solve The Valuable Ones.

Paraphrasing the aphorism “The path to hell is paved with good intentions”, the path to doing low-value work your entire career is answering questions that start ”Wouldn’t it be interesting if…”.

The basis for these statements are often tangents in other meetings, where high-level executives think there is information that should just be available at everyone’s fingertips. But if you were to ask “What business action would you take if you knew this piece of information?” or “Is it worth me stopping a project worth $1 million in Pre-Tax Profit per month to answer this for you?”, you’ll suddenly the question becomes a lot less interesting.

So always have estimates of the business impact of what you are currently working on and ask for the same estimate of those who ask for your time. Projects that are valuable to the business are “interesting”, everything else is just making work for other people.

4. Don’t Just Solved The Stated Question. Solve The Unstated Question Too.

Finally, when I read the mock interaction above, there are actually two questions:

  • Stated: Do we understand our customer’s purchasing behaviors?
  • Unstated: How do we optimize our business to take into account our customer’s purchasing behaviors?

For sure, a deep understanding of the customer base is important no matter the product. But the unstated question of “What are we doing to do about it?” is so much more valuable to answer (i.e. tactic #3).

So even if the refined question becomes ’Build a customer segmentation based on past purchases’, go one step further and figure out how to implement your findings. Create a test plan for increasing email click-through-rates based on the segments or optimize your display bidding, maybe build a recommender system for your website…implementation of new ideas is always going to be more valuable than just analyzing the past.

Always Be Assertive.

If the key to sales is “Always Be Closing”, the key to quality analysis is “Always Be Assertive”. Ask questions. Make people think about what they are doing, what they ask of others and what can be done to improve the business. It’s a rare, ego-centric co-worker who doesn’t appreciate collaborating to get to a better quality question (and answer!) than they originally started with.

Being able to read into what other people are asking for, estimating its value, then delivering more than they even knew they were asking for has helped me tremendously throughout my career. Hopefully by doing some or all of the tactics above, you’ll see a marked improvement in your analysis and career as well!

  • Using RSiteCatalyst With Microsoft PowerBI Desktop
  • RSiteCatalyst Version 1.4.14 Release Notes
  • RSiteCatalyst Version 1.4.13 Release Notes
  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • ODSC webinar: End-to-End Data Science Without Leaving the GPU
  • PyData NYC 2018: End-to-End Data Science Without Leaving the GPU
  • Data Science Without Leaving the GPU
  • Getting Started With OmniSci, Part 2: Electricity Dataset
  • Getting Started With OmniSci, Part 1: Docker Install and Loading Data
  • Parallelizing Distance Calculations Using A GPU With CUDAnative.jl
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS