A Million Text Files And A Single Laptop

GNU Parallel Cat Unix

Wait…What? Why?

More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices).

The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill.

Surprisingly, with judicious use of GNU Parallel, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.

Data Generation

For this blog post, I used a combination of R and Python to generate the data: the “Groceries” dataset from the arules package for sampls ing transactions (with replacement), and the Python Faker (fake-factory) package to generate fake customer profiles and for creating the 1MM+ text files.

The contents of the data itself isn’t important for this blog post, but the data generation code is posted as a GitHub gist should you want to run these commands yourself.

Problem 1: Concatenating (cat * >> out.txt ?!)

The cat utility in Unix-y systems is familiar to most anyone who has ever opened up a Terminal window. Take some or all of the files in a folder, concatenate them together….one big file. But something funny happens once you get enough files…

$ cat * >> out.txt
-bash: /bin/cat: Argument list too long

That’s a fun thought…too many files for the computer to keep track of. As it turns out, many Unix tools will only accept about 10,000 arguments; the use of the asterisk in the `cat` command gets expanded before running, so the above statement passes 1,234,567 arguments to `cat` and you get an error message.

One (naive) solution would be to loop over every file (a completely serial operation):

for f in *; do cat "$f" >> ../transactions_cat/transactions.csv; done

Roughly 10,093 seconds later, you’ll have your concatenated file. Three hours is quite a coffee break…

Solution 1: GNU Parallel & Concatenation

Above, I mentioned that looping over each file gets you past the error condition of too many arguments, but it is a serial operation. If you look at your computer usage during that operation, you’ll likely see that only a fraction of a core of your computer’s CPU is being utilized. We can greatly improve that through the use of GNU Parallel:

ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"

The `$f` argument in the code is to highlight that you can choose the level of parallelism; however, you will not get infinitely linear scaling, as shown below (graph code, Julia):

Given that the graph represents a single run at each level of parallelism, it’s a bit difficult to say exactly where the parallelism gets maxed out, but at roughly 10 concurrent jobs, there’s no additional benefit. It’s also interesting to point out what the `-m` argument represents; by specifying `m`, you allow multiple arguments (i.e. multiple text files) to be passed as inputs into parallel. This alone leads to an 8x speedup over the naive loop solution.

Problem 2: Data > RAM

Now that we have a single file, we’ve removed the “one million files” cognitive dissonance, but now we have a second problem: at 19.93GB, the amount of data exceeds the RAM in my laptop (2014 MBP, 16GB of RAM). So in order to do analysis, either a bigger machine is needed or processing has to be done in a streaming or “chunked” manner (such as using the “chunksize” keyword in pandas).

But continuing on with our use of GNU Parallel, suppose we wanted to answer the following types of questions about our transactions data:

  1. How many unique products were sold?
  2. How many transactions were there per day?
  3. How many total items were sold per store, per month?

If it’s not clear from the list above, in all three questions there is an “embarrassingly parallel” portion of the computation. Let’s take a look at how to answer all three of these questions in a time- and RAM-efficient manner:

Q1: Unique Products

Given the format of the data file (transactions in a single column array), this question is the hardest to parallelize, but using a neat trick with the `tr` (transliterate) utility, we can map our data to one product per row as we stream over the file:

The trick here is that we swap the comma-delimited transactions with the newline character; the effect of this is taking a single transaction row and returning multiple rows, one for each product. Then we pass that down the line, eventually using `sort -u` to de-dup the list and `wc -l` to count the number of unique lines (i.e. products).

In a serial fashion, it takes quite some time to calculate the number of unique products. Incorporating GNU Parallel, just using the defaults, gives nearly a 4x speedup!

Q2. Transactions By Day

If the file format could be considered undesirable in question 1, for question 2 the format is perfect. Since each row represents a transaction, all we need to do is perform the equivalent of a SQL `Group By` on the date and sum the rows:

Using GNU Parallel starts to become complicated here, but you do get a 9x speed-up by calculating rows by date in chunks, then “reducing” again by calculating total rows by date (a trick I picked up at this blog post).

Q3. Total items Per store, Per month

For this example, it could be that my command-line fu is weak, but the serial method actually turns out to be the fastest. Of course, at a 14 minute run time, the real-time benefits to parallelization aren’t that great.

It may be possible that one of you out there knows how to do this correctly, but an interesting thing to note is that the serial version already uses 40-50% of the available CPU available. So parallelization might yield a 2x speedup, but seven minutes extra per run isn’t worth spending hours trying to the optimal settings.

But, I’ve got MULTIPLE files…

The three examples above showed that it’s possible to process datasets larger than RAM in a realistic amount of time using GNU Parallel. However, the examples also showed that working with Unix utilities can become complicated rather quickly. Shell scripts can help move beyond the “one-liner” syndrome, when the pipeline gets so long you lose track of the logic, but eventually problems are more easily solved using other tools.

The data that I generated at the beginning of this post represented two concepts: transactions and customers. Once you get to the point where you want to do joins, summarize by multiple columns, estimate models, etc., loading data into a database or an analytics environment like R or Python makes sense. But hopefully this post has shown that a laptop is capable of analyzing WAY more data than most people believe, using many tools written decades ago.

Comments

  1. As usual, Unix command line tools rocks!

  2. Without a lock on the file, wouldn’t your parallel cat in Solution 1 create some issues where, say, lines generated by one process may be incomplete and cut off by the output from another process? It might be safer to aggregate one file per process, then join them together in a second step.

    • Randy Zwitch says:

      Good question…

    • The command used here will not be protected by GNU Parallel. This is because the redirection is done inside the command. So this may mix up output:

      ls | parallel -m -j $f “cat {} >> ../transactions_cat/transactions.csv”

      But this command will be protected against mixing by GNU Parallel:

      ls | parallel -m -j $f cat {} >> ../transactions_cat/transactions.csv

      Note how the subtle difference in moving the >> outside the quotes makes the difference.

  3. I know that this article is about knowing the right tools and not reinventing the wheel, but with “cat” problem, why wouldn’t you simply use xargs? this is the tool created especially for this kind of problems.

    20G of data is too small to use database? ummm, what? If you’d import this data into *SQL after looking at it for a while, you’d get normalized data way smaller than the original in less time than it took your “transactions by day” parallel job to run. After adding 1-2 indexes queries would be trivial and take seconds to run without worrying about memory etc.

    it is possible, but this solutions would finally fail if arrays you are building in awk would fill up available memory. Also parallel is not that easy as stated in above comments and can mangle the data used this way.

    Enough to say, that my weapon of choice for this kind of small datasets is a database, because you’d get there parallelism, storage optimization, query optimizer, indexes… and you wouldn’t have to “reindex” all the fields each time you run group by via awk-parallel hybrid.

    Don’t get me wrong 🙂 I’m a heavy user of awk, cut, cat, etc. just this example doesn’t look convincing for me.

    • Randy Zwitch says:

      I’m not sure there is a “cat problem” (see comment above about the example from the parallel man pages)

      In terms of where the cutoff is for large enough to use a database…well, that’s certainly up to you. There’s obviously a cost of loading the data, then building indexes, But sure, if you have enough queries to amortize that cost, it’s a perfectly valid way to solve the problem when your data size exceeds RAM.

      As far as “simply using xargs”, the parallel man pages addresses this too. I learned parallel first, so that’s what I use.

      https://www.gnu.org/software/parallel/man.html#DIFFERENCES-BETWEEN-xargs-AND-GNU-Parallel

      Feel free to solve a problem however you’d like.

    • I had the same reaction as you. The size of the dataset has nothing to do with the use of a relational database or not. It’s simply a matter of convenience. Moving a dataset into a SQL database is effortless, from there, most problems can be solved with simple queries.

      The task, generalized, is to perform summaries of transactional data. It’s hard to find a better tool for this task than a relational database.

      • Randy Zwitch says:

        I still assert that it could; what if you don’t have enough space to load the data to SQLite? 20GB is immaterial here, but you could extend the example to 100GB of data and only 50GB of free disk space.

        You might load to SQLite, incrementally deleting files as you load them and it would work. But then your source is gone. Or, you could do as the example here and stream over the files.

        A contrived example? Sure. But it was a question posed to me the day I wrote this article.

  4. Very helpful–thank you. I’m always casting about for ways to deal with that @$% arg list issue…

  5. nice to see somebody is getting good use out of gnu parallel. It can be a bit cumbersome sometimes to pass everything via command line so I often call a bash script from parallel that uses arguments passed to it, ike this: https://gist.github.com/cloudkitsch/b494ad84d707eb84359d

  6. Try this:

    for f in *; do cat “$f” ; done > ../transactions_cat/transactions.csv

Trackbacks

  1. […] Read More: GNU Parallel, A Million Text Files And A Single Laptop […]

Leave a Reply

%d bloggers like this: