A Beginner's Look at BenchmarkTools.jl

For the number of years I’ve been programming using Julia, I’ve never really been concerned with performance. Which is to say, I’ve appreciated that other people are interested in performance and have proven that Julia can be as fast as any other performance language out there. But I’ve never been one to pour over the Performance Tips section of the Julia manual trying to squeeze every last bit of performance.

But now that I’ve released OmniSci.jl, and as a company one of our major selling points is accelerated analytics, I figured it was time to stop assuming I wrote decent-ish code and really pay attention to performance. This post highlights my experience as a beginner, and hopefully will show how others can get started in learning to optimize their Julia code.

Read The Manuals!

As I mentioned above, I’ve written Julia for many years now, and in that time I’ve grown up with many of the tips in the performance tips section of the documentation. Things like “write type stable functions” and “avoid global variables” are things that I’ve internalized as good programming practices, as opposed to doing them just because they are performant. But with this long familiarity with the language comes laziness, and by not reading the BenchmarkTools.jl documentation, I started off benchmarking incorrectly. Consider this example:

julia> using Random, OmniSci, BenchmarkTools, Base.Threads

julia> #change defaults, since examples long-running
       BenchmarkTools.DEFAULT_PARAMETERS.seconds = 1000

julia> BenchmarkTools.DEFAULT_PARAMETERS.samples = 5

julia> #generate test data
       gendata(x, T) = [rand(typemin(T):typemax(T)) for y in 1:x]
gendata (generic function with 1 method)

julia> int64_10x6 = gendata(10^6, Int64);

julia> #Test whether broadcasting more/less efficient than pre-allocating results array
       function preallocate(x)

           v = Vector{OmniSci.TStringValue}(undef, length(x))

           for idx in 1:length(x)
               v[idx] = OmniSci.TStringValue(x[idx])

           return v
preallocate (generic function with 1 method)

julia> @benchmark v61 = OmniSci.TStringValue.(int64_10x6)
  memory estimate:  297.55 MiB
  allocs estimate:  6000005
  minimum time:     750.146 ms (0.00% GC)
  median time:      1.014 s (29.38% GC)
  mean time:        1.151 s (28.38% GC)
  maximum time:     1.794 s (43.06% GC)
  samples:          5
  evals/sample:     1

julia> @benchmark v62 = preallocate(int64_10x6)
  memory estimate:  297.55 MiB
  allocs estimate:  6000002
  minimum time:     753.877 ms (0.00% GC)
  median time:      1.021 s (28.30% GC)
  mean time:        1.158 s (28.10% GC)
  maximum time:     1.806 s (43.17% GC)
  samples:          5
  evals/sample:     1

The benchmark above tests whether it’s worth pre-allocating the results array vs. using the more convenient dot broadcasting syntax. The idea here is that growing an array over and over can be inefficient when you know the result size at the outset. Yet, comparing the times above, for all statistics pre-allocating the array is slightly worse, even though we’re passing the compiler more knowledge up front. This didn’t sit well with me, so I consulted the BenchmarkTools.jl manual and found the following about variable interpolation:

A good rule of thumb is that external variables should be explicitly interpolated into the benchmark expression

Interpolating the int64_10x6 input array into the function takes it from being a global variable to a local, and sure enough, we see roughly a 6% improvement in the minimum time when we pre-allocate the array:

julia> @benchmark v61i = OmniSci.TStringValue.($int64_10x6)
  memory estimate:  297.55 MiB
  allocs estimate:  6000002
  minimum time:     763.817 ms (0.00% GC)
  median time:      960.446 ms (24.02% GC)
  mean time:        1.178 s (28.68% GC)
  maximum time:     1.886 s (45.11% GC)
  samples:          5
  evals/sample:     1

julia> @benchmark v62i = preallocate($int64_10x6)
  memory estimate:  297.55 MiB
  allocs estimate:  6000002
  minimum time:     721.597 ms (0.00% GC)
  median time:      1.072 s (30.45% GC)
  mean time:        1.234 s (32.92% GC)
  maximum time:     1.769 s (44.51% GC)
  samples:          5
  evals/sample:     1

Whether that 6% improvement will hold up over time or not, at least conceptually we’re no longer worse off for pre-allocating, which fits my mental model of how things should work.

Evaluate Your Benchmark Over the Range of Inputs You Care About

In the comparison above, I evaluate the benchmark over 10^6 observations. How did I choose 1 million as the “right” number of events to test, instead of just testing 1 or 10 events? My general goal for benchmarking this code is to speed up the methods of loading data into an OmniSciDB database. TStringValue is one of the internal methods as part of doing a row-wise table load, converting whatever data is present in an array or DataFrame from ::Type{T} into String (think iterating over a text file by line). Since users trying to accelerate their database operations are probably going to be using millions to billions of data points, I’m interested in understanding how the functions are performing at these volumes of data.

The other conscious decision I made was the environment to test on. I could test this on massive CPU- and GPU-enabled servers, but I’m testing this on my Dell XPS 15 laptop. Why? Because I’m actually interested in how things are performing under more real-world conditions for a realistic user. Testing the performance characteristics of a high-end server with tons of memory and cores would be fun, but I want to make sure any performance improvements are broadly applicable, instead of just because I am throwing more hardware at the problem.

Less important to me to control for was garbage collection, using a fresh session before each measurement or other “best case scenario” optimizations. I would expect my users to be more analytics and data science focused, so re-using the same session is going to be common. If the performance improvements aren’t completely obvious, I’m not going to incorporate them into the codebase.

Case Study: Speeding Up TStringValue

For my test, I evaluate the following as the methods to benchmark:

  • broadcasting: current library default
  • pre-allocating result array
  • pre-allocated result array with @inbounds macro
  • pre-allocated result array with threads
  • pre-allocated result array with threads and @inbounds

10x6 observations

For the first three on the left, these are comparisons of the single-threaded methods. You can see that pre-allocating the output array is marginally faster than broadcasting, and using the @inbounds macro is incrementally faster still, but neither method provides enough speedup to be worth implementing. The difference between the red and the blue bars represents a garbage collection occurring, but again, the three methods aren’t different enough to notice anything interesting.

For the multi-threaded tests, I’m using 6 threads (one per physical core), and we’re seeing roughly a 3x speedup. Like the single-threaded tests above, using @inbounds is only marginally faster, but not enough to widely implement for the cost of increased code complexity. Interestingly, doing these multi-threaded benchmarks didn’t trigger garbage collect at all across my five iterations; not sure if this is specific due to threading or not, but something to explore outside of this blog post.

10x7 observations

To see how these calculation methods might change at a larger scale, I bumped up the observations by an order of 10 and saw the following results:

Like at the 1 million data range, there isn’t much difference between the three single-threaded methods. All three of them are within a few percentage in either direction (all three methods triggered garbage collection in each of their five runs).

For the multi-threaded tests, an interesting performance scenario emerged. Like the 1 million point tests, it’s possible to get a run where garbage collection isn’t triggered, which leads to a large min/median difference in the multi-threaded tests. If you can avoid garbage collection, using six threads here gives nearly a 10x speedup, and at the median where both single-threaded and multi-threaded trigger garbage collection you still get a 2x speedup.

Parallelism > Compiler Hinting

In the case study above, I’ve demonstrated that for this problem, threading is the first way to pursue speeding up the OmniSci.jl load table methods. While pre-allocating the size of the output array and using the @inbounds did show some slight speedups, using threads to perform the calculations are where the largest improvements occurred. Incorporating the pre-allocation step naturally comes out from the way I wrote the threading methods, so I’ll incorporate that too. Disabling bounds-checking on arrays using @inbounds seems more dangerous than it is worth, even though none of these methods should ever get outside of their bounds.

Overall, I hope this post has demonstrated that you don’t have to fancy yourself a high-frequency trader or a bit-twiddler to find ways to improve your Julia code. The first step is reading the manuals for benchmarking, and then like any other pursuit, the only way to get a feeling for what works is to try things.

All of the code for this blog post can be found in this GitHub gist

  • RSiteCatalyst Version 1.4.16 Release Notes
  • Using RSiteCatalyst With Microsoft PowerBI Desktop
  • RSiteCatalyst Version 1.4.14 Release Notes
  • RSiteCatalyst Version 1.4.13 Release Notes
  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • ODSC webinar: End-to-End Data Science Without Leaving the GPU
  • PyData NYC 2018: End-to-End Data Science Without Leaving the GPU
  • Data Science Without Leaving the GPU
  • Getting Started With OmniSci, Part 2: Electricity Dataset
  • Getting Started With OmniSci, Part 1: Docker Install and Loading Data
  • Parallelizing Distance Calculations Using A GPU With CUDAnative.jl
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS