Building pyarrow with CUDA support

Randy Zwitch × April 3, 2020 × Programming

The other day I was looking to read an Arrow buffer on GPU using Python, but as far as I could tell, none of the provided pyarrow packages on conda or pip are built with CUDA support. Like many of the packages in the compiled-C-wrapped-by-Python ecosystem, Apache Arrow is thoroughly documented, but the number of permutations of how you could choose to build pyarrow with CUDA support quickly becomes overwhelming.

In this post, I’ll show how to build pyarrow with CUDA support on Ubuntu using Docker and virtualenv. These directions are approximately the same as the official Apache Arrow docs, just that I explain them step-by-step and show only the single build toolchain I used.

Step 1: Docker with GPU support

Even though I use Ubuntu 18.04 LTS on a workstation with an NVIDIA GPU, whenever I undertake a project like this, I like to use a Docker container to keep everything isolated. The last thing you want to do is to debug environment errors, changing dependencies for one project and breaking something else. Thankfully, NVIDIA Docker developer images are available via DockerHub:

docker run -it --gpus=all --rm nvidia/cuda:10.1-devel-ubuntu18.04 bash

Here, the -it flag puts us inside the container at a bash prompt, --gpus=all allows the Docker container to access my workstation’s GPUs and --rm deletes the container after we’re done to save space.

Step 2: Setting up the Ubuntu Docker container

When you pull Docker containers from DockerHub, frequently they are bare-bones in terms of libraries included, and usually can also be updated. For building pyarrow, it’s useful to install the following:

apt update && apt upgrade -y

apt install git \
wget \
libssl-dev \
autoconf \
flex \
bison \
llvm-7 \
clang \
cmake \
python3-pip \
libjemalloc-dev \
libboost-dev \
libboost-filesystem-dev \
libboost-system-dev \
libboost-regex-dev  \
python3-dev -y

In a later step, we’ll use the Arrow third-party dependency script to ensure all needed dependencies are present, but these are a good start.

Step 3: Cloning Apache Arrow from GitHub

Cloning Arrow from GitHub is pretty straightforward. The git checkout apache-arrow-0.15.0 line is optional; I needed version 0.15.0 for the project I was exploring, but if you want to build from the master branch of Arrow, you can omit that line.

git clone https://github.com/apache/arrow.git /repos/arrow
cd /repos/arrow
git submodule init && git submodule update
git checkout apache-arrow-0.15.0
export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"
export ARROW_TEST_DATA="${PWD}/testing/data"

Step 4: Installing remaining Apache Arrow dependencies

As mentioned in Step 2, some of the dependencies for building Arrow are system-level and can be installed via apt. To ensure that we have all the remaining third-party dependencies, we can use the provided script in the Arrow repository:

pip3 install virtualenv
virtualenv pyarrow
source ./pyarrow/bin/activate
pip install six numpy pandas cython pytest hypothesis
mkdir dist
export ARROW_HOME=$(pwd)/dist
export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH

cd cpp
./thirdparty/download_dependencies.sh $HOME/arrow-thirdparty

The script downloads all of the necessary libraries as well as sets environment variables that are picked up later, which is amazingly helpful.

Step 5: Building Apache Arrow C++ library

pyarrow links to the Arrow C++ bindings, so it needs to be present before we can build the pyarrow wheel:

mkdir build && cd build

cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_FLIGHT=ON \
-DARROW_GANDIVA=ON \
-DARROW_ORC=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_PARQUET=ON \
-DARROW_PYTHON=ON \
-DARROW_PLASMA=ON \
-DARROW_BUILD_TESTS=ON \
-DARROW_CUDA=ON \
..

make -j
make install

This is a pretty standard workflow for building a C or C++ library. We create a build directory, call cmake from inside of that directory to set up the options we want to use, then use make and then make install to compile and install the library, respectively. I chose all of the -DARROW_* options above just as a copy/paste from the Arrow documentation; Arrow doesn’t take long to build using these options, but it’s possibly the case that only -DARROW_PYTHON=ON and -DARROW_CUDA=ON are truly necessary to build pyarrow.

Step 6: Building pyarrow wheel

With the Apache Arrow C++ bindings built, we can now build the Python wheel:

cd /repos/arrow/python
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_CUDA=1
python setup.py build_ext --build-type=release --bundle-arrow-cpp bdist_wheel

As cmake and make run, you’ll eventually see the following in the build logs, which shows that we’re getting the behavior we want:

cmake --build . --config release --
[  5%] Compiling Cython CXX source for _cuda...
[  5%] Built target _cuda_pyx
Scanning dependencies of target _cuda
[ 11%] Building CXX object CMakeFiles/_cuda.dir/_cuda.cpp.o
[ 16%] Linking CXX shared module release/_cuda.cpython-36m-x86_64-linux-gnu.so
[ 16%] Built target _cuda

When the process finishes, the final wheel will be in the /repos/arrow/python/dist directory.

Step 7 (optional): Validate build

If you want to validate that your pyarrow wheel has CUDA installed, you can run the following:

(pyarrow) root@9260485caca3:/repos/arrow/python/dist# pip install pyarrow-0.15.1.dev0+g40d468e16.d20200402-cp36-cp36m-linux_x86_64.whl
Processing ./pyarrow-0.15.1.dev0+g40d468e16.d20200402-cp36-cp36m-linux_x86_64.whl
Requirement already satisfied: six>=1.0.0 in /repos/arrow/pyarrow/lib/python3.6/site-packages (from pyarrow==0.15.1.dev0+g40d468e16.d20200402) (1.14.0)
Requirement already satisfied: numpy>=1.14 in /repos/arrow/pyarrow/lib/python3.6/site-packages (from pyarrow==0.15.1.dev0+g40d468e16.d20200402) (1.18.2)
Installing collected packages: pyarrow
Successfully installed pyarrow-0.15.1.dev0+g40d468e16.d20200402
(pyarrow) root@9260485caca3:/repos/arrow/python/dist# python
Python 3.6.9 (default, Nov  7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyarrow import cuda
>>>

When the line from pyarrow import cuda runs without error, then we know that our pyarrow build with CUDA was successful.

A Beginner's Look at BenchmarkTools.jl

Randy Zwitch × December 16, 2019 × Programming

For the number of years I’ve been programming using Julia, I’ve never really been concerned with performance. Which is to say, I’ve appreciated that other people are interested in performance and have proven that Julia can be as fast as any other performance language out there. But I’ve never been one to pour over the Performance Tips section of the Julia manual trying to squeeze every last bit of performance.

But now that I’ve released OmniSci.jl, and as a company one of our major selling points is accelerated analytics, I figured it was time to stop assuming I wrote decent-ish code and really pay attention to performance. This post highlights my experience as a beginner, and hopefully will show how others can get started in learning to optimize their Julia code.

Read The Manuals!

As I mentioned above, I’ve written Julia for many years now, and in that time I’ve grown up with many of the tips in the performance tips section of the documentation. Things like “write type stable functions” and “avoid global variables” are things that I’ve internalized as good programming practices, as opposed to doing them just because they are performant. But with this long familiarity with the language comes laziness, and by not reading the BenchmarkTools.jl documentation, I started off benchmarking incorrectly. Consider this example:

julia> using Random, OmniSci, BenchmarkTools, Base.Threads

julia> #change defaults, since examples long-running
       BenchmarkTools.DEFAULT_PARAMETERS.seconds = 1000
1000

julia> BenchmarkTools.DEFAULT_PARAMETERS.samples = 5
5

julia> #generate test data
       gendata(x, T) = [rand(typemin(T):typemax(T)) for y in 1:x]
gendata (generic function with 1 method)

julia> int64_10x6 = gendata(10^6, Int64);

julia> #Test whether broadcasting more/less efficient than pre-allocating results array
       function preallocate(x)

           v = Vector{OmniSci.TStringValue}(undef, length(x))

           for idx in 1:length(x)
               v[idx] = OmniSci.TStringValue(x[idx])
           end

           return v
       end
preallocate (generic function with 1 method)

julia> @benchmark v61 = OmniSci.TStringValue.(int64_10x6)
BenchmarkTools.Trial:
  memory estimate:  297.55 MiB
  allocs estimate:  6000005
  --------------
  minimum time:     750.146 ms (0.00% GC)
  median time:      1.014 s (29.38% GC)
  mean time:        1.151 s (28.38% GC)
  maximum time:     1.794 s (43.06% GC)
  --------------
  samples:          5
  evals/sample:     1

julia> @benchmark v62 = preallocate(int64_10x6)
BenchmarkTools.Trial:
  memory estimate:  297.55 MiB
  allocs estimate:  6000002
  --------------
  minimum time:     753.877 ms (0.00% GC)
  median time:      1.021 s (28.30% GC)
  mean time:        1.158 s (28.10% GC)
  maximum time:     1.806 s (43.17% GC)
  --------------
  samples:          5
  evals/sample:     1

The benchmark above tests whether it’s worth pre-allocating the results array vs. using the more convenient dot broadcasting syntax. The idea here is that growing an array over and over can be inefficient when you know the result size at the outset. Yet, comparing the times above, for all statistics pre-allocating the array is slightly worse, even though we’re passing the compiler more knowledge up front. This didn’t sit well with me, so I consulted the BenchmarkTools.jl manual and found the following about variable interpolation:

A good rule of thumb is that external variables should be explicitly interpolated into the benchmark expression

Interpolating the int64_10x6 input array into the function takes it from being a global variable to a local, and sure enough, we see roughly a 6% improvement in the minimum time when we pre-allocate the array:

julia> @benchmark v61i = OmniSci.TStringValue.($int64_10x6)
BenchmarkTools.Trial:
  memory estimate:  297.55 MiB
  allocs estimate:  6000002
  --------------
  minimum time:     763.817 ms (0.00% GC)
  median time:      960.446 ms (24.02% GC)
  mean time:        1.178 s (28.68% GC)
  maximum time:     1.886 s (45.11% GC)
  --------------
  samples:          5
  evals/sample:     1

julia> @benchmark v62i = preallocate($int64_10x6)
BenchmarkTools.Trial:
  memory estimate:  297.55 MiB
  allocs estimate:  6000002
  --------------
  minimum time:     721.597 ms (0.00% GC)
  median time:      1.072 s (30.45% GC)
  mean time:        1.234 s (32.92% GC)
  maximum time:     1.769 s (44.51% GC)
  --------------
  samples:          5
  evals/sample:     1

Whether that 6% improvement will hold up over time or not, at least conceptually we’re no longer worse off for pre-allocating, which fits my mental model of how things should work.

Evaluate Your Benchmark Over the Range of Inputs You Care About

In the comparison above, I evaluate the benchmark over 10^6 observations. How did I choose 1 million as the “right” number of events to test, instead of just testing 1 or 10 events? My general goal for benchmarking this code is to speed up the methods of loading data into an OmniSciDB database. TStringValue is one of the internal methods as part of doing a row-wise table load, converting whatever data is present in an array or DataFrame from ::Type{T} into String (think iterating over a text file by line). Since users trying to accelerate their database operations are probably going to be using millions to billions of data points, I’m interested in understanding how the functions are performing at these volumes of data.

The other conscious decision I made was the environment to test on. I could test this on massive CPU- and GPU-enabled servers, but I’m testing this on my Dell XPS 15 laptop. Why? Because I’m actually interested in how things are performing under more real-world conditions for a realistic user. Testing the performance characteristics of a high-end server with tons of memory and cores would be fun, but I want to make sure any performance improvements are broadly applicable, instead of just because I am throwing more hardware at the problem.

Less important to me to control for was garbage collection, using a fresh session before each measurement or other “best case scenario” optimizations. I would expect my users to be more analytics and data science focused, so re-using the same session is going to be common. If the performance improvements aren’t completely obvious, I’m not going to incorporate them into the codebase.

Case Study: Speeding Up `TStringValue`

For my test, I evaluate the following as the methods to benchmark:

broadcasting: current library default
pre-allocating result array
pre-allocated result array with @inbounds macro
pre-allocated result array with threads
pre-allocated result array with threads and @inbounds

10x6 observations

For the first three on the left, these are comparisons of the single-threaded methods. You can see that pre-allocating the output array is marginally faster than broadcasting, and using the @inbounds macro is incrementally faster still, but neither method provides enough speedup to be worth implementing. The difference between the red and the blue bars represents a garbage collection occurring, but again, the three methods aren’t different enough to notice anything interesting.

For the multi-threaded tests, I’m using 6 threads (one per physical core), and we’re seeing roughly a 3x speedup. Like the single-threaded tests above, using @inbounds is only marginally faster, but not enough to widely implement for the cost of increased code complexity. Interestingly, doing these multi-threaded benchmarks didn’t trigger garbage collect at all across my five iterations; not sure if this is specific due to threading or not, but something to explore outside of this blog post.

10x7 observations

To see how these calculation methods might change at a larger scale, I bumped up the observations by an order of 10 and saw the following results:

Like at the 1 million data range, there isn’t much difference between the three single-threaded methods. All three of them are within a few percentage in either direction (all three methods triggered garbage collection in each of their five runs).

For the multi-threaded tests, an interesting performance scenario emerged. Like the 1 million point tests, it’s possible to get a run where garbage collection isn’t triggered, which leads to a large min/median difference in the multi-threaded tests. If you can avoid garbage collection, using six threads here gives nearly a 10x speedup, and at the median where both single-threaded and multi-threaded trigger garbage collection you still get a 2x speedup.

Parallelism > Compiler Hinting

In the case study above, I’ve demonstrated that for this problem, threading is the first way to pursue speeding up the OmniSci.jl load table methods. While pre-allocating the size of the output array and using the @inbounds did show some slight speedups, using threads to perform the calculations are where the largest improvements occurred. Incorporating the pre-allocation step naturally comes out from the way I wrote the threading methods, so I’ll incorporate that too. Disabling bounds-checking on arrays using @inbounds seems more dangerous than it is worth, even though none of these methods should ever get outside of their bounds.

Overall, I hope this post has demonstrated that you don’t have to fancy yourself a high-frequency trader or a bit-twiddler to find ways to improve your Julia code. The first step is reading the manuals for benchmarking, and then like any other pursuit, the only way to get a feeling for what works is to try things.

All of the code for this blog post can be found in this GitHub gist

RSiteCatalyst Version 1.4.16 Release Notes

Randy Zwitch × November 5, 2019 × Analytics

It’s been a while since the last update, but RSiteCatalyst is still going strong! Thanks to Wen for submitting a fix/enhancement to enable the ability to use multiple columns from a Classification within the QueueDataWarehouse function. No other bug fixes were made, nor was any additional functionality added.

Version 1.4.16 of RSiteCatalyst was submitted to CRAN yesterday and should be available for download in the coming days.

Community Contributions

As I’ve mentioned in many a blog post before this one, I encourage all users of the software to continue reporting bugs via GitHub issues, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community.

Note: Please don’t email directly via the email in the RSiteCatalyst package, it will not be returned. Having a valid email contact in the package is a requirement to have a package listed on CRAN so they can contact the package author, it is not meant to imply I can/will provide endless, personalized support for free.

« Prev 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Next »