As my Developer Advocate role leads me to doing more and more Sysadmin/Data Engineer type of work, I continuously find myself looking for more efficient ways of copying data folders to where I need them. While there are a lot of great GUI ETL tools out there, for me the simplest and fastest way tends to be using linux utilities. Here’s how to mirror an FTP using lftp, with a cron repeater every five minutes.
Data are on an FTP, Need Further Processing
The problem I have is data that exists on a remote FTP, but are in a binary format that is incompatible with loading directly into MapD. My current plan is to use Python to convert the binary format into CSV, but with the data on a server that I don’t control, I need to make a copy somewhere else.
It’s also the case that the data are roughly 300GB per day, streaming in at various intervals across the day, so I need to make sure that any copying I do is thoughtful. Downloading 300GB of data per day is bad enough, doing it multiple times even worse!
lftp mirror to the Rescue!
The best choice in my case seems to be copying the files onto a VM I own. lftp has an option mirror to do just that. Here is the one-liner I’m using:
lftp -e : Execute command in quotes. In this case, the FTP allows anonymous access, so no user/pw arguments needed
mirror: Mirror command for lftp
-c: If download fails for whatever reason, keep trying (c = “continue”)
-e: Delete files on remote that are no longer on source (i.e. keep folders in perfect sync)
--parallel: Allow multiple connections for parallel downloading of multiple files
--verbose: Print lots of messages, helpful for debugging
With all of the flags in place, the last two arguments are the source (remote FTP) and destination (my VM) directories. Finally, I add a quit statement to exit lftp once the mirror process is over. This is mostly hygiene since I plan to run this on a cron scheduler and don’t want to leave the sessions open.
Run This Every Five Minutes, Forever
cron really is one of the greatest timesavers ever, especially in that it allows super-repetitive work to be automated away, usually with a single line. Here is the line I added after calling crontab -e on the command-line:
Quite simply, “every 5 minutes, run pull_from_ftp.sh”. Creating pull_from_ftp.sh is as straightforward as creating a text file:
That’s It? YES!
With just a few characters short of a full tweet, you can mirror an entire folder from an FTP, automatically. lftp in combination with cran helped me factor out hundreds of lines of pre-existing Python code, which not only removed untested, copy-pasted code from the workflow but also added parallel downloading, increasing data throughput.
In my post Installing MapD on Microsoft Azure, I showed how to install MapD Community Edition on Microsoft Azure, using Ubuntu 16.04 LTS as the base image. One thing I glossed over during the firewall/security section was that I opened ports for Jupyter Notebook and other data science tools, but I didn’t actually show how to install any of those tools.
For this post, I’ll cover how to install MapD ODBC drivers and create a connection within RStudio server.
There are two major ODBC driver managers for Linux and macOS: unixODBC and iODBC. I have had more overall ODBC driver installation success with unixODBC than iODBC; here are the instructions for building unixODBC from source:
If you want to check everything is installed correctly, you can run the following command:
For Linux, the MapD ODBC drivers are provided as a tarball, which when extracted provides all of the necessary ODBC driver files:
By convention, MapD suggests placing the ODBC drivers in the same directory as your installation (frequently, /opt/mapd). Wherever you choose to place the directory, you need add that location into the /etc/odbcinst.ini file:
At this point, we have everything we need to define a connection string within R using odbc:
Depending on your use case/security preferences, there are two downsides to this method: 1) the credentials are in plain-text in the middle of the script and 2) the RStudio Connection window also shows the credentials in connection window in plain-text until you delete the connection. This can be remedied by defining a DSN (data source name).
4. Defining A DSN
A DSN is what people usually think of when installing ODBC drivers, as it holds some/all of the actual details for connecting to the database. DSN files can be placed in two locations: system-wide in /etc/obdc.ini or in an individual user’s home directory (needs to be ~/.odbc.ini, a hidden file).
In order to have the credentials completely masked in the RStudio session, place the following in the /etc/obdc.ini file:
Within the RStudio Connection pane, we can now test our DSN:
With the DSN defined, the R connection code becomes much shorter, with no credentials exposed within the R session:
ODBC: A Big Bag Of Hurt, But Super Useful
While the instructions above aren’t the easiest to work through, once you have ODBC set up and working one time, it’s usually just a matter of appending various credentials to the existing files to add databases.
From a MapD perspective, ODBC is supported through our Enterprise Edition, but it is the slowest way to work with the database. Up to this point, we’ve focused mostly on supporting Python through the pymapd package and the MapD Ibis backend, but there’s no reason technical reason why R can’t also be a first-class citizen.
So if you’re interested in helping develop an R package for MapD, whether using reticulate to wrap pymapd or to help develop Apache Thrift bindings and Apache Arrow native code, send me a Twitter message or connect via LinkedIn (or any other way to contact me) and we’ll figure out how to collaborate!
Data has been growing rapidly for some time now, but CPU-based analytics solutions haven’t been able to sustain the same rate of growth in order to keep up. CPUs in desktop and laptop machines have started adding more cores, but even a 4- or 8-core CPU can only do so much work. Eventually the bottleneck will become not having enough bandwidth to keep all the CPU cores ‘fed’ with data to manipulate. Hadoop provides a framework for working with larger datasets, but its distributed nature can often feel like setting it up is more hassle than its worth.
GPU-based analytics solutions provide a great middle-ground; high-parallelism via thousands of GPU cores, while not having to automatically use a networked, multi-node architecture such as Hadoop. A single data science workstation with 2-4 GPUs can reasonably handle hundreds of millions of records, especially when using the Ibis backend for MapD.
In this webinar, I demonstrate how to do each step of a machine learning workflow, from exploring a dataset to adding features to estimating an xgboost model for predicting the amount of tip a user will give after a taxi ride. Because MapD incorporates Apache Arrow under the hood for its data transfer, this can all be done seamlessly by passing pointers, rather than needing expensive I/O operations, between each tool used. Not having to transfer the data off of the GPU has interesting implications for analytics, which I also discuss towards the end of the talk.