As my Developer Advocate role leads me to doing more and more Sysadmin/Data Engineer type of work, I continuously find myself looking for more efficient ways of copying data folders to where I need them. While there are a lot of great GUI ETL tools out there, for me the simplest and fastest way tends to be using linux utilities. Here’s how to mirror an FTP using lftp, with a cron repeater every five minutes.
Data are on an FTP, Need Further Processing
The problem I have is data that exists on a remote FTP, but are in a binary format that is incompatible with loading directly into MapD. My current plan is to use Python to convert the binary format into CSV, but with the data on a server that I don’t control, I need to make a copy somewhere else.
It’s also the case that the data are roughly 300GB per day, streaming in at various intervals across the day, so I need to make sure that any copying I do is thoughtful. Downloading 300GB of data per day is bad enough, doing it multiple times even worse!
mirror to the Rescue!
The best choice in my case seems to be copying the files onto a VM I own. lftp has an option
mirror to do just that. Here is the one-liner I’m using:
lftp -e: Execute command in quotes. In this case, the FTP allows anonymous access, so no user/pw arguments needed
mirror: Mirror command for lftp
-c: If download fails for whatever reason, keep trying (c = “continue”)
-e: Delete files on remote that are no longer on source (i.e. keep folders in perfect sync)
--parallel: Allow multiple connections for parallel downloading of multiple files
--verbose: Print lots of messages, helpful for debugging
With all of the flags in place, the last two arguments are the source (remote FTP) and destination (my VM) directories. Finally, I add a
quit statement to exit
lftp once the mirror process is over. This is mostly hygiene since I plan to run this on a cron scheduler and don’t want to leave the sessions open.
Run This Every Five Minutes, Forever
cron really is one of the greatest timesavers ever, especially in that it allows super-repetitive work to be automated away, usually with a single line. Here is the line I added after calling
crontab -e on the command-line:
Quite simply, “every 5 minutes, run pull_from_ftp.sh”. Creating
pull_from_ftp.sh is as straightforward as creating a text file:
That’s It? YES!
With just a few characters short of a full tweet, you can mirror an entire folder from an FTP, automatically.
lftp in combination with
cran helped me factor out hundreds of lines of pre-existing Python code, which not only removed untested, copy-pasted code from the workflow but also added parallel downloading, increasing data throughput.
Not bad for a couple of free Linux utilities :)