If you’re an R user, you’ve surely heard all the hype around ‘big data’ and how R is commonly used to analyze these volumes of data. One thing that’s often missing from the discussion is HOW to work around issues using big data and R, specifically how to deal with the fact that R stores all its objects in-memory.
While you can use packages such as ff and bigmemory to overcome the in-memory limits of your local machine, these additional packages do require some re-engineering of your code. Instead, consider using Amazon EC2 to provision the resources you need. Here are two ways to get started…
Use a Pre-Made AMI
In the great open-source tradition, there are already R Amazon EC2 AMI images available out there to use. The way I got started was using the pre-built images that Louis Aslett provides on his website. Louis also provides great instructions on learning about EC2, so if you’ve never worked with R in the cloud or a just looking to get up and running fast, his website is a great means to do so.
Build Your Own Image
Alternatively, suppose you want to build your own customized image. For example, say you wanted to build a proof-of-concept ‘big data’ environment, so you want R, Python, MySQL and MongoDB. The commands to accomplish this are listed below. Note that I’m assuming you have a basic understanding of working through the Amazon Web Service Console (AWS), including being able to get to the ‘Classic Wizard’ for launching an EC2 instance. You also should have a basic understanding of working from the command line.
Setting Up Amazon EC2 Instance
- Launch an Ubuntu 12.04.1 LTS 64-bit image. You can use a free “t1.micro” image while building, then provision more resources later once you’re ready for analysis.
- Accept defaults until you get to Key-Pair tab. The Key-Pair is what allows you to login securely to your Amazon EC2 image without a password. Create and download a Key-Pair if you don’t already have one or choose an existing Key-Pair if you do.
- When you get to the ‘Security Groups’ tab, create a security group that has the following ports open:
22
(SSH),80
(HTTP),443
(HTTPS),3389
(RDP, optional), and8787
(RStudio Server). - Work through the rest of the Wizard until your instance is launched.
Connecting to Amazon EC2 Instance
- There are two ways to connect to your EC2 image, both of which can be found by going to the “Actions” tab in the AWS console, then selecting “Connect” from the drop-down. The rest of this tutorial assumes you connect via a stand-alone SSH client (such as Terminal for Mac OSX)
- Connect to your instance by typing the code provided to you, such as:
ssh -i me-aws.pem ubuntu@ec2-50-19-18-120.compute-1.amazonaws.com
- Be sure that before you submit this code, you either modify the line to put the directory in front of your Key-Pair, or “cd” to the directory where the Key-Pair is located
- After submitting the connect code, you will get a warning saying that the ‘authenticity can’t be established, do you want to continue?’ Type
yes
and hit enter to log in.
Installing Base R
Once you are logged in, there are about a dozen commands that need to be submitted. Some commands run quickly, others can take 10-15 minutes to run through the entire installation process. Depending on how quickly each command completes, you may or may not need to type “sudo” in front of each command to have proper access rights for installation. Submit each line one at a time.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#Create a user, home directory and set password
sudo useradd rstudio
sudo mkdir /home/rstudio
sudo passwd rstudio
sudo chmod -R 0777 /home/rstudio
#Update all files from the default state
sudo apt-get update
sudo apt-get upgrade
#Add CRAN mirror to custom sources.list file using vi
sudo vi /etc/apt/sources.list.d/sources.list
#Add following line (or your favorite CRAN mirror)
deb http://lib.stat.cmu.edu/R/CRAN/bin/linux/ubuntu precise/
#Update files to use CRAN mirror
#Don't worry about error message
sudo apt-get update
#Install latest version of R
#Install without verification
sudo apt-get install r-base
While not strictly required to run R, I also like to run the following commands to install the Curl and XML packages as well, which are useful if you want to use R to connect to any web data/APIs.
1
2
3
#Install in order to use RCurl & XML
sudo aptitude install libcurl4-openssl-dev
sudo apt-get install libxml2-dev
With these commands run, you will now be able to run R from the command line just by typing “R” at the prompt. However, it would be a crime to do all this work and not install RStudio Server, which makes working in R so much easier.
Installing RStudio Server
Once you’ve installed the above commands, you can now access RStudio through your local browser. Navigate to the Public DNS of your image on port 8787, similar to:
http://ec2-50-19-18-120.compute-1.amazonaws.com:8787
The login
and password
will be the values you used in the image creation process (I used rstudio
as my username above).
Installing MySQL, Python, and MongoDB
If you’ve made it this far, I’m sure you realize that installing additional packages will only take a line or two of code. Even better, Python is installed by default on Linux, so we really only need to install MySQL and MongoDB.
1
2
3
4
5
6
#Install MySQL
sudo apt-get install mysql-common
sudo apt-get install mysql-server
#Install MongoDB
sudo apt-get install mongodb
Summary
While the steps above can be intimidating if you’ve never used Linux or worked on the command line, but once you get the hang of it, your ability to use R on ‘big data’ (however you define it) will be much improved. For only a few pennies to up to a few dollars per hour, you can use hardware having 16-64GB of RAM or more.
EDIT, 4/9: The code is wrapping weird on some monitors. Click here for the commands in a .txt. file.