Instructions for Installing & Using R on Amazon EC2

If you’re an R user, you’ve surely heard all the hype around ‘big data’ and how R is commonly used to analyze these volumes of data. One thing that’s often missing from the discussion is HOW to work around issues using big data and R, specifically how to deal with the fact that R stores all its objects in-memory.

While you can use packages such as ff and bigmemory to overcome the in-memory limits of your local machine, these additional packages do require some re-engineering of your code. Instead, consider using Amazon EC2 to provision the resources you need.  Here are two ways to get started…

Use a Pre-Made AMI

In the great open-source tradition, there are already R Amazon EC2 AMI images available out there to use. The way I got started was using the pre-built images that Louis Aslett provides on his website.  Louis also provides great instructions on learning about EC2, so if you’ve never worked with R in the cloud or a just looking to get up and running fast, his website is a great means to do so.

Build Your Own Image

Alternatively, suppose you want to build your own customized image. For example, say you wanted to build a proof-of-concept ‘big data’ environment, so you want R, Python, MySQL and MongoDB.  The commands to accomplish this are listed below. Note that I’m assuming you have a basic understanding of working through the Amazon Web Service Console (AWS), including being able to get to the ‘Classic Wizard’ for launching an EC2 instance. You also should have a basic understanding of working from the command line.

Setting Up Amazon EC2 Instance

  1. Launch an Ubuntu 12.04.1 LTS 64-bit image. You can use a free “t1.micro” image while building, then provision more resources later once you’re ready for analysis.
  2. Accept defaults until you get to Key-Pair tab. The Key-Pair is what allows you to login securely to your Amazon EC2 image without a password. Create and download a Key-Pair if you don’t already have one or choose an existing Key-Pair if you do.
  3. When you get to the ‘Security Groups’ tab, create a security group that has the following ports open: 22 (SSH), 80 (HTTP), 443 (HTTPS), 3389 (RDP, optional), and 8787 (RStudio Server).
  4. Work through the rest of the Wizard until your instance is launched.

Connecting to Amazon EC2 Instance

  1. There are two ways to connect to your EC2 image, both of which can be found by going to the “Actions” tab in the AWS console, then selecting “Connect” from the drop-down. The rest of this tutorial assumes you connect via a stand-alone SSH client (such as Terminal for Mac OSX)
  2. Connect to your instance by typing the code provided to you, such as: ssh -i me-aws.pem ubuntu@ec2-50-19-18-120.compute-1.amazonaws.com
  3. Be sure that before you submit this code, you either modify the line to put the directory in front of your Key-Pair, or “cd” to the directory where the Key-Pair is located
  4. After submitting the connect code, you will get a warning saying that the ‘authenticity can’t be established, do you want to continue?’  Type yes and hit enter to log in.

Installing Base R

Once you are logged in, there are about a dozen commands that need to be submitted. Some commands run quickly, others can take 10-15 minutes to run through the entire installation process. Depending on how quickly each command completes, you may or may not need to type “sudo” in front of each command to have proper access rights for installation. Submit each line one at a time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#Create a user, home directory and set password
sudo useradd rstudio
sudo mkdir /home/rstudio
sudo passwd rstudio
sudo chmod -R 0777 /home/rstudio

#Update all files from the default state
sudo apt-get update
sudo apt-get upgrade

#Add CRAN mirror to custom sources.list file using vi
sudo vi /etc/apt/sources.list.d/sources.list

#Add following line (or your favorite CRAN mirror)
deb http://lib.stat.cmu.edu/R/CRAN/bin/linux/ubuntu precise/

#Update files to use CRAN mirror
#Don't worry about error message
sudo apt-get update

#Install latest version of R
#Install without verification
sudo apt-get install r-base

While not strictly required to run R, I also like to run the following commands to install the Curl and XML packages as well, which are useful if you want to use R to connect to any web data/APIs.

1
2
3
#Install in order to use RCurl & XML
sudo aptitude install libcurl4-openssl-dev
sudo apt-get install libxml2-dev

With these commands run, you will now be able to run R from the command line just by typing “R” at the prompt. However, it would be a crime to do all this work and not install RStudio Server, which makes working in R so much easier.

Installing RStudio Server

Once you’ve installed the above commands, you can now access RStudio through your local browser. Navigate to the Public DNS of your image on port 8787, similar to:

http://ec2-50-19-18-120.compute-1.amazonaws.com:8787

The login and password will be the values you used in the image creation process (I used rstudio as my username above).

Installing MySQL, Python, and MongoDB

If you’ve made it this far, I’m sure you realize that installing additional packages will only take a line or two of code. Even better, Python is installed by default on Linux, so we really only need to install MySQL and MongoDB.

1
2
3
4
5
6
#Install MySQL

sudo apt-get install mysql-common
sudo apt-get install mysql-server

#Install MongoDB

sudo apt-get install mongodb

Summary

While the steps above can be intimidating if you’ve never used Linux or worked on the command line, but once you get the hang of it, your ability to use R on ‘big data’ (however you define it) will be much improved. For only a few pennies to up to a few dollars per hour, you can use hardware having 16-64GB of RAM or more.

EDIT, 4/9: The code is wrapping weird on some monitors.  Click here for the commands in a .txt. file.

  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • Google Analytics Individual Qualification (IQ) - Passed!
  • Google Analytics SEO reports: Not Ready For Primetime?
  • An Afternoon With Edward Tufte
  • Google Analytics Custom Variables: A Page-Level Example
  • Xchange 2011: Think Tank and Harbor Cruise
  • Google Analytics for WordPress: Two Methods
  • WordPress Stats or Google Analytics? Yes!
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS