Instructions for Installing & Using R on Amazon EC2

If you’re an R user, you’ve surely heard all the hype around ‘big data’ and how R is commonly used to analyze these volumes of data. One thing that’s often missing from the discussion is HOW to work around issues using big data and R, specifically how to deal with the fact that R stores all its objects in-memory.

While you can use packages such as ff and bigmemory to overcome the in-memory limits of your local machine, these additional packages do require some re-engineering of your code. Instead, consider using Amazon EC2 to provision the resources you need.  Here are two ways to get started…

Use a Pre-Made AMI

In the great open-source tradition, there are already R Amazon EC2 AMI images available out there to use. The way I got started was using the pre-built images that Louis Aslett provides on his website.  Louis also provides great instructions on learning about EC2, so if you’ve never worked with R in the cloud or a just looking to get up and running fast, his website is a great means to do so.

Build Your Own Image

Alternatively, suppose you want to build your own customized image. For example, say you wanted to build a proof-of-concept ‘big data’ environment, so you want R, Python, MySQL and MongoDB.  The commands to accomplish this are listed below. Note that I’m assuming you have a basic understanding of working through the Amazon Web Service Console (AWS), including being able to get to the ‘Classic Wizard’ for launching an EC2 instance. You also should have a basic understanding of working from the command line.

Setting Up Amazon EC2 Instance

  1. Launch an Ubuntu 12.04.1 LTS 64-bit image. You can use a free “t1.micro” image while building, then provision more resources later once you’re ready for analysis.
  2. Accept defaults until you get to Key-Pair tab. The Key-Pair is what allows you to login securely to your Amazon EC2 image without a password. Create and download a Key-Pair if you don’t already have one or choose an existing Key-Pair if you do.
  3. When you get to the ‘Security Groups’ tab, create a security group that has the following ports open: 22 (SSH), 80 (HTTP), 443 (HTTPS), 3389 (RDP, optional), and 8787 (RStudio Server).
  4. Work through the rest of the Wizard until your instance is launched.

Connecting to Amazon EC2 Instance

  1. There are two ways to connect to your EC2 image, both of which can be found by going to the “Actions” tab in the AWS console, then selecting “Connect” from the drop-down. The rest of this tutorial assumes you connect via a stand-alone SSH client (such as Terminal for Mac OSX)
  2. Connect to your instance by typing the code provided to you, such as:
    ssh -i zwitch-aws.pem ubuntu@ec2-50-19-18-120.compute-1.amazonaws.com
  3. Be sure that before you submit this code, you either modify the line to put the directory in front of your Key-Pair, or “cd” to the directory where the Key-Pair is located
  4. After submitting the connect code, you will get a warning saying that the ‘authenticity can’t be established, do you want to continue?’  Type ‘yes’ and hit enter to log in.


Installing Base R

Once you are logged in, there are about a dozen commands that need to be submitted. Some commands run quickly, others can take 10-15 minutes to run through the entire installation process. Depending on how quickly each command completes, you may or may not need to type “sudo” in front of each command to have proper access rights for installation. Submit each line one at a time.

While not strictly required to run R, I also like to run the following commands to install the Curl and XML packages as well, which are useful if you want to use R to connect to any web data/APIs.

With these commands run, you will now be able to run R from the command line just by typing “R” at the prompt. However, it would be a crime to do all this work and not install RStudio Server, which makes working in R so much easier.

Installing RStudio Server


Once you’ve installed the above commands, you can now access RStudio through your local browser. Navigate to the Public DNS of your image on port 8787, similar to:

http://ec2-50-19-18-120.compute-1.amazonaws.com:8787

The login and password will be the values you used in the image creation process (I used rstudio as my username above).

Installing MySQL, Python, and MongoDB

If you’ve made it this far, I’m sure you realize that installing additional packages will only take a line or two of code. Even better, Python is installed by default on Linux, so we really only need to install MySQL and MongoDB.

#Install MySQL
sudo apt-get install mysql-common
sudo apt-get install mysql-server

#Install MongoDB
sudo apt-get install mongodb

Summary

While the steps above can be intimidating if you’ve never used Linux or worked on the command line, but once you get the hang of it, your ability to use R on ‘big data’ (however you define it) will be much improved. For only a few pennies to up to a few dollars per hour, you can use hardware having 16-64GB of RAM or more.

EDIT, 4/9: The code is wrapping weird on some monitors.  Click here for the commands in a .txt. file.

Comments

  1. randyzwitch says:

    usujason Glad you enjoyed it

  2. BrockTibert says:

    I have been meaning to do this for a while, thanks for sharing!  A couple of stupid questions, I am sure.  1) How do we save the image to use at a later date?  2) Any instructions on how to read/write data we keep in S3?  

    I have tried to read as much as I can before trying this myself so I apologize if these are straightforward.  Thanks again.  This is great timing!

    • BrockTibertHi Brock, glad you found this useful!
      To save the image, you just choose “Stop” from the Actions menu. This will shutdown the server so that you aren’t charged by the hour, yet leave the image in your account. Depending on how much space you allocate to the instance, there is still a monthly charge for storing the image (though, much less than the hourly charge would be if you left the image running)
      In terms of reading/writing to S3, I’ve never done it, but you can mount an S3 drive in an EC2 instance and it should work just like a regular hard drive. At that point, I *assume* you can use the regular file references to read.csv or whatever and use your files that way.

      Here’s a link that might give some clues:

      https://forums.aws.amazon.com/message.jspa?messageID=319538

  3. Hi Randy,

    Do you have any tips on changing the sources.list file? I originally installed R from the Ubuntu repository, but it turns out that’s a very outdated version. I then tried your commands, with no success. With some help from a friend I think we managed to change it to include a CRAN mirror, but after updating, it still looks like 2.14 is running on the server. I’m stumped!

    Thanks!

    • Randy Zwitch says:

      Hi Kate, sorry to hear you’re having trouble.

      If you’ve successfully added a mirror, running sudo apt-get update and sudo apt-get upgrade should update not only R, but every software package on your instance. I just rant the commands above on a new t1.micro instance on AWS and 3.0.2 was installed.

      The only thing I can think of is that either you didn’t save the file containing the mirror or didn’t run sudo apt-get update. If you run sudo vi /etc/apt/sources.list.d/sources.list, do you see your mirror?

      • If the goal is to make it easier for users (especially those not familiar with linux) I would highly suggest staying away from vi in future tutorials. I love it but it’s not really something you should be telling new users to use to edit text documents (especially if you don’t include instructions on how to actually use vi to edit a file). To make it easiest on the user you can just use a bash command to append the line of interest to the file.

        Something like this is sloppy and I know it could be cleaner but this will do it:

        sudo sh -c “echo ‘deb http://streaming.stat.iastate.edu/CRAN/bin/linux/ubuntu precise/’ >> /etc/apt/sources.list”

        • Randy Zwitch says:

          Thanks for the comment. You are right, I implicitly assumed that anyone wanting to work with Amazon EC2 would know what vi/m is and would be comfortable adding text to a configuration file and would just be looking for the apt-get instructions.

  4. Hi Randy,
    Very useful post, thanks.
    I have problem with RStudio, when I try to login it enters without problems but then RStudio throws an error: Unable to connect to service.
    How can I fix this.
    Many thanks.

    • Randy Zwitch says:

      Hi Rafik –

      I was able to generate the “Unable to connect to service” by omitting this line from the directions:

      sudo chmod -R 0777 /home/rstudio

      When I ran the above command, rstudio then worked as expected. So I would try that first.

  5. Thanks for the awesome tutorial! Very much appreciate the help. I’m getting comfortable with linux command line stuff, but not all the way there yet (in fact, a far way off) and your instructions were very helpful.

  6. Hi Randy, this was very helpful! I have a few (fairly naive) questions about using RStudio on EC2. Can I run multiple instances of RStudio from the same port? Is there a way to submit R scripts so that they continue processing even if I am logged off the server? I’m guessing this needs to be done in the command line, but wasn’t sure how. Finally, I noticed that the location of the (visible) cursor sometimes gets misaligned with the actual location of the cursor when typing scripts in RStudio in EC2. Have you experienced this? If so, were you able to find a fix? Thanks for all the help!

    • Randy Zwitch says:

      Thanks for reading Andrew!

      It’s not a matter of the port that determines the number of instances, it’s the number of users you define. So in the steps above with ‘sudo useradd rstudio’ & ‘sudo mkdir /home/rstudio’, if you defined more users you could have more instances. However, you can only have one instance open of RStudio per login you use. The main use case is to set up RStudio for a classroom full of students or a company, where everyone could share the same server resources while having different workspaces.

      If you want to submit scripts without having to be logged-in, you can use the command line with a linux utility called Screen (http://www.gnu.org/software/screen/). Screen allows you to define a session, disconnect, then re-connect to the Screen at a later time. Screen is a good way to have multiple instances of R running at the same time, just define a Screen for each process.

      And finally, yes I have noticed the mis-alignment. That happens on my local laptop RStudio as well, so I think it might just be in the main RStudio code. Might want to open a ticket with RStudio, they seem pretty receptive to user comments.

  7. Hi, Randy! I ran across your nice post while investigating the problem(s) with my R/RStudio setup on new EC2 instance. I would appreciate, if you could advise on that: http://stackoverflow.com/questions/23357551/unexpected-behavior-of-r-after-install-on-another-ec2-instance. Thank you!

    • Randy Zwitch says:

      Hi Aleksandr –

      Based on the StackOverflow article you posted, it appears you already solved your problem.

      In general, it’s generally not a good practice to have default packages loaded in your profile, as it makes your code less reproducible when you send it to other people. From the looks of what you posted, removing that code solved your issue; without knowing why that code was there originally or having access to your EC2 instance, I can’t comment on what the issue might have been.

  8. Jeremiah Johnson says:

    Very nice intro. As someone who works with large datasets on ec2, have you every run into the issue described here: http://stackoverflow.com/questions/24891861/trouble-uploading-large-files-to-rstudio-using-louis-asletts-ami-on-ec2?

    • Randy Zwitch says:

      I haven’t, though I suspect that the issue might be a limitation on the size of the file that the browser will allow to upload. This is remedied by SFTP’ing into your EC2 image and uploading directly. If you can gzip the file locally first, the upload will go even faster, then you can gunzip from the EC2 image.

Trackbacks

  1. Quora says:

    Will R 64bit (free version) on Win8Pro64bit benefit of 32 GB RAM over 16GB?

    The easiest way to benchmark how to make the process faster would be to use R on Amazon EC2, testing with various sized instances. It sounds like in this problem that you are CPU-bound, not RAM bound, but you can certainly increase both to see what wou…

Leave a Reply