This video explains how to overlay histogram plots in R for 3 common cases: overlaying a histogram with a normal curve, overlaying a histogram with a density curve, and overlaying a histogram with a second data series plotted on a secondary axis.
Note: Towards the end of the video (maybe minute 14 or so), I make a language error when talking about the padj parameter in the mtext function…the setting doesn’t “left truncated” the label, I meant “right align”, “left align”, etc.
#Step 0: load/prepare data#Read in datasample_data<-read.csv("~/Desktop/test_data.csv")# "Explode" counts by age back to unsummarized "raw" dataage.exploded<-rep.int(sample_data$age,sample_data$count)#1. Histogram with normal distributon overlaid or density curve#1A. Create histogramhist(age.exploded,xlim=c(0,20),ylim=c(0,.2),breaks=seq(min(age.exploded),max(age.exploded),length=22),xlab="Age",ylab="Percentage of Accounts",main="Age Distribution of Accounts\n (where 0 <= age <= 20)",prob=TRUE,col="lightgray")#1B. Do one of the following, either put the normal distribution on the histogram# or put the smoothed density function#Calculate normal distribution having mean/sd equal to data plotted in the#histogram abovepoints(seq(min(age.exploded),max(age.exploded),length.out=500),dnorm(seq(min(age.exploded),max(age.exploded),length.out=500),mean(age.exploded),sd(age.exploded)),type="l",col="red")#Add smoothed density function to histogram, smoothness toggled using#"adjust" parameterlines(density(age.exploded,adjust=2),col="blue")#2 Histogram with line plot overlaid#2A. Create histogram with extra border space on right-hand side#Extra border space "2" on right (bottom, left, top, right)par(oma=c(0,0,0,2))hist(age.exploded,xlim=c(0,20),ylim=c(0,.2),breaks=seq(min(age.exploded),max(age.exploded),length=22),xlab="Age",ylab="Percentage of Accounts",main="Age Distribution of Accounts vs. Subscription Rate \n (where reported age <= 20)",prob=TRUE,col="lightgray")#2B. Add overlaid line plot, create a right-side numeric axispar(new=T)plot(sample_data$subscribe_pct,xlab="",ylab="",type="b",col="red",axes=FALSE)axis(4)#2C. Add right-side axis labelmtext(text="Subscription Rate",side=4,outer=TRUE,padj=1)
I did a screencast for my co-workers to show how to get started with R, specifically what a base installation of R looks like, then showing how to improve your workflow using RStudio, Rcmdr or rattle. The examples are somewhat pedestrian, but it gives a feel for what using R actually looks like.
If you have any questions, comments, or jeers about how bad I am at R, feel free to leave a comment in the comments section!
In my first post in this series, I discussed RStudio, an IDE that adds significant functionality and consistency to a basic installation of R. In this post, I will discuss Rcmdr, a GUI that provides the ability to do basic business statistics without having to code in R.
Rcmdr (“R Commander”)
Example Rcmdr window with the "Statistics" menu expanded
Rcmdr is a package for R that was created by John Fox at McMaster University in Canada as a means of providing the basic statistics functionality for classroom use. In this way, Rcmdr is somewhat similar to SAS Enterprise Guide, a GUI that allows quick access to the power of SAS without the requirement of writing code.
While using Rcmdr won’t allow you to tap into every single advanced feature that R provides, it does provide a lot of great “general” functionality that can be used in everyday business such as summary statistics, t-tests, ANOVA, linear regression modeling, graphing and data re-coding.
For the most part, the Rcmdr dialog boxes all look very similar. Only the most useful options are provided, such as the variable(s) you are looking to interrogate, variable(s) you’d like to break down your analysis by, what statistics you want the output to display (mean, median, mode, etc.) and so on. The dialog boxes vary depending on whether you are estimating a model or plotting a graph, but in my preliminary usage I haven’t found any dialog boxes that were so confusing that I needed to check the “Help” files.
For example, suppose I wanted to make a boxplot of my data, income by job type. To do so, I would go to the “Graphs” menu and select “Boxplot”, which provides me with the following dialog box:
Rcmdr options for creating a Boxplot
Boxplot output created by Rcmdr
Within this dialog box, there are only 3 choices: variable to plot (income), variable to break down the graph by (type), and “Identify outliers with mouse”, which allows for the user to point at the resulting graph to designate outliers to be labeled on the graph. When I click “OK” in the dialog box, the result is the boxplot shown above. We can see that the “bc” (blue-collar) group has a lower mid-point to the income range than “prof” (professors) and “wc” (white-collar).
One of the best features of Rcmdr is that not only do we get the output we requested, but the code window also shows the code that was necessary to create the boxplot. In this example, the underlying R code is relatively simple:
By providing the underlying code, Rcmdr serves as a teaching tool to move the beginning user towards coding in R directly, or at least, modifying the tool-generated code to include titles or whatever options the user wants to add to the original analysis/output.
Installation of Rcmdr
Sadly, Rcmdr is one of those add-ins that seems to work better on Windows than Mac OSX, at least for the installation portion. I’ve been able to successfully install Rcmdr on my relatively old MacBook Pro, but it did take a bit of time to figure out. Luckily, the instructions to install Rcmdr on a Mac are fairly well laid out in this article.
However, once you get over the hurdle of downloading tcltk and XQuartz (X11 emulator), the program seems to work the same on both platforms.