randyzwitch.com

Randy Zwitch × December 17, 2012 × DataScience

This video covers how to run SQL queries using the ‘sqldf’ package within R. This sqldf tutorial was part of a Keystone Solutions podcast discussion about data science and what skills beginning analysts should be learning to improve their skill set.

The example files from this tutorial can be downloaded from this link:

Example Data files

Video: Overlay Histogram in R (Normal, Density, Another Series)

Randy Zwitch × November 9, 2012 × DataScience

This video explains how to overlay histogram plots in R for 3 common cases: overlaying a histogram with a normal curve, overlaying a histogram with a density curve, and overlaying a histogram with a second data series plotted on a secondary axis.

Note: Towards the end of the video (maybe minute 14 or so), I make a language error when talking about the padj parameter in the mtext function…the setting doesn’t “left truncated” the label, I meant “right align”, “left align”, etc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#Step 0:  load/prepare data

#Read in data
sample_data <- read.csv("~/Desktop/test_data.csv")

# "Explode" counts by age back to unsummarized "raw" data
age.exploded <- rep.int(sample_data$age, sample_data$count)


#1. Histogram with normal distributon overlaid or density curve


#1A.  Create histogram
hist(age.exploded, xlim= c(0,20), ylim= c(0,.2), breaks=seq(min(age.exploded),
  max(age.exploded), length=22), xlab = "Age", ylab= "Percentage of Accounts",
  main = "Age Distribution of Accounts\n (where 0 <= age <= 20)",
  prob= TRUE, col= "lightgray")

#1B.  Do one of the following, either put the normal distribution on the histogram
#     or put the smoothed density function

#Calculate normal distribution having mean/sd equal to data plotted in the
#histogram above
points(seq(min(age.exploded), max(age.exploded), length.out=500),
       dnorm(seq(min(age.exploded), max(age.exploded), length.out=500),
             mean(age.exploded), sd(age.exploded)), type="l", col="red")

#Add smoothed density function to histogram, smoothness toggled using
#"adjust" parameter
lines(density(age.exploded, adjust = 2), col = "blue")

#2 Histogram with line plot overlaid

#2A.  Create histogram with extra border space on right-hand side

#Extra border space "2" on right  (bottom, left, top, right)
par(oma=c(0,0,0,2))

hist(age.exploded, xlim= c(0,20), ylim= c(0,.2),
     breaks=seq(min(age.exploded), max(age.exploded), length=22), xlab = "Age",
     ylab= "Percentage of Accounts", main = "Age Distribution of Accounts vs. Subscription Rate \n (where reported age <= 20)",
     prob= TRUE, col= "lightgray")

#2B.  Add overlaid line plot, create a right-side numeric axis
par(new=T)
plot(sample_data$subscribe_pct, xlab= "", ylab="", type = "b", col = "red", axes=FALSE)  
axis(4)

#2C.  Add right-side axis label

mtext(text="Subscription Rate",side=4, outer=TRUE, padj=1)

File Download:

Histogram overlay in R code and sample data file

Video: R, RStudio, Rcmdr & rattle

Randy Zwitch × September 7, 2012 × DataScience

I did a screencast for my co-workers to show how to get started with R, specifically what a base installation of R looks like, then showing how to improve your workflow using RStudio, Rcmdr or rattle. The examples are somewhat pedestrian, but it gives a feel for what using R actually looks like.

If you have any questions, comments, or jeers about how bad I am at R, feel free to leave a comment in the comments section!

R Demo - Randy Zwitch from Keystone Solutions on Vimeo.

Video: SQL Queries in R using sqldf

Video: Overlay Histogram in R (Normal, Density, Another Series)

Video: R, RStudio, Rcmdr & rattle