Clustering Search Keywords Using K-Means Clustering

One of the key tenets to doing impactful digital analysis is understanding what your visitors are trying to accomplish. One of the easiest methods to do this is by analyzing the words your visitors use to arrive on site (search keywords) and what words they are using while on the site (on-site search). 

Although Google has made it much more difficult to analyze search keywords over the past several years (due to their passing of “(not provided)” instead of the actual keywords), we can create customer intent segments based on the keywords that are still being passed using unsupervised clustering methods such as k-means clustering.

Concept: K-Means Clustering/Unsupervised Learning

K-means clustering is one of many techniques within unsupervised learning that can be used for text analysis. Unsupervised refers to the fact that we’re trying to understand the structure of our underlying data, rather than trying to optimize for a specific, pre-labeled criterion (such as creating a predictive model for conversion). Unsupervised learning is a great technique for exploratory analysis in that the analyst enforces few assumptions on the data, so previously unexamined relationships can be determined then analyzed; contrast that with pre-defined relationships specified by the analyst (such as visitors from mobile or visitors from social), then evaluating how various metrics differ across these pre-defined groups.

Without getting too technical, k-means clustering is a method of partitioning data into ‘k’ subsets, where each data element is assigned to the closest cluster based on the distance of the data element from the center of the cluster. In order to use k-means clustering with text data, we need to do some text-to-numeric transformation of our text data. Luckily, R provides several packages to simplify the process.

Converting Text to Numeric Data: Document-Term Matrix

Since I use Adobe Analytics on this blog, I’m going to use the RSiteCatalyst package to get my natural search keywords into a dataframe. Once the keywords are in a dataframe, we can use the RTextTools package to create a document-term matrix, where each row is our search term and each column is a 1/0 representation of whether a single word is contained within natural search term. 

#### 0. Setup
library("RTextTools") #Loads many packages useful for text mining

#### 1. RSiteCatalyst code - Get Natural Search Keywords & Metrics

#Set credentials
SCAuth(<username:company>, <shared secret>)

#Get list of search engine terms
searchkeywords <- QueueRanked(<report_suite>, "2013-02-01","2013-09-16",
                  c("entries", "visits", "pageviews", "instances", "bounces"),
                  "searchenginenaturalkeyword", top="100000", startingWith = "1")

#### 2. Process keywords into format suitable for text mining

#Create document-term matrix, passing data cleaning options
#Stem the words to avoid multiples of similar words
#Need to set wordLength to minimum of 1 because "r" a likely term
dtm <- create_matrix(searchkeywords$'Natural Search Keyword',
                     removePunctuation= TRUE)

Within the create_matrix function, I’m using four keyword arguments to process the data:

  1. stemWords reduces a word down to its root, which is a standardization method to avoid having multiple versions of words referring to the same concept (e.g. argue, arguing, argued reduces to ‘argu’)
  2. removeStopwords eliminates common English words such as “they”, “he” , “always”
  3. minWordLength sets the minimum number of characters that constitutes a ‘word’, which I set to 1 because of the high likelihood of ‘r’ being a keyword
  4. removePunctuation removes periods, commas, etc.

If you are unfamiliar with the terms that might be contained in your dataset, you can use the findFreqTerms to see which terms occur with a minimum frequency. Here are the terms that occur at least 20 times on this blog:

> #Inspect most popular words, minimum frequency of 20
> findFreqTerms(dtm, lowfreq=20)
  [1] "15"           "2008"         "2009"         "2011"         "a"            "ad"           "add"          "adsens"      
  [9] "air"          "analyt"       "and"          "appl"         "at"           "back"         "bezel"        "black"       
 [17] "book"         "bookmark"     "break"        "broke"        "broken"       "bubbl"        "by"           "can"         
 [25] "case"         "chang"        "child"        "code"         "comment"      "comput"       "cost"         "cover"       
 [33] "crack"        "css"          "custom"       "data"         "delet"        "disabl"       "display"      "do"          
 [41] "doe"          "drop"         "edit"         "eleven"       "em209"        "entri"        "fix"          "footer"      
 [49] "footerphp"    "for"          "free"         "from"         "get"          "glue"         "googl"        "hadoop"      
 [57] "header"       "hing"         "how"          "i"            "if"           "imag"         "in"           "is"          
 [65] "it"           "laptop"       "late"         "lcd"          "lid"          "link"         "logo"         "loos"        
 [73] "mac"          "macbook"      "make"         "mobil"        "modifi"       "much"         "my"           "navig"       
 [81] "of"           "off"          "omnitur"      "on"           "page"         "permalink"    "php"          "post"        
 [89] "power"        "pro"          "problem"      "program"      "proud"        "r"            "remov"        "repair"      
 [97] "replac"       "report"       "sas"          "screen"       "separ"        "site"         "sitecatalyst" "store"       
[105] "tag"          "text"         "the"          "theme"        "this"         "tighten"      "to"           "top"         
[113] "turn"         "twenti"       "twentyeleven" "uncategor"    "unibodi"      "use"          "variabl"      "version"     
[121] "view"         "vs"           "warranti"     "was"          "what"         "will"         "with"         "wordpress"   
[129] "wp"           "you"    

Guessing at ‘k’: A First Run at Clustering

Once we have our data set up, we can very quickly run the k-means algorithm within R. The one downside to using k-means clustering as a technique is that the user must choose ‘k’, the number of clusters expected from the dataset. In absence of any heuristics about what ‘k’ to use, I can guess that there are five topics on this blog: 1. Data Science

  1. Digital Analytics  
  2. R
  3. Julia
  4. WordPress

Running the following code, we can see if the algorithm agrees:

#I think there are 5 main topics: Data Science, Web Analytics, R, Julia, Wordpress
kmeans5<- kmeans(dtm, 5)

#Merge cluster assignment back to keywords
kw_with_cluster <-$'Natural Search Keyword', kmeans5$cluster))
names(kw_with_cluster) <- c("keyword", "kmeans5")

#Make df for each cluster result, quickly "eyeball" results
cluster1 <- subset(kw_with_cluster, subset=kmeans5 == 1)
cluster2 <- subset(kw_with_cluster, subset=kmeans5 == 2)
cluster3 <- subset(kw_with_cluster, subset=kmeans5 == 3)
cluster4 <- subset(kw_with_cluster, subset=kmeans5 == 4)
cluster5 <- subset(kw_with_cluster, subset=kmeans5 == 5)

Opening the dataframes to observe the results, it seems that the algorithm disagrees:

  • Cluster 1: “Free-for-All” cluster: not well separated (41.1% of terms)
  • Cluster 2: “wordpress” and “remove” (4.9% of terms)
  • Cluster 3: “powered by wordpress” (4.3% of terms)
  • Cluster 4: “twenty eleven” (13.5% of terms)
  • Cluster 5: “macbook” (36.2% of terms)

Of the clusters, the strongest cluster in terms of performance is cluster 5, which is pretty homogenous in terms of being about ‘macbook’ terms. Clusters 2-4 are all about WordPress, albeit different topics surrounding blogging. And cluster 1 is a large hodge-podge of terms that seem unrelated. Clearly, five clusters isn’t the proper value for ‘k’.   

Selecting ‘k’ Using ‘Elbow Method’

Instead of randomly choosing values of ‘k’, then looking at each cluster result until we find one we like, we can take a more automated approach to picking ‘k’. For every kmeans object returned by R, there is a metric tot.withinss that provides the total of the squared distance metric for each cluster.

#accumulator for cost results
cost_df <- data.frame()

#run kmeans for all clusters up to 100
for(i in 1:100){
  #Run kmeans for each level of i, allowing up to 100 iterations for convergence
  kmeans<- kmeans(x=dtm, centers=i, iter.max=100)

  #Combine cluster number and cost together, write to df
  cost_df<- rbind(cost_df, cbind(i, kmeans$tot.withinss))

names(cost_df) <- c("cluster", "cost")

The cost_df dataframe accumulates the results for each run, which can then be plotted using ggplot2 (ggplot2 Gist here):


The plot above is a technique known informally as the ‘elbow method’, where we are looking for breakpoints in our cost plot to understand where we should stop adding clusters. We can see that the slope of the cost function gets flatter at 10 clusters, then flatter again around 20 clusters. This means that as we add clusters above 10 (or 20), each additional cluster becomes less effective at reducing the distance from the each data center (i.e. reduces the variance less). So while we haven’t determined an absolute, single ‘best’ value of ‘k’, we have narrowed down a range of values for ‘k’ to evaluate.

Ultimately, the best value of ‘k’ will be determined as a combination of a heuristic method like the ‘Elbow Method’, along with analyst judgement after looking at the results. Once you’ve determined your optimal cluster definitions, it’s trivial to calculate metrics such as Bounce Rate, Pageviews per Visit, Conversion Rate or Average Order Value to see how well the clusters actually describe different behaviors on-site.


K-means clustering is one of many unsupervised learning techniques that can be used to understand the underlying structure of a dataset. When used with text data, k-means clustering can provide a great way to organize the thousands-to-millions of words being used by your customers to describe their visits. Once you understand what your customers are trying to do, you can tailor your on-site experiences to match these needs, as well as adjusting your reporting/dashboards to monitor the various customer groups.

EDIT: For those who want to play around with the code but don’t use Adobe Analytics, here is the file of search keywords I used. Once you read in the .csv file into a dataframe and name it searchkeywords, you should be able to replicate everything in this blog post.

Fun With Just-In-Time Compiling: Julia, Python, R and pqR

Recently I’ve been spending a lot of time trying to learn Julia by doing the problems at Project Euler. What’s great about these problems is that it gets me out of my normal design patterns, since I don’t generally think about prime numbers, factorials and other number theory problems during my normal workday. These problems have also given me the opportunity to really think about how computers work, since Julia allows the programmer to pass type declarations to the just-in-time compiler (JIT).

As I’ve been working on optimizing my Julia code, I decided to figure out how fast this problem can be solved using any of the languages/techniques I know. So I decided to benchmark one of the Project Euler problems using Julia, Python, Python with NumbaPyPy, R, R using the compiler package, pqR and pqR using the compiler package. Here’s what I found…


The problem I’m using for the benchmark is calculating the smallest number that is divisible by all of the numbers in a factorial. For example, for the numbers in 5!, 60 is the smallest number that is divisible by 2, 3, 4 and 5. Here’s the Julia code:

function smallestdivisall(n::Int64)
    for i = 1:factorial(n)
        for j = 1:n
            if i % j !=0
            elseif j == n
                return i

All code versions follow this same pattern: the outside loop will run from 1 up to n!, since by definition the last value in the loop will be divisible by all of the numbers in the factorial. The inner loops go through and do a modulo calculation, checking to see if there is a remainder after division. If there is a remainder, break out of the loop and move to the next number. Once the state occurs where there is no remainder on the modulo calculation and the inner loop value of j equals the last number in the factorial (i.e. it is divisible by all of the factorial numbers), we have found the minimum number.

Benchmarking - Overall

Here are the results of the eight permutations of languages/techniques (see this GitHub Gist for the actual code used, this link for results file, and this GitHub Gist for the ggplot2 code):


Across the range of tests from 5! to 20!, Julia is the fastest to find the minimum number. Python with Numba is second and PyPy is third. pqR fares better than R in general, but using the compiler package can narrow the gap.

To make more useful comparisons, in the next section I’ll compare each language to its “compiled” function state.

Benchmarking - Individual



Amongst the native Python code options, I saw a 16x speedup by using PyPy instead of Python 2.7.6 (10.62s vs. 172.06s at 20!). Using Numba with Python instead of PyPy nets an incremental ~40% speedup using the @autojit decorator (7.63s vs. 10.63 at 20!).

So in the case of Python, using two lines of code with the Numba JIT compiler you can get substantial improvements in performance without needing to do any code re-writes. This is a great benefit given that you can stay in native Python, since PyPy doesn’t support all existing packages within the Python ecosystem.



It’s understood in the R community that loops are not a strong point of the language. In the case of this problem, I decided to use loops because 1) it keeps the code pattern similar across languages and 2) I hoped I’d see the max benefit from the compiler package by not trying any funky R optimizations up front.

As expected, pqR is generally faster than R and using the compiler package is faster than not using the compiler. I saw ~30% improvement using pqR relative to R and ~20% incremental improvement using the compiler package with pqR. Using the compiler package within R showed ~35% improvement.

So unlike the case with Python, where you could just use Python with Numba and stay within the same language/environment, if you can use pqR and the compiler package, you can get a performance benefit from using both.


For a comparison like I’ve done above, it’s easy to get carried away and extrapolate the results from one simple test to all programming problems ever. “Julia is the best language for all cases ever!!!11111eleventy!” would be easy to proclaim, but all problems aren’t looping problems using simple division. Once you get into writing longer programs, other tasks such string manipulation and accessing APIs, using a technique from a package only available in one ecosystem but not another, etc., which tool is “best” for solving a problem becomes a much more difficult decision. The only way to know how much improvement you can see from different techniques & tools is to profile your program(s) and experiment.

The main thing that I took away from this exercise is that no matter which tool you are comfortable with to do analysis, there are potentially large performance improvements that can be made just by using a JIT without needing to dramatically re-write your code. For those of us who don’t know C (and/or are too lazy to re-write our code several times to wring out a little extra performance), that’s a great thing.

RSiteCatalyst Version 1.1 Release Notes

RSiteCatalyst version 1.1 is now available on CRAN. Changes from version 1 include:

  • Support for Correlations/Subrelations in the QueueRanked function
  • Support for Current Data in all Queue* functions
  • Support Anomaly Detection for QueueOvertime and QueueTrended functions (example usage with ggplot2 graph)
  • Decrease in wait time for API calls (from 5 seconds to 2 seconds) and extending total number of API tries before report failure (from 100 seconds to 10 minutes)

For those of you Adobe Analytics (Omniture) users who haven’t yet tried to use the Adobe Analytics API, I’ve created an introduction video to get started. There will also continue to be examples of using this package on this blog on the RSiteCatalyst tag. Enjoy!

  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • Google Analytics Individual Qualification (IQ) - Passed!
  • Google Analytics SEO reports: Not Ready For Primetime?
  • An Afternoon With Edward Tufte
  • Google Analytics Custom Variables: A Page-Level Example
  • Xchange 2011: Think Tank and Harbor Cruise
  • Google Analytics for WordPress: Two Methods
  • WordPress Stats or Google Analytics? Yes!
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS