Clustering Search Keywords Using K-Means Clustering

One of the key tenets to doing impactful digital analysis is understanding what your visitors are trying to accomplish. One of the easiest methods to do this is by analyzing the words your visitors use to arrive on site (search keywords) and what words they are using while on the site (on-site search). Although Google has made it much more difficult to analyze search keywords over the past several years (due to their passing of “(not provided)” instead of the actual keywords), we can create customer intent segments based on the keywords that are still being passed using unsupervised clustering methods such as k-means clustering.

Concept: K-Means Clustering/Unsupervised Learning

K-means clustering is one of many techniques within unsupervised learning that can be used for text analysis. Unsupervised refers to the fact that we’re trying to understand the structure of our underlying data, rather than trying to optimize for a specific, pre-labeled criterion (such as creating a predictive model for conversion). Unsupervised learning is a great technique for exploratory analysis in that the analyst enforces few assumptions on the data, so previously unexamined relationships can be determined then analyzed; contrast that with pre-defined relationships specified by the analyst (such as visitors from mobile or visitors from social), then evaluating how various metrics differ across these pre-defined groups.

Without getting too technical, k-means clustering is a method of partitioning data into ‘k’ subsets, where each data element is assigned to the closest cluster based on the distance of the data element from the center of the cluster. In order to use k-means clustering with text data, we need to do some text-to-numeric transformation of our text data. Luckily, R provides several packages to simplify the process.


Converting Text to Numeric Data: Document-Term Matrix

Since I use Adobe Analytics on this blog, I’m going to use the RSiteCatalyst package to get my natural search keywords into a dataframe. Once the keywords are in a dataframe, we can use the RTextTools package to create a document-term matrix, where each row is our search term and each column is a 1/0 representation of whether a single word is contained within natural search term. Within the create_matrix function, I’m using four keyword arguments to process the data: 1) stemWords reduces a word down to its root, which is a standardization method to avoid having multiple versions of words referring to the same concept (e.g. argue, arguing, argued reduces to  ‘argu’) 2) removeStopwords eliminates common English words such as “they”, “he” , “always” 3) minWordLength sets the minimum number of characters that constitutes a ‘word’, which I set to 1 because of the high likelihood of ‘r’ being a keyword and 4) removePunctuation removes periods, commas, etc.

Popular Words

If you are unfamiliar with the terms that might be contained in your dataset, you can use the findFreqTerms to see which terms occur with a minimum frequency. Here are the terms that occur at least 20 times on this blog:

Guessing at ‘k': A First Run at Clustering

Once we have our data set up, we can very quickly run the k-means algorithm within R. The one downside to using k-means clustering as a technique is that the user must choose ‘k’, the number of clusters expected from the dataset. In absence of any heuristics about what ‘k’ to use, I can guess that there are five topics on this blog: 1) Data Science 2) Digital Analytics  3) R 4) Julia 5) WordPress. Running the following code, we can see if the algorithm agrees:

Opening the dataframes to observe the results, it seems that the algorithm disagrees:

  • Cluster 1: “Free-for-All” cluster: not well separated (41.1% of terms)
  • Cluster 2: “wordpress” and “remove” (4.9% of terms)
  • Cluster 3: “powered by wordpress” (4.3% of terms)
  • Cluster 4: “twenty eleven” (13.5% of terms)
  • Cluster 5: “macbook” (36.2% of terms)

 Of the clusters, the strongest cluster in terms of performance is cluster 5, which is pretty homogenous in terms of being about ‘macbook’ terms. Clusters 2-4 are all about WordPress, albeit different topics surrounding blogging. And cluster 1 is a large hodge-podge of terms that seem unrelated. Clearly, five clusters isn’t the proper value for ‘k’.   

Selecting ‘k’ Using ‘Elbow Method’

Instead of randomly choosing values of ‘k’, then looking at each cluster result until we find one we like, we can take a more automated approach to picking ‘k’. For every kmeans object returned by R, there is a metric tot.withinss that provides the total of the squared distance metric for each cluster.

The cost_df dataframe accumulates the results for each run, which can then be plotted using ggplot2 (ggplot2 Gist here):

elbow-plot

The plot above is a technique known informally as the ‘elbow method’, where we are looking for breakpoints in our cost plot to understand where we should stop adding clusters. We can see that the slope of the cost function gets flatter at 10 clusters, then flatter again around 20 clusters. This means that as we add clusters above 10 (or 20), each additional cluster becomes less effective at reducing the distance from the each data center (i.e. reduces the variance less). So while we haven’t determined an absolute, single ‘best’ value of ‘k’, we have narrowed down a range of values for ‘k’ to evaluate.

Ultimately, the best value of ‘k’ will be determined as a combination of a heuristic method like the ‘Elbow Method’, along with analyst judgement after looking at the results. Once you’ve determined your optimal cluster definitions, it’s trivial to calculate metrics such as Bounce Rate, Pageviews per Visit, Conversion Rate or Average Order Value to see how well the clusters actually describe different behaviors on-site.

Summary

K-means clustering is one of many unsupervised learning techniques that can be used to understand the underlying structure of a dataset. When used with text data, k-means clustering can provide a great way to organize the thousands-to-millions of words being used by your customers to describe their visits. Once you understand what your customers are trying to do, you can tailor your on-site experiences to match these needs, as well as adjusting your reporting/dashboards to monitor the various customer groups.

EDIT: For those who want to play around with the code but don’t use Adobe Analytics, here is the file of search keywords I used. Once you read in the .csv file into a dataframe and name it searchkeywords, you should be able to replicate everything in this blog post.

Comments

  1. Any thoughts on this error message when setting up the Document Term Matrix ?

    Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), :
    ‘data’ must be of a vector type, was ‘NULL’

    Imported via CSV and the printed data frame doesn’t show any “null” values on inspection.

    • Randy Zwitch says:

      When writing to csv, R slightly changed the column name, adding periods instead of the blanks. So you need to make the following modification to the code to get it to work:

      dtm <- create_matrix(searchkeywords$Natural.Search.Keyword,
      stemWords=TRUE,
      removeStopwords=FALSE,
      minWordLength=1,
      removePunctuation= TRUE)

      • The QueueRanked function call gave me an error: “API error 5030 : Invalid mixing of commerce and traffic metrics”. I kept only the pageviews metric and searchenginenaturalkeyword element and the call was accepted.

        The dtm<-create_matrix code snippet above got me exactly the same error as what Jesse saw. I did replace the code as per what Randy recommended above. Could we get an updated version of the whole code?

        • This one worked for me, I just referred to the column by index number rather than by name

          dtm<-create_matrix(
          scdata3[1:1],
          stemWords=TRUE,
          removeStopwords=FALSE,
          minWordLength=1,
          removePunctuation=TRUE

          • Randy Zwitch says:

            Glad you were able to get this figured out. Just for clarification, which part were you looking to have updated, the code to load the keywords from the file or the RSiteCatalyst API call itself?

  2. Could you show us how you can use ggplot2 to plot the 5 clusters?

  3. Aanchal Maheshwari says:

    The package “RTextTools” is not available (for R version 3.0.2), what should I use instead of that.

  4. I’m seeing the same thing.
    It installs on OSX but not Windows
    Warning in install.packages :
    package ‘RTextTools’ is not available (for R version 3.0.2)

  5. sebastienbrodeur says:

    Great article.

    Could the users intend be weight by visits?

    • Randy Zwitch says:

      Thanks for stopping by Sebastien. If you are using R, you can accomplish what you are talking about by using the rep() function prior to creating the document-term matrix. The rep() function takes a vector of data and a vector of frequencies and returns a vector as-is the data was in its raw, unsummarized form. So if the original search term had 1 row with a visits frequency of 5, the vector after using rep() would show that search term 5 times.

      Note, there exists a “weighted k-means” algorithm, but it’s not the same thing as the process you’re describing above. Rather, weighted k-means describes a weight that measures the relative importance of a column of data in determining cluster assignment.

  6. sebastienbrodeur says:

    Have you ever tried to add synonym to help clustering. (ie. : business lexicon)?

    • Randy Zwitch says:

      I haven’t; my usage of the technique has more been for explanatory nature, just to see what the algorithm says the groupings are based on the underlying data structure. But adding in additional context can only help (in the same way that stemming words improves the ability to cluster like terms)

  7. Randy,

    In this example, how did you find the percent of terms in the Clusters?

    • Randy Zwitch says:

      Hi Chip –

      In the step where I observe the clusters, I’m just dividing the number of rows for each sub-Data Frame by the total number of rows in kw_with_cluster. So there are 3611 total terms/rows, and cluster1 had 1484 rows or 41.1% of the terms. In terms of R code:

      pct_terms_cluster1 <- nrow(cluster1)/nrow(kw_with_cluster)

%d bloggers like this: