If you’ve spent any non-trivial amount of time working with Hadoop and Hive at the command line, you’ve likely wished that you could interact with Hadoop like you would any other database. If you’re lucky, your Hadoop administrator has already installed the Apache Hue front-end to your cluster, which allows for interacting with Hadoop via an easy-to-use browser interface. However, if you don’t have Hue, Hive also supports access via JDBC; the downside is, setup is not as easy as including a single JDBC driver.
While there are paid database administration tools such as Aqua Data Studio that support Hive, I’m an open source kind of guy, so this tutorial will show you how to use SQL Workbench to access Hive via JDBC. This tutorial assumes that you are proficient enough to get SQL Workbench installed on whatever computing platform you are using (Windows, OSX, or Linux).
Download Hadoop jars
The hardest part of using Hive via JDBC is getting all of the required jars. At work I am using a MapR distribution of Hadoop, and each Hadoop vendor platform provides drivers for their version of Hadoop. For MapR, all of the required Java .jar files are located at /opt/mapr/hive/hive-0.1X/lib (where X represents the Hive version number you are using).
Download all the .jar files in one shot, just in case you need them in the future
Since it’s not always clear which .jar files are required (especially for other projects/setups you might be doing), I just downloaded the entire set of files and placed them in a directory called hadoop_jars. If you’re not using MapR, you’ll need to find and download your vendor-specific version of the following .jar files:
Additionally, you will need the following general Hadoop jars (Note: for clarity/long-term applicability of this blog post, I have removed the version number from all of the jars):
Whew. Once you have the Hive JDBC driver and the 10 other .jar files, we can begin the installation process.
Setting up Hive JDBC driver
Setting up the JDBC driver is simply a matter of providing SQL Workbench with the location of all 11 of the required .jar files. After clicking File -> Manage Drivers, you’ll want to click on the white page icon to create a New Driver. Use the Folder icon to add the .jars:
For the Classname box, if you are using a relatively new version of Hive, you’ll be using Hive2 server. In that case, the Classname for the Hive driver is org.apache.hive.jdbc.HiveDriver (this should pop up on-screen, you just need to select the value). You are not required to put any value for the Sample URL. Hit OK and the driver window will close.
With the Hive driver defined, all that’s left is to define the connection string. Assuming your Hadoop administrator didn’t change the default port from 10000, your connection string should look as follows:
As stated above, I’m assuming you are using Hive2 Server; if so, your connection string will be jdbc:hive2://your-hadoop-cluster-location:10000. After that, type in your Username and Password and you should be all set.
Using Hive with SQL Workbench
Assuming you have achieved success with the instructions above, you’re now ready to use Hive like any other database. You will be able to submit your Hive code via the Query Window, view your schemas/tables (via the ‘Database Explorer’ functionality which opens in a separate tab) and generally use Hive like any other relational database.
Of course, it’s good to remember that Hive isn’t actually a relational database! From my experience, using Hive via SQL Workbench works pretty well, but the underlying processing is still in Hadoop. So you’re not going to get the clean cancelling of queries like you would with an RDBMS , there can be a significant lag to getting answers back (due to the Hive overhead), you can blow up your computer streaming back results larger than available RAM…but it beats working at the command line.
Before using the real-time reporting capabilities of Adobe Analytics, you first need to indicate which metrics and elements you are interested in seeing in real-time. To see which reports are already set up for real-time access on a given report suite, you can use the GetRealTimeConfiguration() function:
#Get Real-Time reports that already set up
It’s likely the case that the first time you set this up, you’ll already see a real-time report for ‘Instances-Page-Site Section-Referring Domain’. You can leave this report in place, or switch the parameters using SaveRealTimeConfiguration().
If you want to add/modify which real-time reports are available in a report suite, you can use the SaveRealTimeConfiguration() function:
Up to three real-time reports are available to be stored at any given time. Note that you can mix-and-match what reports you want to modify, you don’t have to submit all three reports at a given time. Finally, keep in mind that it can take up to 15 minutes for the API to incorporate your real-time report changes, so if you don’t get your data right away don’t keep re-submitting the function call!
Once you have your real-time reports set up in the API, you can use the GetRealTimeReport() function in order to access your reports. There are numerous parameters for customization; selected examples are below.
Minimum Example - Overtime Report
The simplest function call for a real-time report is to create an Overtime report (monitoring a metric over a specific time period):
The result of this call will be a DataFrame having 15 rows of one minute granularity for your metric. This is a great way to monitor real-time orders & revenue during a flash sale, see how users are accessing a landing page for an email marketing campaign or any other metric where you want up-to-the-minute status updates.
Granularity, Offset, Periods
If you want to have a time period other than the last 15 minutes, or one minute granularity is too volatile for the metric you are monitoring, you can add additional arguments to modify the returned DataFrame:
This function call will return Instances by Page, for the last 27 minutes (3 rows/periods per page, 9 minute granularity…just because!). Additionally, there are other arguments such as algorithm, algorithmArgument, firstRankPeriod and floorSensitivity that allow for creating reports similar to what is provided in the Real-Time tab in the Adobe Analytics interface.
Currently, even through the Adobe Analytics API supports real-time reports with three breakdowns, only one element breakdown is supported by RSiteCatalyst; it is planned to extend these functions in RSiteCatalyst to full support the real-time capabilities in the near future.
From DataFrame to Something ‘Shiny’
If we’re talking real-time reports, we’re probably talking about dashboarding. If we’re talking about R and dashboarding, then naturally, ggvis/Shiny comes to mind. While providing a full ggvis/Shiny example is beyond the scope of this blog post, it’s my hope to provide a working example in a future blog post. Stay tuned!
Version 1.3 of the RSiteCatalyst package to access the Adobe Analytics API is now available on CRAN! Changes include:
Search via regex functionality in QueueRanked/QueueTrended functions
Support for Realtime API reports: Overtime and one-element Ranked report
Allow for variable API request timing in Queue*` functions
Fixed validate flag in JSON request to work correctly
Deprecated GetAdminConsoleLog (appears to be removed from the API)
Searching via Regex functionality
RSiteCatalyst now supports the search functionality of the API, similar in nature to using the Advanced Filter/Search feature within Reports & Analytics. Here are some examples for the QueueRanked function:
#Top 100 Pages where the pagename starts with "Categories"
#Uses searchKW argument
queue_ranked_pages_search<-QueueRanked("<reportsuite>","2013-01-01","2014-01-28",c("pageviews","visits"),"page",top="100",searchKW="^Categories")#Top 100 Pages where the pagename starts with "Categories" OR contains "Home Page"
#Uses searchKW and searchType arguments
QueueTrended function calls work in a similar manner, returning elements broken down by time rather than a single record per element name.
Realtime Reporting API
Accessing the Adobe Analytics Realtime API now has limited support in RSiteCatalyst. Note that this is different than just using the currentData parameter within the Queue* functions, as the realtime API methods provide data within a minute of that data being generated on-site. Currently, RSiteCatalyst only supports the most common types of reports: Overtime (no eVar or prop breakdown) and one-element breakdown.
Because of the extensive new functionality for the GetRealTimeConfiguration(), SaveRealTimeConfiguration() and GetRealTimeReport() functions, code examples will be provided as a separate blog post.
Variable request timing for Queue function calls
This feature is to fix the issue of having an API request run so long that RSiteCatalyst gave up on retrieving an answer. Usually, API requests come back in a few seconds, but in selected cases a call could run so long as to exhaust the number of attempts (previously, 10 minutes). You can use the maxTries and waitTime arguments to specify how many times you’d like RSiteCatalyst to retrieve the report and the wait time between calls:
#Change timing of function call
#Wait 30 seconds between attempts to retrieve the report, try 5 times
If you don’t specify either of these arguments, RSiteCatalyst will default to trying every five seconds to retrieve the report, up to 120 tries.
New Contributor: Willem Paling
I’m pleased to announce that I’ve got a new contributor for RSiteCatalyst, Willem Paling! Willem did a near-complete re-write of the underlying code to access the API, and rather than have multiple packages out in the wild, we’ve decided to merge our works. So look forward to better-written R code and more complete access to the Adobe Analytics API’s in future releases…
If you run into any problems with RSiteCatalyst, please file an issue on GitHub so it can be tracked properly. Note that I’m not an Adobe employee, so I can only provide so much support, as in most cases I can’t validate your settings to ensure you are set up correctly (nor do I have any inside information about how the system works :) )
Edit 2/20/2014: I mistakenly forgot to add the new real-time functions to the R NAMESPACE file, and as such, you won’t be able to use them if you are using version 1.3. Upgrade to 1.3.1 to access the real-time functionality.