Recently, I’ve been getting my blood pressure up reading (marketing) articles about “big data” and “data science”. What saddens me about the whole discussion is that there is the underlying premise that what is stopping companies from “harnessing the power of big data” is just the lack of an easy-to-use, push-button tool. Respectfully, if you believe this, you should bow out of the conversation altogether.
Math is hard and stuff.
The first article that really bothered me is titled “Do Predictive Modelers Need to Know Math?” This is a provocative title from a veteran in the data mining/data science industry, and his conclusion is basically ‘Yes, but not everyone on the team needs to be able to hand-solve equations.’ I think that’s a fair point within the context of needing to understand the mathematical concepts behind algorithms, but not needing to be bogged down by notation.
Extending that idea a little further, how far away from the math should a business be comfortable with an employee pushing the button on a machine learning algorithm? Should the CEO be building predictive models? The Intern? A Call Center Rep? For me, I think the answer falls back on the allegory of the highly-specialized tradesperson:
Driver: “How can you charge $100 for five minutes work? All you did was put a bolt on and turned the wrench a few times!
Mechanic: “I didn’t charge you for the parts, I charged you for knowing where to put the wrench…”
The value a data scientist brings to a business is not that he can push the buttons in a GUI like rattle for R, Weka, KnowledgeSeeker, or SAS Enterprise Miner. What your data scientist brings to the table is knowing the underlying assumptions that go into a model, how the algorithm works, which algorithm is appropriate for the business problem being solved and when to know the model/algorithm has failed.
Of all of the things listed, the experience of knowing when the model can/has failed is what you’re paying the money for. That knowledge doesn’t come from just pushing the GUI buttons a bunch of times. And if you’re making million-dollar decisions based on an algorithm, it’s worth paying the salary for a person who really understands the model.
Hire a mathematician, get a programmer free
The next article that bothered me is a “Business Analytics: Do we need data scientists?” debate over “Do we need data scientists at all?” The No argument boils down to an idea that only things that can be made easy and are sufficiently developed are useful/valuable. Thus, because a general analyst can’t use Excel, but rather might need to write a SQL query or write a program to put together a dataset, the problem-domain is too difficult. The No debater also refers to data scientists as being “adversarial”, “pretentious”, project “snobbery”, etc.
But here’s the thing…the problem-domain isn’t particularly difficult if you hire someone with above-average math proficiency. Any decent graduate program in mathematics, statistics, computer science, economics, finance, psychology and others will be using data through programming. Now, the languages may vary between Java, R, Python, Matlab, C++, SAS, Octave, Eviews or others, but the language doesn’t matter, they’ll learn whatever language your company is using once you hire them. They also will learn the systems you are using to store your data, whether it’s a standard relational database, a NoSQL database, or a parallel processing platform like Hadoop.
How can I be certain that the math person you hire will be able to learn all that’s necessary for data science? Because the type of person who likes math & programming is probably a ‘system builder’ type of person. The type of person who played with Legos growing up. The type of person built their own desktop computer back in the day. The type of person who thinks How It’s Made is much more interesting TV than mindless reality shows. The type of person who WANTS to know how a database is storing data, what new open-source technology is out there, wants to find out how many nodes they can connect together before their program won’t finish any faster.
As far as the adversarial/pretentious/snobby comment, all I can say is I’ve never witnessed that. Everyone I know in the data science community are the nicest people, willing to share code, collaborate on ideas and talk until they lose their voice about how to solve an interesting problem.
Data Science is about innovative research, not reporting
I’ve read four academic papers this week. I’m not in graduate school.
As some of you might know, I started a new position at a startup which provides real-time intelligence for the lead generation industry. As such, I’ve got access to billions of records of unstructured data and equally as much structured data. And as a startup, there are several warts that need to be fixed with respect to data storage. So for any given day, I might go from accessing a MySQL database, Amazon Redshift (columnar RDBMS), Amazon DynamoDB (NoSQL) and plain ol’ .csv files via Excel or massive .csv files on Amazon S3. To access this data, I’ve used a combination of R, Python, SQL Workbench, and MySQL Workbench using OSX, Ubuntu desktop and a ‘headless’ Ubuntu image on Amazon EC2.
Why am I giving you about all this jibber-jabber about research papers and tools? Because the idea of building a one-size-fits-all tool to solve the problem I’m working on just doesn’t make sense. And for that matter, I’m not even sure the problem I’m working on is worth solving. But that’s the thing…I don’t KNOW it’s not worth solving, so I need to find out. I’ve got a quarter-billion URLs that I think I can extract information from, just to give our clients ONE more data element to use to optimize their marketing strategies. There may be an already existing algorithm I can use, or maybe I’ll try this research paper on “word breaking” I found from Microsoft Research. Once I find out the answer, if it’s valuable, then I need to be able to implement my algorithm into our real-time API, because it’s likely whatever language I end up using isn’t going to be what our API is written in.
So if these aren’t the type of problems you’re working on, then maybe there is an all-in-one tool out there for you to use (and that’s okay). But these are the types of edge-case problems that I think about when I think about “data science”, and as such, it will always be custom and ad-hoc. There are many awesome open-source tools I will use to help me along the way, but it will never make sense to build an easy-to-use tool for a problem a few dozen companies may ever need to know the answer to.
I’m already 1100 words into this rant, so I’ll finish up with a few admissions. Yes, “data science” is somewhat a ridiculous name for the combination of advanced analytics and data engineering that it represents. And yes, there are plenty of vendors out there pedaling hype about the grandeur of ‘Big Data’ and why every business MUST jump on board or be left behind.
But rather than focusing on why something is “useless” or “stupid” or “hype”, just ask yourself “Can I solve the business problems I have today using the tools I currently have access to?” If the answer is yes, then great, get to work. If not, maybe you can find someone to help you get where you’re going (and that person may or may not call themselves a “Data Scientist”). Either way, let’s all move forward and do something extraordinary. It’s the least we can do for our customers.
In Part 1 of this series, I discussed some of the basic concepts around Hadoop, specifically when it’s appropriate to use Hadoop to solve your data engineering problems and the terminology of the Hadoop eco-system. This post will cover how to install your own Hadoop cluster on Amazon EC2 using Cloudera Manager.
Like prior posts talking about Amazon EC2, this post assumes you have some basic facility with Linux, submitting instructions via the command line, etc. Because really, if you’re interested in Hadoop, using the command line probably isn’t a limiting factor!
Building a 18-node Hadoop Cluster
The SlideShare presentation below shows the steps to building a 18-node Hadoop cluster, using a single m1.large EC2 instance as the ‘Name Node’ and 18 m1.medium EC2 instances as the ‘Data Nodes’. I chose 18 nodes because according to Cloudera, 20 is the maximum that can be activated at one time through the Amazon API, so let’s stay under the max to avoid any errors. It’s possible to add more instances later through the Cloudera Manager (up to 50 total), if so desired.
Note that going through this tutorial will cost $2.40/hr at current prices ($0.24/hr per m1.large instance and $0.12/hr per m1.medium instance).
Since the SlideShare presentation is potentially not so friendly on the eyes, I’ve also created a PDF download that’s full resolution.
Once you make it through all these steps to set up a Hadoop cluster, you are ready to do some analysis. Part 3 of this tutorial will cover how to upload data into HDFS using Hue.
Update, 7/13/13: As is the case with any open-source project, there have been several changes to the Cloudera Manager that makes setup easier. When getting started, on the screen where it asks “Which Cloudera do you want to deploy?”, choose ‘Cloudera Standard’. Also, once you get to slides 13-14 where you click on the link to get started with Hue, the link now works correctly (you don’t need to search for the Amazon DNS any more!)
For the last couple of days I’ve been at the eMetrics conference in San Francisco. There were several panels that discussed big data, both from an engineering standpoint as well as how to adopt newer technologies from a business perspective.
Unfortunately, there wasn’t a whole lot of practical information on how to actually get started using ‘big data’ technologies, of which Hadoop is one. Luckily, it’s fairly easy to create a proof-of-concept Hadoop cluster using Amazon EC2 and Cloudera.
This series will be at least 5 parts, as follows:
- Intro to Hadoop ecosystem and concepts
- Setting up Hadoop cluster on Amazon EC2 using Cloudera
- Populating HDFS with airline dataset files using Hue
- Use Hive and/or Pig to stack datasets into one master dataset
- Doing analytics on the combined Airline dataset using Pig and/or Hive
My aim with this series is to simply explain why you might want to consider using Hadoop for your data storage and processing. There’s a lot of marketing & vendor
bullshit excitement surrounding the term ‘big data’, so for this blog series, I’m just going to focus on the most important points for an analyst/marketer to understand. And other than this sentence, there will be no mentions of MS Excel in terms of ‘big data’, which is barely an appropriate tool for analysis in general, let alone analysis at scale.
What Is Hadoop & Why Are People Talking About It?
At it’s simplest, Hadoop provides a parallel-processing computing framework for data storage and processing. The reason why a parallel-processing framework is important for enterprise-level analysis is due to physical limitations on how quickly a single machine can process information.
As an example, suppose you want to create a report that looks at 1 trillion daily credit card transactions. It’s possible to do your calculations on your local desktop using a tool like SAS. However, the amount of time to process that much data on a desktop with 8GB-16GB of RAM might be 8 hours, 10 hours….24 hours?! So an analyst trying to get an answer can start a SINGLE business question at 8am and hope they get their answer before it’s time to leave at the end of the day. Suffice to say, not a particularly efficient way to run a business.
The solution might seem to add more processors and RAM to a desktop, but what happens when you add more users asking questions? Now you need an enterprise-class server such as Oracle or Teradata (and a few million dollars!). And for every terabyte of data you want to store, you’ll need a few thousand dollars. And that’s just for your nicely structured data…what happens when you want to start storing data such as free-form text that’s not so cleanly structured? Eventually, these types of engineering questions lead you towards a solution like Hadoop.
The reason why there is so much discussion around Hadoop as a data platform is that it solves the problems stated above: excessive time to process vast amounts of data and excessive cost of data storage. By using “commodity hardware” along with some fancy engineering, Hadoop provides an extremely cost-effective and flexible way to handle your enterprise data.
If Hadoop is so Great, Why Doesn’t Everyone Use It?
“Fast, Cheap And Good. Everyone should use Hadoop!” - Every vendor in marketplace
Just like you (probably) don’t use a screwdriver to stir a pot of chicken soup, not every data storage and analysis problem requires the extreme flexibility that Hadoop can provide. From the example above with credit card transactions, a standard relational database might continue to be an acceptable solution if you’re just running a basic SQL query to sum across the rows. But once your data starts moving beyond “rows and columns” and into things such as free-form text, images, clickstream data…the more Hadoop makes sense.
While it’s a tautology, how you know you need a solution like Hadoop is when you suspect you need a solution like Hadoop! If you already have a highly functioning data mart that answers your business questions, you probably don’t need to re-engineer everything just because. If you’re an Internet startup trying to create the next Facebook, then a standard relational database probably won’t cut it.
The best example I heard at eMetrics about the need for Hadoop was from Bob Page (now at Hortonworks, a Hadoop vendor): when Bob was at Ebay, for the longest time they were throwing away data, specifically images from the listings. So prior high storage costs leading to undesirable business outcome (deletion), unstructured data in the form of images…a Hadoop framework made sense to implement. Once implemented, Ebay could look across years of auctions to answer their business questions.
I’m An Analyst, Not An Engineer…What’s The Minimum I Need To Know To Get Started?
“MapReduce, Pigs, HCatalogs, Elephants, Bees, Zoos…Ooozie (Uzi’s)? WTF is everyone talking about?”
If you’ve made it this far and you’re not an engineer or DBA, you’re probably someone who’s interested in data science. You may be someone who already uses R, Python, Ruby or Java. Or, you’re a masochist. In any case, here are the minimum concepts I think you need to know to get started for later blog posts:
- MapReduce: Not explicitly a Hadoop idea, but the idea that data can be split into chunks by a key (“Map”) and then processed into information by one or more functions/transformations (“Reduce”). In the Hadoop sense, MapReduce is generally a reference to a “job” written in Java that performs a data transformation
- HDFS: Hadoop Distributed File System. Raw data gets imported into HDFS (either structured or unstructured), the distributed around to all of the various nodes to allow for parallel processing
- Hive: SQL-like interface so that analysts don’t have to write MapReduce code directly
- Pig: A scripting language used for analysis. Generally, an analyst will use Hive and/or Pig to do their work
- HCatalog: A ‘Data Warehouse’ layer on top of HDFS, similar to how you define a database table (a series of columns in a table with formats)
With the above five Hadoop concepts in place, the next few posts will be to set up a proof-of-concept Hadoop cluster on Amazon EC2, processing ~12GB of publicly available data from the ‘Airline dataset’. That’s not ‘big’ as ‘big data’ goes, but it’s big enough to be fun to work with.