What do yellow elephants, hives and pigs have in common? Is Cassandra the Zookeeper? because if so the cats need to be herded!
No, stay with me, I haven’t lost it just yet! if I had, i could find it using HDFS and MapReduce processes.
Don’t be alarmed! What you’ve just experienced is something we can help with, confusion! One of the biggest hurdles we are helping our clients over in recent times are the barriers of Big Data adoption, a confusion of technologies, concepts and strategies blurred together to sell one of the most on trend IT initiatives today.
In this article I will break down and separate the components of a Big Data solution, and provide some meaningful examples of how Big Data strategies can be leveraged to drive real value for your organisation. I will use my experience in “traditional” enterprise business intelligence and information management to highlight synergies I can see between the two, and how your organisation can adopt one in addition to the other. There are some very good reasons for running these initiatives in parallel, they are extremely complimentary, and together can provide a powerful mixture of agility and performance for your organisation.
So what is Big Data?
Just in case you weren’t aware, we live in a data rich world! Most people have a number of personal smart devices, phones, tablets, watches, televisions etc, which are capable of connecting to networks, tracking events, displaying information, talking to us, etc etc. In each instance these devices produce data, both qualitative and quantitive. Most of us understand this demand on a personal level, but this consumerism also drives industry. The networks that carry data, the applications that move data between devices, our electricity/gas suppliers, our traffic lights, our airports, booking systems and cars all have points of communication along their disparate processing channels. Our world, and our businesses, create enormous amounts of data.
Some organisations have always dealt with enormous data volumes, so Big Data remains a very relative term, generally speaking though the concern is not so much with the volumes of data specifically, but rather how easily a quantity of data can be used to generate value for an organisation. There are a number of issues for consideration; how accessible is the information in your organisation? How secure is it? How do we get insight from it? how do we derive real strategic value?
Some of the more traditional forms of collating, processing and displaying information struggle under the requirements of the modern data explosion, and for that reason alternative approaches have been created to deal with the burden of knowing!
The Big Data approach is a distinct change in philosophy to that of a traditional enterprise data warehouse, but still deals with some familiar issues;
- What data is important to our organisation?
- How do we get access to it?
- How do we take that raw data and present it as useful information?
All data is important and insightful – we just don’t know it yet
Determining the importance of data is a difficult equation. We don’t just look at the data, we look at the cost of churning raw data into something useful, we look at the impacts of that data tactically and strategically, and then we prioritize according to highest value and business focus. Fundamentally we make decisions around investing today for future value.
Big Data is a change in focus. It assumes that all data in an organisation is of value today, or will become visibly valuable in the future. Big Data takes the prioritization process out of the business case, and justifies visibility over maturity. Invest today and have it all, find uses for it over time through explorative analysis.
I haven’t met a lock I couldn’t pick with a crowbar
Accessing Data (and by that I mean obtaining it, transforming it, conforming it) is a long winded process, made simpler by a variety of great tools in the ETL space, but nonetheless a time and resource exhaustive part of traditional business intelligence and information management. The main contention here is moving data to the point of processing between environments, generally there is a focus on performance at that point of processing.
The Big Data world diverges a little here in terms of how this happens physically, it either utilises the Hadoop open source platform, or some other form of massively parallel processing (MPP). Conceptually though, the focus is not on bringing data to processing power, but rather taking processing power to the data at its source, and plenty of it! Scalability is a huge factor with Big Data processing, the more hardware you throw at it, the more processing power there is to churn through data. Whilst well planned EDW’s can offer serious grunt as well, generally speaking there is a limit to what can be achieved, especially constrained by time taken to deliver, and organic growth over time. In this respect Big Data can take serious advantage of a Cloud based infrastructure, shameless (and fully disclosed!) plug here for our sister company Kloud Solutions who are already taking advantage of those synergies.
Hadoop is the word in terms of open source. The Hadoop project is maintained by Apache.org, it’s not an acronym, but rather a creative name for a series of subprojects that provide a technical solution specifically around the concepts that Big Data seeks to address. Hadoop features a yellow elephant as its mascot, and many of the subprojects feature creative names and depictions that also had a mention in my witty introduction. Most commercial Big Data products borrow in some way from the fundamental offerings of the project, but seek to make a more intuitive, manageable, supportable solution.
To quote Apache.org:
“The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures”
Notable Hadoop projects for the acquisition of data include:
- HDFS (Hadoop Distributed File System) maps distributed hardware together, putting its data, as well as its computational power on the Big Data radar.
- MapReduce is responsible for picking out specific data elements across the HDFS, typically from flat, text based files. A Java based executable is pushed to the source, where in effect a “query” is run, the results are then brought together and assembled.
- IBM Netezza
- Oracle Exadata
- EMC Greenplum
The proof of the pudding is in the reporting
“Amounts and counts Brad, that’s all they really want, amounts and counts!” The words of one of my very first project manager still ring in my ears, and there is some truth in that, simplicity is often key in reporting. In keeping with the comparison between some more traditional uses of information and the Big Data approach, the key differences lay in maturity of reporting in general.
Some people may take exception in the over generalisation found here, but it is keeping with our focus on simplicity. Traditional warehousing and modelling techniques generally emphasize relationships between entities over a period of time that are well defined and mature in terms of organisational IP and reporting typically reflects that. What I mean by that is, data has generally gone through a process of vetting by SMEs within a business, consolidation within a framework of understood business rules and usages that make sense on a whole of network level. Organisations look for well understood business patterns.
Then in walks Big Data with its pistols a’blazin! Firstly lets deal with the “Unstructured” misnomer sometimes used to describe Big Data, in essence read “not in the conformed structure of a traditional EDW”, because the data we are talking about is generally very structured. In truth Big Data is not mature in the enterprise sense, but from an operational point of view is extremely optimised as a source, and as such has strong interpretable formatting. At a base level what we are really talking about are definable name-value pairs, a text based search result, and the number of it’s occurrences in a data set that can be aggregated/transformed in layer upon layer of processing. Think of twitter logs that might mention your business, and then the correlation of those tags with sales of specific products online at specific times, this might inform marketing campaigns. The value here lies in true data mining and analytical exploration of the unknown. Spotting a new unforeseen trend that sets your organisation apart from others is the goal. Measuring the state of your organisation in a managerial sense is something that is better placed in an EDW.
Notable Hadoop projects for the analysis of data include:
- Hive, a data warehouse infrastructure for data summarization and ad hoc querying.
- Mahout, a scalable machine learning and data mining library.
The Strategic Coalition
So we have looked at how conceptually Big Data is different to a traditional EDW, but to be fair we aren’t comparing apples with apples, we are however comparing a couple of fruits that mix well, maybe more of an orange and mango? Apple and Guava? Ok, you get the point. This is where the human factor can intervene to determine the appropriate use for these software and hardware based solutions.
I alluded to someone in a long white coat earlier on, which was a reference to the term “Data Scientist”. The definitions for this term are many and varied, and not always associated with Big Data, but in short, a data scientist is someone who understands the traditional practices of business and data analysis. What sets them apart is an innovative approach to sourcing and delivering information in a way that influences how an organization approaches a business challenge. Well in a world where terms come and go at the speed of light, I’d like to coin one I prefer, “Data Pioneer”. I think its more fitting given the explorative nature of Big Data, not only in assessment, but in picking which trail to follow strategically to arrive at real value.
If we look at the benefits of both schools of thought we find a more holistic solution. We can supplement the inherent benefits of conformed/well defined data that suits a more established way of measuring our businesses success, with a more dynamic explorative assessment of a volatile market place, with unsurpassed speed of delivery.
Who says pioneering went out with the wild west! Just watch out for the cowboys!
(author: Brad Riley)