In this article I’ll highlight the history of BigData since Google’s MapReduce till current trends and tools.
It’s debatable what BigData means, or where the boundaries lies, there is no standard way to define it, but generally the following diagram is quite popular :
other figures will use only 3 Fronts (Variety, Volume and Velocity) , generally the more far from the center, the more close you’re to what is treated and called now “BigData”.
Google File System (GFS) (2003) :
The kickstart, we can’t say that GFS is the start of the BigData concepts, but by far, it’s almost the only thing people now, will remember as a start (too few still remember or use datagrids or other forms), as in 2003, google has published this paper, which describes a Distributed File System, a system whose main purpose is handling the distribution of user’s files across a number of connected commodity machine, and abstracting the developer from the complexity of the underlying systems, those underlying systems are composed of cheap hardware components, which isn’t fail-safe by any means.
The framework is illustrated in the following figure (see google’s paper for more details) :
This system, should be able to store files sized at terabytes of data, across the cloud, and handle the replication as appropriate, so the developer or system analyst who is using the system, should only be worried about the data he want to store or retrieve and not the low level details.
MapReduce (2004) :
Shortly after the GFS paper, MapReduce was introduced in 2004 from Google as a published paper, which was a big step toward distributed and shared nothing systems, an idea which is very simple, but yet, extremely powerful, which have given data analysts a very simple and intuitive framework, to process Tera Bytes of data stored in distributed systems, without worrying about the underlying Network and Distributed computing power, and let them just focus on the task on hand.
The idea is simply, divide any data processing task, in two parts, a reducer and a mapper, and the analyst will just need to write/provide the mapper logic, and the reducer logic, an idea which is illustrated by the following simple example.
Assuming that we want to calculate the total payments each employee has received from a certain company, And assume the data has the following format.
ID Name Payment Date 2100 Mostafa 1000.0 01/01/2004 2200 Mohamed 2000.0 01/01/2004 2300 Usama 1500.0 01/01/2004 2100 Mostafa 1000.0 01/02/2004 2200 Mohamed 2000.0 01/02/2004 2300 Usama 1500.0 01/02/2004
Now let’s assume that the final output should look like
Name Total Mostafa 2000 Mohamed 4000 Usama 3000
To achieve this result, we construct a MapReduce Job, as illustrated in the following diagram :
Now let’s understand what’s described in this image, on the very left, we see the input file, then we see Mappers, which is logic provided by The Data Analyst, such that given a line from the input file, it produces a key/value pair corresponding to it, as illustrated in the “Intermediate results” part of the previous illustration, i.e.:
Mapper Input: 2100 Mostafa 1000.0 01/01/2004 Mapper Output: Key=>Mostafa Value=>1000.0
So, after the mapping phase, we’ll be having each line of the input transformed by some mapper, to key/value form, and the order in which Mappers receive input, is arbitrary, and the analyst need not worry about that order, he, should just worry about each line separate from others.
After that, we meet the reduce phase, in which we start aggregating the results, in reduction phase, a number of reduce tasks are created, then the framework will call the reduce logic provided by the Data Analyst, with each call, it’ll handle the reduce function/logic, a key, and a list of corresponding values associated with that key, and it’s up to the analyst to decide what logic will reside in the reduce function.
Each reducer will then return a line, which is handled by the framework, and stored in the distributed file system (explanation will follow shortly).
So, that’s it, simple huh !
This is the magical concept which was the origin of almost all current big data systems (as we’ll explain below).
The Birth Of Hadoop and HDFS (2005) :
Once Google has published it’s papers, Doug Cutting and Mike Cafarella, who have been working on Nutch (A web crawling/indexing engine) and could handle about 100 million web pages, decided to mimic what Google has done, clone it, and use it for this web crawler, to enhance it’s ability to embrace the web explosion, so they started by creating their own version of the GFS, and called it Hadoop Distribute File System, and created their own implementation of MapReduce, and called it Hadoop, and released it around 2005.
Doug created Hadoop while he was working at yahoo, where he led a Hadoop project as full-timer, and since then Hadoop has drawn a huge deal of community interest, as the only reliable MapReduce implementation available at the time.
In 2008, Yahoo has announced the Search Webmap, a Hadoop cluster running on 10,000 linux core, and in 2009, Yahoo has provided it’s version of Hadoop to the community.
In 2010, Facebook claimed the owner ship of the largest Hadoop cluster, with a size of 21 PB, which increased in 2012 to 100 PB.
And as of 2013, hadoop was used by half the 50 fortune companies.
Life getting simpler with Pig & Hive (2006) :
Although the MapReduce model is extremely simple, but it’s far too much simple to be used to create large and complex data analysis systems, and is considered too low-level, and hard to use, for those who’re familiar with either SQL syntax or procedural programming, so the need emerged for tools which can satisfy those needs, and thus was the birth of Pig & Hive tools, see the following example for Pig code, and the corresponding Hive code samples.
Pig Sample :
Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
Hive Sample :
insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; Those two examples shows how easy it is to construct an analysis job, which could have required a large number of MapReduce jobs to create manually, but that doesn't mean, these tools has dominated the need to write native MapReduce jobs, as writing the jobs directly still gives you more control on the execution of your jobs.
Google Again, BigTable (2006):
Google has came back to the scene with another great idea, presented in this paper, BigTable, which came from the need to a distributed key/value store, as the GFS or HDFS, doesn’t give the ability to retrieve certain row/line corresponding to certain key, and instead, you should scan through the whole thing to extract even a tiny bit of information, or update a small part of the data, so this BigTable, is a map of maps, where each row is referenced by a key, which then retrieve a set of column families, which in turn another map, referenced by the column and column-family, as illustrated in the following diagram:
And the paper claimed that, they have used the BigTable both as input and output to MapReduce jobs.
Amazon comes to play, with it’s Dynamo (2007):
Inspired by Google’s BigTable, Amazon’s Dynamo provides the same functionality of a Distributed key/value store, but has taken a decentralised approach, unlike the master/slave setup of most systems, Dynamo, is a fully distributed system, which means there’s no single point of failure, as the node negotiate and comes to know each other by discovering the surrounding network by peer-to-peer protocol.
Look who comes, Facebook’s Cassandra (2008):
Inspired by the other two giants, Facebook has created it’s own version of the Distributed key/value store, Cassandra, which was developed at Facebook, and then contributed to the community, the Cassandra, is providing almost the same main features provided by it’s two predecessors, and is decentralised, but with the advantage of being open source, and having a very good documentation and community, it is getting a good deal of attention and potential.
Is traditional MapReduce dead yet:
No, it’s not, although the Key/Value stores seems to provide huge advantages, the traditional Distributed File Systems are still more desirable when it comes to batch processing, as the key/value stores are used more for on-line processing needs, and a typical use case, will include both systems, such that, during the day, an on-line system will be running , and by the end of the day, may be at night or so, a batch processing job, can be run, to analyse the data collected during the day, combine it with previous results, and then feed the online system with whatever findings resulted from this batch processing analytics.
Hadoop community trying to catchup and HBase is introduced (2007):
A company called PowerSet created HBase, and handed it to the community, and was appealing to the community, as a counter part to Cassandra which is taking a totally different path from HDFS and Hadoop, a thing which then became a weak point for HBase, which is inheriting a lot of weaknesses and problems of HDFS along with it, and given the fact that HDFS was not designed with online tasks on mind, you can understand why HBase is facing issues serving as online key/value store.
Google is taking the lead again, with Percolator (2010):
In 2010, google has published another paper, describing what’s called Percolator, which is taking online processing one step further, and provide what was called Triggers, in the traditional RDBMs, see the following figure :
The idea behind percolator, is that instead of waiting till the end of the day, to update the data stores, the percolator system, will define trigger-like logic, that’s on data change in one row in a data store, the change is percolated through the System, to all other related data stores.
Has any body else provided a free “Percolator-like” system :
Although Cassandra has promised to provide a trigger functionality soon, they’ve not done so yet, but other similar ideas was introduced like Yahoo’s S4 (which is Apache project now), which is introducing a similar functionality to Percolator, but it’s not very integrated with the data stores, and is providing external monitoring for events to update data stores based on them, and also a System called Storm, provided to the community by Twitter, which it’s main focus is providing out-of-the-box support, for handling incoming stream of tweets, and update a data-store as appropriate, but still not the same thing provided by percolator.
Google again! , introducing Pregel (2010):
Both patch processing systems like hadoop and online data stores, fail when it comes to the need to process data having a graph nature (like friends network, or web sites relation ships), and that’s what Pregel doing, it provides support for such types of processing,
Another interesting idea from Google, Dremel (2010):
After the introduction of online data-stores (BigTable…etc), here comes the need for online data processing, Dremel, which provides a tree like structure for query processing, and take a column based approach to pass the data through the tree, so that instead of moving every thing across the tree, only the needed data is passed, then Google has provided a hosted service called BigQuery, which is providing real-time analytics based on Dremel.
Not only google, SAP was here too, SAP HANA (2010):
Around the same time, Dremel was announced, SAP also has announced a similar online processing service called SAP HANA and provided it as a hosted service also, but as part of a complete data analysing and processing suite.
Spark stack :
Spark has also provided it’s own stack, which is parallel to Hadoop’s stack, but claiming to be more than 10x faster, due to depending more on memory and optimising the communication layer, it provides the following components :
Shark (Hive like implementation)
Mlib (Machine Learning)
GraphX (Pregel-like graph processing)
Other Names :
Those aren’t every thing, a lot of other names and system has emerged too :
Facebook’s Skuba (Dremel like realtime-DB)
Gigraph (Pregel like, graph processing tool)
And a lot more others, which you can find in the following link :
The point from the article is highlighting the history of the tools we use today to process and analyse BigData, it’s not a complete map of the available tool (although I intend to provide such map soon).
The point that should be clear from this article, is that, BigData is not only about Hadoop, it’s way more than that.
organisations and people intending to learn and leverage BigData Analytics should recognise that fact, and recognise that although Hadoop is one of the oldest and most mature tools out there, it has a limited real need, and a lot other variations can be leveraged to achieve maximum performance.