In this article I’ll highlight the history of BigData since Google’s MapReduce till current trends and tools.
It’s debatable what BigData means, or where the boundaries lies, there is no standard way to define it, but generally the following diagram is quite popular :
other figures will use only 3 Fronts (Variety, Volume and Velocity) , generally the more far from the center, the more close you’re to what is treated and called now “BigData”.
So, today I’ve experienced a nice experience with Hadoop’s Mapper output compression, where I had the output of the mapper as structured data (to simplify later-on calculations), but to my surprise, I’ve found that the data shuffled, way too much (about 3x) the original data size, although I’ve enabled map output compression, then I decided to try to encode the Mapper output value in text object, and to my surprise, I got about 100x improvement regarding the size of the shuffled data (because my data was easy to compress in textual format, as the entries was similar to far extent).
So, the reason here was that the custom object, I’ve created at first was serialised to binary format, which make us lose the advantage of the similar nature of the data, and didn’t compress well.
So, next time you decide to use custom object as Mapper output, and marshal it, think twice about your data nature, and experiment with encoding it in Textual format instead of custom objects.