Hadoop Mapper output Compression, Optimization Trick

So, today I’ve experienced a nice experience with Hadoop’s Mapper output compression, where I had the output of the mapper as structured data (to simplify later-on calculations), but to my surprise, I’ve found that the data shuffled, way too much (about 3x) the original data size, although I’ve enabled map output compression, then I decided to try to encode the Mapper output value in text object, and to my surprise, I got about 100x improvement regarding the size of the shuffled data (because my data was easy to compress in textual format, as the entries was similar to far extent).

So, the reason here was that the custom object, I’ve created at first was serialised to binary format, which make us lose the advantage of the similar nature of the data, and didn’t compress well.

So, next time you decide to use custom object as Mapper output, and marshal it, think twice about your data nature, and experiment with encoding it in Textual format instead of custom objects.