Big data and MapReduce

Big data is a big thing. With the sheer amount of data being collected in this age of interconnectivity, there has to be an efficient and effective way of storing and processing it. It isn’t as much of an issue when dealing with a couple of gigabytes, but when data storage needs reach hundreds of petabytes (1 and 15 zeroes bytes) a bigger solution is required.

When Google was faced with this problem a decade or so ago, they came up with the ‘MapReduce’ algorithm. First, imagine how much data Google has. The Google search engine indexes most of the webpages on the internet – and there are a lot of webpages. As well as being able to serve a link, Google results pages need to serve the metadata, tiles, descriptions, text and images, XML, SQL, etc. As you can imagine, this adds up to LOTS of data.

Storing this data all in one place would be 1) stupid and 2) expensive. It would also result in very slow data processing. For example, when a Google search query is made for ‘cat pictures’, if all the possible data for that query is stored in one place, then the Google algorithm would have to go through the database one-by-one checking for mentions of ‘cat pictures’. You can imagine that this might take a while – there are a lot of cat pictures on the internet – and Google’s traditional superfast results could end up taking hours. Maybe then people would start using Bing, but maybe not.

Instead, Google devised an algorithm that allowed them to query their bulk of data and produce results much faster. This is the ‘MapReduce’ algorithm. Essentially, this breaks one data storage location into many, and data processing is done in parallel. The ‘Map’ part represents that data is mapped to many (many, many) locations, and ‘Reduce’ means that the data is broken down into easier to manage chunks. By way of example, one data location could store the metadata, one could store the XML, one could store the links, and one could store page content.

Then, whenever a search is made for ‘cat pictures’, Google’s MapReduce algorithm searches all of those individual databases in parallel, and then collates the results to be displayed on the search page.

Imagine it like this… you have a deck of cards and want to locate all of the Kings in the deck. One way of doing it would be to search through all 52 cards one-by-one to find the Kings of each suit. Or, if you have a friend with you, a more efficient way would be to give your friend half the cards, and you can both look for Kings in parallel. If there are four friends, split the deck in four and, again, each search your own pack for Kings. You each find your Kings, and then collate them at the end of the search.

This, at a very basic level, is the method of MapReduce. It leads to more efficient access to data storage, and faster data processing.

With increasing developments in the Internet of Things (IoT), artificial intelligence and machine learning, big data storage and processing is going to become more and more important. If your smart thermostat is taking temperature measurements in every room of your house every minute of every day, then there’s got to be an efficient way to store and process that data. And, with new General Data Protection Regulations upcoming in May 2018 the privacy and security of that data is going to become more important.

Big data and MapReduce

Katie Collins

Katie Collins

Google’s email policy updates – what you need to know

How to stay safe as an online business

Keep it professional – the art of separating work from home

Our jargon busting guide

Magento ecommerce websites