Storage space is a hot commodity, and it’s always a good idea to optimise the amount of storage space you require. One method for this is data deduplication.

What is data deduplication?

Data deduplication is a way of reducing the amount of storage space required when backing up files to a server. Imagine a business environment where each employee has their own desktop machine. With that, they have their own folder system for storing their files. These files are backed up onto the company’s network. In an enterprise of hundreds or thousands of employees, all of these files add up to a huge amount of storage space. As such, methods like data deduplication are vital for reducing storage space on a server.

How data deduplication works

In an office environment, imagine that your boss sends the five members of your team a copy of a PowerPoint document (let’s be super imaginative and call it example.ppt). Everyone saves example.ppt to their own machines, which means there are now six copies of example.ppt on the company network – the original, and the five copies. Say that example.ppt is 10MB in size, because everyone has a copy, it now takes up 60MB of storage space on the network.

Admittedly, 60MB doesn’t sound like much, but what if it was a 50MB photo file sent to 100 employees, or a 1GB HD video file sent to 1000 employees? Suddenly the 1GB video file takes up a terabyte of storage space across the network.

Data deduplication only stores one copy of the file, and everyone else just gets a file that points them to the original. The user doesn’t even realise that they haven’t got the original document. This dramatically cuts storage requirements.

File and block level data deduplication

Data deduplication goes through a server and scans for exact duplicates of files and, well… deduplicates them. This is called file-level data deduplication. A more efficient method is block-level data deduplication. Let’s explain the differences…

File-level data deduplication

File-level data deduplication is the basic method we’ve described so far. When an employee saves a file to their area of the network, data deduplication checks the file against the index of all of the files on the network. If it is unique, the file is stored and the index is updated, but if it isn’t unique a pointer to the original file is saved.

This does save space, but is pretty inefficient. If one of the employees corrects a typo in example.ppt then file-level data deduplication considers it now to be a unique file – it won’t be a pointer any more, it will be considered by the network as a new file. This doubles the previous storage amount because there is now one 10MB original, four pointers, and one 10MB edited copy. When a change is made, file-level data deduplication doubles the required storage space.

Block-level data deduplication

A solution to this problem is block-level data deduplication. Instead of treating files as entire files, block-level data deduplication works on a more granular and binary level. Say two employees are sent example.ppt, and they both make some changes to it. File-level data deduplication would treat this as three separate files, but block-level data deduplication works slightly differently.

Block-level deduplication looks at a file and saves unique blocks of binary iterations. More simplistically, imagine that example.ppt is a PowerPoint presentation made up of four slides. Block-level data deduplication would treat those as four unique blocks of data within the file, and the original file would be saved as four blocks.

Let’s say that it is saved as blocks ABCD. There are now three versions of the same file, all with ABCD blocks. The two employees both make slight changes to it, and now the three files are unique, but similar. They are now made up of blocks ABCD, ABCE, and ABDE. Block-level data deduplication just stores the unique blocks, not the unique files, so blocks ABCDE are stored on the network.

This means that instead of having three unique 10MB files totalling 30MB of space, there are now only five blocks to be stored for a little over 10MB of space.

Even though files aren’t stored in their entirety, when the user wants to access the file again, block-level data deduplication is clever enough to only present them with the relevant blocks to the relevant user. With block-level data deduplication, any changes the employees make don’t turn it into an entirely new file, it just updates the blocks.

Benefits of data deduplication

The obvious benefit of data deduplication is a reduced storage demand. This also comes with the benefit of a reduced bandwidth consumption on the network, and faster speeds when downloading from (or uploading to) the backup. Ultimately, the benefit here is cost savings. Saving storage space saves money.

Data deduplication is also highly customisable. You can tell the server to only process certain folders, to exclude duplicates of certain file types, or to exclude files that are less than a set amount of days old.

Deduplication can be set up to happen as soon as data is backed up to the server (known as in-line deduplication), or it can happen in the background at set increments (known as post processing).

File compression and data deduplication

We previously wrote a blog on how to save storage space with file compression. Data deduplication isn’t an alternative to file compression, it’s more of a supplement to it. In fact, many companies will use a combination of data deduplication and file compression to fully optimise their storage space.

Data deduplication can be enabled on all Fasthosts Windows Dedicated Servers.