Deduplication – The Global Compression

The society we live in bases its functionality on information. People have become dependent on it. What’s more, deepening this process, they are generating more and more data. Therefore, the ever-increasing need to enhance the availability of storage space has become one of the biggest problems in today’s IT industry. If we consider the backup process – which doubles this need and thus the cost of maintaining IT infrastructure – it becomes apparent that the data explosion problem requires immediate action. What can be the “antidote” to this?

Deduplication – the Compression Alternative

If we look at the foregone question, we can point at two ways to solve this problem: we may try to increase the capacity of storage devices (on the hardware level) or find a way to organize data storage so it helps to optimize the consumption of capacity (on the software level). Let’s think about the second idea – in this case, optimizing by data deduplication. The process helps storage backups in a “thrifty” way. How should we understand it? What makes deduplication a better tool than hardware data compression, incremental development techniques, or differential backups? Let us take a closer look at it.

In short, deduplication is a process that eliminates duplicated data and replaces it with links that lead to one portion of the original data. Easy, isn’t it? It is, without a doubt. What is more interesting, however, is that while the deduplication process is dedicated to virtual mass storage, it can also be used in database systems or other applications.

How Does It Work?

Theoretically, the process of deduplication is quite simple. It is based on systematic searches for repeated data blocks, eliminating them and replacing them with references to a single remaining copy of the data in the system. The process can take place on a file system level and also at the level of disk blocks, which allows for obtaining better results since it is independent of the type and quantity of files in the file system and the operating system on which the system works. We will return to this issue later.

The process’s complexity increases if we consider the “smart” side of this solution. This is because deduplication also relies on finding the same results despite any record differences, errors, and typos. It means that it is not necessary to find exact duplicates. Such a solution is possible thanks to an advanced algorithm that evaluates the similarity between data blocks. When the search is finished, similar records are eligible for one of three groups: identical, similar, and different records. That is how it works – in general.

Deduplication vs. Compression – the Difference

There is one main difference between the mentioned methods. When we use compression, the process takes place at the file level, so it does not matter if we compress many or only one file. In the case of deduplication, it applies to all the files it includes. It means repeated data blocks are searched globally- not in one file area. Such a solution is unquestionably more efficient than compression.

Models

Deduplication is a method, an idea. But if we decide on such a solution, we also need to choose between “models” (we can also call them “projects”) – we can point, i.e., at Opendedup, LessFS, BitWackr (Exar), etc. It all depends on the characteristics of individual solutions, on which we will decide – it is hard to point at only one correct solution. The optimal way to find the best solution possible is by taking into account their different properties.

Some of the existing solutions are based on a model which binds deduplication and compression approaches. The range of these solutions and their quality in this area is extensive. It may be subjected to adopted technology, programming language, etc. Their level of performance may be a result of deduplication efficiency. Also, read and write speeds can differ – depending on the deduplication model. As it is easy to see, the choice of a proper solution should be based on the compromise between the technological capabilities of the implementation and the requirements in the performance area.

Another thing is the level of deduplication. We can highlight:

file level deduplication – the least effective model but also the simplest and the one which requires the least effort;
variable block level deduplication – where data blocks do not have a fixed size but are adjusted to catch the best possible data string;
fixed block level deduplication – in this method, a unit of data (which generates a shortcut and checks whether it is unique) is also a block. The shorter it is, the greater the number of copies that can be found and the greater the benefit in the area of the recovered space;
byte-level deduplication – in this case, data are compared byte by byte – such a solution is implemented while similar files are used (it is content-aware deduplication) – like .doc, .png, etc.

It is easy to see that all of the mentioned solutions have their own benefits and drawbacks. That is why every implementation must be adapted to specific applications – also in the case for the need for specific functionalities which are unique for each model. In addition, it is important to write also about the purpose of specific application – various models do not always work with different software – i.e. Hyper-V. So, as it was written above – a choice of appropriate solution must be a result of needs we are aware of.

1 Comment

Reply
python software(vcube)
June 12, 06 2024 08:16:25
Keep it up for more valuable information sharing with us.
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)