|
What is Deduplication?
Deduplication is the process of identifying records that appear more than once in multiple sets within databases, and removing all but one of those duplicated records for the purpose of data integration.
A very important step in integrating data from various sources is finding and eliminating duplicate records that refer to the same entity. This process is called deduplication.
Deduplication is a crucial operation in integrating data from various sources. The major challenge in this undertaking is creating a function that can resolve when a couple of records refer to the same entity in spite of a variety of data inconsistencies.
A general requirement for knowledge discovery is accurately merging data from numerous, diverse sources into a unified database. An important step in creating such a database is record deduplication: merging multiple records that refer to the same entity. The difficulty in this task develops both from data errors (e.g. misspellings and missing fields) and from variants in field values (e.g. abbreviations).
Record deduplication is the task of merging database records that refer to the same underlying entity. In relational databases, accurate deduplication for records of one type is often dependant on the merge decisions made for records of other types.
Given any arbitrary pair of records in a database: declare the pair a duplicate (both records represent the same entity), declare the pair not a duplicate, declare the pair a possible duplicate requiring manual evaluation, or say nothing at all.
One method that can be used to supervise the detection and location of huge quantities of data is deduplication. Just as the name implies, it is the process that reduces a list of resources by removing any duplicate references to a resource.
With the costs of processing and reviewing electronic data soaring, eliminating duplicative data from a database is clearly of great value. Deduplications not only greatly lowers the cost of processing and review, it also considerably shortens review time and decreases the possibility of creating identical documents with opposing review calls.
Checking data before mailing is very important to a successful campaign. Sending out duplicates not only adds considerably to the cost of mailing, but can irritate and alienate customers or prospective customers.
Data Consolidation - As soon as you have established that multiple records represent identical data element, you must decide what procedure to follow to merge the duplicate/redundant data. Yet again, as data can be ambiguously represented, identical customer, prospect, part, item, transaction, or additional essential data can occur numerous times. In situations like these, the redundancy can only be established by looking across numerous fields, needing a data quality technology apparatus. Your merging process could be made up ofmerging (selecting the most excellent information across numerous records) or hang on to the information from each and every data source.
Metadata is data about data. It documents data characteristics such as name, size, and type; it records data structures such as length, fields, and columns; and it details data properties such as where data is located, how it is associated, and who owns it.
It is derived from comparison of metadata is mostly effective for email, where minute differences in formatting not obvious to the user can trigger changes in hash values. Deduping based on metadata will have varying results, depending on the metadata that is compared. For example, if the number of attachments for email is not one of the fields that is compared, two emails with identical content, one with an attachment and one without, sent by the same party to two different people, may be considered duplicates.
Once you have decided that you would like to dedupe and have defined what constitutes a duplicate in your case, you must decide if you would you like to globally dedupe. This refers to removing duplicates across all sources, instead of only removing duplicates found within the data of a single source. Global deduping removes more duplicates, but imposes two major restrictions.
This process can save enormous amounts of money and time. You must define what a dupe is in your case - functional v. exact, global v. custodian level - and that will determine the method that should be used. No approach is inherently better than the other; what matters is what works best given the requirements of a particular case.
At Intimate Data we understand the power of quality data - and the damage that can be caused by data that has become infected, corrupted or out-of-date. To improve the quality of your data, we have developed powerful and sophisticated data quality software to get it clean and keep it clean.
We at Intimate Data can undertake the task of almost any data, however badly structured it is, or however many different file formats are supplied with different data files.
|