« Projekte

Bitte aktivieren Sie JavaScript in Ihren Browsereinstellungen, um das Forschungsportal nutzen zu können.

Sie verwenden einen sehr veralteten Browser und können Funktionen dieser Seite nur sehr eingeschränkt nutzen. Bitte aktualisieren Sie Ihren Browser. http://www.browser-update.org/de/update.html

Efficient and Effective Entity Resolution Under Cloud-Scale Data

Projektleiter:

Saake, Gunter; Prof. Dr.

Projektbearbeiter:

M.Sc. Xiao Chen

Finanzierung:

Fördergeber - Sonstige; 01.07.2014 bis 30.04.2020

There might exist several different descriptions for one real-world entity. The differences may result from typographical errors, abbreviations, data formatting, etc. However, the different descriptions may lower data quality and lead to misunderstanding. Therefore, it is necessary to be able to resolve and clarify such different descriptions. Entity Resolution (ER) is a process to identify records that refer to the same real-world entity. It is also known under several other names. If the records to be identified are all located within a single source, it is called de-duplication. Otherwise, in the field of computer science it is also typically referred to data matching, record linkage, duplicate detection, reference reconciliation, object identification. In the database domain, ER is synonymous with similarity join. Today, ER plays a vital role in diverse areas, not only in the traditional applications of census, health data or national security, but also in the network applications of business mailing lists, online shopping, web searches, etc. It is also an indispensable step in data cleaning, data integration and data warehousing. The use of computer techniques to perform ER dates back to the middle of the last century. Since then, researchers have developed many techniques and algorithms for ER due to its extensive applications. In its early days, there are two general goals: efficiency and effectiveness, which means how fast and how accurately an ER task can be solved. In recent years, the rise of the web has led to the extension of techniques and algorithms for ER. Such web data (also known as big data) is often semi-structured, comes from diverse domains and exists on a very large scale. These three properties make big data qualitatively different from traditional data, which brings new challenges to ER that require new techniques or algorithms as solutions. To be specific, specialized similarity measures are required for semi-structured data; cross-domain techniques are needed to handle data from diverse domains; parallel techniques are needed to make algorithms not only efficient and effective, but also scalable, so as to be able to deal with the large scale of the data. This project focuses on the last point: parallelize the process of entity resoution. The specific research direction is to explore several big data processing frameworks to know their advantages and disadvantages on performing ER.