Efficient and Effective Entity Resolution Under Cloud-Scale Data
M.Sc. Xiao Chen
Fördergeber - Sonstige;
There might exist several different descriptions for one real-world entity. The differences may result from typographical errors, abbreviations, data formatting, etc. However, the different descriptions may lower data quality and lead to misunderstanding. Therefore, it is necessary to be able to resolve and clarify such different descriptions. Entity Resolution (ER) is a process to identify records that refer to the same real-world entity. It is also known under several other names. If the records to be identified are all located within a single source, it is called de-duplication. Otherwise, in the field of computer science it is also typically referred to data matching, record linkage, duplicate detection, reference reconciliation, object identification. In the database domain, ER is synonymous with similarity join. Today, ER plays a vital role in diverse areas, not only in the traditional applications of census, health data or national security, but also in the network applications of business mailing lists, online shopping, web searches, etc. It is also an indispensable step in data cleaning, data integration and data warehousing. The use of computer techniques to perform ER dates back to the middle of the last century. Since then, researchers have developed many techniques and algorithms for ER due to its extensive applications. In its early days, there are two general goals: efficiency and effectiveness, which means how fast and how accurately an ER task can be solved. In recent years, the rise of the web has led to the extension of techniques and algorithms for ER. Such web data (also known as big data) is often semi-structured, comes from diverse domains and exists on a very large scale. These three properties make big data qualitatively different from traditional data, which brings new challenges to ER that require new techniques or algorithms as solutions. To be specific, specialized similarity measures are required for semi-structured data; cross-domain techniques are needed to handle data from diverse domains; parallel techniques are needed to make algorithms not only efficient and effective, but also scalable, so as to be able to deal with the large scale of the data. This project focuses on the last point: parallelize the process of entity resoution. The specific research direction is to explore several big data processing frameworks to know their advantages and disadvantages on performing ER.
Data Matching, Entity Resolution, Record Linkage, Similarity Join
Prof. Dr. Gunter Saake
Institut für Technische und Betriebliche Informationssysteme
Tel.:+49 391 6758800
Die Daten werden geladen ...
Keine Ergebnisse gefunden, bitte ändern Sie Ihre Suchanfrage.