Explore or Exploit? Effective Strategies for Disambiguating Large Databases
Prof. Reynold C.K.Cheng
Data ambiguity is inherent in applications such as data integration, location-based services, and sensor monitoring. In order to obtain a database with a higher quality, we study how to disambiguate a database by appropriately selecting candidates to clean. This problem is challenging because cleaning involves a cost, is limited by a budget, may fail, and may not remove all ambiguities. Moreover, the statistical information about how likely database objects can be cleaned may not be precisely known. We tackle these challenges by proposing two kinds of algorithms. The first type makes use of greedy heuristics to make sensible decisions; however, these algorithms do not make use of cleaning information and require user input for parameters to achieve high cleaning effectiveness. We propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested.
We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Experimental evaluations on real and synthetic datasets validate the effectiveness and efficiency of our approaches.
Dr. Reynold Cheng is an Associate Professor of the Department of Computer Science in the University of Hong Kong. He is the Chair of the Department Research Postgraduate Committee, and is the Vice Chairperson of the ACM (Hong Kong Chapter).He is also a guest editor for a special issue in TKDE.He has served as PC members and reviewer for international conferences and journals including TODS, TKDE, TMC, VLDBJ, IS, DKE, KAIS, VLDB, ICDE, ICDM, DEXA and DASFAA.