With a Case Study on the Cleaning of Scientific references in bibliographic databases
Dr. Emiel Caron, Dr. Ekaterini Ioannou, & Wen Xin Lin
Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for further analysis. In order for these databases to be a reliable point of reference, the data needs to be cleaned. Entity resolution focuses on disambiguating records that refer to the same entity. In this paper we propose a generic optimization method for disambiguating large databases. This method is used on a table with scientific references from the Patstat database. The table holds ambiguous information on citations to scientific references. The research method described is used to create clusters of records that refer to the same bibliographic entity. The method starts by pre-cleaning the records and extracting bibliographic labels. Next, we construct rules based on these labels and make use of the tf-idf algorithm to compute string similarities. We create clusters by means of a rule-based scoring system. Finally, we perform precision-recall analysis using a golden set of clusters and optimize our parameters with simulated annealing. Here we show that it is possible to optimize the performance of a disambiguation method using a global optimization algorithm