Developing a generic method for data disambiguation in large bibliographic databases

Supervisor: Emiel Caron

Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for researchers. In order for these databases to be a reliable point of reference for research, the data needs to be cleaned and disambiguated. Entity resolution focuses on disambiguating records that refer to the same entity. In this thesis we study different applications of entity resolution in order to explore a generic method for cleaning large databases. We implement the method on table TLS214 of the database PatStat. PatStat is a product of the European Patent Office that contains bibliographic information on patent applications and publications. Table TLS214 of the database holds information on citations to scientific references. The method starts by pre-cleaning the records of table TLS214 in SQL and extracting bibliographic information. Next, the data is transferred to Python where we make use of the TF-IDF algorithm to compute a string similarity measure. We create clusters by means of a rule-based scoring system. Finally we perform precision and recall analysis using a golden set of clusters and optimize our parameters with a genetic algorithm.

At the moment Wenxin Lin (w.x.lin@tilburguniversity.edu) is working on this topic.