Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat

The patent system acts as a policy instrument to encourage innovation and technical progress by providing protection and exclusivity. The Worldwide Patent Statistical Database, henceforth PATSTAT, is a product of the European Patent Office designed to assist in statistical research into patent information. The TLS214_NPL_PUBLN table (referred to as TLS214) contains additional bibliographic references of patents. TLS214 contained over 40 million records in the 2019 Spring Edition, however, they are often duplicated or inaccurate. The disambiguation method was developed in an SQL environment with a codebase written in T-SQL and C# on a 2014 version of PATSTAT. This study evaluates the disambiguation method on a 2017 version of PATSTAT and presents the translation of the codebase to Python.

As a result, the first part of this study focuses on the performance and bottlenecks of the disambiguation method in SQL. The performance is measured with precision, recall, F1-score, and cosine measure on a golden sample of the TLS214 table. The second part of this study focuses on converting a part of the codebase to Python. To transform the method towards a generic disambiguation method several adjustments are proposed. The codebase of the disambiguation method is transformed into Python and extended with the proposed adjustments. After presenting the Python codebase, the performance of the generic disambiguation method is measured.

With the initial parameters, the disambiguation method in Python remains conservative, because it values precision over recall. The connected components and MaxClique algorithm are equal in terms of performance, however, connected components require less computation time. By altering the rule parameters simulated annealing increases the F1-score with approximately 16.5%. After optimization, it is concluded that altering the rule parameters does not solve the recall problem, because the rule-based scoring system remains unable to find sufficient evidence to create pairs.

Colin de Ruiter ( is working on this topic.