{"id":225,"date":"2020-05-20T23:47:01","date_gmt":"2020-05-20T21:47:01","guid":{"rendered":"https:\/\/emielcaron.nl\/?p=225"},"modified":"2020-05-21T00:11:11","modified_gmt":"2020-05-20T22:11:11","slug":"thesis-project-disambiguation-of-scientific-literature-in-patent-data-an-entity-resolution-process-in-patstat","status":"publish","type":"post","link":"https:\/\/emielcaron.nl\/?p=225","title":{"rendered":"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">The patent system acts as a policy instrument to encourage innovation and technical progress by providing protection and exclusivity. The Worldwide Patent Statistical Database, henceforth PATSTAT, is a product of the European Patent Office designed to assist in statistical research into patent information. The <em>TLS214_NPL_PUBLN<\/em> table (referred to as TLS214) contains additional bibliographic references of patents. TLS214 contained over 40 million records in the 2019 Spring Edition, however, they are often duplicated or inaccurate. The disambiguation method was developed in an SQL environment with a codebase written in T-SQL and C# on a 2014 version of PATSTAT. This study evaluates the disambiguation method on a 2017 version of PATSTAT and presents the translation of the codebase to Python.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a result, the first part of this study focuses on the performance and bottlenecks of the disambiguation method in SQL. The performance is measured with precision, recall, F1-score, and cosine measure on a golden sample of the TLS214 table. The second part of this study focuses on converting a part of the codebase to Python. To transform the method towards a generic disambiguation method several adjustments are proposed. The codebase of the disambiguation method is transformed into Python and extended with the proposed adjustments. After presenting the Python codebase, the performance of the generic disambiguation method is measured.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With the initial parameters, the disambiguation method in Python remains conservative, because it values precision over recall. The connected components and MaxClique algorithm are equal in terms of performance, however, connected components require less computation time. By altering the rule parameters simulated annealing increases the F1-score with approximately 16.5%. After optimization, it is concluded that altering the rule parameters does not solve the recall problem, because the rule-based scoring system remains unable to find sufficient evidence to create pairs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Colin de Ruiter (<a href=\"mailto:colinderuiter@gmail.com\" target=\"_blank\" rel=\"noreferrer noopener\">colinderuiter@gmail.com<\/a>) is working on this topic.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The patent system acts as a policy instrument to encourage innovation and technical progress by providing protection and exclusivity. The Worldwide Patent Statistical Database, henceforth PATSTAT, is a product of the European Patent Office designed to assist in statistical research into patent information. The TLS214_NPL_PUBLN table (referred to as TLS214) contains additional bibliographic references of &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/emielcaron.nl\/?p=225\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-225","post","type-post","status-publish","format-standard","hentry","category-thesis"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat - Emiel Caron<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/emielcaron.nl\/?p=225\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat - Emiel Caron\" \/>\n<meta property=\"og:description\" content=\"The patent system acts as a policy instrument to encourage innovation and technical progress by providing protection and exclusivity. The Worldwide Patent Statistical Database, henceforth PATSTAT, is a product of the European Patent Office designed to assist in statistical research into patent information. The TLS214_NPL_PUBLN table (referred to as TLS214) contains additional bibliographic references of &hellip; Continue reading &quot;Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/emielcaron.nl\/?p=225\" \/>\n<meta property=\"og:site_name\" content=\"Emiel Caron\" \/>\n<meta property=\"article:published_time\" content=\"2020-05-20T21:47:01+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-05-20T22:11:11+00:00\" \/>\n<meta name=\"author\" content=\"Emiel Caron\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Emiel Caron\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=225#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=225\"},\"author\":{\"name\":\"Emiel Caron\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#\\\/schema\\\/person\\\/992b3c38031ce991eef0e83dd12e11cd\"},\"headline\":\"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat\",\"datePublished\":\"2020-05-20T21:47:01+00:00\",\"dateModified\":\"2020-05-20T22:11:11+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=225\"},\"wordCount\":321,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#\\\/schema\\\/person\\\/992b3c38031ce991eef0e83dd12e11cd\"},\"articleSection\":[\"Thesis projects\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/emielcaron.nl\\\/?p=225#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=225\",\"url\":\"https:\\\/\\\/emielcaron.nl\\\/?p=225\",\"name\":\"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat - Emiel Caron\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#website\"},\"datePublished\":\"2020-05-20T21:47:01+00:00\",\"dateModified\":\"2020-05-20T22:11:11+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=225#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/emielcaron.nl\\\/?p=225\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=225#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/emielcaron.nl\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#website\",\"url\":\"https:\\\/\\\/emielcaron.nl\\\/\",\"name\":\"Emiel Caron\",\"description\":\"PhD, Lecturer &amp; Researcher in Business Intelligence &amp; Analytics, Data science\",\"publisher\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#\\\/schema\\\/person\\\/992b3c38031ce991eef0e83dd12e11cd\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/emielcaron.nl\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#\\\/schema\\\/person\\\/992b3c38031ce991eef0e83dd12e11cd\",\"name\":\"Emiel Caron\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g\",\"caption\":\"Emiel Caron\"},\"logo\":{\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat - Emiel Caron","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/emielcaron.nl\/?p=225","og_locale":"en_US","og_type":"article","og_title":"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat - Emiel Caron","og_description":"The patent system acts as a policy instrument to encourage innovation and technical progress by providing protection and exclusivity. The Worldwide Patent Statistical Database, henceforth PATSTAT, is a product of the European Patent Office designed to assist in statistical research into patent information. The TLS214_NPL_PUBLN table (referred to as TLS214) contains additional bibliographic references of &hellip; Continue reading \"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat\"","og_url":"https:\/\/emielcaron.nl\/?p=225","og_site_name":"Emiel Caron","article_published_time":"2020-05-20T21:47:01+00:00","article_modified_time":"2020-05-20T22:11:11+00:00","author":"Emiel Caron","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Emiel Caron","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/emielcaron.nl\/?p=225#article","isPartOf":{"@id":"https:\/\/emielcaron.nl\/?p=225"},"author":{"name":"Emiel Caron","@id":"https:\/\/emielcaron.nl\/#\/schema\/person\/992b3c38031ce991eef0e83dd12e11cd"},"headline":"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat","datePublished":"2020-05-20T21:47:01+00:00","dateModified":"2020-05-20T22:11:11+00:00","mainEntityOfPage":{"@id":"https:\/\/emielcaron.nl\/?p=225"},"wordCount":321,"commentCount":0,"publisher":{"@id":"https:\/\/emielcaron.nl\/#\/schema\/person\/992b3c38031ce991eef0e83dd12e11cd"},"articleSection":["Thesis projects"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/emielcaron.nl\/?p=225#respond"]}]},{"@type":"WebPage","@id":"https:\/\/emielcaron.nl\/?p=225","url":"https:\/\/emielcaron.nl\/?p=225","name":"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat - Emiel Caron","isPartOf":{"@id":"https:\/\/emielcaron.nl\/#website"},"datePublished":"2020-05-20T21:47:01+00:00","dateModified":"2020-05-20T22:11:11+00:00","breadcrumb":{"@id":"https:\/\/emielcaron.nl\/?p=225#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/emielcaron.nl\/?p=225"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/emielcaron.nl\/?p=225#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/emielcaron.nl\/"},{"@type":"ListItem","position":2,"name":"Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat"}]},{"@type":"WebSite","@id":"https:\/\/emielcaron.nl\/#website","url":"https:\/\/emielcaron.nl\/","name":"Emiel Caron","description":"PhD, Lecturer &amp; Researcher in Business Intelligence &amp; Analytics, Data science","publisher":{"@id":"https:\/\/emielcaron.nl\/#\/schema\/person\/992b3c38031ce991eef0e83dd12e11cd"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/emielcaron.nl\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/emielcaron.nl\/#\/schema\/person\/992b3c38031ce991eef0e83dd12e11cd","name":"Emiel Caron","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g","caption":"Emiel Caron"},"logo":{"@id":"https:\/\/secure.gravatar.com\/avatar\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g"}}]}},"_links":{"self":[{"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/posts\/225","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=225"}],"version-history":[{"count":6,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/posts\/225\/revisions"}],"predecessor-version":[{"id":236,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/posts\/225\/revisions\/236"}],"wp:attachment":[{"href":"https:\/\/emielcaron.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}