{"id":173,"date":"2019-11-19T16:23:56","date_gmt":"2019-11-19T14:23:56","guid":{"rendered":"https:\/\/emielcaron.nl\/?p=173"},"modified":"2019-11-19T17:46:08","modified_gmt":"2019-11-19T15:46:08","slug":"a-generic-method-for-the-disambiguation-of-bibliographic-databases","status":"publish","type":"post","link":"https:\/\/emielcaron.nl\/?p=173","title":{"rendered":"Developing a generic method for data disambiguation in large bibliographic databases"},"content":{"rendered":"\n<p> Supervisor: Emiel Caron <\/p>\n\n\n\n<p> Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for researchers. In order for these databases to be a reliable point of reference for research, the data needs to be cleaned and disambiguated. Entity resolution focuses on disambiguating records that refer to the same entity. In this thesis we study different applications of entity resolution in order to explore a generic method for cleaning large databases. We implement the method on table TLS214 of the database PatStat. PatStat is a product of the European Patent Office that contains bibliographic information on patent applications and publications. Table TLS214 of the database holds information on citations to scientific references. The method starts by pre-cleaning the records of table TLS214 in SQL and extracting bibliographic information. Next, the data is transferred to Python where we make use of the TF-IDF algorithm to compute a string similarity measure. We create clusters by means of a rule-based scoring system. Finally we perform precision and recall analysis using a golden set of clusters and optimize our parameters with a genetic algorithm. <\/p>\n\n\n\n<p>At the moment Wenxin Lin (<a href=\"mailto:w.x.lin@tilburguniversity.edu\">w.x.lin@tilburguniversity.edu<\/a>) is working on this topic.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Supervisor: Emiel Caron Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for researchers. In order for these databases to be a reliable point of reference for research, the data needs to be cleaned and disambiguated. Entity resolution focuses on disambiguating records that refer to the same entity. &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/emielcaron.nl\/?p=173\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Developing a generic method for data disambiguation in large bibliographic databases&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-173","post","type-post","status-publish","format-standard","hentry","category-thesis"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Developing a generic method for data disambiguation in large bibliographic databases - Emiel Caron<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/emielcaron.nl\/?p=173\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Developing a generic method for data disambiguation in large bibliographic databases - Emiel Caron\" \/>\n<meta property=\"og:description\" content=\"Supervisor: Emiel Caron Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for researchers. In order for these databases to be a reliable point of reference for research, the data needs to be cleaned and disambiguated. Entity resolution focuses on disambiguating records that refer to the same entity. &hellip; Continue reading &quot;Developing a generic method for data disambiguation in large bibliographic databases&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/emielcaron.nl\/?p=173\" \/>\n<meta property=\"og:site_name\" content=\"Emiel Caron\" \/>\n<meta property=\"article:published_time\" content=\"2019-11-19T14:23:56+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-11-19T15:46:08+00:00\" \/>\n<meta name=\"author\" content=\"Emiel Caron\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Emiel Caron\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=173#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=173\"},\"author\":{\"name\":\"Emiel Caron\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#\\\/schema\\\/person\\\/992b3c38031ce991eef0e83dd12e11cd\"},\"headline\":\"Developing a generic method for data disambiguation in large bibliographic databases\",\"datePublished\":\"2019-11-19T14:23:56+00:00\",\"dateModified\":\"2019-11-19T15:46:08+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=173\"},\"wordCount\":211,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#\\\/schema\\\/person\\\/992b3c38031ce991eef0e83dd12e11cd\"},\"articleSection\":[\"Thesis projects\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/emielcaron.nl\\\/?p=173#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=173\",\"url\":\"https:\\\/\\\/emielcaron.nl\\\/?p=173\",\"name\":\"Developing a generic method for data disambiguation in large bibliographic databases - Emiel Caron\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#website\"},\"datePublished\":\"2019-11-19T14:23:56+00:00\",\"dateModified\":\"2019-11-19T15:46:08+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=173#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/emielcaron.nl\\\/?p=173\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/?p=173#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/emielcaron.nl\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Developing a generic method for data disambiguation in large bibliographic databases\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#website\",\"url\":\"https:\\\/\\\/emielcaron.nl\\\/\",\"name\":\"Emiel Caron\",\"description\":\"PhD, Lecturer &amp; Researcher in Business Intelligence &amp; Analytics, Data science\",\"publisher\":{\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#\\\/schema\\\/person\\\/992b3c38031ce991eef0e83dd12e11cd\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/emielcaron.nl\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/emielcaron.nl\\\/#\\\/schema\\\/person\\\/992b3c38031ce991eef0e83dd12e11cd\",\"name\":\"Emiel Caron\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g\",\"caption\":\"Emiel Caron\"},\"logo\":{\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Developing a generic method for data disambiguation in large bibliographic databases - Emiel Caron","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/emielcaron.nl\/?p=173","og_locale":"en_US","og_type":"article","og_title":"Developing a generic method for data disambiguation in large bibliographic databases - Emiel Caron","og_description":"Supervisor: Emiel Caron Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for researchers. In order for these databases to be a reliable point of reference for research, the data needs to be cleaned and disambiguated. Entity resolution focuses on disambiguating records that refer to the same entity. &hellip; Continue reading \"Developing a generic method for data disambiguation in large bibliographic databases\"","og_url":"https:\/\/emielcaron.nl\/?p=173","og_site_name":"Emiel Caron","article_published_time":"2019-11-19T14:23:56+00:00","article_modified_time":"2019-11-19T15:46:08+00:00","author":"Emiel Caron","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Emiel Caron","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/emielcaron.nl\/?p=173#article","isPartOf":{"@id":"https:\/\/emielcaron.nl\/?p=173"},"author":{"name":"Emiel Caron","@id":"https:\/\/emielcaron.nl\/#\/schema\/person\/992b3c38031ce991eef0e83dd12e11cd"},"headline":"Developing a generic method for data disambiguation in large bibliographic databases","datePublished":"2019-11-19T14:23:56+00:00","dateModified":"2019-11-19T15:46:08+00:00","mainEntityOfPage":{"@id":"https:\/\/emielcaron.nl\/?p=173"},"wordCount":211,"commentCount":0,"publisher":{"@id":"https:\/\/emielcaron.nl\/#\/schema\/person\/992b3c38031ce991eef0e83dd12e11cd"},"articleSection":["Thesis projects"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/emielcaron.nl\/?p=173#respond"]}]},{"@type":"WebPage","@id":"https:\/\/emielcaron.nl\/?p=173","url":"https:\/\/emielcaron.nl\/?p=173","name":"Developing a generic method for data disambiguation in large bibliographic databases - Emiel Caron","isPartOf":{"@id":"https:\/\/emielcaron.nl\/#website"},"datePublished":"2019-11-19T14:23:56+00:00","dateModified":"2019-11-19T15:46:08+00:00","breadcrumb":{"@id":"https:\/\/emielcaron.nl\/?p=173#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/emielcaron.nl\/?p=173"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/emielcaron.nl\/?p=173#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/emielcaron.nl\/"},{"@type":"ListItem","position":2,"name":"Developing a generic method for data disambiguation in large bibliographic databases"}]},{"@type":"WebSite","@id":"https:\/\/emielcaron.nl\/#website","url":"https:\/\/emielcaron.nl\/","name":"Emiel Caron","description":"PhD, Lecturer &amp; Researcher in Business Intelligence &amp; Analytics, Data science","publisher":{"@id":"https:\/\/emielcaron.nl\/#\/schema\/person\/992b3c38031ce991eef0e83dd12e11cd"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/emielcaron.nl\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/emielcaron.nl\/#\/schema\/person\/992b3c38031ce991eef0e83dd12e11cd","name":"Emiel Caron","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g","caption":"Emiel Caron"},"logo":{"@id":"https:\/\/secure.gravatar.com\/avatar\/16d7767d69c769cde896a0f5e53533595a081cfaeab0aca485f4736e51e08ae0?s=96&d=mm&r=g"}}]}},"_links":{"self":[{"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/posts\/173","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=173"}],"version-history":[{"count":6,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/posts\/173\/revisions"}],"predecessor-version":[{"id":179,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=\/wp\/v2\/posts\/173\/revisions\/179"}],"wp:attachment":[{"href":"https:\/\/emielcaron.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=173"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=173"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/emielcaron.nl\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=173"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}