Entity Resolution in Large Patent Databases: an optimization approach (Accepted for ICEIS 2021, April 26-28):

Candidate to ICEIS 2021 best paper award

Authors: Emiel Caron and Ekaterini Ioannou

Abstract: Entity resolution focuses on detecting and merging entities that refer to the same real-world object. Collective resolution is among the most prominent mechanisms suggested for address this challenge since the resolution decisions are not made independently but are based on available relationships. In this paper we introduce a
novel resolution approach that combines the essence of collective resolution with rules and transformations among entity attributes and values. We illustrate how the approach’s parameters are optimized based on a global optimization algorithm, i.e., simulated annealing, and explain how this optimization is performed using a small training set. The quality of the approach is verified through an extensive experimental evaluation with 40M real-world scientific entities from the Patstat database.

Keywords: Entity Resolution, Data Disambiguation, Data Cleaning, Data Integration, Bibliographic Databases.

Presentation at ICT OPEN 2021: ‘An Optimization Method for Entity Resolution in Databases’

With a Case Study on the Cleaning of Scientific references in bibliographic databases

Dr. Emiel Caron, Dr. Ekaterini Ioannou, & Wen Xin Lin

Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for further analysis. In order for these databases to be a reliable point of reference, the data needs to be cleaned. Entity resolution focuses on disambiguating records that refer to the same entity. In this paper we propose a generic optimization method for disambiguating large databases. This method is used on a table with scientific references from the Patstat database. The table holds ambiguous information on citations to scientific references. The research method described is used to create clusters of records that refer to the same bibliographic entity. The method starts by pre-cleaning the records and extracting bibliographic labels. Next, we construct rules based on these labels and make use of the tf-idf algorithm to compute string similarities. We create clusters by means of a rule-based scoring system. Finally, we perform precision-recall analysis using a golden set of clusters and optimize our parameters with simulated annealing. Here we show that it is possible to optimize the performance of a disambiguation method using a global optimization algorithm

10 minute presentation at ICT Open 2021

Thesis project: An operational alignment model for developing business performance dashboards

Business performance dashboards are a well-known solution to support business decision making. However, developing such a dashboard is not applying an off- the-shelve solution. Questions that raise are for example: what KPIs should the dashboard contain? What data should be used? What is the most appropriate visual to display the information? What functionalities should the dashboard contain? Therefore, the aim of this research is to evolve an operational alignment model for developing business performance dashboard: Cross-industry standard process for business performance dashboards (CRISP-PD). This research presents a first exploration of this model.

Alignment between the requirements of the business and the IT solution is essential when designing a dashboard. Alignment is guaranteed in every step of the development process that is presented in this research. This is accomplished by implementing feedback mechanism between the business and IT.

First, literature is reviewed to build a theoretical framework. This review contains the concept of alignment and theory about business performance dashboards. Second, the alignment model, based on the literature, is presented. Finally, to verify the model, it is applied in a case study. The goal of developing a business performance dashboard in the case study is supporting business decision making at commercial departments.

The case study resulted in a concept of the business performance dashboard which is evaluated by the problem owners. Since developing a business performance dashboard is an iterative process, the dashboard is not finalized. Changing circumstances (e.g. market conditions, business goals, information need) can cause changes in the design of the dashboard. Therefore, several recommendations are made for further developing the dashboard.

However, this research has some limitations. First, CRISP-PD is not fully applied in the case study. Second, the case study is based on a limited data set. Therefore, basic statistical parameters could not be calculated. Third, CRISP-PD is only applied in one case study and therefore lacks reliability.

In addition, recommendations are made for future research. Since this is an operational alignment model, nothing is said about strategic and tactical performance dashboards. Therefore, in future research the alignment model should be tested for aligning strategic and tactical performance dashboards. Furthermore, the model should be validated in other case studies in order to improve the research reliability.

Yannick Visser is working on this topic.

 

Thesis project: Models for the prediction of overdue invoices

Keeping a healthy cash flow is extremely vital for businesses of every size and kind. One of the biggest influences in keeping cash flows healthy is paying invoices on time and getting paid on time. Some researchers suggest that getting paid on time would actually prevent the collapse of over 50.000 small business every year. And this is even before the economic uncertainty caused by COVID-19.

Many companies, including multiple payment providers, reported rising payment periods during the first and second lockdown periods. To keep payment periods short and cash flows steady, companies can start paying more attention to the Accounts Receivable processes. However, deciding which clients or invoices to put time and resources in can be a tricky process. To solve the problem of determining which clients need to be contacted it needs to be determined which invoices are likely to be overdue. By applying the methodological framework CRISP-DM, different Machine Learning models were studied to predict which invoices are likely to be overdue. For building these models a dataset consisting of 290.000 invoices from a Dutch top 30 accounting firm was used.

After thoroughly executing the processes of data preparation, feature selection and applying different techniques of feature engineering and hyperparameter tuning, it is concluded that weighted Random Forest models yield the best predictive performance. When evaluating these models, historical behavior of clients is determined to be the best predictor of overdue invoices. Interestingly though, models that solely rely on client demographics without any historical behavior also predict overdue invoices relatively well. This means even for new clients without any historical record, relatively accurate predictions can also be provided.

Frank van den Berg is working on this topic. LinkedIn: https://www.linkedin.com/in/frank-van-den-berg-29021996/

Thesis project: Explanatory analytics in business dashboards – a comparison of explanatory models

Business dashboards are visual products of business intelligence and analytics, that are used to support decision-making. This thesis studies how dashboards can be extended with explanatory analytics. Explanatory analytics are automated diagnostics that generate probable explanations to a problem, often an exceptional value. This is especially important with the advent of big data and maturing dashboard technologies.

This thesis is conducted by design science research, where an artifact is created based on previous theoretical grounding, and then evaluated through business experts. Three separate models for explanations from different fields are compared, namely explanation formalism, informative summarization, and explanation by intervention. First, the models’ theoretical bases are detailed and compared. Then the extension is planned by the use of UML diagrams, and implemented through Python using object-oriented programming and Microsoft Power Bi as the dashboarding platform. This implementation is then evaluated with business experts, through semi-structured qualitative interviews.

As a result, it is found that business dashboards can be extended with explanatory analytics, and that the three models share many functions, while differing in others. The main differences found are the recursion logic, measure of impact, and visualisation of the models. Explanation formalism uses top-down recursion logic, with a measure of impact based on the absolute difference of actual and reference value and has a visualisation in the form of an explanation tree. Informative summarization, in contrast, uses bottom-up logic, with impact measure of both magnitude and ratio, and the result is in the form of a table. Explanation by intervention has no recursion, but calculates everything with a big bang method, measuring the impact by ratio, and the result is in the form of table. With a qualitative evaluation it was found that most business experts prefer the use of absolute difference in the measure and having a visualisation such as explanation tree to speed up the assimilation of information. Ratio as a measure of impact was seen as including insignificant explanations when solving business problems.

Key wordsBusiness dashboard, Business intelligence, Analytics, Explanatory analytics

Aaro Askala (a.j.askala@tilburguniversity.edu) has worked on this topic.

Thesis project: creating HR talent demand insights by classification of job requisitions by applying dictionary matching on job descriptions

Due to the rapidly changing internal and external environment, organizations are constantly challenged to find the most qualified employees. In order the stay ahead of competitors, recruiters must find talent in order to achieve the organizational goals. In order to find qualified employees strategic recruitment decisions are made. For these decisions HR managers require insights in the external supply of talent and the internal demand of talent.

HR analysis enables data-driven decision making for HR managers by creating insights based on the available recruitment data. One of the data sources used by HR analysts is the data from job requisitions, which is used to create insights for recruitment. The structured data alone from the job requisitions, does not contain the necessary data for HR analysts to create insights about the talent demand of the organization. This data is in the unstructured job description, where the skills and studies requirements are specified that an applicant must meet. In order to use this data, the job requisitions should be classified by the skills and study requirements mentioned in the job description.

Different methods have been considered to extract the studies and skills from the job description, but in the end dictionary matching is chosen. Literature research showed that dictionary matching was the most accurate for smaller datasets. This method was applied to a case study with internal job requisition data. This resulted in a process covering the phases: data collection, data preparation, information extraction and classification.

After validation of this process with a precision and recall analysis it is concluded that dictionary matching on job descriptions to acquire classifications for the job requisitions is accurate. In addition, there are indications that the insights created with the job requisitions classified by the skills and study are valuable for data-driven recruitment decisions.

Rody Franken (FrankenRody@gmail.com) is working on this topic.

Thesis project: the prediction of voluntary employee turnover

This thesis examines how the prediction of voluntary employee turnover could bring value to organisations. A case study was performed with data of Deloitte Holding B.V., consisting of employee records. Four classification models were used as predicting methods. CRISP-DM was used as guiding principles for the application of data mining. The data set was re-sampled as it showed to be imbalanced. Based on F1 score as leading performance measure, it was concluded that Random Forest was the best predicting model for Deloitte. Literature pointed out that voluntary employee turnover was shown to be dysfunctional. Hence, there was concluded that decision trees empowers organisations to identify profiles that form a ‘risk’ for the organisation. Organisations can use decision trees as insights in order to develop effective policies and strategies for retaining employees. However, voluntary employee turnover remains a complex phenomenon, which is only able to explain a small percentage of the variance of the actual turnover decision.

Keywords: voluntary employee turnover, classification, imbalanced data

Koen Geerding (koengeerding@gmail.com) is working on this topic. LinkedIn: https://www.linkedin.com/in/koengeerding/

Thesis project: Disambiguation of scientific literature in patent data: an entity resolution process in Patstat

The patent system acts as a policy instrument to encourage innovation and technical progress by providing protection and exclusivity. The Worldwide Patent Statistical Database, henceforth PATSTAT, is a product of the European Patent Office designed to assist in statistical research into patent information. The TLS214_NPL_PUBLN table (referred to as TLS214) contains additional bibliographic references of patents. TLS214 contained over 40 million records in the 2019 Spring Edition, however, they are often duplicated or inaccurate. The disambiguation method was developed in an SQL environment with a codebase written in T-SQL and C# on a 2014 version of PATSTAT. This study evaluates the disambiguation method on a 2017 version of PATSTAT and presents the translation of the codebase to Python.

As a result, the first part of this study focuses on the performance and bottlenecks of the disambiguation method in SQL. The performance is measured with precision, recall, F1-score, and cosine measure on a golden sample of the TLS214 table. The second part of this study focuses on converting a part of the codebase to Python. To transform the method towards a generic disambiguation method several adjustments are proposed. The codebase of the disambiguation method is transformed into Python and extended with the proposed adjustments. After presenting the Python codebase, the performance of the generic disambiguation method is measured.

With the initial parameters, the disambiguation method in Python remains conservative, because it values precision over recall. The connected components and MaxClique algorithm are equal in terms of performance, however, connected components require less computation time. By altering the rule parameters simulated annealing increases the F1-score with approximately 16.5%. After optimization, it is concluded that altering the rule parameters does not solve the recall problem, because the rule-based scoring system remains unable to find sufficient evidence to create pairs.

Colin de Ruiter (colinderuiter@gmail.com) is working on this topic.

Thesis project: Detecting unusual journal entries in financial statement audits with auto-encoder neural networks

Every once in a while, a new news item appears reporting a new case of fraud and the many affected victims. According to the Association of Certified Fraud Examiners, fraud has caused more than $7 billion in total losses in 125 different countries between January 2016 and October 2017.  The current fraud detection techniques are based on heuristics and past experience, so the main issue is that new types of fraud cannot be detected. The thesis aims to introduce a new method for fraud detection which resolves the downsides of the current methods, namely auto encoder neural networks. This method is explored by first realizing a replication study, upon which adversarial auto encoders are implemented to attempt to exceed the results of the existing studies.

At the moment Joyce Hendriks (j.p.a.hendriks@tilburguniversity.edu) is working on this topic.

Thesis project: Develop a diagnostic analytic tool to enrich business dashboards

Supervisor: Emiel Caron

The central research question of this thesis is: ‘How can (explainable) AI analytics enrich dashboards by automatically alarming for exceptional values, analyze root-causes and potentially even recommend fixes or suggest optimizations processes? ‘

The aim of this research is to enrich business dashboard with integrated explanatory analytics that are supported by mathematical models developed in Python. As a starting point, the model that is proposed by Daniels & Feelders in “A general model for automated business diagnosis” will function as our basis model.  Throughout this research several subjects are tackled, for example detection of exceptional values, various techniques to prune explanations and how to extract mathematically business models from data.

At the moment Claire Vink (c.p.j.vink@tilburguniversity.edu) is working on this topic.