Abstract: Entity resolution focuses on detecting and merging entities that refer to the same real-world object. Collective resolution is among the most prominent mechanisms suggested for address this challenge since the resolution decisions are not made independently but are based on available relationships. In this paper we introduce a novel resolution approach that combines the essence of collective resolution with rules and transformations among entity attributes and values. We illustrate how the approach’s parameters are optimized based on a global optimization algorithm, i.e., simulated annealing, and explain how this optimization is performed using a small training set. The quality of the approach is verified through an extensive experimental evaluation with 40M real-world scientific entities from the Patstat database.
Keywords: Entity Resolution, Data Disambiguation, Data Cleaning, Data Integration, Bibliographic Databases.
With a Case Study on the Cleaning of Scientific references in bibliographic databases
Dr. Emiel Caron, Dr. Ekaterini Ioannou, & Wen Xin Lin
Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for further analysis. In order for these databases to be a reliable point of reference, the data needs to be cleaned. Entity resolution focuses on disambiguating records that refer to the same entity. In this paper we propose a generic optimization method for disambiguating large databases. This method is used on a table with scientific references from the Patstat database. The table holds ambiguous information on citations to scientific references. The research method described is used to create clusters of records that refer to the same bibliographic entity. The method starts by pre-cleaning the records and extracting bibliographic labels. Next, we construct rules based on these labels and make use of the tf-idf algorithm to compute string similarities. We create clusters by means of a rule-based scoring system. Finally, we perform precision-recall analysis using a golden set of clusters and optimize our parameters with simulated annealing. Here we show that it is possible to optimize the performance of a disambiguation method using a global optimization algorithm
There is a lack of consensus on the usefulness of Human Resource (HR) analytics to achieve better business results. The authors suggest this is due to lack of empirical evidence demonstrating how the use of data in the HR field makes a positive impact on performance, due to the detachment of the HR function from accessible data, and due to the typically poor IT infrastructure in place. We provide an in-depth case study of Strategic Competence analytics, as an important part of HR analytics, in a large multinational company, labelled ABC, which potentially shows two important contributions. First, we contribute to HR analytics literature by providing a data-driven competency model to improve the recruitment and selection process. This is used by the organization to search more effectively for talents in their knowledge networks. Second, we further develop a model for data-driven competence analytics, thus also contributing to the information systems literature, in developing specialized analytics for HR, and by finding appropriate forms of computerized network analysis for identifying and analyzing knowledge hubs. Overall, our approach, shows how internal and external data triangulation and better IT integration makes a difference for the recruitment and selection process. We conclude by discussing our model’s implications for future research and practical implications