Developing a generic method for data disambiguation in large bibliographic databases

Supervisor: Emiel Caron

Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for researchers. In order for these databases to be a reliable point of reference for research, the data needs to be cleaned and disambiguated. Entity resolution focuses on disambiguating records that refer to the same entity. In this thesis we study different applications of entity resolution in order to explore a generic method for cleaning large databases. We implement the method on table TLS214 of the database PatStat. PatStat is a product of the European Patent Office that contains bibliographic information on patent applications and publications. Table TLS214 of the database holds information on citations to scientific references. The method starts by pre-cleaning the records of table TLS214 in SQL and extracting bibliographic information. Next, the data is transferred to Python where we make use of the TF-IDF algorithm to compute a string similarity measure. We create clusters by means of a rule-based scoring system. Finally we perform precision and recall analysis using a golden set of clusters and optimize our parameters with a genetic algorithm.

At the moment Wenxin Lin (w.x.lin@tilburguniversity.edu) is working on this topic.

Explanatory Analytics in Business Dashboards to improve business decision-making

Supervisor: Emiel Caron

The central question of this research is: ‘how can business dashboards be extended with explanatory analytics capabilities to support business analysts in answering managerial questions? ’. The relevance for answering this question for business comes from the lack of explanatory functions in current business intelligence tools, and more specifically in business dashboards. This is a problem for business analysts that must browse large amounts of data visually to discover interesting patterns. Furthermore, visual analysis is not only slow and expensive, but also highly subjective through the bias of the analyst (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Lastly, explanatory functions can help the effort to move to a more data-driven decision-making process. Tools that can assist in the effort of making these tasks easier and quicker are valuable.

The objective of this design-oriented research is to develop applications that can extend current dashboard solutions like MS PowerBI with functions to 1) find exceptional values in the dashboard and 2) give explanations why the exceptions have occurred.

Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27–34. https://doi.org/10.1145/240455.240464

The Effectiveness of Adding Storytelling Design Elements to Business Dashboards

Supervisor: Emiel Caron

Data is increasingly used for decision making. Business dashboards help users with decision-making, monitoring, and managing by easily showing findings from data. The more effective a business dashboard is designed, the better the dashboard helps with these functions. Knowing how to design a business dashboard as effective as possible is especially important for consultants and data analysts, as they have to design the dashboards to communicate patterns of data and findings to the client and/or decision maker. As argued in literature, a way to make data visualizations (and thus dashboards) more effectively is by adding storytelling design elements to them (Hullmann & Diakopoulos, 2011; Segel & Heer, 2009). However, incorporating these elements takes additional (valuable) time for consultants and data analysts. Hence, it is important to quantitatively research the effectiveness of incorporating storytelling design elements into business dashboards.

Hullman, J., & Diakopoulos, N. (2011). Visualization Rhetoric: Framing Effects in Narrative Visualization. IEEE Transactions on Visualization and Computer Graphics, 17(12), 2231-2240. https://doi.org/10.1109/tvcg.2011.255 

Segel, E., & Heer, J. (2010). Narrative Visualization: Telling Stories with Data. IEEE Transactions on Visualization and Computer Graphics, 16(6), 1139-1148. https://doi.org/10.1109/tvcg.2010.179 

Master thesis project & internship: Governance models for data science & analytics

Supervisor: Emiel Caron

An insurance company wants to be more successful with its data science projects. They want to develop a governance framework that manages the life cycle of data science & analytics project. The framework should provide a clear alignment of business objectives and data science projects. An important part of the framework is how to manage changes in data and data mining models to be transparent and accountable.

  • Skills & knowledge required: Project management, business analytics, databases, data warehousing.
  • Start January, 2020.
  • Please contact me via the contact form for information.

Visited conference ICSoft 2019

Paper: Knowledge hubs in competence analytics

Abstract

There is a lack of consensus on the usefulness of Human Resource (HR) analytics to achieve better business results. The authors suggest this is due to lack of empirical evidence demonstrating how the use of data in the HR field makes a positive impact on performance, due to the detachment of the HR function from accessible data, and due to the typically poor IT infrastructure in place. We provide an in-depth case study of Strategic Competence analytics, as an important part of HR analytics, in a large multinational company, labelled ABC, which potentially shows two important contributions. First, we contribute to HR analytics literature by providing a data-driven competency model to improve the recruitment and selection process. This is used by the organization to search more effectively for talents in their knowledge networks. Second, we further develop a model for data-driven competence analytics, thus also contributing to the information systems literature, in developing specialized analytics for HR, and by finding appropriate forms of computerized network analysis for identifying and analyzing knowledge hubs. Overall, our approach, shows how internal and external data triangulation and better IT integration makes a difference for the recruitment and selection process. We conclude by discussing our model’s implications for future research and practical implications