Completed STUDENTSHIP UKRI Gateway to Research

Novel advances in unsupervised learning for mixed-type data

Funder	Engineering and Physical Sciences Research Council
Recipient Organization	Imperial College London
Country	United Kingdom
Start Date	Oct 01, 2021
End Date	Sep 30, 2025
Duration	1,460 days
Number of Grantees	1
Roles	Student
Data Source	UKRI Gateway to Research
Grant ID	`2602507`

Grant Description

Nowadays, given the vast amount of data that is available to us, there is a strong demand for efficient and flexible unsupervised learning algorithms to be used in the industry. Unsupervised algorithms include cluster analysis and outlier detection, among others. A typical use case of the former comes from the field of market research, where one of the main tasks is dividing a large group of a company's current or potential customers into smaller groups.

The ultimate goal of this process is to aggregate the subjects into segments or clusters, such that each segment consists of subjects which are likely to share the same needs or have common interests. This is achieved by identifying similarities among the

subjects, which may be of demographic, geographic, psychographic or behavioural nature. However, the process of clustering can be very challenging and may lead to misleading conclusions being drawn when dealing with data sets that include outliers. Outliers are defined as data points consisting of unusual values which arouse suspicion regarding the mechanism that has been used to generate them.

Despite the fact that a significant number of methods for cluster analysis and outlier detection exists in the literature, the majority of these can not deal with a combination of continuous and categorical variables, also known as mixed-type data. Moreover, very few such methods are robust to the presence of outlying data points in a mixed-attribute domain.

This is potentially a consequence of having a well-established definition of outliers for numerical data but of this not being the case for nominal observations. This mandates a more general definition for categorical outliers ('categorical' referring to the fact that some variables may only take a fixed number of values, called 'categories'), so that we can better understand what it means to have outliers in a mixed data set.

Although a simple approach could involve detecting outliers for the numerical and the categorical data individually, this is rather naive; outliers might still exist based on the relationships among variables of different types. In fact, there exists a very small number of anomaly detection algorithms for data of mixed-type in the literature but these rely on the aforementioned simplistic approach and most of them lack a software implementation.

Our project seeks to address these gaps in the literature by defining a notion of outlyingness for purely nominal variables and hence developing novel methodology for identifying data points that are anomalous in a mixed data set. This will involve making use of statistical tools to capture any dependencies or interactions between the variables that a mixed data set is comprised of, in order to achieve a good understanding of the data set and, as a result, of which observations may be outlying.

Combining the results of such a method with clustering algorithms for mixed-type data could enhance the performance of existing non-robust methods. Ultimately, we aim to provide a novel framework under which practitioners from several sectors in the industry (such as the automotive, the education, the insurance, the retail or the telecommunications sectors, all of which use segmentation techniques in some form) can obtain results which are meaningful and easily interpretable to them, without being misled by anomalous observations. This project falls within the EPSRC Statistics and applied probability research area.

All Grantees

Imperial College London

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Novel advances in unsupervised learning for mixed-type data

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants