Loading…
Loading grant details…
| Funder | Engineering and Physical Sciences Research Council |
|---|---|
| Recipient Organization | Imperial College London |
| Country | United Kingdom |
| Start Date | Oct 01, 2021 |
| End Date | Sep 30, 2025 |
| Duration | 1,460 days |
| Number of Grantees | 1 |
| Roles | Student |
| Data Source | UKRI Gateway to Research |
| Grant ID | 2602507 |
Nowadays, given the vast amount of data that is available to us, there is a strong demand for efficient and flexible unsupervised learning algorithms to be used in the industry. Unsupervised algorithms include cluster analysis and outlier detection, among others. A typical use case of the former comes from the field of market research, where one of the main tasks is dividing a large group of a company's current or potential customers into smaller groups.
The ultimate goal of this process is to aggregate the subjects into segments or clusters, such that each segment consists of subjects which are likely to share the same needs or have common interests. This is achieved by identifying similarities among the
subjects, which may be of demographic, geographic, psychographic or behavioural nature. However, the process of clustering can be very challenging and may lead to misleading conclusions being drawn when dealing with data sets that include outliers. Outliers are defined as data points consisting of unusual values which arouse suspicion regarding the mechanism that has been used to generate them.
Despite the fact that a significant number of methods for cluster analysis and outlier detection exists in the literature, the majority of these can not deal with a combination of continuous and categorical variables, also known as mixed-type data. Moreover, very few such methods are robust to the presence of outlying data points in a mixed-attribute domain.
This is potentially a consequence of having a well-established definition of outliers for numerical data but of this not being the case for nominal observations. This mandates a more general definition for categorical outliers ('categorical' referring to the fact that some variables may only take a fixed number of values, called 'categories'), so that we can better understand what it means to have outliers in a mixed data set.
Although a simple approach could involve detecting outliers for the numerical and the categorical data individually, this is rather naive; outliers might still exist based on the relationships among variables of different types. In fact, there exists a very small number of anomaly detection algorithms for data of mixed-type in the literature but these rely on the aforementioned simplistic approach and most of them lack a software implementation.
Our project seeks to address these gaps in the literature by defining a notion of outlyingness for purely nominal variables and hence developing novel methodology for identifying data points that are anomalous in a mixed data set. This will involve making use of statistical tools to capture any dependencies or interactions between the variables that a mixed data set is comprised of, in order to achieve a good understanding of the data set and, as a result, of which observations may be outlying.
Combining the results of such a method with clustering algorithms for mixed-type data could enhance the performance of existing non-robust methods. Ultimately, we aim to provide a novel framework under which practitioners from several sectors in the industry (such as the automotive, the education, the insurance, the retail or the telecommunications sectors, all of which use segmentation techniques in some form) can obtain results which are meaningful and easily interpretable to them, without being misled by anomalous observations. This project falls within the EPSRC Statistics and applied probability research area.
Imperial College London
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant