Completed CONTINUING GRANT National Science Foundation (US)

Robust and Efficient Statistical Inference in Large Scale Semi-Supervised Settings

$1.7M USD

Funder	National Science Foundation (US)
Recipient Organization	Texas A&M University
Country	United States
Start Date	Aug 01, 2021
End Date	Jul 31, 2025
Duration	1,460 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2113768`

Grant Description

This project will develop methods for robust statistical inference in semi-supervised settings. Unlike more traditional data settings, semi-supervised settings are characterized by two types of available data: 1) a typical small or moderate sized labeled (or supervised) data containing observations for a response (or outcome) and a set of covariates (or predictors), and 2i) a much larger sized unlabeled (or unsupervised) data having observations only for the covariates.

Such settings arise naturally whenever the covariates are easily available for a large cohort, while the response may be difficult and/or expensive to obtain due to practical constraints. These are increasingly relevant in modern studies in the big data era with large unlabeled databases (often electronically recorded) becoming easily available (and tractable) on top of a labeled data.

Examples are ubiquitous across many disciplines, including computer science, machine learning, econometrics, and biomedical applications like electronic health records and integrative genomics. Statistical inference in semi-supervised settings is therefore of substantial interest. The ultimate question here is to investigate when and how one can use the extra information available from the large unlabeled data to “improve” upon a corresponding supervised approach, where improvement could be in terms of efficiency or robustness or both.

This project aims to provide answers to such questions by developing a class of novel, provable and scalable semi-supervised inference methods for a range of fundamental problems in two fairly distinct and active research areas: 1) causal inference in semi-supervised settings, and 2) semi-supervised inference in the presence of selection bias in labeling. The research outlined in the project will lead to advances in bridging some major gaps in the existing literature and providing a much-needed unified understanding of semi-supervised inference and its subtleties.

The methods will also have wide applicability to various domain areas, e.g. biomedical studies for precision medicine and causal inference. The project also has a significant education component, including mentoring of graduate students and curriculum development via short courses to raise awareness about these exciting new areas in modern statistics.

In the first part of the project, the PI will consider causal inference in semi-supervised settings under the potential outcome framework, and explore semi-supervised inference for popular causal parameters, e.g. the average treatment effect and the quantile treatment effect, both of which have been widely studied in supervised settings but rarely so under semi-supervised settings. The PI will aim to develop semi-supervised methods for so-called doubly robust estimation of such parameters that can lead to improved (if not optimal) efficiency, as well as much stronger robustness properties than their best achievable supervised counterparts.

The second part of the project will consider semi-supervised inference where the labeling mechanism has inherent selection bias, thus making the labeled and unlabeled data unequally distributed. Such settings, while of great practical relevance, have rarely been addressed so far, partly because their analysis is quite challenging since the labeling fraction decays to zero leading to a natural violation of the so-called positivity/overlap assumption.

Under this setting, the PI will explore efficient and rate-optimal semi-supervised inference for various parameters, e.g. the mean response and the average treatment effect (under a causal framework), via doubly robust estimation methods, as well as modeling strategies for estimating the decaying propensity score which arises as an inevitable challenge and is of independent interest. Throughout, the PI's emphasis will be on developing methods with rigorous theoretical guarantees as well as efficient implementation that meets the scalability demanded by the intended applications on large modern datasets.

The proposed methods will also bring together a synergy of tools and ideas from classical semi-parametric inference and modern high dimensional statistics theory.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Texas A&M University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Robust and Efficient Statistical Inference in Large Scale Semi-Supervised Settings

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants