Completed STANDARD GRANT National Science Foundation (US)

RR: CompCog: A challenge suite for statistical word segmentation

$2.57M USD

Funder	National Science Foundation (US)
Recipient Organization	Mgh Institute of Health Professions
Country	United States
Start Date	Oct 01, 2024
End Date	Aug 31, 2025
Duration	334 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2435735`

Grant Description

A central scientific puzzle is how children manage to acquire language despite limited and inconsistent explicit feedback. Numerous mathematical results seem to suggest that acquiring a language should be impossible; the fact that children do it every day reveals a deep gap in the science of learning. Some research suggests that children make considerable headway by detecting patterns in what they hear even without any explicit teaching or even knowing what is being talked about ("statistical" or "unsupervised" learning).

Indeed, much of the recent progress in "teaching" computers to understand language has made use of just this strategy. Even more compelling: numerous experiments have shown that both adults and infants are able to learn at least a little bit about language this way. How much they can learn remains unclear.

A central difficulty is that mathematically, there are many different methods for pattern-detection and it is unclear which one(s) humans use. This is important because some work better than others; and whether unsupervised pattern-detection can help solve the mystery of language learning depends on which method is used. The purpose of this project is to put together a "challenge suite": a dataset that can be used to systematically evaluate and compare the possibilities.

Such challenge suites have been instrumental in advancing artificial intelligence. This project also serves as a proof-of-concept to determine whether challenge suites are similarly beneficial for the science of learning, and at the same time provide valuable resources and training to the research community.

To develop the challenge suite, the investigators will first conduct a comprehensive, quantitative literature review (meta-analysis) focusing on the largest body of work on unsupervised pattern-detection: adult statistical word segmentation. Aided by outside experimenters, the meta-analysis will be used to identify 10-15 key experiments. As a group, these experiments will establish a basic set of facts about adult statistical word segmentation that any theory must account for.

For these reasons, the project will focus particularly on theoretically-central phenomena that distinguish different theories. To measure different aspects of linguistic pattern-detection, each experiment will involve large numbers of subjects (approx. 1,200 each) and a subset of 3-5 experiments with an even larger number (approx. 24,000 each). A tool will be developed to enable researchers to compare any mathematical theory of learning against these data, determining how well it matches human performance.

In order to determine how the mathematical theory could learn language, a database of transcripts of child-directed speech in 3-5 languages will be developed. Each theory will also be tested/trained on the database to see how much it could learn about those languages. The challenge suite will be made available to all researchers as a download and also through a website where researchers can submit their models and compare results against those of other models.

This work will be publicized to the scientific community through a closing workshop focused on models of unsupervised word segmentation.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Mgh Institute of Health Professions

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

RR: CompCog: A challenge suite for statistical word segmentation

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants