Completed STANDARD GRANT National Science Foundation (US)

Collaborative Research: NCS-FO: Studying language in the brain in the modern machine learning era

$5M USD

Funder	National Science Foundation (US)
Recipient Organization	Children'S Hospital Corporation
Country	United States
Start Date	Sep 15, 2021
End Date	Aug 31, 2025
Duration	1,446 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2123818`

Grant Description

The project will investigate how the brain processes language, one of the most consequential questions we can ask. Language skills significantly affect lifetime income and social disparities. The loss of language or a halt in its development can be devastating.

At the same time, insights from the brain that could improve machines’ understanding of language opens up new applications -- from web search, to voice assistants, to, one day, robots that can help us in our daily lives. Using neuroscience to understand what happens during language use, what goes right and wrong, what linguistic structures and theories are used by the brain, would be revolutionary.

To do this, neuroscientists use many of the same tools as those created for machine learning. Those machine learning tools have improved tremendously using large datasets, changing what machines are capable of; yet the neuroscience of language has been largely unable to reap these rewards. We will provide that data, the new methods and metrics, required to enable neuroscience to scale up and take advantage of modern machine learning.

At the same time, scale in machine learning has democratized access to tools; scientific communities can investigate questions that pertain to them. Today, only a few groups have the resources to collect data and investigate questions around the neuroscience of language, leaving many communities in the dark. A large-scale central repository of data, tools, and benchmarks will democratize access to the study of language in the brain, one of the core aspects of what makes us human.

Our technical goal is to produce the largest dataset, by a factor of 1000, for investigating the neuroscience of language along with new types of models that exploit this data, and benchmarks which formalize linguistic questions to derive insights about the language network and the structure of language. Thus far, investigations in the neuroscience of language have only been able to provide small snapshots of the language network on different datasets, making it hard to build a coherent picture.

A single large-scale dataset with precise benchmarks that formally define what hypotheses in linguistics mean in terms of neural data will enable the community to ask many questions of the same data, allowing for a synthesis of the structure and operation of the language network. At the same time, large-scale data is known to be required to probe the understanding of artificial language models.

It is likely that if tens of thousands of sentences are required to probe an artificial language-neural-networks and derive meaningful insight, the same scale of data will be required per subject to probe biological language-neural-networks. Formalizing questions around benchmarks on a common dataset has resulted in astronomical progress in many fields from parsing (Penn Treebank) to image recognition (ImageNet); we will apply this same methodology to the neuroscience of language.

This process is so efficient, in part, because it casts questions in a way that non-domain experts can access; machine learning experts need not concern themselves with linguistic minutia, they will be able to improve decoding of language from the brain and thereby drive insights by following existing protocols. By putting forward linguistic questions in a precisely-defined manner, we will also enable cross-disciplinary collaboration: linguists will be able to propose benchmarks which are questions around the performance of classifiers or mapping between networks and neural activity.

These benchmarks will provide a common mathematical language by which different fields can express their key questions, in a way that has not been possible before because no dataset existed that could even support such work. We see a future where neuroscience, linguistics, natural language processing, and machine learning act as an integrated whole to ask the right questions about language in the brain, to develop new tools that support answering those questions, and to probe a large-scale resource that supports building a coherent picture of the language system.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Children'S Hospital Corporation

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Collaborative Research: NCS-FO: Studying language in the brain in the modern machine learning era

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants