Completed STANDARD GRANT National Science Foundation (US)

CRII: RI: Using Linguistic Variation to Understand Deep Neural Models of Language

$1.75M USD

Funder	National Science Foundation (US)
Recipient Organization	University of California-Santa Barbara
Country	United States
Start Date	Jul 01, 2021
End Date	Aug 31, 2021
Duration	61 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2104995`

Grant Description

Many successful modern computational language systems rely on deep neural networks. Whereas older techniques relied on structured linguistic representations, the inner workings of neural models can be opaque even to the engineers and scientists who create them. Therefore, a current major challenge in Natural Language Processing, as in other areas of Artificial Intelligence, is to develop methods that allow us to understand the internal representations of opaque neural models.

Natural Language Processing is uniquely poised to contribute to this endeavor for two reasons. First, the field of linguistics has long sought to develop tools for characterizing the kinds of representations necessary for processing human language, and so there is a rich body of prior work to draw on. Second, the breadth and variation of world languages give us a natural way of studying models under different, but equally valid, parameterizations.

Just as studying how humans can process diverse languages gives insight into human language processing and human cognition, understanding how multilingual computational systems process different languages can give insights into computational models.

A class of deep neural models, known as transformers, has been particularly successful at natural language tasks. Some of these models are massively multilingual, trained on large numbers of languages at once. Interestingly, these multilingual models seem to acquire both language-specific and language-general knowledge.

Taking advantage of linguistic techniques and variation among world languages, this Computer Research Initiation Research (CRII) project undertakes a series of computational experiments that involve training small classifiers on the pre-trained embedding space of massive multilingual models (e.g., Multilingual BERT and XLM-Roberta) and using the classifier output to characterize how these models represent crucial grammatical aspects of language (e.g., grammatical subject) across languages with different morphosyntactic systems. Moreover, in order to develop more robust ways of studying grammatical roles in these models, the project uses computational techniques to build and publicly release more richly annotated multilingual corpora.

The experimental results and public corpora contribute both to our understanding of computational language models and diversify the set of languages that can be studied using these techniques.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of California-Santa Barbara

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

CRII: RI: Using Linguistic Variation to Understand Deep Neural Models of Language

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants