Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | University of California-Santa Barbara |
| Country | United States |
| Start Date | Jul 01, 2021 |
| End Date | Aug 31, 2021 |
| Duration | 61 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2104995 |
Many successful modern computational language systems rely on deep neural networks. Whereas older techniques relied on structured linguistic representations, the inner workings of neural models can be opaque even to the engineers and scientists who create them. Therefore, a current major challenge in Natural Language Processing, as in other areas of Artificial Intelligence, is to develop methods that allow us to understand the internal representations of opaque neural models.
Natural Language Processing is uniquely poised to contribute to this endeavor for two reasons. First, the field of linguistics has long sought to develop tools for characterizing the kinds of representations necessary for processing human language, and so there is a rich body of prior work to draw on. Second, the breadth and variation of world languages give us a natural way of studying models under different, but equally valid, parameterizations.
Just as studying how humans can process diverse languages gives insight into human language processing and human cognition, understanding how multilingual computational systems process different languages can give insights into computational models.
A class of deep neural models, known as transformers, has been particularly successful at natural language tasks. Some of these models are massively multilingual, trained on large numbers of languages at once. Interestingly, these multilingual models seem to acquire both language-specific and language-general knowledge.
Taking advantage of linguistic techniques and variation among world languages, this Computer Research Initiation Research (CRII) project undertakes a series of computational experiments that involve training small classifiers on the pre-trained embedding space of massive multilingual models (e.g., Multilingual BERT and XLM-Roberta) and using the classifier output to characterize how these models represent crucial grammatical aspects of language (e.g., grammatical subject) across languages with different morphosyntactic systems. Moreover, in order to develop more robust ways of studying grammatical roles in these models, the project uses computational techniques to build and publicly release more richly annotated multilingual corpora.
The experimental results and public corpora contribute both to our understanding of computational language models and diversify the set of languages that can be studied using these techniques.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
University of California-Santa Barbara
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant