Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | University of Colorado At Boulder |
| Country | United States |
| Start Date | May 15, 2021 |
| End Date | May 31, 2026 |
| Duration | 1,842 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2149404 |
Language technology has become an integral part of how we interact with the world of information, but sophisticated natural language processing (NLP) tools are available only for a handful of the approximately 7000 languages spoken across the world. Modern data-driven methods for developing NLP tools generally rely on the availability of enormous amounts of data for the language in question, an obstacle that may be insurmountable for many languages, especially languages lacking significant digital resources and languages with small or diminishing numbers of speakers.
This project aims to remove barriers to developing NLP tools for languages with less data, developing new methods that incorporate knowledge about linguistic properties of languages into models learned from data. Learning how to build faster paths to NLP tools for new languages has the potential to rapidly advance the state of language technology for any language.
In addition, the tools and knowledge developed here have the potential to speed up the description of endangered languages, helping to secure an informed record of the world's languages while there are still speakers to learn from.
The imbalance in access to language technologies arises in part because current NLP models and algorithms need to learn from large amounts of training data. This project addresses that imbalance by adapting methods from cross-lingual transfer learning, in which models learned on one language are adapted and exploited to make predictions for another language.
One innovation of this project is to investigate the incorporation of expert linguistic knowledge for improving model transfer. Two types of linguistic knowledge will be injected into artificial neural network models for morphological analysis and part-of-speech tagging: a) knowledge about relationships between individual languages and language families; and b) knowledge about specific linguistic properties of individual languages and language families.
The models will be evaluated both intrinsically and extrinsically, the latter by studying the usefulness of the models for human linguistic analysis and as part of the language documentation and description workflow.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
University of Colorado At Boulder
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant