Completed STANDARD GRANT National Science Foundation (US)

Computational Methods for Speech Analysis

$2.49M USD

Funder	National Science Foundation (US)
Recipient Organization	Washington University
Country	United States
Start Date	Aug 01, 2021
End Date	Jul 31, 2024
Duration	1,095 days
Number of Grantees	2
Roles	Principal Investigator; Co-Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2120087`

Grant Description

This research project will develop tools for testing hypotheses about human communication. Researchers generally study human communication from textual transcripts which omit vocal tone. The project will directly address the disconnect between the data-generating process - in which speakers and listeners use the auditory channel to convey both textual and non-textual signals - and the widespread practice of discarding speech audio.

The investigators will extend their prior speech model, The Model of Audio and Speech Structure, to address some limitations of the model. In particular, the statistical extensions will accommodate multiple speakers and allow for the joint modeling of text and tone. To demonstrate the value of the statistical extensions, the model will be applied to two original video corpora - police body-worn camera footage and campaign speeches for federal office.

New software will be developed that makes it easy for researchers to quickly annotate a large amount of speech audio. The browser-based tools will enable automatic and manual segmentation, along with labeling. Multiple graduate students will gain experience in computationally intensive research and software development.

The tools to be developed will be incorporated into ongoing public-private collaborations to improve oversight of police officers in the field.

This research project will extend the Model of Audio and Speech Structure (MASS), which analyzes conversation as a nested stochastic process in which (i) the flow of conversation unfolds as a sequence of utterances transitioning between speakers and their vocal tones, based on contextual covariates; and (ii) the auditory signal within each utterance unfolds as a hidden Markov model that transitions between phonemes which generate sound. The model enables social scientists to test hypotheses about how conversations are structured by fixed covariates (e.g., speaker gender, conversation role) and time-varying covariates (e.g., exogenous external stimuli, endogenous conversation trajectory such as the previous speaker's tone).

In its current implementation, however, MASS has two key limitations: First, it uses resource-intensive human annotations of tone for each speaker, which limits application to contexts with many unique speakers, such as police body-worn camera footage. This project will develop extensions allowing the model to borrow strength by partial pooling across speakers with similar speech profiles.

Second, MASS incorporates text as externally given metadata. The project will develop a new approach for joint modeling of text and audio which will incorporate a dynamic topic model into the flow-of-conversation layer of MASS. The investigators will conduct two applications to demonstrate the value of the multi-speaker and joint text-audio modeling extensions.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Washington University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Computational Methods for Speech Analysis

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants