Completed STANDARD GRANT National Science Foundation (US)

Deep Learning Based Complex Spectral Mapping for Multi-Channel Speaker Separation and Speech Enhancement

$3.91M USD

Funder	National Science Foundation (US)
Recipient Organization	Ohio State University
Country	United States
Start Date	Aug 01, 2021
End Date	Jul 31, 2025
Duration	1,460 days
Number of Grantees	2
Roles	Former Principal Investigator; Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2125074`

Grant Description

Despite tremendous advances in deep learning based speech separation and automatic speech recognition, a major challenge remains how to separate concurrent speakers and recognize their speech in the presence of room reverberation and background noise. This project will develop a multi-channel complex spectral mapping approach to multi-talker speaker separation and speech enhancement in order to improve speech recognition performance in such conditions.

The proposed approach trains deep neural networks to predict the real and imaginary parts of individual talkers from the multi-channel input in the complex domain. After overlapped speakers are separated into simultaneous streams, sequential grouping will be performed for speaker diarization, which is the task of grouping the speech utterances of the same talker over intervals with the utterances of other speakers and pauses.

Proposed speaker diarization will integrate spatial and spectral speaker features, which will be extracted through multi-channel speaker localization and single-channel speaker embedding. Recurrent neural networks will be trained to perform classification for the purpose of speaker diarization, which can handle an arbitrary number of speakers in a meeting.

The proposed separation system will be evaluated using open, multi-channel speaker separation datasets that contain both room reverberation and background noise. The results from this project are expected to substantially elevate the performance of continuous speaker separation, as well as speaker diarization, in adverse acoustic environments, helping to close the performance gap between recognizing single-talker speech and recognizing multi-talker speech.

The overall goal of this project is to develop a deep learning system that can continuously separate individual speakers in a conversational or meeting setting and accurately recognize the utterances of these speakers. Building on recent advances on simultaneous grouping to separate and enhance overlapped speakers in a talker-independent fashion, the project is mainly focused on speaker diarization, which aims to group the speech utterances of the same speaker across time.

To achieve speaker diarization, deep learning based sequential grouping will be performed and it will integrate spatial and spectral speaker characteristics. Through sequential organization, simultaneous streams will be grouped with earlier-separated speaker streams to form sequential streams, each of which corresponds to all the utterances of the same speaker up to the current time.

Speaker localization and classification will be investigated to make sequential grouping capable of creating new sequential streams and handling an arbitrary number of speakers in a meeting scenario. With the added spatial dimension, the proposed diarization approach provides a solution to the question of who spoke when and where, significantly expanding the traditional scope of who spoke when.

The proposed separation system will be evaluated using multi-channel speaker separation datasets that contain highly overlapped speech in recorded conversations, as well as room reverberation and background noise present in real environments. The main evaluation metric will be word error rate in automatic speech recognition. The performance of speaker diarization will be measured using diarization error rate.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Ohio State University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Deep Learning Based Complex Spectral Mapping for Multi-Channel Speaker Separation and Speech Enhancement

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants