Whispered utterances and alaryngeal speech, i.e., the speech produced by a substitute voice after surgical larynx removal (laryngectomy), have similar characteristics. Primarily due to the absence of pitch, whispered and alaryngeal speech is perceived as less natural and intelligible than regular laryngeal voice production.
While whispering is a common method of speech communication that is usually applied only for a limited period (e.g. in areas where loud noises are prohibited), surgical treatments necessitated by laryngeal cancer, force the affected individuals to use the substitute voice as their permanent method of speech communication. In everyday situations, the properties of alaryngeal speech can become obstructive, ultimately resulting in a lower quality of life.
Recently, deep learning methods have been successfully employed to recover prosodic information from whispered speech signals. These methods usually combine a vocoder for analysis and synthesis of the speech signal with deep neural networks for the prediction of speech features.
This work aims to develop more efficient systems by integrating the transformation of whispered/alaryngeal inputs into voiced outputs directly into the vocoder, thereby removing the need for a separate feature predictor.
The developed systems are evaluated based on their ability to reconstruct voiced speech, and to create
realistic pitch contours. The goal is not only to apply speech conversion techniques to whispered signals, but also to recordings obtained from patients who underwent laryngectomy.
The methods applied in this work are generative models, such as Generative Adversarial Networks (GANs). These systems are not specifically designed for the transformation of speech. Therefore, adjustments with respect to their architecture and training criteria need to be made.
Whispered and alaryngeal speech conversion
MITGLIED IM KOLLEG
seit
Publikationen
Wagner, D., Bayerl, S.P., Cordourier Maruri H. und T. Bocklet (2022):
Generative Models for Improved Naturalness Intelligibilityand Voicing of Whispered Speech. In 2022 IEEE Spoken LanguageTechnology Workshop (SLT).
Wagner, D., Baumann, I., Bayerl, S.P. und T. Bocklet (2022):
Nonwords Pronunciation Classification in Language Development Testsfor Preschool Children. In Proc. Interspeech 2022 (pp. 3643–3647).
Wagner, D., Bayerl, S.P., Nöth, E. und K. Riedhammer (2022):
Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0. InProc. Interspeech 2022 (pp. 2868–2872).
Wagner, D., Bayerl, S. P., Nöth, E., Bocklet, T. und K. Riedhammer (2022):
The Influence of Dataset Partitioning on DysfluencyDetection Systems. In P. Sojka, A. Horák, I. Kopeček, & K. Pala(Eds.), Text, Speech, and Dialogue (pp. 423–436). SpringerInternational Publishing.