zur Hauptnavigation springen zum Inhaltsbereich springen

BayWISS-Kolleg Gesundheit www.baywiss.de

Projekte im Verbundkolleg Gesundheit

© eliola, Pixabay '

Whispered and alaryngeal speech conversion

Whispered utterances and alaryngeal speech, i.e., the speech produced by a substitute voice after surgical larynx removal (laryngectomy), have similar characteristics. Primarily due to the absence of pitch, whispered and alaryngeal speech is perceived as less natural and intelligible than regular laryngeal voice production.

While whispering is a common method of speech communication that is usually applied only for a limited period (e.g. in areas where loud noises are prohibited), surgical treatments necessitated by laryngeal cancer, force the affected individuals to use the substitute voice as their permanent method of speech communication. In everyday situations, the properties of alaryngeal speech can become obstructive, ultimately resulting in a lower quality of life.

Recently, deep learning methods have been successfully employed to recover prosodic information from whispered speech signals. These methods usually combine a vocoder for analysis and synthesis of the speech signal with deep neural networks for the prediction of speech features.
This work aims to develop more efficient systems by integrating the transformation of whispered/alaryngeal inputs into voiced outputs directly into the vocoder, thereby removing the need for a separate feature predictor.
The developed systems are evaluated based on their ability to reconstruct voiced speech, and to create
realistic pitch contours. The goal is not only to apply speech conversion techniques to whispered signals, but also to recordings obtained from patients who underwent laryngectomy.
The methods applied in this work are generative models, such as Generative Adversarial Networks (GANs). These systems are not specifically designed for the transformation of speech. Therefore, adjustments with respect to their architecture and training criteria need to be made.

MITGLIED IM KOLLEG

seit

Verbundkolleg Gesundheit

Publikationen

Wagner, D., Baumann, I., und T. Bocklet (2024):
Generative Adversarial Networks for Whispered to Voiced Speech Conversion: A Comparative Study. International Journal of Speech Technology.

Seeberger, P., Wagner, D., und K. Riedhammer (2024):
MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 6539–6548). Association for Computational Linguistics.

Wagner, D., Lee, S., Baumann, I., Seeberger, P., Riedhammer, K., und T. Bocklet (2024):
Optimized Speculative Sampling for GPU Hardware Accelerators. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 6442–6458). Association for Computational Linguistics.

Baumann, I., Unger, N., Wagner, D., Riedhammer, K., und T. Bocklet (2024): Automatic Evaluation of a Sentence Memory Test for Preschool Children. In Proc. INTERSPEECH 2024 (pp. 5158–5162).

Wagner, D., Baumann, I., Riedhammer, K., und T. Bocklet (2024):
Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models. In Proc. INTERSPEECH 2024 (pp. 4623–4627).

Baumann, I., Wagner, D., Schuster, M., Riedhammer, K., Nöth, E., und T. Bocklet (2024):
Towards Self-Attention Understanding for Automatic Articulatory Processes Analysis in Cleft Lip and Palate Speech. In Proc. INTERSPEECH 2024 (pp. 2430–2434).

Wagner, D., Bayerl, S. P. , Baumann, I., Nöth, E., Riedhammer, K., und T. Bocklet (2024):
Large Language Models for Dysfluency Detection in Stuttered Speech. In Proc. INTERSPEECH 2024 (pp. 5118–5122).

Baumann, I., Wagner, D., Schuster, M., Nöth, E., und T. Bocklet (2024):
Towards Interpretability of Automatic Phoneme Analysis in Cleft Lip and Palate Speech. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 12602-12606).

Wagner, D., Churchill, A., Sigtia, S., Georgiou, P., Mirsamadi, M., Mishra, A., und E. Marchi (2024):
A Multimodal Approach to Device-Directed Speech Detection with Large Language Models. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 10451-10455).

Bayerl, S. P., Wagner, D., Baumann, I., Hönig, F., Bocklet, T.,Nöth, E. und K. Riedhammer (2023):
A Stutter Seldom Comes Alone – Cross-Corpus Stuttering Detection as a Multi-label Problem. Proc.INTERSPEECH 2023, 1538–1542.

Bayerl, S. P., Wagner, D., Baumann, I., Bocklet, T. und K. Riedhammer (2023):
Detecting Vocal Fatigue with Neural Embeddings. Journal of Voice.

Wagner, D., Baumann, I., Braun, F., Bayerl, S. P., Nöth, E., Riedhammer, K. und T. Bocklet (2023):
Multi-class Detection of Pathological Speech with Latent Features: How does it perform onunseen data? Proc. INTERSPEECH 2023, 2318–2322.

Baumann, I., Wagner, D., Braun, F., Bayerl, S. P., Nöth, E., Riedhammer, K. und T. Bocklet (2023):
Influence of Utterance and Speaker Characteristics on the Classification of Children with CleftLip and Palate. Proc. INTERSPEECH 2023, 4648–4652.

Wagner, D., Bayerl, S. P., und T. Bocklet (2023):
Implementing Easy-to-Use Recipes for the Switchboard Benchmark. In C. Draxler(Ed.), Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2023 (pp. 150–157). TUDpress, Dresden.

Wagner, D., Churchill, A., Sigtia, S., Georgiou, P., Mirsamadi,M., Mishra, A. und E. Marchi (2023):
Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models. Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III) at NeurIPS 2023.

Riedhammer, K., Baumann, I., Bayerl, S. P., Bocklet, T., Braun, F. und D. Wagner (2023):
Medical Speech Processing for Diagnosisand Monitoring: Clinical Use Cases. Fortschritte Der Akustik - DAGA2023.

Wagner, D., Baumann, I., Bayerl, S. P., Riedhammer, K. und T. Bocklet (2023):
Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy Environments. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

Wagner, D., Bayerl, S. P., Maruri, H. C., und T. Bocklet (2022):
Generative Models for Improved Naturalness Intelligibilityand Voicing of Whispered Speech. In 2022 IEEE Spoken LanguageTechnology Workshop (SLT).

Baumann, I., Wagner, D., Bayerl, S. P., und T. Bocklet (2022):
Nonwords Pronunciation Classification in Language Development Testsfor Preschool Children. In Proc. Interspeech 2022 (pp. 3643–3647).

Bayerl, S. P., Wagner, D., Nöth, E., und K. Riedhammer (2022):
Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0. InProc. Interspeech 2022 (pp. 2868–2872).

Bayerl, S. P., Wagner, D., Nöth, E., Bocklet, T. und K. Riedhammer (2022):
The Influence of Dataset Partitioning on DysfluencyDetection Systems. In P. Sojka, A. Horák, I. Kopeček, & K. Pala(Eds.), Text, Speech, and Dialogue (pp. 423–436). SpringerInternational Publishing.

Wagner, D. (2019): Latent Representations of Transaction Network Graphs in Continuous Vector spaces as Features for Money Laundering Detection. In SKILL 2019 - Studierendenkonferenz Informatik (pp. 143–154 ). Gesellschaft für Informatik e.V.

Dominik Wagner

Dominik Wagner

Technische Hochschule Nürnberg Georg Simon Ohm

Koordination

Treten Sie mit uns in Kontakt. Wir freuen uns auf Ihre Fragen und Anregungen zum Verbundkolleg Gesundheit.

Dr. des. Christina Schmidt

Dr. des. Christina Schmidt

Koordinatorin BayWISS-Verbundkolleg Gesundheit und BayWISS-Verbundkolleg Economics and Business

Universität Regensburg
Zentrum zur Förderung des wissenschaftlichen Nachwuchses
Universitätsstraße 31
93053 Regensburg

Telefon: +49 941 9435548
gesundheit.vk [ at ] baywiss.de