HCAIM Webinar: Silent Speech Interfaces & HCI / ML aspects

April 27, 2022

Webinar

On Thursday, April 28, 2022, at 15:00 CET, we will be having a live session with an academic partner from Hungary, Dr Tamás Gábor Csapó.

Dr Gábor Csapó obtained his PhD in computer science & speech technology & machine learning from the Budapest University of Technology and Economics (BME), Hungary in 2014. He was a Fulbright scholar at Indiana University, the USA in 2014, where he started to deal with ultrasound imaging of the tongue. In 2016, he joined the MTA-ELTE Lingual Articulation Research Group, focusing on investigating Hungarian articulation during speech production. Since 2017, he has had two national research projects about ultrasound-based articulatory-to-acoustic mapping and articulatory-to-acoustic inversion, both of them applying deep learning methods. He regularly cooperates with international researchers and has co-authors from the USA, Canada, Colombia, China, and several EU countries. His research interests include Silent Speech Interfaces, speech analysis and synthesis, ultrasound-based tongue movement analysis, and deep learning methods applied to speech technologies. Currently, he is a research fellow at BME.

Silent Speech Interfaces (SSI) are a revolutionary field of speech technologies, having the main idea of recording the articulatory movement, and automatically generating speech from the movement information, while the original subject is not producing any sound. This research area, also known as articulatory-to-acoustic mapping (AAM) has a large potential impact in a number of domains, and might be highly useful for the speaking impaired (e.g., after laryngectomy), and for scenarios where regular speech is not feasible but the information should be transmitted from the speaker (e.g., extremely noisy environments; military applications. Voice assistants are getting popular lately, but they are still not in every home. One of the reasons is privacy concerns: some people do not feel comfortable if they have to speak loud, having others around – but SSI equipment can be a solution for that.

There are two distinct ways of SSI solutions, namely `direct synthesis’ and `recognition-and-synthesis’. In the first case, the speech signal is generated without an intermediate step, directly from the articulatory data. In the second case, silent speech recognition (SSR) is applied on the biosignal which extracts the content spoken by the person (i.e. the result of this step is text); this step is then followed by text-to-speech (TTS) synthesis. In the SSR+TTS approach, any information related to speech prosody (intonation and durations) is lost, whereas it may be kept with direct synthesis. In addition, the smaller delay by the direct synthesis approach might enable conversational use; therefore, we are following this approach in our project.

To fulfil the above goals, we formulated a multidisciplinary team with expert senior researchers in speech synthesis, recognition, deep learning, and articulatory data acquisition. As the human biosignals, 2D ultrasound, lip video and magnetic resonance imaging were used to image the motion of the speaking organs. In our experiments, we used standard deep learning approaches (convolutional and recurrent neural networks, autoencoders) and high-potential novel machine learning methods (adversarial training, neural vocoders and cross-speaker experiments). When designing ML/DL approaches, it is not enough to test the system with objective measures (e.g. validation loss), but it is also important to keep in mind the human aspects. Therefore, after each deep learning experiment, we evaluated the resulting synthesized speech samples in subjective listening tests with potential users. Such an SSI system, being able to convert the silent articulation of any person to fully natural audible speech, is not yet available; but we had significant progress towards practical prototypes.

Until now, numerous Hungarian and international BSc/MSc/PhD students of BME were involved in the above project, as part of their project laboratory, thesis writing, internship or individual research project. We invite those students taking part in the Human-Centred Artificial Intelligence Master’s Programme to get involved with Silent Speech Interfaces!

All sessions will run live and will be hosted on LinkedIn Live. You can view the recorded sessions at our Webinars Archive. We will have more engaging discussions with top industry leaders including our project partners from Universities, Research Labs, Industry parties and others. A complete list of all project partners can be found here. View the live event here .