The challenges associated with an aging global population—such as increased healthcare demands and shifts in workforce demographics—have become an increasingly focal point of societal concern. One of the key challenges faced by older adults is the difficulty in articulating their needs clearly due to hearing or speech impairments, which can hinder effective communication. 

An article published in IEEE Transactions on Human-Machine Systems introduces AVE speech, a comprehensive multimodal dataset for speech recognition tasks. By integrating audio, visual, and EMG signals into a multimodal speech recognition system, the proposed system holds significant potential, offering a natural and efficient means of facilitating human-machine interaction. This approach can be applied to a broader range of scenarios, including rehabilitation for patients with speech disorders, daily assistance for older adults, and private communication in low-light or dynamic indoor environments.

Reviewing Prior Work

The researchers analyze related work to establish a baseline of knowledge before introducing their model. According to the article, speech recognition can generally be categorized into two main approaches: automatic speech recognition (ASR) and silent speech recognition (SSR). ASR has become deeply integrated into daily life, exemplified by the speech-to-text functionality in numerous mobile applications. Despite the widespread utility and impressive capabilities of ASR, its accuracy significantly diminishes in environments where acoustic signals cannot be reliably captured. 

To address these limitations, SSR has experienced accelerated development, particularly in sectors such as elderly care, disability assistance, and environments with significant background noise. The progress in SSR has been further propelled by advancements in training datasets and deep learning algorithms, leading to substantial improvements in specific areas such as visual speech recognition (commonly known as lipreading) and surface electromyography (EMG)-based speech recognition. 

Introducing AVE Speech

By integrating multiple modalities, AVE speech can offer complementary information that enhances the robustness and accuracy of speech recognition, thereby enabling unconstrained and more reliable communication. Leveraging the strengths of each modality allows this multimodal system to effectively overcome the limitations inherent in single-modal speech recognition systems, providing more robust and accurate communication solutions.

Fig. 1.
Illustration of multimodal speech recognition pipeline integrating audio, visual, and electromyographic signals.

 

Explicitly designed for multimodal speech recognition, the AVE speech dataset integrates audio, visual, and EMG signals directly related to the speech process, providing a fusion paradigm for this emerging field.

AVE Speech Dataset

The AVE speech dataset enables multimodal speech recognition by integrating audio, visual, and EMG signals. In the article, the researchers outline the details of the corpus design, participant demographics, data collection equipment, dataset organization, data preprocessing, and basic feature extraction.

The AVE speech dataset is crafted based on the five levels of Maslow’s Hierarchy of Needs: physiological, safety, belongingness and love, esteem, and self-actualization. In particular, the dataset comprises 100 Mandarin sentences crafted to meet a broad spectrum of user needs, particularly those of individuals with speech impairments and elderly persons requiring assistance with daily living, rehabilitation, and caregiving.

Representative examples of sentences in the AVE speech corpus, categorized by the type of need.

 

Statistical overview of the corpus and participant demographics. (a) Distribution of the number of words per sentence. (b) Gender distribution of the participants. (c) Age distribution of the participants.

 

During the multimodal data collection, participants were seated comfortably in a quiet room with normal lighting conditions. They were instructed to read the presented sentences within two seconds, while audio signals, lip-region video images, and six-channel EMG signals corresponding to the same sentence were collected simultaneously.

Overview of the data collection system, including both hardware devices and the recording interface.

 

Experiment Results

The researchers implemented standard methods for single-modal speech recognition and employed conventional fusion paradigms to conduct multimodal speech recognition experiments using the proposed dataset. 

Fusion network architecture.

 

The fusion pipeline integrates three modalities of speech data: audio (Mel spectrogram), visual (image sequence of lip movements), and EMG (electromyography Mel spectrogram). The entire AVE speech dataset was utilized for the cross-subject speech recognition experiments. Unlike prior research, which often involves training and testing on data collected from the same speakers, this work focuses on speaker-independent speech recognition, aligning closely with real-world applications.

The experimental findings demonstrate that the integration of multimodal speech data not only enhances recognition performance under conditions of severe audio noise but also mitigates the variability caused by subject differences. The multilevel and rich-grained semantic information provided by the combination of audio, visual, and EMG data effectively compensates for the limitations observed in single-modal speech recognition systems, particularly in cross-subject and high-noise scenarios.

Limitations and Future Directions

The proposed multimodal AVE speech dataset offers a comprehensive, high-quality, and synchronized speech dataset, collected from a diverse pool of speakers. This dataset serves as a vital resource for a wide range of research and practical applications. While the proposed dataset offers a valuable resource for advancing multimodal speech recognition research, several limitations should be acknowledged.

Primarily, the current participant group consists of younger adults with no known speech impairments. While practical and effective for establishing a high-quality baseline dataset, this approach may limit immediate generalizability to populations such as older adults or individuals with speech disorders, who often experience different physiological and cognitive conditions that affect speech production and perception. Age-related changes—such as reduced respiratory capacity, decreased vocal fold elasticity, and muscular atrophy—can impact various aspects of speech, including fluency and articulation.

Despite these considerations, the researchers believe the dataset serves as a robust foundation for the development of assistive and human-machine interaction technologies. The AVE Speech dataset offers a platform for future research into speech recognition and multimodal learning, serving as a crucial tool for investigating the underlying mechanisms of human speech perception and interaction.

Interested in learning more about AI and Smart Cities? The IEEE Xplore Digital Library offers over 66,000 publications on Speech Recognition.

Interested in acquiring full-text access to this collection for your entire organization? Request a free demo and trial subscription for your organization.