NTT’s 18 papers accepted for Interspeech2025, the world’s largest international conference on spoken language processing

WEBWIRE – Thursday, August 7, 2025

18 papers authored by NTT Laboratories have been accepted at Interspeech2025 (the 26th edition of the Interspeech Conference) to be held in Rotterdam, The Netherlands, from August 17 to 21, 2025. Interspeech is the world’s largest and most comprehensive international conference on the science and technology of spoken language processing that supports speech communication between humans as well as between humans and machines/AI. It covers a broad range of fields, from speech recognition, speech synthesis, and spoken dialogue to phonetics. In addition to the 18 accepted papers, we will be presenting a Survey Talk and a demonstration at Interspeech2025.
Abbreviated names of the laboratories:
CS: NTT Communication Science Laboratories
HI: NTT Human Informatics Laboratories
CD: NTT Computer and Data Science Laboratories
SIC: NTT Software Innovation Center
(Affiliations are at the time of submission.)

◆Towards Pre-training an Effective Respiratory Audio Foundation Models
Daisuke Niizumi (CS), Daiki Takeuchi (CS), Binh Thien Nguyen (CS), Masahiro Yasuda (CS/CD), Yasunori Ohishi (CS), Noboru Harada (CS)
Respiratory audio foundation models garner interest as a key AI component for respiratory audio. However, the effectiveness of pre-training these models using existing methods and the training data lacking diversity has not been sufficiently verified in this domain. We investigated the effectiveness of 21 different audio foundation models and identified best practices for pre-training. Experiments confirmed that these practices significantly improve benchmark performance. Looking ahead, we aim to contribute to advancing research in areas such as health monitoring using respiratory audio.

◆CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer
Daiki Takeuchi (CS), Binh Thien Nguyen (CS), Masahiro Yasuda (CS/CD), Daisuke Niizumi (CS), Yasunori Ohishi (CS), Noboru Harada (CS/CD)
Automated audio captioning aims to describe the semantic content of general sounds using natural language. In the previous study, discrete tokens obtained through neural audio codec models were repurposed as input features; however, these tokens are designed to reconstruct waveforms rather than capture the sound’s semantic contexts. To address this issue, our approach applies vector quantization to pre-trained audio representations designed to capture semantic information, generating semantic-rich discrete tokens. Experimental results show that our method achieves higher captioning accuracy than conventional approaches that use discrete tokens from neural audio codecs, and it achieves comparable performance with state-of-the-art methods using large language models, even if fine-tuned with smaller language models. This work contributes to developing foundation models capable of understanding diverse sounds, with potential applications in advanced monitoring systems and audio content retrieval.

◆Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains
Takanori Ashihara (HI), Marc Delcroix (CS), Tsubasa Ochiai (CS), Kohei Matsuura (HI), Shota Horiguchi (HI)
Technologies that convert continuous audio waveforms into discrete units (tokens) are increasingly being used in a variety of audio processing tasks. In this study, we analyze how audio tokens behave across different domains (e.g., speech, music, and general sounds). Our analysis reveals characteristics of token representations that are domain-invariant, as well as properties that can be shared across domains. These findings offer valuable insights for the future design of audio processing systems and machine learning models that leverage audio tokens, particularly from the perspective of domain generalization.

◆Attention-Free Dual-Mode ASR with Latency-Controlled Selective State Spaces
Takafumi Moriya (HI/CS), Masato Mimura (HI), Kiyoaki Matsui (HI), Hiroshi Sato (HI), Kohei Matsuura (HI)
We propose a model architecture based on State Space Models (SSMs) that does not rely on attention mechanisms, which require quadratic computational cost, along with a decoding algorithm suited for streaming chunkwise processing. Experimental results demonstrate that the proposed SSM-based model achieves recognition performance comparable to that of models employing attention mechanisms, while enabling faster inference. This work facilitates the development of even faster and more accurate speech recognition systems, leading to applications in meeting transcription and dialogue systems.

◆MOVER: Combining Multiple Meeting Recognition Systems
Naoyuki Kamo (CS), Tsubasa Ochiai (CS), Marc Delcroix (CS), Tomohiro Nakatani (CS)
Meeting recognition is the task of estimating who spoke what and when from audio data involving multiple speakers. This study proposes a method for integrating the outputs of multiple meeting recognition systems. Although conventional methods for system combination of speech recognition results exist, they assume that the recognition results correspond to utterances by the same speaker during the same time segments. Therefore, these methods cannot be directly applied to general meeting recognition systems. In this study, we propose a method that first determines the correspondence between recognition results from different time segments and then performs system combination. Our approach enables further improvement in the performance of meeting recognition systems.

◆Switch Conformer with Universal Phonetic Experts for Multilingual ASR
Masato Mimura (HI), Jaeyoung Lee (Kyoto University), Tatsuya Kawahara (Kyoto University)
Multilingual end-to-end ASR presents significant challenges due to the need to accommodate diverse writing systems, lexicons, and grammatical structures. Existing methods often rely on large models with high computational costs for adequate cross-language performance. To address this, we propose the switch Conformer, which enhances model capacity while maintaining nearly the same inference cost as a standard Conformer. Our approach replaces the feedforward network module in each Conformer block with a sparse mixture of independent experts, activating only one expert per input to enable efficient language-specific feature learning. In addition, a shared expert captures language-universal speech characteristics. Experiments on streaming speech recognition demonstrate that these experts work synergistically to achieve better performance than the baseline Conformer, with minimal additional active parameters. This technology enables more robust and seamless multilingual communication.

◆Why is children’s ASR so difficult? Analyzing children’s phonological error patterns using SSL-based phoneme recognizers
Koharu Horii (CS), Naohiro Tawara (CS), Atsunori Ogawa (CS), Shoko Araki (CS)
Accurate children’s automatic speech recognition (child ASR) is essential for systems that support childcare and education at home and school more efficiently. However, child ASR is generally more challenging than adult ASR, mainly because children’s speech features (e.g., speech rate and pitch) differ significantly from those of adults and change as they grow. To identify challenges in current child ASR systems, we developed a self-supervised learning (SSL)-based phoneme recognizer adapted to children’s speech. Using this recognizer, we analyzed children’s speech from age 5 to 15, identifying factors that make their speech difficult to recognize and revealing how recognition error patterns vary with age. Our findings provide insights for improving child ASR, advancing research on children’s speech development, and supporting developments of educational and parenting applications.

◆Pick and Summarize: Integrating Extractive and Abstractive Speech Summarization
Takatomo Kano (CS), Atsunori Ogawa (CS) Marc Delcroix (CS), Ryo Fukuda (CS), Chen William (CMU), Shinji Watanabe (CMU)
There are two main approaches to speech summarization: extractive summarization, which constructs summaries by selecting meaningful utterances, and abstractive summarization, which freely rephrases the talk’s content. This study proposes a method integrating both approaches into a single deep-learning model. In the proposed method, the speech summarization model first selects and outputs meaningful utterances from a long talk. It then generates an abstractive summary using more topic-relevant and specific vocabulary, based on both the overall content of speech and the extracted summary of text. Finally, our approach successfully improves the accuracy of the overall summarization. With this technique, the spoken language model can output summaries in two stages: showing which parts of the speech it focused on and how the mode summarizes overall content. This two-step process is expected to improve both the interpretability and accuracy of the speech summarization.

◆Unified Audio-Visual Modeling for Recognizing Which Face Spoke When and What in Multi-Talker Overlapped Speech and Video
Naoki Makishima (HI), Naotaka Kawata (HI), Taiga Yamane (HI), Mana Ihori (HI), Tomohiro Tanaka (HI), Satoshi Suzuki (HI), Shota Orihashi (HI), Ryo Masumura (HI)
In understanding videos where multiple speakers talk simultaneously, it is practically important to recognize which face spoke "when" and "what" from multi-talker overlapped speech and video of various speakers. Conventional methods require combining speech separation, active speaker detection, and automatic speech recognition to address this task. However, the combination of these partially optimized systems complicates the overall system and produces suboptimal results. In this study, we address this problem by serializing "which face spoke what and when" from multiple speakers into a single token sequence and recursively estimating it using a unified model, enhancing the accuracy of estimating "who spoke, when, and what" This study enables AI to integrate visual and auditory information, allowing for more advanced environmental understanding and communication support compared to conventional systems that rely solely on audio or visual input.

◆SOMSRED-SVC: Sequential Output Modeling with Speaker Vector Constraints for Joint Multi-Talker Overlapped ASR and Speaker Diarization
Naoki Makishima (HI), Naotaka Kawata (HI), Taiga Yamane (HI), Mana Ihori (HI), Tomohiro Tanaka (HI), Satoshi Suzuki (HI), Shota Orihashi (HI), Ryo Masumura (HI)
In the task of estimating "who spoke when and what" from overlapped speech where multiple speakers speak simultaneously, the conventional method, called SOMSRED, has demonstrated high performance. SOMSRED solves the problem of estimating "who" by treating it as a problem of estimating discrete speaker tokens and their intermediate features, speaker vectors. This enables high-precision estimation even in speech with high overlap rates where the entire speech is overlapping. However, the discretization of speaker tokens caused the degradation of speaker vectors. In this study, we introduced a loss function that imposes constraints on speaker vectors in a continuous feature space, improving both the speaker vector performance and speech recognition performance of SOMSRED. This study is expected to help AI better understand its surroundings and support smoother voice communication.

◆Pretraining Multi-Speaker Identification for Neural Speaker Diarization
Shota Horiguchi (HI), Atsushi Ando (HI), Marc Delcroix (CS), Naohiro Tawara (CS)
Speaker diarization is the task of estimating who spoke when in an audio recording. Traditionally, achieving high-performance speaker diarization models required pretraining with a large amount of simulated multi-speaker conversation data. In this study, we propose a novel pretraining method that uses only short audio clips involving 0 to 2 speakers. This approach successfully improves model performance. The results of this research are expected to contribute to more effective recognition and understanding of speech in multi-speaker scenarios such as meetings and business negotiations.

◆Mitigating Non-Target Speaker Bias in Guided Speaker Embedding
Shota Horiguchi (HI), Takanori Ashihara (HI), Marc Delcroix (CS), Atsushi Ando (HI), Naohiro Tawara (CS)
A speaker embedding is a vector representation that enables the determination of whether two utterances are from the same speaker based on their similarity. In this study, we propose a method to improve the accuracy of extracting speaker embeddings corresponding to a specific speaker from audio involving multiple speakers. Our method enhances extraction performance by weighting the embeddings based on statistics calculated from the speech intervals of the target speaker. This approach is expected to contribute to better speech recognition and understanding in multi-speaker scenarios such as meetings and business negotiations.

◆Voice Impression Control in Zero-Shot TTS
Kenichi Fujita (HI), Shota Horiguchi (HI), Yusuke Ijima (HI)
We have developed a method for controlling the impression of synthesized speech in zero-shot text-to-speech synthesis, a technology that can generate voices similar to a target speaker from just a few seconds of utterance. Our approach quantifies 11 different voice impressions—such as "cold–warm" or "weak–powerful"—as numerical vectors. By manipulating these vectors, it is possible to change the impression of the synthesized speech while preserving the original speaker’s characteristics. This enables users to easily generate speech that matches their desired style or scene.

◆FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation
Takuhiro Kaneko (CS), Hirokazu Kameoka (CS), Kou Tanaka (CS), Yuto Kondo (CS)
Voice conversion is a technique that changes a person’s voice to sound like another’s while keeping the spoken content. Recently, diffusion model-based methods have gained attention for their high speech quality and speaker similarity. However, they require repeated computations and take time to extract linguistic content. To address this, we propose a new method called Adversarial Diffusion Conversion Distillation. This method achieves voice conversion performance comparable to existing methods, with significantly less time. This makes it a promising step toward practical and high-performance voice conversion.

◆Vocoder-Projected Feature Discriminator
Takuhiro Kaneko (CS), Hirokazu Kameoka (CS), Kou Tanaka (CS), Yuto Kondo (CS)
In speech synthesis and voice conversion, a common approach is the two-stage method: one model predicts acoustic features, and another separately generates the waveform based on those features. This study aims to improve the performance of the first model to produce higher-quality speech. In this field, generative adversarial networks (GANs), where models are trained in competition, are often used. However, GAN-based training is prone to instability. To address this issue, we propose a new approach using a Vocoder-Projected Feature Discriminator, which enables the model to capture more effective feature representations and ensures more stable training. As a result, we significantly reduced training time and memory usage while maintaining voice conversion quality. This technique can lower the development cost of speech AI and is expected to contribute to more sustainable technologies.

◆JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles
Yuto Kondo (CS), Hirokazu Kameoka (CS), Kou Tanaka (CS), Takuhiro Kaneko (CS)
In order to advance research on speech generation AI technologies such as text-to-speech synthesis and voice conversion, we have constructed a large-scale, multi-speaker speech corpus called JIS, which includes 17 hours of recordings from over 150 live idols as speakers. Until now, surveys on speaker similarity — often conducted to evaluate the performance of speech generation AI by comparing synthesized speech with target speaker speech — have tended to provide relatively lenient assessments. However, given the unique characteristics of JIS (e.g., fan-based evaluations leveraging stage names), we anticipate that it will now be possible to conduct much more fine-grained similarity assessments. The copyright of JIS belongs to NTT. We plan to provide JIS free of charge to other research institutions, strictly for non-commercial, fundamental research purposes. We expect that JIS will further accelerate high-quality speech generation AI research and development by research groups throughout Japan.

◆Leveraging LLMs for Written to Spoken Style Data Transformation to Enhance Spoken Dialog State Tracking
Haris Gulzar (SIC), Monikka Roslianna Busto(SIC), Akiko Masaki(SIC), Takeharu Eda(SIC), Ryo Masumura(HI)
Dialog State Tracking (DST) is a vital component of Task Oriented Dialog (TOD) systems for navigating human conversation to perform various tasks. However, traditional TOD systems trained on written text perform poorly in spoken scenarios due to disfluencies and recognition errors of human speech. Labeled spoken-style data is scarce because of cost and privacy issues. In this work, we leveraged Large Language Models (LLMs) to generate such data synthetically. Carefully designed prompts produced labeled spoken TOD data that improved Joint Goal Accuracy (JGA) by 3.39% (absolute) and 11.6% (relative) for spoken TOD systems. In this research, we present our divide-and-conquer approach for data generation and DST model training. Adapting the existing dialog systems to spoken scenarios can significantly enhance various applications like voice-operated assistive robots, AI assistants in cars, and call center agents.

◆Improving User Impression of Spoken Dialogue Systems by Controlling Para-linguistic Expression Based on Intimacy
Shoki Kawanishi (Tohoku University), Akinori Ito (Tohoku University), Yuya Chiba (CS), Takashi Nose (Tohoku University)
We propose a method for controlling the intimacy of responses in spoken dialogue systems by gradually adapting both linguistic and para-linguistic expressions in accordance with the progression of the dialogue. While previous studies on intimacy-based dialogue control have demonstrated the effectiveness of linguistic adaptations, para-linguistic features such as prosody and speaking rate have largely been overlooked. In this study, we trained a speech synthesis model using utterances extracted from conversations involving varying levels of intimacy between speakers, allowing the system to adapt its para-linguistic expressions as well. The results of this study are expected to provide design guidelines for maintaining user engagement in systems that are used continuously in everyday life.

We will also deliver a Survey Talk:

◆Advances in Conversational Speech Recognition
Marc Delcroix (CS)
Conversational Speech Recognition (CSR) aims to accurately transcribe natural, multi-talker conversations, such as meetings, and associate speaker labels and temporal information to the utterances. This task presents significant challenges due to the inherent complexities of spontaneous conversations, encompassing phenomena such as overlapping speech, speaker variability, and diverse acoustic environments. This survey talk will provide a comprehensive overview of the CSR problem, establishing its formal definition, common datasets, and standard evaluation metrics. We will analyze existing CSR frameworks and review recent developments in the field. The talk will conclude with a discussion on remaining challenges.

We also present a demonstration at a Show and Tell session.

◆Real-time TSE demonstration via SoundBeam with KD
Keigo Wakayama (CD), Tomoko Kawase (CD), Takafumi Moriya (HI/CS), Marc Delcroix (CS), Hiroshi Sato (HI), Tsubasa Ochiai (CS), Masahiro Yasuda (CD/CS), Shoko Araki (CS)
We propose a demonstration of a target sound extraction (TSE) system that extracts desired sound sources from a recorded sound mixture in real time. The proposed system extends NTT’s SoundBeam TSE approach to a causal system, maintaining high extraction performance by using knowledge distillation (KD) from a non-causal TSE system. Moving forward, we will actively enhance the real-time TSE system with the goal of applying it to a wide range of fields, including immersive systems and auditory devices.

( Press Release Image: https://photos.webwire.com/prmedia/6/342151/342151-1.png )

Related Links: Source

WebWireID342151

This news content was configured by WebWire editorial staff. Linking is permitted.

News Release Distribution and Press Release Distribution Services Provided by WebWire.

News and Press Release Distribution, Since 1995

Deliver Your News to the World

NTT’s 18 papers accepted for Interspeech2025, the world’s largest international conference on spoken language processing

Distribute Your News