Development of End-to-End Speaker Diarization Method for Detecting Multi-speaker Overlapping Speech

Our method outperformed conventional methods on overlapping speech

February 3, 2020

Hitachi, Ltd. today announced the development of end-to-end^*1 speaker diarization^*2 method that detects speech segments (start and end times) of multiple speakers accurately by using a neural network trained with speaker-overlapping speech. Different from most of the other speaker diarization methods, which cannot handle overlapping speech, this method improves the speech recognition accuracy of overlapping speech in natural conversation. Evaluation results on a telephone speech dataset show that the method outperforms conventional methods. The technique also achieved excellent diarization error rates^*3 on heavily-overlapping simulation speech datasets. Hitachi will aim to tackle the labor shortage and to contribute to the productivity improvement through applying the method to speech recognition and dialogue services.

Fig. 1 Developed end-to-end speaker diarization

Fig. 2 Conventional speaker diarization

*1: End-to-end: A learning method using a single neural network that directly outputs the target results, bypassing a complicated pipeline of systems.
*2: Speaker diarization: The process of detecting multiple speaker segments to answer the question "who spoke when?".
*3: Diarization error rate: The most commonly used metric of speaker diarization. The ratio of falsely missed, falsely detected, or falsely speaker-assigned audio time to total audio time of correct speaker segments.

For more information, use the enquiry form below to contact the Research & Development Group, Hitachi, Ltd. Please make sure to include the title of the article.

https://www8.hitachi.co.jp/inquiry/hitachi-ltd/hqrd/news/en/form.jsp