SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION

Yifan Ding¹, Yong Xu², Shi-Xiong Zhang ², Yahuan Cong ³ and Liqiang Wang¹

¹University of Central Florida, USA, ²Tencent AI Lab, USA, ³BUPT, China

Abstract | Network Architecture | Training Examples | Losses | More Test Results | Per-frame Test Distances | Video Demos

Abstract

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +10% (F1-scores) and-12% (DER). Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of the audio-video datasets in Chinese.

Network Architecture

Fig. 1. Two stream network architecture.

We process audio and video separately using a two-stream network, a common approach for multi-modal tasks. For the audio stream, the input is transformed with DCT (Discrete cosine transform) and MFCC(Mel Frequency Cepstral Coefficient) feature,i.e., a power spectrum of a sound on a non-linear mel scale of frequency. Then the MFCC is sent to a 2D convolutional network to produce speech features. For the video stream, a 3D convolution module is employed to extract both temporal information between consecutive video frames and spatial information in each video frame.

Training Examples

Fig. 2. Examples of audio-video synchronized, shifted, and heterologous pairs. W denotes the length of visual clip. T is the shifting range.

We consider three types of training samples in our experiments: the synchronized pair (positive) in which the audio and video are synchronized, the shifted pair (negative) with different shifting range, in which the audio and video are from the same source but different time clip, the heterologous pair (negative) in which the audio and video are from different sources.

Losses

Fig. 3. Comparison of loss functions. (a): The negative pair is separated by a margin. (b): Negative pairs are separated from each other by a margin. (c): Negative pairs with different shifts or heterologous audio are separated by separate margins.

We propose two new losses: the dynamic triplet loss and multinomial loss. In the dynamic triplet loss, positive pairs and negatives pairs are dynamically determined in each iteration. In the multinomial loss, all shifting combinations for an audio-video pair and all heterologous combinations for audio-video pairs within a mini-batch are considered simultaneously. Specifically, we cluster the negative pairs into groups, where different margins with LogSumExp (LSE) are employed to achieve a smooth maximum in each group. The experiment results show that multinomial loss achieves faster convergence and better performance compared with dynamic triplet loss and contrastive loss.

More Test Results

We compare our results with SyncNet and UIS-RNN (unbounded interleaved-state recurrent neural networks) with/without faces as unbound. UIS-RNN is a fully-supervised audio-only speaker diarization system that takes d-vector embedding as input and each individual speaker is modeled by a parameter-sharing RNN, while the RNN states for different speakers interleave in the time domain. For the face bounded version, the number of faces is used to indicate the number of speakers, which has video involved.

Per-frame Test Distances

Fig. 4. (a): Ours. Different colors denote the audio-video distance of different speakers. The curve with the lowest distance is the predicted active speaker. (b): SyncNet. (c): GT. (d): Visualization of GT. The red box frames the face of active speakers.

Video Demo

GT label

ours

The green boxes denote the labeled/detected active speaker.