SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION


Yifan Ding1, Yong Xu2, Shi-Xiong Zhang 2, Yahuan Cong 3 and Liqiang Wang1

1University of Central Florida, USA, 2Tencent AI Lab, USA, 3BUPT, China


Abstract   |   Network Architecture   |   Training Examples   |   Losses   |   More Test Results   |   Per-frame Test Distances   |   Video Demos


Abstract

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems.   In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort.   We  improve the previous approaches by introducing two new loss functions:  the dynamic triplet loss and the multinomial loss.  We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +10% (F1-scores) and-12% (DER). Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of the audio-video datasets in Chinese.



Network Architecture


Fig. 1. Two stream network architecture.


We process audio and video separately using a two-stream network,  a common approach for multi-modal tasks.  For the audio stream, the input is transformed with DCT (Discrete cosine transform) and MFCC(Mel  Frequency  Cepstral  Coefficient)  feature,i.e.,  a  power spectrum of a sound on a non-linear mel scale of frequency. Then the MFCC is sent to a 2D convolutional network to produce speech features.   For the video stream,  a  3D  convolution module is employed to extract both temporal information between consecutive video frames and spatial information in each video frame. 

Training Examples


Fig. 2. Examples of audio-video synchronized, shifted, and heterologous pairs. W denotes the length of visual clip. T is the shifting range.


We consider three types of training samples in our experiments: the synchronized pair (positive) in which the audio and video are synchronized, the shifted pair (negative) with different shifting range, in which the audio and video are from the same source but different time clip, the heterologous pair (negative) in which the audio and video are from different sources.


Losses



Fig. 3. Comparison of loss functions. (a): The negative pair is separated by a margin. (b): Negative pairs are separated from each other by a margin.  (c):  Negative pairs with different shifts or heterologous audio are separated by separate margins.


We propose two new losses: the dynamic triplet loss and multinomial loss. In the dynamic triplet loss, positive pairs and negatives pairs are dynamically determined in each iteration. In the multinomial loss, all shifting combinations for an audio-video pair and all heterologous combinations for audio-video pairs within a mini-batch are considered simultaneously. Specifically, we cluster the negative pairs into groups, where different margins with LogSumExp  (LSE) are employed to achieve a smooth maximum in each group. The experiment results show that multinomial loss achieves faster convergence and better performance compared with dynamic triplet loss and contrastive loss.


More Test Results


We compare our results with SyncNet and  UIS-RNN  (unbounded interleaved-state recurrent neural networks) with/without faces as unbound.   UIS-RNN is a fully-supervised audio-only speaker diarization system that takes d-vector embedding as input and each individual speaker is modeled by a parameter-sharing RNN, while the RNN states for different speakers interleave in the time domain.  For the face bounded version, the number of faces is used to indicate the number of speakers, which has video involved.

Per-frame Test Distances



Fig. 4. (a): Ours. Different colors denote the audio-video distance of different speakers. The curve with the lowest distance is the predicted active speaker. (b): SyncNet. (c): GT. (d): Visualization of GT. The red box frames the face of active speakers.

Video Demo


GT label

ours

The green boxes denote the labeled/detected active speaker.