2022, End-to-End Audio-Visual Neural Speaker Diarization [2022, Interspeech]

DongKeon Park·2023년 5월 17일



  • multimodal inputs
    • uses audio features, lip regions of interest, and i-vector embeddings
  • I-vectors are the key point to solve the alignment problem caused by visual modality errors
    • e.g., occlusions, off-screen speakers, or unreliable detection
  • Our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades significantly using the visual-only model
  • It is robust to visual modality errors and outperforms audio-only and video-only systems


  • exploring the effects of lip motion and speech on speaker diarization using high-definition lip ROIs and single-channel audios
  • By manually removing lip ROI fragments, we can compare the impact of different degrees of lip misalignment on speaker diarization.
Currently pursuing my Ph.D. in GIST, I am deeply intrigued by the field of speaker diarization and committed to making meaningful contributions to it.

1개의 댓글

2023년 12월 20일

From the very core only up online of my being, I want to express my heartfelt thanks for your extraordinary generosity, both in material ways and in the warmth and love you have shown me.
only up

답글 달기