[ASR]SincNet: Speaker Recognition From Raw waveform with SincNet

누렁이·2024년 2월 8일

ASR Speaker recognition

0

Speech (ASR/TTS)

목록 보기

6/15

paper:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8639585&tag=1
code:
관련 자료: https://velog.io/@gyfla1512/ASR-Study-Paper.-SPEAKER-RECOGNITION-FROM-RAW-WAVEFORM-WITH-SINCNET

Preliminary

band-pass filter:정해진 주파수 대역의 신호를 선택적으로 통과시키는 역할을 함
"rectangular bandpass filters" = 직사각형 모양의 주파수 대역을 추출하는 필터로 이루어진 filter bank
sinc func
baum-welch
GMMUBM

Introduction

연구동향:

1) i-vector representation of speech segments:
- i-vector => 모든 정보를 다 합친다?!
- Gausian Mixture Model-Universal Background Models (GMMUBMs)
- Deep learning: baum-welch statistics 구하기 위해 사용. frame-level feature 추출
2) raw waveforms
- 기존에 handcrafted feature 사용했는데 이게 optimal하단 보장이 없음. 그래서 그냥 바로 raw waveform을 집어 넣는 경우 증가.
- CNN: raw speech sample 프로세싱하는데 가장 효과적. (weight sharing, local filters, pooling-> invariant representations 찾는거 도와줌)

Challenge: First Convolutional Layer

waveform-based CNN에서 가장 중요한건 첫번째 conv layer임.
얘는 고차원 input을 받고, vanishing gradient problem에 가장 취약함.
CNN으로 학습된 filter들은 noisy하고 incongrous multi-band shapes인 경향이 있음. (샘플이 얼마 없으면?)
예쁘지않고 통통 튄다는건가?!?!!?
어쨌든 어떻게 효과적으로... 전처리? 임베딩하냐 이거겠군?
그래서 이 부분에서 효과적인 representation 만들어내는게 엄청 중요!

Approach

CNN이 meaningful한 filter를 discover하기 위해서, 제약사항 추가.

기존 CNN: filterbank가 몇가지 parameters에 의존함. filter vector의 각 요소들은 directly하게 학습됨)
SincNet: waveform을 parametrized된 sinc function을 가지고 conv한다. (sinc function은 band-pass filter를 implement한다). 저,고주파는 데이터로부터 학습된 filter의 parameter일뿐. (뭔소리지)

Contribution:

flexibility

Experiments & Results

task : speaker recognition
minimal training data (각 speaker마다 12-15초만 있음, short sentence 2-6초)
SincNet: faster, better than standard CNN, i-vectors

CNN 1d:
filter 지나가면서 연관이 많아지면 값이 커짐 => 그래서 통통 튄다는건가???
모든 애들 다 학습!
SincNet
학습가능한 매개변수만 학습한다
band-pass filter: 필요한거만 살리고, 별로는 없애!
=> rectangular func: 사각형 모양이 이런걸 잘한대

band-pass 크기 정할 때,

억제할때 그 값을 정하는걸 두개 저주파 차이로 만든다?!!?!?

오직 스칼라값이 f1, f2뿐.

주요피크가 있었다?!
필요한 부분을 잘 봤다는거구나?! 깔끔하게 나온다..!? 잘 filtering 했다는거구나?

eer: 오인식률
frame error rate가 뭐임? : 데이터 전송 품질
sentence error rate??: 전체문장에서 단어 하나라도 틀리면 틀린거

Question

기존이랑 다른 부분이 어디지? 코드보면 알라나?! 8천개랑 160개 이 부분이 식에서 어디서 저거로 나오는거지?!?!!?
fig 2. 위에랑 아래랑 뭐가 차이야..?!?!?!!?!? 시간 도메인, fequency domain 표현한거구나 어차피 같은거구나?
그 다음 시간 도메인으로 변환해서 역푸리에 변환 역산과 동치하다 ?

왜 첫번째만해! 다른 애들은 영향 없는거야?

왈왈

이전 포스트

[Speech] Korean Phonology

다음 포스트

[ASR] Baum-Welch algorithm

0개의 댓글