[NeurIPS 2022] Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

-·2024년 5월 27일

multi-modal representation

목록 보기

1/3

https://arxiv.org/abs/2203.02053

해당 논문은 제목에서 보시는 것 처럼 Multi-modal Contrastive Representation Learning에서 Modality Gap을 발견한 첫번째 논문입니다.

Modality Gap이란, CLIP과 같은 multi-modal model의 representation space에서 modality들이 분리되어 있는 현상을 지칭합니다.

저자들은 세가지 설명을 통해 이러한 modality gap을 설명합니다.

Cone effect
- pre-trained model or random weights를 가지고 있는 모델들의 embedding space는 아주 좁은 형태의 cone으로 이뤄졌다.
Different random initialization create different embedding cones
- 두개의 인코더를 지닌 multi-modal model은 그러니 different cone을 형성할 것이고,
Contrastive Learning objective가 이러한 gap을 계속해서 유지함

위 세가지 이유를 modality gap이 발생하는 이유라고 분석.

실험을 여러가지 한 것 같아서 그 중에 하나만 소개드리면

4장에서 Contrastive learning preserves modality gap를 확인하기 위한 실험을 진행했습니다.

Embedding Shift Experiment
- Contrastive learning objective가 modality gap을 계속 유지한다고 가정
  - 이를 확인하기 위해서, n=5000 개의 image-caption pair를 가지고 loss landscape(contrastive loss)를 설계.
  - image embedding과 text embedding들의 center(각 임베딩의 평균)를 구하고 이를 뺀 것을 modality gap 이라고 둠.
  - 모든 image embedding이랑 text embedding을 modality gap을 줄이는 방향으로 이동시켜버림 (그림 3.(a))

식에 의하면 lambda를 가지고 가중치를 두고, 각 평균을 뺀 측정 gap만큼 그냥 빼버림. (이미지는 텍스트쪽으로, 텍스트는 이미지 쪽으로 이동)

그리고 나서 Normalize (project hypersphere)

shifting하기 전 gap이 0.82(검정 vertical dash line), shifting 하고난 이후에 contrastive loss 증가.

Euclidean distance를 가지고 angle에 대한 정보를 얻음 ((x-y)^T(x-y)= 2(1-x^Ty), cos(x,y) = x^Ty)

그림 (b,c,d)을 보면, shifting이 많이 일어나서 두 모달리티의 위치가 바뀌었을 때는 (Euclidean distance 0 이하) repulsive한 landscape를 형성하고, Euclidian distance (shifting하기 전 0.82 지점) 에서 optimal한 것을 발견.

modality가 나뉜것이 contrastive loss관점에서 문제될게 없음
temperature가 커질수록 repulsive한 동작이 사라지고,
optimal loss가 euclidean distance가 0인 지점 (즉, embedding이 분리되지 않은 지점) 에서 global optimum 형성.
modality가 나뉘어지지 않는 지점이 optimal point
결론적으로, repulsive structure와 optimal gap은 temperature-dependent하다.
high temperature(0.1,1)로 fine-tuning 했을 때, gap을 줄일 수 있었고, modality gap이 줄어든다.

거인의 어깨에 올라서서 더 넓은 세상을 바라보라 - 아이작 뉴턴

다음 포스트

[NeurIPS 2022] Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

multi-modal representation

[NeurIPS 2023] Geodesic Multi-Modal Mixup for Robust Fine-Tuning

0개의 댓글