BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning

이슬비·2024년 7월 23일

Backdoor attack unlearning

paper

목록 보기

5/5

Publish: CVPR 2024
Year: 2024
LINK: https://openaccess.thecvf.com/content/CVPR2024/papers/Liang_BadCLIP_Dual-Embedding_Guided_Backdoor_Attack_on_Multimodal_Contrastive_Learning_CVPR_2024_paper.pdf

Summary

CLIP과 같은 multimodal contrastive learning methods (MCL) 는 방대한 양의 data로부터 학습되어 여러 downstream task를 수행할 수 있는 foundation model로 작용

하지만 malicious attack에 취약함 = backdoor attacks
- 단점: malicious examples를 training data에 주입하여 test time에 특정 input에 대해 잘못된 예측을 수행함
- 장점: privacy/copyright protection에 있어 이점 + defense를 더 강화한다는 장점
backdoor attack은 backdoor defense 방법을 통해 충분히 방어 가능
- clean dataset으로 fine-tuning 하여 malicious effects를 제거
- pretrained MCL을 이용하여 encoder에 backdoor가 있는지 탐지
본 논문에서는, 위와 같은 defense 방법을 사용함에도 불구하고 여전히 효과적임을 보임
- Bayesian rule의 입장에서 backdoor attack 재정의
  - poisoned model parameter와 clean model parameter 간에는 매우 작은 차이를 가져야함
  - poisoned dataset은 clean fine-tuning dataset과 매우 유사해야함
- 이러한 분석을 기반으로 강력한 CLIP에 대해 backdoor attack인 BadCLIP 제안
  - textual embedding consistency optimization 제안: visual trigger pattern이 target label을 향하도록 함
  - visual embedding resistance optimization 제안: poisoned sample이 target label의 original visual feature와 더 잘 정렬되도록 함

Method

Bayesian rule's analysis

Pre-training process

Notation
- Initial model parameters distribution $P(\Theta)$
- pre-trining dataset $\mathcal{D}_0$

P(\Theta|\mathcal{D}_0) \propto P(\mathcal{D}_0|\Theta)P(\Theta)

Poisoning process

Notation
- pre-trained model $\Theta^{(0)}$
- poisoning training set $\mathcal{D}_1$ : clean dataset에 약간의 perturbation을 추가한 것
  - clean set: $v_i$
  - poisioned set: $\hat{v}_i = v_i + \delta_v$

P(\Theta^{(0)}|\mathcal{D}_1) \propto P(\mathcal{D}_1|\Theta^{(0)})P(\Theta^{(0)})

이때 backdoor를 주입한다는 것은 결국 pre-trained model을 poisoned dataset으로 fine-tuning 하는 것 $\Theta^{(1)} = \argmin_{\Theta^{(0)}+\epsilon}-\sum_{i=1}^{N_1}\log\frac{g(\{{v_i}^{(1)}, {t_i}^{(1)}\};\Theta^{(0)} + \epsilon)}{\sum_{j=1}^{N_1}g(\{{v_i}^{(1)}, {t_i}^{(1)}\};\Theta^{(0)} + \epsilon)}$

Defense process

User/defender들은 third-party poisoned model $\Theta^{(1)}$ 을 사용
Notation
- clean dataset $\mathcal{D}_2$

P(\Theta^{(0)}|\mathcal{D}_2, \mathcal{D}_1) \propto P(\mathcal{D}_2|\Theta^{(0)}, \mathcal{D}_1)(P(\mathcal{D}_1|\Theta^{(0)})P(\Theta^{(0)}))

원래의 pre-trianed model과 clean dataset을 이용하여 fine-tuning 한 model은 서로 비슷해야함
즉, $P(\Theta^{(0)}|\mathcal{D}_2, \mathcal{D}_1) \sim P(\Theta^{(0)})$
이를 이용하여 위의 수식을 다시 작성해보면, $P(\Theta^{(0)}) \propto P(\mathcal{D}_2|\Theta^{(0)}, \mathcal{D}_1)(P(\mathcal{D}_1|\Theta^{(0)})P(\Theta^{(0)}))$

Motivation

Poisoned model parameters $\Theta^{(1)}$ 과 clean model parameter $\Theta^{(0)}$ 의 차이는 매우 작아야함

P(\Theta^{(0)}|\mathcal{D}_1) \propto P(\mathcal{D}_1|\Theta^{(0)})P(\Theta^{(0)})

위의 수식을 볼 때, 기존 model parameter와 차이가 많이 나면 안됨

Poisoned dataset은 clean dataset과 매우 비슷해야함

P(\Theta^{(0)}) \propto P(\mathcal{D}_2|\Theta^{(0)}, \mathcal{D}_1)(P(\mathcal{D}_1|\Theta^{(0)})P(\Theta^{(0)}))

위의 수식에서 $\mathcal{D}_2$ 과 $\mathcal{D}_1$ 서로 반대의 의미를 내포하고 있기 때문에 $\mathcal{D}_2$ 를 이용하여 backdoor의 영향을 줄일 수 있는 것

BadCLIP Attack Design

Textual embedding consistency Optimization

첫 번째 motivation을 해결하기 위한 방법
$\mathcal{D}_1$ 이 original model에 주는 영향을 최소화
```
- Text: inference phase의 text는 attacker 수정할 수 없다는 가정 하에서 구성 -> target text를 $$\mathcal{T}$$로 정의
- Image: model parameter의 미묘한 변화를 유도하기 위해 visual trigger pattern을 찾는 것을 목표 (아래의 optimizer 식을 이용하여!)
```
$\Theta^{(1)} = \min_{\Theta^{(0)}+\epsilon}\min_{\hat{v}_i^{(1)}}-\sum_{i=1}^{N_1}\log\frac{g(\{\hat{v}_i^{(1)}, \mathcal{T}_i^*\};\Theta^{(0)} + \epsilon)}{\sum_{j=1}^{N_1}g(\{\hat{v}_i^{(1)}, {t_i}^{(1)}\};\Theta^{(0)} + \epsilon)}$
위의 식이 의미하는 바는 original model에 대한 $\mathcal{D}_1$ 의 영향을 줄이는 것
여기서 최적의 시나리오는 모델 $\Theta^{(0)}$ 을 수정하지 않고도 얻을 수 있는 자연스러운 backdoor가 존재한다는 것!
즉, visual trigger pattern을 찾아 원래 모델을 목표 텍스트로 오도할 수 있는 경우를 의미
다시 말해, 모델 파라미터를 변경하지 않고도 최소한의 손실을 달성할 수 있는 visual trigger pattern 을 최적화 하는 것으로 이어짐

\mathcal{L}_t = -\sum_{i=1}^{N_1} \log \frac{g(\{\hat{v}_i^{(1)}, \mathcal{T}_i^*\}; \Theta^{(0)})}{\sum^{N_1}_{j=1}g(\{\hat{v}_i^{(1)}, t_j^{(1)}\}; \Theta^{(0)})}

Visual embedding consistency Optimization

backdoor forgetting을 방지하기 위해 $\mathcal{D}_2$ 와 $\mathcal{D}_1$ 간의 충돌을 막아야 함 (매우 비슷해야 함)
하지만! $\mathcal{D}_2$ 는 attacker가 접근 불가
중요한 사실: $\mathcal{D}_2$ 는 original training dataset $\mathcal{D}_0$ 을 반영하고 있어야 함 (매우 비슷)
- 왜? model usability를 보존하고 clean performance를 보존하고 있어야 하고 있기 때문
즉, $\mathcal{D}_1$ 은 $\mathcal{D}_0$ 과 비슷해야 함 $\mathcal{L}_i^p = \sum_{i=1}^{N_1}d(f^v(\hat{v}_i^{(1)};\theta_v^{(0)});f^v(\mathcal{I}_i^*;\theta_v^{(0)}))$
이때 trigger가 담긴 sample이 anchor sample 이므로 해당 이미지가 다른 이미지와도 구별되도록 negative sample과의 connection도 추가 $\mathcal{L}_i^n = \sum_{i=1}^{N_1}d(f^v(\hat{v}_i^{(1)};\theta_v^{(0)});f^v(v_i^{(1)};\theta_v^{(0)}))$

Overall Poisoning Process

Trigger pattern optimization

patch-based visual trigger pattern $\delta_v \in \R^{w \times h \times c}$
Total loss $\mathcal{L} = \mathcal{L}_t + \lambda_1 \times \max(0, \mathcal{L_i^p + \lambda_2 \times \mathcal{L}_i^n} + \eta)$

Experiments

Setup

Models & dataset
- CLIP model 사용
- CC3M dataset으로부터 500K의 image-text pairs 선택
- 1500개의 sample은 target label인 banana로 poisoning
Evaluation
- Clean Accuracy (CA)
- Attack success rate (ASR)
Backdoor attack
- 흔히 쓰이는 7개의 method
  - unimodal: BadNet, Bleded, SIG, SSBA
  - multimodel: TrojanVQA (visual question & answering)
  - SSL: mmPosion, BadEncoder
Backdoor defense
- DECREE: backdoor detection on pre-trained encoders
- FT: fine-tuning with clean dataset
- CleanCLIP: CLIP에 특화된 defense method
Implementation Details
- $\lambda_1$ = 500
- $\lambda_2$ = 1
- $\eta$ = 1

Results

Effectiveness of attacks

No-defense senario에서 가장 높은 ASR를 보여줌
defense senario
- CC3M에서 100개의 pair를 random sampling 하여 clean dataset으로 삼음
- fine-tuning 이후에 CA가 약간 하락함을 알 수 있음 (FT를 이용한 defense의 한계)
- BadCLIP은 CleanCLIP, FT를 이용한 defense에도 불구하고 높은 ASR을 보여줌

DECREE defense 에서의 attack mehtod 비교
BadCLIP: Clean Encoder의 양상과 값을 가장 일관적으로 가지고 있는 method

Attacks on the Linear Probe Task

CLIP이 downstream task에서 사용된다는 상황에서 evaluation
50K의 dataset을 ImageNet로부터 random sampling
일종의 fine-tuning defense 방법이라고도 볼 수 있지만, feature extractor를 freeze 시키고 linear layer만 학습한다는 점에서 차이가 있음 (fine-tuning defense의 특이 케이스)

CA는 대부분의 method가 64%에 머무는 것을 확인
ASR 측면에서는 BadCLIP이 가장 좋은 성능을 보임

Attacks on More rigorous senarios

Fine-tuning poisoned model on cross-domain data
: defender가 poisoned dataset의 domain/distribution을 알고 있음 + 다른 domain/distribution의 clean dataset으로 fine-tuning 했을 때
- Dataset
  - poisoned dataset: CC3M subset
  - CleanCLIP defense: SBU caption subset
- Result
  - ASR이 많이 떨어지는 다른 attack method에 비해 BadCLIP은 robust한 편

Poisoned data detection on pre-trained CLIP
: 의심스러운 dataset을 defender가 확보 했다고 가정 + 이를 정제한 dataset으로 pre-trained dataset을 re-train
- 방법
  - ABL defense 방법 채택
- Result
  - 각 마커라인을 보았을 때, BadCLIP이 다른 방법보다 clean sample의 분포에 더 가깝다는 것을 알 수 있음
  - 즉, clean sample과의 구분이 어려워 detection이 어려움

Analysis

Ablation study
- Poisoned Pairs Sampling Strategy (PPS) 와 최종 loss term을 함께 썼을 때 가장 높은 성능의 결과를 보여줌

Triger patch size
- Patch size가 커지면 커질수록 ASR도 높아지지만, 16 X 16 이후에는 유사
- 16 X 16 을 default size로 설정
Poisoned sample numbers
- 1500개의 sample에서 peak를 찍고 내려옴
- 1500개를 default numbers로 설정

이슬비

정말 알아?

이전 포스트

BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning

paper

Summary

Method

Bayesian rule's analysis

Pre-training process

Poisoning process

Defense process

Motivation

BadCLIP Attack Design

Textual embedding consistency Optimization

Visual embedding consistency Optimization

Overall Poisoning Process

Trigger pattern optimization

Experiments

Setup

Results

Effectiveness of attacks

Attacks on the Linear Probe Task

Attacks on More rigorous senarios

Analysis

Test-Time Linear Out-of-Distribution Detection

0개의 댓글