케라스 창시자에게 배우는 딥러닝_12장

코넬·2023년 2월 23일

Deep Learning Keras

DeepLearning_Keras

목록 보기

13/13

생성 모델을 위한 딥러닝

지금까지 학습해온 딥러닝을, 예술 창작에 어떻게 쓰일 수 있는지 다양한 각도에서 살펴보자.

텍스트 생성

시퀀스 데이터 생성하기

순환 신경망으로 시퀀스 데이터를 생성하는 방법을 알아보자.
딥러닝에서 시퀀스 데이터를 생성하는 일반적인 방법은 이전 토큰을 입력으로 사용해서 시퀀스의 다음 1개 또는 몇개의 토큰을 예측하는 것이다.
이전 토큰들이 주어졌을 때 다음 토큰의 확률을 모델링할 수 있는 네트워크를 언어 모델(language model) 이라고 부른다. 언어 모델은 언어의 통계적 구조인 잠재 공간을 탐색한다.

언어 모델을 훈련하고 나면 이 모델에서 샘플링을 진행한다. 초기 텍스트 문자열(즉, 조건 데이터(conditioning data) ) 를 주입하고 새로운 글자나 단어를 생성한다. 생성된 출력은 다시 입력 데이터로 추가된다. 이러한 과정을 여러번 반복한다.

샘플링 전략의 중요성

텍스트를 생성할 때 다음 문자를 선택하는 방법이 중요한데, 두가지 방식이 있다.

탐욕적 샘플링(greedy sampling) : 단순한 방법. 항상 가장 높은 확률을 가진 글자를 선택한다. 이 방법을 반복적이고 예상 가능한 문자열을 만들기 때문에 논리적인 언어로 보이지 않는다.
확률적 샘플링(stochastic sampling) : 다음 단어를 확률 분포에서 샘플링하는 과정에 무작위성을 주입한다. 예를 들어, 어떤 단어가 문장의 다음 단어가 될 확률이 0.3이라면 모델의 30% 정도는 이 단어를 선택한다.

모델의 소프트맥스 출력에는 확률적 샘플링이 좋으나, 극단적인 예시가 존재하기 때문에 샘플링의 무작위성의 양을 조절하는 방법 을 찾아야한다. 샘플링 과정에서 확률의 양을 조절하기 위해 소프트맥스 온도(softmax temperature) 라는 파라미터를 사용한다. 이 파라미터는 샘플링에 사용되는 확률 분포의 엔트로피를 나타낸다.

케라스를 사용한 텍스트 생성 모델 구현하기

직접 구현해보자. 모델 학습을 위해 많은 텍스트 데이터가 필요하다. IMDB 영화 리뷰 데이터셋을 사용하여 이전에 본 적 없는 영화 리뷰를 생성하는 방법을 학습시켜보자.

!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
#IMDB 영화 리뷰 데이터셋을 내려받아 압축을 푼다.

import tensorflow as tf
from tensorflow import keras
dataset = keras.utils.text_dataset_from_directory(
    directory="aclImdb", label_mode=None, batch_size=256)
dataset = dataset.map(lambda x: tf.strings.regex_replace(x, "<br />", " "))
#텍스트 파일(한 파일=한 샘플)에서 데이터셋 만들기

#TextVectorization 층 준비하기
from tensorflow.keras.layers import TextVectorization

sequence_length = 100
vocab_size = 15000
text_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
text_vectorization.adapt(dataset)

이 층을 사용하여 언어 모델링 데이터셋을 만든다. 입력 샘플은 벡터화된 텍스트이고 타깃은 한 스텝 앞의 동일 텍스트이다.

언어 모델링 데이터셋을 만들어보자.

def prepare_lm_dataset(text_batch):
    vectorized_sequences = text_vectorization(text_batch)
    x = vectorized_sequences[:, :-1] #시퀀스의 마지막 단어를 제외한 입력을 만든다.
    y = vectorized_sequences[:, 1:] #시퀀스의 첫 단어를 제외한 타깃을 만든다.
    return x, y

lm_dataset = dataset.map(prepare_lm_dataset, num_parallel_calls=4)

트랜스포머 기바니 시퀀스-투-시퀀스 모델

이 모델은, 몇 개의 초기 단어가 주어지면 문장의 다음 단어에 대한 확률 분포를 예측하는 모델을 훈련한다.
모델은 훈련시 초기 문장을 주입하고, 다음 단어를 샘플링하여 이 문장에서 추가하는 식으로 짧은 문단을 생성할 때까지 반복한다.
하지만 이렇게 작성을 하면 두가지 문제가 발생하는데,

첫 번째, 모델은 N개의 단어로 예측을 만드는 방법을 학습하지만 N개보다 적은 단어로 예측을 시작할 수 있어야한다. 그래야 처음 훈련할 때 긴 시작 문장을 넣지 않아도 훈련이 가능하기 때문이다.
두 번째, 훈련에 사용하는 많은 시퀀스는 중복되어 있기 때문에 (문장을 쪼갤 때 I have a lots of apples를 I have a / a lots of / a lots of apples / have a lots of ... 등 중복 O ) 이런 시퀀스를 독립적인 샘플로 처리하는 모델은 대부분 이전에 처리했던 시퀀스를 여러 번 다시 인코딩하는 많은 중복 작업을 수행해야하는데, 중복이 많을수록 매번 재작업을 해야하는 밀집 모델과 합성곱 모델이 반복된다.

이 두가지 문제를 해결하기 위하여 시퀀스-투-시퀀스 모델 을 사용한다.

단어 N개의 시퀀스(0-N) 를 모델에 주입하고 한 스텝 다음의 시퀀스( 1에서 N+1 까지 ) 를 예측한다. Casual masking을 사용해 인덱스 i에서 모델은 0에서 i 까지 단어만 사용해서 i+1 번째 단어를 예측하도록 만든다.

즉, 대부분 중복되지만 N 개의 다른 문제를 해결하도록 모델을 동시에 훈련한다는 의미이다.

자 그럼 모델의 틀을 만들어보자. 11장에서 만든 구성 요소인 PositionalEmbedding과 TransformerDecoder를 재사용한다.

#간단한 트랜스포머 기반 언어 모델 (하단의 것은 11장 구성 요소를 불러온다.
import tensorflow as tf
from tensorflow.keras import layers

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_config(self):
        config = super(TransformerDecoder, self).get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

from tensorflow.keras import layers
embed_dim = 256
latent_dim = 2048
num_heads = 2

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, x)
outputs = layers.Dense(vocab_size, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.compile(loss="sparse_categorical_crossentropy", optimizer="rmsprop")

가변 온도 샘플링을 사용한 텍스트 생성 콜백

콜백을 사용하여 epoch가 끝날 때마다 다양한 온도로 텍스트를 생성해보자. 모델이 수렴하면서 생성된 텍스트가 어떻게 발전하는지와 온도가 샘플링 전략에 미치는 영향을 확인할 수 있다.
텍스트 생성 콜백을 작성해보자.

import numpy as np

tokens_index = dict(enumerate(text_vectorization.get_vocabulary()))

def sample_next(predictions, temperature=1.0):
    predictions = np.asarray(predictions).astype("float64")
    predictions = np.log(predictions) / temperature
    exp_preds = np.exp(predictions)
    predictions = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, predictions, 1)
    return np.argmax(probas)

class TextGenerator(keras.callbacks.Callback):
    def __init__(self,
                 prompt,
                 generate_length,
                 model_input_length,
                 temperatures=(1.,),
                 print_freq=1):
        self.prompt = prompt
        self.generate_length = generate_length
        self.model_input_length = model_input_length
        self.temperatures = temperatures
        self.print_freq = print_freq

    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.print_freq != 0:
            return
        for temperature in self.temperatures:
            print("== Generating with temperature", temperature)
            sentence = self.prompt
            for i in range(self.generate_length):
                tokenized_sentence = text_vectorization([sentence])
                predictions = self.model(tokenized_sentence)
                next_token = sample_next(predictions[0, i, :])
                sampled_token = tokens_index[next_token]
                sentence += " " + sampled_token
            print(sentence)

prompt = "This movie"
text_gen_callback = TextGenerator(
    prompt,
    generate_length=50,
    model_input_length=sequence_length,
    temperatures=(0.2, 0.5, 0.7, 1., 1.5))

fit()메서드를 호출해 언어 모델을 훈련해보면,

model.fit(lm_dataset, epochs=200, callbacks=[text_gen_callback])

#중간 에폭 생략

Epoch 7/200
390/391 [============================>.] - ETA: 0s - loss: 4.2607== Generating with temperature 0.2
This movie is portrayed a just more like extreme a depictions greek of republicans [UNK] who on wind another investigates matter the after history the of deaths film death makers scene knows with nothing one but [UNK] an really entirely interesting realistic story view making of meaningful michael situations man in expertly
== Generating with temperature 0.5
This movie somewhat is dark a and psychological movie drama to everything study about into nightclub relationships backdrop with but a its body plot office does network drag undergoes rounds a it convincing has can a be testament suspenseful to performances sugar from and old george tables c performance s ive who
== Generating with temperature 0.7
This movie satire is about in a the real bizarre news situation for in example years that on the earth simplest of of it life deals but with about an not 16 gentle year comedy not but heart a prepares great for movie guys popping [UNK] up and every cast day warrants
== Generating with temperature 1.0
This movie version leaves of no it sleeps sucked because divorced it wife got was embark better i than have average some at teenagers the but same i how could not you enjoy care it while about doing lowbudget something things short whereas and both then love works before his you surprise
== Generating with temperature 1.5
This movie show is could played not out compare of to a the buddy reality movie tv [UNK] show at all the costs characters and would plot be go less wrong than some getting kind together of a a lot series of that american people stereotype are those so ones superior i
391/391 [==============================] - 29s 74ms/step - loss: 4.2607
Epoch 8/200
390/391 [============================>.] - ETA: 0s - loss: 4.2164== Generating with temperature 0.2
This movie is is so an wonderful odd dialogues funny which and has surprisingly the enough rare magic [UNK] and to very make serious something movies of with the characters effect absolutely it frightening makes is you one hope of this a scene thespian that who makes cares no about sense the
== Generating with temperature 0.5
This movie movie should pretty have much a [UNK] lot production of television slow in but intensity at than all moments the that plot make and it this look film too felt high [UNK] if for only it a was time good when movie the [UNK] lead [UNK] the name acting actors
== Generating with temperature 0.7
This movie [UNK] is with loosely about [UNK] the woman notorious legend of in a london contemporary ian [UNK] [UNK] and in coincidence the it movie was is [UNK] such ernest a [UNK] greater [UNK] than [UNK] fun career but of watching manners it and in even one the [UNK] directors reputation
== Generating with temperature 1.0
This movie movie is has fantastic the special story effects the are [UNK] excellent [UNK] and [UNK] sweet use that of sound [UNK] in has order its good entire [UNK] dvd and resembles a a [UNK] usual screenplay 80s changes story and here locations the very music at supports the the first
== Generating with temperature 1.5
This movie was from ever the by first the fairness [UNK] i there watched was the nothing movie like just acting a no guy feeling in was the most music just [UNK] kept [UNK] me in going awe after but [UNK] not through a the [UNK] [UNK] mix should comes make with
391/391 [==============================] - 29s 74ms/step - loss: 4.2163
Epoch 9/200
390/391 [============================>.] - ETA: 0s - loss: 4.1789== Generating with temperature 0.2
This movie interesting should 20 never or have have the been redeeming so feature bad for it a while while maybe long if bloodbath you nine can lives get than some spot [UNK] on amongst the others two youd liberals expect do of think that a theres high something point about in
== Generating with temperature 0.5
This movie is [UNK] so bad cute it that has brings been with hacked films down like images when this you film will it frequent is the evil makers japanese of sign this onto up television from screen harmony italian to the summarize point the defending writing wall for eliminate us all
== Generating with temperature 0.7
This movie show could africanamerican actually man be me an haha american thats idiot true format its anyways result the the episodes plot held involves the a kind dog of who australian happens [UNK] to his be friends increasingly are difficult petty to crazed catch hunter phrase from too criminal stairs to
== Generating with temperature 1.0
This movie movie is makes about you look people at who your dies parents in cheated the upon sex psychopathic drugs numbers plenty logic of and gay distant prostitutes plot and points somewhat to the make whole sure movie isnt is to funny say but never i things also ever does seen
== Generating with temperature 1.5
This movie movie is is such made a up bad of screenplay bad unworthy acting of is the far worst worst acting comedy ive ever ever seen seen it and is i boring recommend it dislike as other a films movie that i produces had film no types rules of such movies

결과물을 확인해보면, 낮은 온도는 매우 단조롭고 반복적인 텍스트를 만든다. 따라서 생성 단계가 루프 안에 갇힐 수 있다(반복적인 텍스트들만 뽑힐 수 있다). 매우 높은 온도에서는 국부적인 구조가 무너지기 시작하고 출력이 대체로 랜덤하게 보인다.
이렇게 결과물들을 확인해보며 알맞은 생성 온도를 뽑아야하는데, 여기서는 좋은 생성 온도가 약 0.7 정도 인 것같다.
다양한 샘플링 전략으로 실험을 진행하여 추가적으로 더 찾아내야한다.

딥드림

딥드림(DeepDream)은 합성곱 신경망이 학습한 표현을 사용하여 예술적으로 이미지를 조작하는 기법이다.
알고리즘으로 변경된 환상적인 인공물, 새 깃털, 강아지 눈이 가득 차 있다. 이러한 딥드림은 다양한 종류의 강아지와 새가 있는
ImageNet 데이터셋에서 훈련된 컨브넷을 사용한다.
딥드림의 특징은 다음과 같다.

딥드림에서는 특정 필터가 아니라 전체 층의 활성화를 최대화한다. 한꺼번에 많은 특성을 섞어 시각화를 진행한다.

빈 이미지나 노이즈가 조금 있는 입력이 아니라 이미 가지고있는 이미지를 사용한다. 그 결과 기존 시각 패턴을 바탕으로 이미지의 요소를 다소 예술적인 스타일로 왜곡 시킨다.

입력 이미지는 시각 품질을 높이기 위해 여러 다른 스케일로 처리한다.

케라스 딥드림 구현

자세한 내용은 코드를 통해 확인할 수 있다.

딥드림은 네트워크가 학습한 표현을 기반으로 컨브넷을 거꾸로 실행하여 입력 이미지를 생성한다.
이 과정은 이미지 모델이나 컨브넷에 국한되는 것이 아닌, 음성, 음악 등에도 적용가능하다.

뉴럴 스타일 트랜스퍼

딥러닝을 사용하여 이미지를 변경하는 또 다른 주여 분야는 뉴럴 스타일 트랜스퍼(neural style transfer) 이다.
뉴럴 스타일 트랜스퍼는 타깃 이미지의 콘텐츠를 보존하면서 참조 이미지의 스타일을 타깃 이미지에 적용하는 방식이다.
여기서 스타일은 질감, 색깔, 이미지에 있는 다양한 크기의 시각 요소를 의미한다. 콘텐츠는 이미지에 있는 ㅗㄱ수준의 대형 구조를 말한다.

구현의 핵심 개념은 모든 딥러닝 알고리즘의 핵심과 동일한데, 목표를 표현한 손실 함수를 정의하고 이 손실을 최소화한다. 여기에서 원하는 것은 다음과 같다. 참조 이미지를 적용하면서 원본 이미지의 콘텐츠를 보존하는 것이다.

콘텐츠 손실

네트워크에 있는 하위 층의 활성화는 이미지에 관한 국부적인 정보를 담고 있다. 반면 상위층의 활성화일수록 점점 전역적이고 추상적인 정보를 담게 된다. 컨브넷 층의 활성화는 이미지를 다른 크기의 콘텐츠로 분해한다고 보면 된다.

타킷 이미지와 생성된 이미지를 사전 훈련된 컨브넷에 주입하여 상위 층의 활성화를 계산한다. 이 두 값 사이의 L2 노름이 콘텐츠 손실로 사용하기에 좋다. 상위 층에서 보았을 때 생성된 이미지와 원본 타깃 이미지를 비슷하게 만든다. 컨브넷의 상위 층에서 보는 것이 입력 이미지의 콘텐츠라고 가정하면 이미지의 콘텐츠를 보존하는 방법으로 사용할 수 있다.

스타일 손실

콘텐츠 손실은 하나의 상위 층만 사용하지만, 스타일 손실은 컨브넷의 여러 층을 사용한다. 하나의 스타일이 아니라 참조 이미지에서 컨브넷이 추출한 모든 크기의 스타일을 잡아야하기 때문에, 게티스 등은 층의 활성화 출력의 그람 행렬(Gram matrix) 을 스타일 손실로 사용한다. 그람 행렬은 층의 특성 맵들의 내적이다. 내적은 층의 특성 사이에 있는 상관관계를 표현한다고 이해할 수 있기에, 이런 특성의 상관관계는 특정 크기의 공간적인 패턴 통계를 잡아낸다. 스타일 참조 이미지와 생성된 이미지로 층의 활성화를 계산한다. 스타일 손실은 그 안에 내제된 상관관계를 비슷하게 보존하는 것이 목적이다.

정리를 해보자면,

콘텐츠를 보존하기 위해 원본 이미지와 생성된 이미지 사이에서 상위 층의 활성화를 비슷하게 유지한다. 이 컨브넷은 원본 이미지와 생성된 이미지에서 동일한 것을 보아야한다.

스타일을 보존하기 위해 저수준 층과 고수준 층에서 활성화 안에 상관관계를 비슷하게 유지한다. 특성의 상관관계는 텍스쳐를 나타내며, 따라서 생성된 이미지와 스타일 참조 이미지는 여러 크기의 텍스쳐를 공유할 것이다.

케라스로 뉴럴 스타일 트랜스퍼 구현하기