[딥러닝] 패딩(Padding)

백건·2022년 1월 20일

육알(育AI) - 인공지능 육성 프로젝트

목록 보기

10/17

뜻

자연어를 처리할 때 문장의 길이가 달라 행렬로 처리하기 어렵다. 이때 문장의 길이를 맞춰주어 행렬로 한번에 처리할 수 있도록 해준다.
딥러닝을 위해 병렬연산을 할때 여러 문장의 길이를 동일하게 맞춰주어야 한다.
길이를 맞춰주기 위해 문장 뒤에 0을 붙이거나, 문장 앞에 0을 붙여준다.

흐름

문장 → 단어별 자르기 → 단어를 정수로 대치(인코딩) → 가장 긴 문장의 길이(또는 임의)로 맞춰줌

사용예

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer

아래는 텍스트 데이터

preprocessed_sentences = [['barber', 'person'], 
['barber', 'good', 'person'], ['barber', 'huge', 'person'], 
['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], 
['huge', 'secret'], ['barber', 'kept', 'word'], 
['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], 
['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'],
['barber', 'went', 'huge', 'mountain']]

단어를 토큰으로 만들고 정수 인코딩

tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
print(encoded)

[[1, 5], [1, 8, 5], [1, 3, 5], [9, 2], [2, 4, 3, 2], [3, 2], [1, 4, 6], [1, 4, 6], [1, 4, 2], [7, 7, 3, 2, 10, 1, 11], [1, 12, 3, 13]]

단어가 고유 정수로 변환

가장 긴 문장의 길이 계산

max_len = max(len(item) for item in encoded)
print('최대 길이 :',max_len)

최대길이 : 7

모든 길이를 7로 맞춤

for sentence in encoded:
    while len(sentence) < max_len:
        sentence.append(0)

padded_np = np.array(encoded)
padded_np

array([[ 1, 5, 0, 0, 0, 0, 0],
[ 1, 8, 5, 0, 0, 0, 0],
[ 1, 3, 5, 0, 0, 0, 0],
[ 9, 2, 0, 0, 0, 0, 0],
[ 2, 4, 3, 2, 0, 0, 0],
[ 3, 2, 0, 0, 0, 0, 0],
[ 1, 4, 6, 0, 0, 0, 0],
[ 1, 4, 6, 0, 0, 0, 0],
[ 1, 4, 2, 0, 0, 0, 0],
[ 7, 7, 3, 2, 10, 1, 11],
[ 1, 12, 3, 13, 0, 0, 0]])

길이가 전부 7로 바뀜.
길이가 7이 아닌것에 전부 0을 넣어줌.

0번 단어는 아무 의미없는 단어로 컴터는 무시함
테이터에 특정 값을 채워서 데이터의 크기(shape)를 조정하는 것을 패딩(padding)
숫자 0을 사용하고 있다면 제로 패딩(zero padding)

KERAS로 패딩하기

keras에서는 패딩을 위해 pad_sequences() 제공

from tensorflow.keras.preprocessing.sequence import pad_sequences

아래는 동일과정

tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
print(encoded)

[[1, 5], [1, 8, 5], [1, 3, 5], [9, 2], [2, 4, 3, 2], [3, 2], [1, 4, 6], [1, 4, 6], [1, 4, 2], [7, 7, 3, 2, 10, 1, 11], [1, 12, 3, 13]]

케라스의 pad_sequences 를 사용

padded = pad_sequences(encoded)
padded

array([[ 0, 0, 0, 0, 0, 1, 5],
[ 0, 0, 0, 0, 1, 8, 5],
[ 0, 0, 0, 0, 1, 3, 5],
[ 0, 0, 0, 0, 0, 9, 2],
[ 0, 0, 0, 2, 4, 3, 2],
[ 0, 0, 0, 0, 0, 3, 2],
[ 0, 0, 0, 0, 1, 4, 6],
[ 0, 0, 0, 0, 1, 4, 6],
[ 0, 0, 0, 0, 1, 4, 2],
[ 7, 7, 3, 2, 10, 1, 11],
[ 0, 0, 0, 1, 12, 3, 13]], dtype=int32)

패딩의 결과가 다른것은 패딩을 앞에 붙여줘서 그럼
뒤에 붙여주기 위해서는 padding('post')

padded = pad_sequences(encoded, padding='post')
padded

array([[ 1, 5, 0, 0, 0, 0, 0],
[ 1, 8, 5, 0, 0, 0, 0],
[ 1, 3, 5, 0, 0, 0, 0],
[ 9, 2, 0, 0, 0, 0, 0],
[ 2, 4, 3, 2, 0, 0, 0],
[ 3, 2, 0, 0, 0, 0, 0],
[ 1, 4, 6, 0, 0, 0, 0],
[ 1, 4, 6, 0, 0, 0, 0],
[ 1, 4, 2, 0, 0, 0, 0],
[ 7, 7, 3, 2, 10, 1, 11],
[ 1, 12, 3, 13, 0, 0, 0]], dtype=int32)

문장 길이를 5로 줄이면

padded = pad_sequences(encoded, padding='post', maxlen=5)
padded

array([[ 1, 5, 0, 0, 0],
[ 1, 8, 5, 0, 0],
[ 1, 3, 5, 0, 0],
[ 9, 2, 0, 0, 0],
[ 2, 4, 3, 2, 0],
[ 3, 2, 0, 0, 0],
[ 1, 4, 6, 0, 0],
[ 1, 4, 6, 0, 0],
[ 1, 4, 2, 0, 0],
[ 3, 2, 10, 1, 11],
[ 1, 12, 3, 13, 0]], dtype=int32)

5보다 긴 데이터는 손실

앞이 아닌 뒤으 단어가 삭제되도록 하겠다면 truncating 인자 사용

truncating='post'를 사용

padded = pad_sequences(encoded, padding='post', truncating='post', maxlen=5)
padded

array([[ 1, 5, 0, 0, 0],
[ 1, 8, 5, 0, 0],
[ 1, 3, 5, 0, 0],
[ 9, 2, 0, 0, 0],
[ 2, 4, 3, 2, 0],
[ 3, 2, 0, 0, 0],
[ 1, 4, 6, 0, 0],
[ 1, 4, 6, 0, 0],
[ 1, 4, 2, 0, 0],
[ 7, 7, 3, 2, 10],
[ 1, 12, 3, 13, 0]], dtype=int32)

백건

마케팅을 위한 인공지능 설계와 스타트업 Log

이전 포스트

open () - 파일 입출력

다음 포스트

[딥러닝] 패딩(Padding)

육알(育AI) - 인공지능 육성 프로젝트

뜻

흐름

사용예

KERAS로 패딩하기

문장 길이를 5로 줄이면

open () - 파일 입출력

Print () 출력 형태

0개의 댓글