แ„‚ ๐Ÿ˜„ [13 ์ผ์ฐจ] : EXPLORATION 04. ์ธ๊ณต์ง€๋Šฅ ์ž‘์‚ฌ๊ฐ€

๋ฐฑ๊ฑดยท2022๋…„ 1์›” 21์ผ
0

์ธ๊ณต์ง€๋Šฅ ์ž‘์‚ฌ๊ฐ€๋งŒ๋“ค๊ธฐ

๋ฌธ์žฅ์˜ ์ •์˜

  • ์ƒ๊ฐ์ด๋‚˜ ๊ฐ์ •์„ ๋ง๊ณผ ๊ธ€๋กœ ํ‘œํ˜„ํ•  ๋–„ ์™„๊ฒฐ๋œ ๋‚ด์šฉ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ตœ์†Œ์˜ ๋‹จ์œ„

ํ•ต์‹ฌ

  • ์ธ๊ณต์ง€๋Šฅ์ด ๋ฌธ์žฅ์„ ์ดํ•ดํ•˜๋Š” ๋ฐฉ์‹
  • ์ž‘๋ฌธ์„ ๊ฐ€๋ฅด์น˜๋Š” ๋ฒ•

์‹œํ€€์Šค

ํŒŒ์ด์ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์‹œํ€€์Šค ์ž๋ฃŒํ˜•

list1 = list()          # ๋นˆ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ƒ์„ฑ
list2 = list('ABCD')    # list(iterable object)
                        # ['A','B','C','D']
list3 = list(range(10)) # [0,1,2,3,4,5,6,7,8,9]

def even_generator() :
    for i in range(10) :
        if i % 2 == 0 :
            yield i

list4 = list(even_generator()) # [0, 2, 4, 6, 8]

list5 = list((i for i in range(10) if i % 2 == 0))
list6 = list( i for i in range(10) if i % 2 == 0 ) # ()๊ฐ€ ์—†์–ด๋„ ๋จ

print('๋ฆฌ์ŠคํŠธ1 ', list1)
print('๋ฆฌ์ŠคํŠธ2 ', list2)
print('๋ฆฌ์ŠคํŠธ3 ', list3)
print('๋ฆฌ์ŠคํŠธ4 ', list4)
print('๋ฆฌ์ŠคํŠธ5 ', list5)
print('๋ฆฌ์ŠคํŠธ6 ', list6)


    
๋ฆฌ์ŠคํŠธ1  []
๋ฆฌ์ŠคํŠธ2  ['A', 'B', 'C', 'D']
๋ฆฌ์ŠคํŠธ3  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
๋ฆฌ์ŠคํŠธ4  [0, 2, 4, 6, 8]
๋ฆฌ์ŠคํŠธ5  [0, 2, 4, 6, 8]
๋ฆฌ์ŠคํŠธ6  [0, 2, 4, 6, 8]

RNN

๋‹จ์–ด๋ฅผ ์ ์žฌ์ ์†Œ์— ํ™œ์šฉํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ฐœ๋‹ฌ(๋ฌธ๋ฒ•์ ์ธ ์›๋ฆฌ X)

sentence = " ๋‚˜๋Š” ๋ฐฅ์„ ๋จน์—ˆ๋‹ค "

source_sentence = "<start>" + sentence
target_sentence = sentence + "<end>"

print("Source ๋ฌธ์žฅ:", source_sentence)
print("Target ๋ฌธ์žฅ:", target_sentence)
Source ๋ฌธ์žฅ: <start> ๋‚˜๋Š” ๋ฐฅ์„ ๋จน์—ˆ๋‹ค 
Target ๋ฌธ์žฅ:  ๋‚˜๋Š” ๋ฐฅ์„ ๋จน์—ˆ๋‹ค <end>
  • ์ผ๋ฐ˜์ ์œผ๋กœ '๋‚˜๋Š” ๋ฐฅ์„ ()'์—์„œ ()์— ๋“ค์–ด๊ฐˆ ๋ง์€ ๋จน๋Š”๋‹ค.
  • ๊ฐ€์žฅ ์ฒซ ์‹œ์ž‘์ธ ๋‚˜๋Š” ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ
  • \ ๋ผ๋Š” ํŠน์ˆ˜ํ•œ ํ† ํฐ์„ ์ถ”๊ฐ€
  • \ ํ† ํฐ์„ ๋ฐ›์€ ์ˆœํ™˜์‹ ๊ฒฝ๋ง์€ ๋‹ค์Œ์œผ๋กœ ๋‚˜๋Š” ์„ ์ƒ์„ฑํ•˜๊ณ , ์ƒ์„ฑํ•œ ๋‹จ์–ด๋ฅผ ๋‹ค์‹œ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ
  • ์ˆœ์ฐจ์ ์œผ๋กœ ์ƒ์„ฑํ•˜๊ณ ๋‚˜๋ฉด, ๋๋ƒˆ๋‹ค๋Š” \ ํ† ํฐ์„ ์ƒ์„ฑ
  • \๊ฐ€ ๋ฌธ์žฅ์˜ ์‹œ์ž‘์— ๋”ํ•ด์ง„ ์ž…๋ ฅ๋ฐ์ดํ„ฐ - ๋ฌธ์ œ
  • \๊ฐ€ ๋ฌธ์žฅ ๋์— ๋”ํ•ด์ง„ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ - ๋‹ต์•ˆ์ง€

์–ธ์–ด๋ชจ๋ธ

  • nโˆ’1๊ฐœ์˜ ๋‹จ์–ด ์‹œํ€€์Šค w1,...wnโˆ’1w_1, ... w_{n-1}๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, n๋ฒˆ์งธ ๋‹จ์–ด wnw_n์œผ๋กœ ๋ฌด์—‡์ด ์˜ฌ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ™•๋ฅ  ๋ชจ๋ธ ์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ฮธ\theta๋กœ ๋ชจ๋ธ๋งํ•˜๋Š” ์–ธ์–ด ๋ชจ๋ธ

$ P({w_n|w_1, ..., w{n-1}}; \theta) $

nโˆ’1nโˆ’1๋ฒˆ์งธ๊นŒ์ง€์˜ ๋‹จ์–ด ์‹œํ€€์Šค๊ฐ€ x_train์ด ๋˜๊ณ  nn๋ฒˆ์งธ ๋‹จ์–ด๊ฐ€ y_train์ด ๋˜๋Š” ๋ฐ์ดํ„ฐ์…‹

๋ฐ์ดํ„ฐ ๋‹ค๋“ฌ๊ธฐ

๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ

import os, re 
import numpy as np
import tensorflow as tf

# ํŒŒ์ผ์„ ์ฝ๊ธฐ๋ชจ๋“œ๋กœ ์—ด๊ณ 
# ๋ผ์ธ ๋‹จ์œ„๋กœ ๋Š์–ด์„œ list ํ˜•ํƒœ๋กœ ์ฝ์–ด์˜ต๋‹ˆ๋‹ค.
file_path = './lyricist/data/shakespeare.txt'
with open(file_path, "r") as f:
    raw_corpus = f.read().splitlines()

# ์•ž์—์„œ๋ถ€ํ„ฐ 10๋ผ์ธ๋งŒ ํ™”๋ฉด์— ์ถœ๋ ฅํ•ด ๋ณผ๊นŒ์š”?
print(raw_corpus[:9])
['First Citizen:', 'Before we proceed any further, hear me speak.', '', 'All:', 'Speak, speak.', '', 'First Citizen:', 'You are all resolved rather to die than to famish?', '']

  • ์›์น˜ ์•Š๋Š” ๋ฌธ์žฅ์€ ํ™”์ž๊ฐ€ ํ‘œ๊ธฐ๋œ ๋ฌธ์žฅ(0, 3, 6), ๊ทธ๋ฆฌ๊ณ  ๊ณต๋ฐฑ์ธ ๋ฌธ์žฅ(2, 5, 9)
  • ํ™”์ž๊ฐ€ ํ‘œ๊ธฐ๋œ ๋ฌธ์žฅ์€ ๋ฌธ์žฅ์˜ ๋์ด :๋กœ ๋
  • : ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฌธ์žฅ์„ ์ œ์™ธ
  • ๊ณต๋ฐฑ์ธ ๋ฌธ์žฅ์€ ๊ธธ์ด๋ฅผ ๊ฒ€์‚ฌํ•˜์—ฌ ๊ธธ์ด๊ฐ€ 0์ด๋ผ๋ฉด ์ œ์™ธ
for idx, sentence in enumerate(raw_corpus):
    if len(sentence) == 0: continue   # ๊ธธ์ด๊ฐ€ 0์ธ ๋ฌธ์žฅ์€ ๊ฑด๋„ˆ๋œ๋‹ˆ๋‹ค.
    if sentence[-1] == ":": continue  # ๋ฌธ์žฅ์˜ ๋์ด : ์ธ ๋ฌธ์žฅ์€ ๊ฑด๋„ˆ๋œ๋‹ˆ๋‹ค.

    if idx > 9: break   # ์ผ๋‹จ ๋ฌธ์žฅ 10๊ฐœ๋งŒ ํ™•์ธํ•ด ๋ณผ ๊ฒ๋‹ˆ๋‹ค.
        
    print(sentence)
The first words that come out
And I can see this song will be about you
I can't believe that I can breathe without you
But all I need to do is carry on
The next line I write down
And there's a tear that falls between the pages
I know that pain's supposed to heal in stages
But it depends which one I'm standing on I write lines down, then rip them up
Describing love can't be this tough I could set this song on fire, send it up in smoke
I could throw it in the river and watch it sink in slowly

ํ† ํฐํ™”(Tokenize)

  1. Hi, my name is John. *("Hi," "my", ..., "john." ์œผ๋กœ ๋ถ„๋ฆฌ๋จ) - ๋ฌธ์žฅ๋ถ€ํ˜ธ
    โ†’ ๋ฌธ์žฅ ๋ถ€ํ˜ธ ์–‘์ชฝ์— ๊ณต๋ฐฑ์„ ์ถ”๊ฐ€
  2. First, open the first chapter. *(First์™€ first๋ฅผ ๋‹ค๋ฅธ ๋‹จ์–ด๋กœ ์ธ์‹) - ๋Œ€์†Œ๋ฌธ์ž
    โ†’ ๋ชจ๋“  ๋ฌธ์ž๋“ค์„ ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜
  3. He is a ten-year-old boy. *(ten-year-old๋ฅผ ํ•œ ๋‹จ์–ด๋กœ ์ธ์‹) - ํŠน์ˆ˜๋ฌธ์ž
    โ†’ ํŠน์ˆ˜๋ฌธ์ž๋“ค์€ ๋ชจ๋‘ ์ œ๊ฑฐ

์ •๊ทœํ‘œํ˜„์‹(Regex)์„ ์ด์šฉํ•œ ํ•„ํ„ฐ๋ง์ด ์œ ์šฉํ•˜๊ฒŒ ์‚ฌ์šฉ

# ์ž…๋ ฅ๋œ ๋ฌธ์žฅ์„
#     1. ์†Œ๋ฌธ์ž๋กœ ๋ฐ”๊พธ๊ณ , ์–‘์ชฝ ๊ณต๋ฐฑ์„ ์ง€์›๋‹ˆ๋‹ค
#     2. ํŠน์ˆ˜๋ฌธ์ž ์–‘์ชฝ์— ๊ณต๋ฐฑ์„ ๋„ฃ๊ณ 
#     3. ์—ฌ๋Ÿฌ๊ฐœ์˜ ๊ณต๋ฐฑ์€ ํ•˜๋‚˜์˜ ๊ณต๋ฐฑ์œผ๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค
#     4. a-zA-Z?.!,ยฟ๊ฐ€ ์•„๋‹Œ ๋ชจ๋“  ๋ฌธ์ž๋ฅผ ํ•˜๋‚˜์˜ ๊ณต๋ฐฑ์œผ๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค
#     5. ๋‹ค์‹œ ์–‘์ชฝ ๊ณต๋ฐฑ์„ ์ง€์›๋‹ˆ๋‹ค
#     6. ๋ฌธ์žฅ ์‹œ์ž‘์—๋Š” <start>, ๋์—๋Š” <end>๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค
# ์ด ์ˆœ์„œ๋กœ ์ฒ˜๋ฆฌํ•ด์ฃผ๋ฉด ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ƒํ™ฉ์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๊ฒ ๋„ค์š”!
def preprocess_sentence(sentence):
    sentence = sentence.lower().strip() # 1
    sentence = re.sub(r"([?.!,ยฟ])", r" \1 ", sentence) # 2
    sentence = re.sub(r'[" "]+', " ", sentence) # 3
    sentence = re.sub(r"[^a-zA-Z?.!,ยฟ]+", " ", sentence) # 4
    sentence = sentence.strip() # 5
    sentence = '<start> ' + sentence + ' <end>' # 6
    return sentence

# ์ด ๋ฌธ์žฅ์ด ์–ด๋–ป๊ฒŒ ํ•„ํ„ฐ๋ง๋˜๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”.
print(preprocess_sentence("This @_is ;;;sample        sentence."))
<start> this is sample sentence . <end>

์ง€์ €๋ถ„ํ•œ ๋ฌธ์žฅ์„ ๋„ฃ์–ด๋„ ์˜ˆ์˜๊ฒŒ ๋ณ€ํ™˜ํ•ด ์ฃผ๋Š” ์ •์ œ ํ•จ์ˆ˜๊ฐ€ ์™„์„ฑ

  • \ \ ๋„ ์ถ”๊ฐ€
  • ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๋ชจ๋ธ์˜ ์ž…๋ ฅ์ด ๋˜๋Š” ๋ฌธ์žฅ์„ ์†Œ์Šค ๋ฌธ์žฅ(Source Sentence)
    โ†’ x_train
  • ์ •๋‹ต ์—ญํ• ์„ ํ•˜๊ฒŒ ๋  ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๋ฌธ์žฅ์„ ํƒ€๊ฒŸ ๋ฌธ์žฅ(Target Sentence)
    โ†’ y_train

corpus ์ด์šฉ ์ •์ œ ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•

# ์—ฌ๊ธฐ์— ์ •์ œ๋œ ๋ฌธ์žฅ์„ ๋ชจ์„๊ฒ๋‹ˆ๋‹ค
corpus = []

for sentence in raw_corpus:
    # ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜์ง€ ์•Š๋Š” ๋ฌธ์žฅ์€ ๊ฑด๋„ˆ๋œ๋‹ˆ๋‹ค
    if len(sentence) == 0: continue
    if sentence[-1] == ":": continue
    
    # ์ •์ œ๋ฅผ ํ•˜๊ณ  ๋‹ด์•„์ฃผ์„ธ์š”
    preprocessed_sentence = preprocess_sentence(sentence)
    corpus.append(preprocessed_sentence)
        
# ์ •์ œ๋œ ๊ฒฐ๊ณผ๋ฅผ 10๊ฐœ๋งŒ ํ™•์ธํ•ด๋ณด์ฃ 
corpus[:10]
['<start> before we proceed any further , hear me speak . <end>',
 '<start> speak , speak . <end>',
 '<start> you are all resolved rather to die than to famish ? <end>',
 '<start> resolved . resolved . <end>',
 '<start> first , you know caius marcius is chief enemy to the people . <end>',
 '<start> we know t , we know t . <end>',
 '<start> let us kill him , and we ll have corn at our own price . <end>',
 '<start> is t a verdict ? <end>',
 '<start> no more talking on t let it be done away , away ! <end>',
 '<start> one word , good citizens . <end>']

tf.keras.preprocessing.text.Tokenizer ํŒจํ‚ค์ง€

 - ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ํ† ํฐํ™” 
 - ๋‹จ์–ด ์‚ฌ์ „(vocabulary ๋˜๋Š” dictionary๋ผ๊ณ  ์นญํ•จ) ์ œ์ž‘
 - ๋ฐ์ดํ„ฐ๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜.

โ†’ ๋ฒกํ„ฐํ™”(vectorize), ์ˆซ์ž๋กœ ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ๋ฅผ ํ…์„œ(tensor)

# ํ† ํฐํ™” ํ•  ๋•Œ ํ…์„œํ”Œ๋กœ์šฐ์˜ Tokenizer์™€ pad_sequences๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
# ๋” ์ž˜ ์•Œ๊ธฐ ์œ„ํ•ด ์•„๋ž˜ ๋ฌธ์„œ๋“ค์„ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences
def tokenize(corpus):
    # 7000๋‹จ์–ด๋ฅผ ๊ธฐ์–ตํ•  ์ˆ˜ ์žˆ๋Š” tokenizer๋ฅผ ๋งŒ๋“ค๊ฒ๋‹ˆ๋‹ค
    # ์šฐ๋ฆฌ๋Š” ์ด๋ฏธ ๋ฌธ์žฅ์„ ์ •์ œํ–ˆ์œผ๋‹ˆ filters๊ฐ€ ํ•„์š”์—†์–ด์š”
    # 7000๋‹จ์–ด์— ํฌํ•จ๋˜์ง€ ๋ชปํ•œ ๋‹จ์–ด๋Š” '<unk>'๋กœ ๋ฐ”๊ฟ€๊ฑฐ์—์š”
    tokenizer = tf.keras.preprocessing.text.Tokenizer(
        num_words=7000, 
        filters=' ',
        oov_token="<unk>"
    )
    # corpus๋ฅผ ์ด์šฉํ•ด tokenizer ๋‚ด๋ถ€์˜ ๋‹จ์–ด์žฅ์„ ์™„์„ฑํ•ฉ๋‹ˆ๋‹ค
    tokenizer.fit_on_texts(corpus)
    # ์ค€๋น„ํ•œ tokenizer๋ฅผ ์ด์šฉํ•ด corpus๋ฅผ Tensor๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค
    tensor = tokenizer.texts_to_sequences(corpus)   
    # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์ผ์ •ํ•˜๊ฒŒ ๋งž์ถฐ์ค๋‹ˆ๋‹ค
    # ๋งŒ์•ฝ ์‹œํ€€์Šค๊ฐ€ ์งง๋‹ค๋ฉด ๋ฌธ์žฅ ๋’ค์— ํŒจ๋”ฉ์„ ๋ถ™์—ฌ ๊ธธ์ด๋ฅผ ๋งž์ถฐ์ค๋‹ˆ๋‹ค.
    # ๋ฌธ์žฅ ์•ž์— ํŒจ๋”ฉ์„ ๋ถ™์—ฌ ๊ธธ์ด๋ฅผ ๋งž์ถ”๊ณ  ์‹ถ๋‹ค๋ฉด padding='pre'๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')  
    
    print(tensor,tokenizer)
    return tensor, tokenizer

tensor, tokenizer = tokenize(corpus)
[[   2  143   40 ...    0    0    0]
 [   2  110    4 ...    0    0    0]
 [   2   11   50 ...    0    0    0]
 ...
 [   2  149 4553 ...    0    0    0]
 [   2   34   71 ...    0    0    0]
 [   2  945   34 ...    0    0    0]] <keras_preprocessing.text.Tokenizer object at 0x7fdf6cb56df0>

โ†’ ํ…์„œ ๋ฐ์ดํ„ฐ๋ฅผ 3๋ฒˆ์žฌ ํ–‰, 10๋ฒˆ์งธ ์—ด๊นŒ์ง€๋งŒ ์ถœ๋ ฅ

print(tensor[:3, :10])
[[   2  143   40  933  140  591    4  124   24  110]
 [   2  110    4  110    5    3    0    0    0    0]
 [   2   11   50   43 1201  316    9  201   74    9]]
  • ํ…์„œ ๋ฐ์ดํ„ฐ๋Š” ๋ชจ๋‘ ์ •์ˆ˜
  • ์ˆซ์ž๋Š” tokenizer์— ๊ตฌ์ถ•๋œ ๋‹จ์–ด ์‚ฌ์ „์˜ ์ธ๋ฑ์Šค
  • 2๋ฒˆ ์ธ๋ฑ์Šค๊ฐ€ ๋ฐ”๋กœ \
for idx in tokenizer.index_word:
    print(idx, ":", tokenizer.index_word[idx])

    if idx >= 10: break
1 : <unk>
2 : <start>
3 : <end>
4 : ,
5 : .
6 : the
7 : and
8 : i
9 : to
10 : of
  • ํ…์„œ ์ถœ๋ ฅ๋ถ€์—์„œ ํ–‰ ๋’ค์ชฝ์— 0์ด ๋งŽ์ด ๋‚˜์˜จ ๋ถ€๋ถ„์€ ์ •ํ•ด์ง„ ์ž…๋ ฅ ์‹œํ€€์Šค ๊ธธ์ด๋ณด๋‹ค ๋ฌธ์žฅ์ด ์งง์„ ๊ฒฝ์šฐ 0์œผ๋กœ ํŒจ๋”ฉ(padding)์„ ์ฑ„์›Œ ๋„ฃ์€ ๊ฒƒ
  • ์‚ฌ์ „์—๋Š” ์—†์ง€๋งŒ 0์€ ๋ฐ”๋กœ ํŒจ๋”ฉ ๋ฌธ์ž \
# tensor์—์„œ ๋งˆ์ง€๋ง‰ ํ† ํฐ์„ ์ž˜๋ผ๋‚ด์„œ ์†Œ์Šค ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
# ๋งˆ์ง€๋ง‰ ํ† ํฐ์€ <end>๊ฐ€ ์•„๋‹ˆ๋ผ <pad>์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.
src_input = tensor[:, :-1]  
# tensor์—์„œ <start>๋ฅผ ์ž˜๋ผ๋‚ด์„œ ํƒ€๊ฒŸ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
tgt_input = tensor[:, 1:]    

print(src_input[0])
print(tgt_input[0])
[  2 143  40 933 140 591   4 124  24 110   5   3   0   0   0   0   0   0
   0   0]
[143  40 933 140 591   4 124  24 110   5   3   0   0   0   0   0   0   0
   0   0]
  • corpus ๋‚ด์˜ ์ฒซ ๋ฒˆ์งธ ๋ฌธ์žฅ์— ๋Œ€ํ•ด ์ƒ์„ฑ๋œ ์†Œ์Šค์™€ ํƒ€๊ฒŸ ๋ฌธ์žฅ์„ ํ™•์ธ
  • ์†Œ์Šค๋Š” 2(\)์—์„œ ์‹œ์ž‘ํ•ด์„œ 3(\)์œผ๋กœ ๋๋‚œ ํ›„ 0(\)๋กœ ์ฑ„์›Œ์ ธ
  • ํƒ€๊ฒŸ์€ 2๋กœ ์‹œ์ž‘ํ•˜์ง€ ์•Š๊ณ  ์†Œ์Šค๋ฅผ ์™ผ์ชฝ์œผ๋กœ ํ•œ ์นธ ์‹œํ”„ํŠธ ํ•œ ํ˜•ํƒœ

tf.data.Dataset.from_tensor_slices() ๋ฉ”์†Œ๋“œ๋ฅผ ์ด์šฉ โ†’ tf.data.Dataset ์ƒ์„ฑ

  • ํ…์„œ๋กœ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด tf.data.Dataset๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•
  • tf.data.Dataset๊ฐ์ฒด๋Š” ํ…์„œํ”Œ๋กœ์šฐ์—์„œ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•œ ์†๋„ ๊ฐœ์„  ๋ฐ ๊ฐ์ข… ํŽธ์˜ ๊ธฐ๋Šฅ์„ ์ œ๊ณต
  • tf.data.Dataset.from_tensor_slices() ๋ฉ”์†Œ๋“œ๋ฅผ ์ด์šฉ
    โ†’ tf.data.Dataset ์ƒ์„ฑ
BUFFER_SIZE = len(src_input)
BATCH_SIZE = 256  #์™œ 256์œผ๋กœ??
steps_per_epoch = len(src_input) // BATCH_SIZE

 # tokenizer๊ฐ€ ๊ตฌ์ถ•ํ•œ ๋‹จ์–ด์‚ฌ์ „ ๋‚ด 7000๊ฐœ์™€, ์—ฌ๊ธฐ ํฌํ•จ๋˜์ง€ ์•Š์€ 0:<pad>๋ฅผ ํฌํ•จํ•˜์—ฌ 7001๊ฐœ
VOCAB_SIZE = tokenizer.num_words + 1   

# ์ค€๋น„ํ•œ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค
# ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” ์•„๋ž˜ ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”
# ์ž์„ธํžˆ ์•Œ์•„๋‘˜์ˆ˜๋ก ๋„์›€์ด ๋งŽ์ด ๋˜๋Š” ์ค‘์š”ํ•œ ๋ฌธ์„œ์ž…๋‹ˆ๋‹ค
# https://www.tensorflow.org/api_docs/python/tf/data/Dataset
dataset = tf.data.Dataset.from_tensor_slices((src_input, tgt_input))
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
dataset
<BatchDataset shapes: ((256, 20), (256, 20)), types: (tf.int32, tf.int32)>

์ธ๊ณต์ง€๋Šฅํ•™์Šต

์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“ค ๊ตฌ์กฐ๋„

  • tf.keras.Model์„ Subclassingํ•˜๋Š” ๋ฐฉ์‹
  • 1๊ฐœ์˜ Embedding ๋ ˆ์ด์–ด, 2๊ฐœ์˜ LSTM ๋ ˆ์ด์–ด, 1๊ฐœ์˜ Dense ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ
class TextGenerator(tf.keras.Model):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super().__init__()
        
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_size)
        self.rnn_1 = tf.keras.layers.LSTM(hidden_size, return_sequences=True)
        self.rnn_2 = tf.keras.layers.LSTM(hidden_size, return_sequences=True)
        self.linear = tf.keras.layers.Dense(vocab_size)
        
    def call(self, x):
        out = self.embedding(x)
        out = self.rnn_1(out)
        out = self.rnn_2(out)
        out = self.linear(out)
        
        return out
    
embedding_size = 256
hidden_size = 1024
model = TextGenerator(tokenizer.num_words + 1, embedding_size , hidden_size)
  • ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ Embedding ๋ ˆ์ด์–ด์˜ ์—ญํ• 
  • ์ž…๋ ฅ ํ…์„œ์—๋Š” ๋‹จ์–ด ์‚ฌ์ „์˜ ์ธ๋ฑ์Šค ํฌํ•จ
  • Embedding ๋ ˆ์ด์–ด๋Š” ์ด ์ธ๋ฑ์Šค ๊ฐ’์„ ํ•ด๋‹น ์ธ๋ฑ์Šค ๋ฒˆ์งธ์˜ ์›Œ๋“œ ๋ฒกํ„ฐ๋กœ ๋ณ€๊ฒฝ
  • ์›Œ๋“œ ๋ฒกํ„ฐ๋Š” ์˜๋ฏธ ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ ๋‹จ์–ด์˜ ์ถ”์ƒ์  ํ‘œํ˜€๋Šฅ๋กœ ์‚ฌ์šฉ

์œ„ ์ฝ”๋“œ์—์„œ embedding_size ๋Š” ์›Œ๋“œ ๋ฒกํ„ฐ์˜ ์ฐจ์›์ˆ˜, ์ฆ‰ ๋‹จ์–ด๊ฐ€ ์ถ”์ƒ์ ์œผ๋กœ ํ‘œํ˜„๋˜๋Š” ํฌ๊ธฐ
์˜ˆ๋ฅผ ๋“ค์–ด 2๋ผ๋ฉด,

  • ์ฐจ๊ฐ‘๋‹ค: [0.0, 1.0]
  • ๋œจ๊ฒ๋‹ค: [1.0, 0.0]
  • ๋ฏธ์ง€๊ทผํ•˜๋‹ค: [0.5, 0.5]
  • ํด์ˆ˜๋ก ์ถ”์ƒ์ ์ธ ํŠน์ง•์„ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ์Œ

  • ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†์œผ๋ฉด ํ˜ผ๋ž€์ด ์•ผ๊ธฐ

  • ์œ„์˜ ๋ฌธ์ œ์—์„œ๋Š” 256์œผ๋กœ ์„ค์ •

  • LSTM ๋ ˆ์ด์–ด์˜hidden state์˜ ์ฐจ์›์ˆ˜์ธ hidden_size๋„ ๊ฐ™์€ ๋งฅ๋ฝ

  • 1024๊ฐ€ ์ ๋‹น > ์™œ???

  • model ๋นŒ๋”ฉ์€ ์•„์ง

    • model.complie()๋ฅผ ํ˜ธ์ถœ ์•ˆํ•จ
    • model ์˜ ์ž…๋ ฅ ํ…์„œ ์ง€์ •๋„ ์•ˆํ•จ
  • model์˜ input shape๊ฐ€ ๊ฒฐ์ •๋˜๋ฉด์„œ model.build()๊ฐ€ ์ž๋™์œผ๋กœ ํ˜ธ์ถœ

# ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฐ์ดํ„ฐ ํ•œ ๋ฐฐ์น˜๋งŒ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
# ์ง€๊ธˆ์€ ๋™์ž‘ ์›๋ฆฌ์— ๋„ˆ๋ฌด ๋น ์ ธ๋“ค์ง€ ๋งˆ์„ธ์š”~
for src_sample, tgt_sample in dataset.take(1): break

# ํ•œ ๋ฐฐ์น˜๋งŒ ๋ถˆ๋Ÿฌ์˜จ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋ธ์— ๋„ฃ์–ด๋ด…๋‹ˆ๋‹ค
model(src_sample)
<tf.Tensor: shape=(256, 20, 7001), dtype=float32, numpy=
array([[[-4.31300432e-05, -1.61468924e-04, -2.28978643e-05, ...,
         -1.30526796e-05, -1.65740246e-04, -1.52532608e-04],
        [-3.31729389e-04, -3.35492572e-04,  9.25426939e-05, ...,
         -5.40830388e-06, -3.98255826e-04, -1.54450114e-04],
        [-3.56497185e-04, -5.21454436e-04,  9.67504602e-05, ...,
          5.08658530e-04, -2.61457870e-04, -4.00694320e-04],
        ...,
        [-1.85111989e-04, -4.65744082e-03, -2.55297264e-03, ...,
         -1.98253267e-03,  1.08430139e-03, -1.12067943e-03],
        [-2.26652643e-04, -4.91228886e-03, -2.57866457e-03, ...,
         -2.21222127e-03,  1.26814318e-03, -1.42959598e-03],
        [-2.72341655e-04, -5.11216559e-03, -2.58543598e-03, ...,
         -2.40770658e-03,  1.43469241e-03, -1.71830074e-03]],

       [[-4.31300432e-05, -1.61468924e-04, -2.28978643e-05, ...,
         -1.30526796e-05, -1.65740246e-04, -1.52532608e-04],
        [-5.15408523e-04, -2.02041178e-04,  1.23195961e-04, ...,
          2.49126664e-04, -3.87586042e-04, -2.07115838e-04],
        [-4.60794341e-04, -6.10847856e-05,  1.97986214e-04, ...,
          4.49985295e-04, -6.17051148e-04, -3.10465286e-04],
        ...,
        [ 8.63971800e-05, -4.08918737e-03, -2.27619591e-03, ...,
         -2.37792335e-03,  3.57422628e-04, -1.23522384e-03],
        [ 4.76920541e-05, -4.44815168e-03, -2.36615562e-03, ...,
         -2.57940381e-03,  6.33708783e-04, -1.43076328e-03],
        [-4.88970727e-06, -4.74188896e-03, -2.42232508e-03, ...,
         -2.74938415e-03,  8.77008599e-04, -1.63904799e-03]],

       [[-4.31300432e-05, -1.61468924e-04, -2.28978643e-05, ...,
         -1.30526796e-05, -1.65740246e-04, -1.52532608e-04],
        [-3.92017188e-04, -4.05272236e-04, -4.03737446e-04, ...,
          2.08484649e-04, -1.78227972e-04, -1.51166416e-04],
        [-6.78456272e-04, -8.21831811e-04, -4.12415975e-04, ...,
          7.16181123e-04, -2.64001399e-04, -1.19791046e-04],
        ...,
        [ 6.99736876e-04, -3.99236009e-03, -1.08700444e-03, ...,
         -1.69491197e-03,  2.50219717e-04, -7.33265886e-04],
        [ 6.50285685e-04, -4.42640856e-03, -1.28968060e-03, ...,
         -1.98570453e-03,  5.35400759e-04, -1.02897582e-03],
        [ 5.66453091e-04, -4.77890717e-03, -1.45729876e-03, ...,
         -2.23665801e-03,  7.83600728e-04, -1.31549360e-03]],

       ...,

       [[-4.31300432e-05, -1.61468924e-04, -2.28978643e-05, ...,
         -1.30526796e-05, -1.65740246e-04, -1.52532608e-04],
        [ 9.46695582e-06, -1.02088648e-04,  2.74929276e-04, ...,
          2.84693117e-04, -2.84875568e-04, -2.80248147e-04],
        [ 2.56561616e-04, -2.39318193e-04,  5.11636317e-04, ...,
          1.42004428e-04, -4.73114924e-04, -4.22215729e-04],
        ...,
        [ 3.18835628e-05, -5.07764611e-03, -2.44524307e-03, ...,
         -2.17102608e-03,  9.32406518e-04, -1.62668424e-04],
        [-3.98430202e-05, -5.27923508e-03, -2.50015478e-03, ...,
         -2.35773833e-03,  1.14679523e-03, -5.46753930e-04],
        [-1.08012224e-04, -5.43248793e-03, -2.52589677e-03, ...,
         -2.51999241e-03,  1.33289036e-03, -9.18680802e-04]],

       [[-4.31300432e-05, -1.61468924e-04, -2.28978643e-05, ...,
         -1.30526796e-05, -1.65740246e-04, -1.52532608e-04],
        [-2.36422085e-04, -2.61728710e-04,  2.46185751e-04, ...,
         -6.08692972e-05, -3.13386961e-04, -1.68910818e-04],
        [-2.05222939e-04, -2.19296082e-04,  1.04809449e-04, ...,
          1.52687702e-04, -2.95088452e-04, -1.40395117e-04],
        ...,
        [-3.88026383e-04, -4.55201278e-03, -2.38304026e-03, ...,
         -2.11151317e-03,  1.44088583e-03, -1.46701431e-03],
        [-4.51929373e-04, -4.85128677e-03, -2.44412408e-03, ...,
         -2.32648896e-03,  1.53514405e-03, -1.73510751e-03],
        [-5.07737219e-04, -5.08674514e-03, -2.47974368e-03, ...,
         -2.51027266e-03,  1.62658445e-03, -1.98661070e-03]],

       [[-4.31300432e-05, -1.61468924e-04, -2.28978643e-05, ...,
         -1.30526796e-05, -1.65740246e-04, -1.52532608e-04],
        [-9.77924428e-05, -4.37246083e-04,  3.97726777e-04, ...,
          5.69955155e-05, -2.75543978e-04, -6.65865809e-05],
        [-1.78766222e-05, -7.11038534e-04,  5.13010891e-04, ...,
          6.48956920e-05, -4.87306388e-04,  4.01803874e-04],
        ...,
        [ 7.31389737e-04, -3.84947122e-03, -2.41923332e-03, ...,
         -2.36577378e-03,  1.51708734e-03, -6.34322700e-04],
        [ 5.93748991e-04, -4.21182066e-03, -2.47242069e-03, ...,
         -2.56023020e-03,  1.65442866e-03, -9.69938294e-04],
        [ 4.57269518e-04, -4.51200176e-03, -2.50035292e-03, ...,
         -2.72135716e-03,  1.77274307e-03, -1.29425328e-03]]],
      dtype=float32)>
  • ๋ชจ๋ธ์˜ ์ตœ์ข… ์ถœ๋ ฅ ํ…์„œ shape๋ฅผ ์œ ์‹ฌํžˆ ๋ณด๋ฉด shape=(256, 20, 7001)์ž„
  • 7001์€ Dense ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ ์ฐจ์›์ˆ˜
  • 7001๊ฐœ์˜ ๋‹จ์–ด ์ค‘ ์–ด๋Š ๋‹จ์–ด์˜ ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์„์ง€๋ฅผ ๋ชจ๋ธ๋งํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ
  • 256์€ ์ด์ „ ์Šคํ…์—์„œ ์ง€์ •ํ•œ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ
  • dataset.take(1)๋ฅผ ํ†ตํ•ด์„œ 1๊ฐœ์˜ ๋ฐฐ์น˜, ์ฆ‰ 256๊ฐœ์˜ ๋ฌธ์žฅ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜จ ๊ฒƒ
  • 20์€
    • tf.keras.layers.LSTM(hidden_size, return_sequences=True)๋กœ ํ˜ธ์ถœํ•œ LSTM ๋ ˆ์ด์–ด์—์„œ return_sequences=True์ด๋ผ๊ณ  ์ง€์ •ํ•œ ๋ถ€๋ถ„
    • LSTM์€ ์ž์‹ ์—๊ฒŒ ์ž…๋ ฅ๋œ ์‹œํ€€์Šค์˜ ๊ธธ์ด๋งŒํผ ๋™์ผํ•œ ๊ธธ์ด์˜ ์‹œํ€€์Šค๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค๋Š” ์˜๋ฏธ
    • return_sequences=False์˜€๋‹ค๋ฉด LSTM ๋ ˆ์ด์–ด๋Š” 1๊ฐœ์˜ ๋ฒกํ„ฐ๋งŒ ์ถœ๋ ฅ
    • ๋ฐ์ดํ„ฐ์…‹์˜ max_len์ด 20์œผ๋กœ ๋งž์ถฐ์ ธ ์žˆ์—ˆ๋˜ ๊ฒƒ
# ๋“œ๋””์–ด model.summary() ์ถœ๋ ฅ
model.summary()
Model: "text_generator"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        multiple                  1792256   
_________________________________________________________________
lstm (LSTM)                  multiple                  5246976   
_________________________________________________________________
lstm_1 (LSTM)                multiple                  8392704   
_________________________________________________________________
dense (Dense)                multiple                  7176025   
=================================================================
Total params: 22,607,961
Trainable params: 22,607,961
Non-trainable params: 0
_________________________________________________________________

Output Shape???

  • ๋ชจ๋ธ์€ ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๊ธธ์ด๋ฅผ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Output Shape๋ฅผ ํŠน์ •ํ•  ์ˆ˜ ์—†๋Š” ๊ฒƒ
    ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ์ธก์ •
    - 22million
    - GPT-2์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์‚ฌ์ด์ฆˆ๋Š”, 1.5billion
    - GPT-3์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” GPT-2์˜ 100๋ฐฐ

๋ชจ๋ธ ํ•™์Šต

# optimizer์™€ loss๋“ฑ์€ ์ฐจ์ฐจ ๋ฐฐ์›๋‹ˆ๋‹ค
# ํ˜น์‹œ ๋ฏธ๋ฆฌ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜ ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”
# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
# https://www.tensorflow.org/api_docs/python/tf/keras/losses
# ์–‘์ด ์ƒ๋‹นํžˆ ๋งŽ์€ ํŽธ์ด๋‹ˆ ์ง€๊ธˆ ๋ณด๋Š” ๊ฒƒ์€ ์ถ”์ฒœํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
    reduction='none'
)

model.compile(loss=loss, optimizer=optimizer)
model.fit(dataset, epochs=30)
Epoch 1/30
93/93 [==============================] - 21s 203ms/step - loss: 1.8062
Epoch 2/30
93/93 [==============================] - 20s 216ms/step - loss: 1.6858
Epoch 3/30
93/93 [==============================] - 19s 205ms/step - loss: 1.6052
Epoch 4/30
93/93 [==============================] - 19s 200ms/step - loss: 1.5281
Epoch 5/30
93/93 [==============================] - 19s 202ms/step - loss: 1.4540
Epoch 6/30
93/93 [==============================] - 19s 207ms/step - loss: 1.3776
Epoch 7/30
93/93 [==============================] - 19s 206ms/step - loss: 1.3050
Epoch 8/30
93/93 [==============================] - 19s 203ms/step - loss: 1.2305
Epoch 9/30
93/93 [==============================] - 19s 203ms/step - loss: 1.1576
Epoch 10/30
93/93 [==============================] - 19s 204ms/step - loss: 1.0847
Epoch 11/30
93/93 [==============================] - 19s 206ms/step - loss: 1.0169
Epoch 12/30
93/93 [==============================] - 19s 205ms/step - loss: 0.9519
Epoch 13/30
93/93 [==============================] - 19s 203ms/step - loss: 0.8902
Epoch 14/30
93/93 [==============================] - 19s 203ms/step - loss: 0.8373
Epoch 15/30
93/93 [==============================] - 19s 203ms/step - loss: 0.7912
Epoch 16/30
93/93 [==============================] - 19s 203ms/step - loss: 0.7523
Epoch 17/30
93/93 [==============================] - 19s 204ms/step - loss: 0.7208
Epoch 18/30
93/93 [==============================] - 19s 204ms/step - loss: 0.6963
Epoch 19/30
93/93 [==============================] - 19s 204ms/step - loss: 0.6760
Epoch 20/30
93/93 [==============================] - 19s 205ms/step - loss: 0.6597
Epoch 21/30
93/93 [==============================] - 19s 205ms/step - loss: 0.6463
Epoch 22/30
93/93 [==============================] - 19s 204ms/step - loss: 0.6353
Epoch 23/30
93/93 [==============================] - 19s 203ms/step - loss: 0.6260
Epoch 24/30
93/93 [==============================] - 19s 203ms/step - loss: 0.6183
Epoch 25/30
93/93 [==============================] - 19s 203ms/step - loss: 0.6118
Epoch 26/30
93/93 [==============================] - 19s 203ms/step - loss: 0.6053
Epoch 27/30
93/93 [==============================] - 19s 203ms/step - loss: 0.6007
Epoch 28/30
93/93 [==============================] - 19s 203ms/step - loss: 0.5963
Epoch 29/30
93/93 [==============================] - 19s 204ms/step - loss: 0.5925
Epoch 30/30
93/93 [==============================] - 19s 205ms/step - loss: 0.5893





<keras.callbacks.History at 0x7fdedafa3400>

Loss๋Š” ๋ชจ๋ธ์ด ์˜ค๋‹ต์„ ๋งŒ๋“ค๊ณ  ์žˆ๋Š” ์ •๋„๋ผ๊ณ  ์ƒ๊ฐ( Loss๊ฐ€ 1์ผ ๋•Œ 99%๋ฅผ ๋งž์ถ”๊ณ  ์žˆ๋‹ค๋Š” ์˜๋ฏธ๋Š” ์•„๋‹˜).
์˜ค๋‹ต๋ฅ ์ด ๊ฐ์†Œํ•˜๊ณ  ์žˆ์œผ๋‹ˆ ํ•™์Šต์ด ์ž˜ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค ๊ณ  ํ•ด์„

์ž‘๋ฌธ์„ ์‹œ์ผœ๋ณด๊ณ  ์ง์ ‘ ํ‰๊ฐ€

def generate_text(model, tokenizer, init_sentence="<start>", max_len=20):
    # ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•ด์„œ ์ž…๋ ฅ๋ฐ›์€ init_sentence๋„ ํ…์„œ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค
    test_input = tokenizer.texts_to_sequences([init_sentence])
    test_tensor = tf.convert_to_tensor(test_input, dtype=tf.int64)
    end_token = tokenizer.word_index["<end>"]

    # ๋‹จ์–ด ํ•˜๋‚˜์”ฉ ์˜ˆ์ธกํ•ด ๋ฌธ์žฅ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค
    #    1. ์ž…๋ ฅ๋ฐ›์€ ๋ฌธ์žฅ์˜ ํ…์„œ๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค
    #    2. ์˜ˆ์ธก๋œ ๊ฐ’ ์ค‘ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์ธ word index๋ฅผ ๋ฝ‘์•„๋ƒ…๋‹ˆ๋‹ค
    #    3. 2์—์„œ ์˜ˆ์ธก๋œ word index๋ฅผ ๋ฌธ์žฅ ๋’ค์— ๋ถ™์ž…๋‹ˆ๋‹ค
    #    4. ๋ชจ๋ธ์ด <end>๋ฅผ ์˜ˆ์ธกํ–ˆ๊ฑฐ๋‚˜, max_len์— ๋„๋‹ฌํ–ˆ๋‹ค๋ฉด ๋ฌธ์žฅ ์ƒ์„ฑ์„ ๋งˆ์นฉ๋‹ˆ๋‹ค
    while True:
        # 1
        predict = model(test_tensor) 
        # 2
        predict_word = tf.argmax(tf.nn.softmax(predict, axis=-1), axis=-1)[:, -1] 
        # 3 
        test_tensor = tf.concat([test_tensor, tf.expand_dims(predict_word, axis=0)], axis=-1)
        # 4
        if predict_word.numpy()[0] == end_token: break
        if test_tensor.shape[1] >= max_len: break

    generated = ""
    # tokenizer๋ฅผ ์ด์šฉํ•ด word index๋ฅผ ๋‹จ์–ด๋กœ ํ•˜๋‚˜์”ฉ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค 
    for word_index in test_tensor[0].numpy():
        generated += tokenizer.index_word[word_index] + " "

    return generated

generate_text() ํ•จ์ˆ˜์—์„œ init_sentence๋ฅผ ์ธ์ž๋กœ ๋ฐ›๊ณ 
๋ฐ›์€ ์ธ์ž๋ฅผ ์ผ๋‹จ ํ…์„œ๋กœ
๋””ํดํŠธ๋กœ๋Š” \ ๋‹จ์–ด ํ•˜๋‚˜๋งŒ

- while์˜ ์ฒซ ๋ฒˆ์งธ ๋ฃจํ”„์—์„œ test_tensor์— \<start> ํ•˜๋‚˜๋งŒ ๋“ค์–ด๊ฐ”๋‹ค๊ณ  ํ•ฉ์‹œ๋‹ค.
    - ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์ด ์ถœ๋ ฅ์œผ๋กœ 7001๊ฐœ์˜ ๋‹จ์–ด ์ค‘ A๋ฅผ ๊ณจ๋ž๋‹ค๊ณ  ํ•ฉ์‹œ๋‹ค.
- while์˜ ๋‘ ๋ฒˆ์งธ ๋ฃจํ”„์—์„œ test_tensor์—๋Š” \<start> A๊ฐ€ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. 
    - ๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์ด ๊ทธ๋‹ค์Œ B๋ฅผ ๊ณจ๋ž๋‹ค๊ณ  ํ•ฉ์‹œ๋‹ค.
- while์˜ ์„ธ ๋ฒˆ์งธ ๋ฃจํ”„์—์„œ test_tensor์—๋Š” <\start> A B๊ฐ€ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. 
    - ๊ทธ๋ž˜์„œ..... (์ดํ•˜ ํ›„๋žต)
generate_text(model, tokenizer, init_sentence="<start> I")
'<start> i am not well , sir , i am not well . <end> '

๋ฌธ์žฅ๋งŒ๋“ค๊ธฐ

๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ

๋ฐ์ดํ„ฐ ์ฝ์–ด์˜ค๊ธฐ

  • glob ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๋ฉด ํŒŒ์ผ์„ ์ฝ์–ด์˜ค๋Š” ์ž‘์—…์„ ํ•˜๊ธฐ๊ฐ€ ์•„์ฃผ ์šฉ์ด
    • txt ํŒŒ์ผ์„ ์ฝ์–ด์˜จ ํ›„
    • raw_corpus ๋ฆฌ์ŠคํŠธ์— ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ์ €์žฅ
import glob
import os

txt_file_path = './lyricist/data/lyrics/*'

txt_list = glob.glob(txt_file_path)

raw_corpus = []

# ์—ฌ๋Ÿฌ๊ฐœ์˜ txt ํŒŒ์ผ์„ ๋ชจ๋‘ ์ฝ์–ด์„œ raw_corpus ์— ๋‹ด์Šต๋‹ˆ๋‹ค.
for txt_file in txt_list:
    with open(txt_file, "r") as f:
        raw = f.read().splitlines()
        raw_corpus.extend(raw)

print("๋ฐ์ดํ„ฐ ํฌ๊ธฐ:", len(raw_corpus))
print("Examples:\n", raw_corpus[:3])
๋ฐ์ดํ„ฐ ํฌ๊ธฐ: 187088
Examples:
 ['The first words that come out', 'And I can see this song will be about you', "I can't believe that I can breathe without you"]

๋ฐ์ดํ„ฐ ์ •์ œ

  • preprocess_sentence() ํ•จ์ˆ˜ ํ™œ์šฉ
  • ์ถ”๊ฐ€๋กœ ์ง€๋‚˜์น˜๊ฒŒ ๊ธด ๋ฌธ์žฅ์€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋“ค์ด ๊ณผ๋„ํ•œ Padding์„ ๊ฐ–๊ฒŒ ํ•˜๋ฏ€๋กœ ์ œ๊ฑฐ
  • ํ† ํฐํ™” ํ–ˆ์„ ๋•Œ ํ† ํฐ์˜ ๊ฐœ์ˆ˜๊ฐ€ 15๊ฐœ๋ฅผ ๋„˜์–ด๊ฐ€๋Š” ๋ฌธ์žฅ์„ ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ์ œ์™ธํ•˜๊ธฐ ๋ฅผ ๊ถŒ์žฅ

ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ


  • tokenize() ํ•จ์ˆ˜๋กœ ๋ฐ์ดํ„ฐ๋ฅผ Tensor๋กœ ๋ณ€ํ™˜ํ•œ ํ›„, sklearn ๋ชจ๋“ˆ์˜ train_test_split() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌ
  • ๋‹จ์–ด์žฅ์˜ ํฌ๊ธฐ๋Š” 12,000 ์ด์ƒ ์œผ๋กœ ์„ค์ •ํ•˜์„ธ์š”! ์ด ๋ฐ์ดํ„ฐ์˜ 20% ๋ฅผ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‚ฌ์šฉ
enc_train, enc_val, dec_train, dec_val = <์ฝ”๋“œ ์ž‘์„ฑ>
  File "/tmp/ipykernel_2308/3310982766.py", line 1
    enc_train, enc_val, dec_train, dec_val = <์ฝ”๋“œ ์ž‘์„ฑ>
                                             ^
SyntaxError: invalid syntax
# ๊ฒฐ๊ณผํ™•์ธ
print("Source Train:", enc_train.shape)
print("Target Train:", dec_train.shape)
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

/tmp/ipykernel_2308/1251382030.py in <module>
      1 # ๊ฒฐ๊ณผํ™•์ธ
----> 2 print("Source Train:", enc_train.shape)
      3 print("Target Train:", dec_train.shape)


NameError: name 'enc_train' is not defined

# ์ธ๊ณต์ง€๋Šฅ ๋งŒ๋“ค๊ธฐ
- ๋ชจ๋ธ์˜ Embedding Size์™€ Hidden Size๋ฅผ ์กฐ์ ˆํ•˜๋ฉฐ 10 Epoch ์•ˆ์— val_loss ๊ฐ’์„ 2.2 ์ˆ˜์ค€์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ์„ค๊ณ„
- Loss ํ•จ์ˆ˜๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ
  File "/tmp/ipykernel_2308/1025019345.py", line 2
    - ๋ชจ๋ธ์˜ Embedding Size์™€ Hidden Size๋ฅผ ์กฐ์ ˆํ•˜๋ฉฐ 10 Epoch ์•ˆ์— val_loss ๊ฐ’์„ 2.2 ์ˆ˜์ค€์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ์„ค๊ณ„
          ^
SyntaxError: invalid syntax
#Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')
generate_text(lyricist, tokenizer, init_sentence="<start> i love", max_len=20)
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

/tmp/ipykernel_2308/936477000.py in <module>
----> 1 generate_text(lyricist, tokenizer, init_sentence="<start> i love", max_len=20)


NameError: name 'lyricist' is not defined

ํ‰๊ฐ€๋ฌธํ•ญ ์ƒ์„ธ๊ธฐ์ค€
1. ๊ฐ€์‚ฌ ํ…์ŠคํŠธ ์ƒ์„ฑ ๋ชจ๋ธ์ด ์ •์ƒ์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š”๊ฐ€?

ํ…์ŠคํŠธ ์ œ๋„ˆ๋ ˆ์ด์…˜ ๊ฒฐ๊ณผ๊ฐ€ ๊ทธ๋Ÿด๋“ฏํ•œ ๋ฌธ์žฅ์œผ๋กœ ์ƒ์„ฑ๋˜๋Š”๊ฐ€?

  1. ๋ฐ์ดํ„ฐ์˜ ์ „์ฒ˜๋ฆฌ์™€ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ ๊ณผ์ •์ด ์ฒด๊ณ„์ ์œผ๋กœ ์ง„ํ–‰๋˜์—ˆ๋Š”๊ฐ€?

ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ, ํ† ํฌ๋‚˜์ด์ € ์ƒ์„ฑ, ํŒจ๋”ฉ์ฒ˜๋ฆฌ ๋“ฑ์˜ ๊ณผ์ •์ด ๋น ์ง์—†์ด ์ง„ํ–‰๋˜์—ˆ๋Š”๊ฐ€?

  1. ํ…์ŠคํŠธ ์ƒ์„ฑ๋ชจ๋ธ์ด ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต๋˜์—ˆ๋Š”๊ฐ€?

ํ…์ŠคํŠธ ์ƒ์„ฑ๋ชจ๋ธ์˜ validation loss๊ฐ€ 2.2 ์ดํ•˜๋กœ ๋‚ฎ์•„์กŒ๋Š”๊ฐ€?

profile
๋งˆ์ผ€ํŒ…์„ ์œ„ํ•œ ์ธ๊ณต์ง€๋Šฅ ์„ค๊ณ„์™€ ์Šคํƒ€ํŠธ์—… Log

0๊ฐœ์˜ ๋Œ“๊ธ€