๐Ÿ˜ข ์Šคํ„ฐ๋””๋…ธํŠธ (Machine Learning 15)

zoeยท2023๋…„ 5์›” 30์ผ
0

Clustering

  • ๋น„์ง€๋„ ํ•™์Šต
    - ๊ตฐ์ง‘ Clustering ๋น„์Šทํ•œ ์ƒ˜ํ”Œ ๋ชจ์Œ
    - ์ด์ƒ์น˜ ํƒ์ง€ Outier detection : ์ •์ƒ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณด์ด๋Š”์ง€ ํ•™์Šต, ๋น„์ •์ƒ ์ƒ˜ํ”Œ์„ ๊ฐ์ง€
    - ๋ฐ€๋„ ์ถ”์ • : ๋ฐ์ดํ„ฐ์…‹์˜ ํ™•๋ฅ  ๋ฐ€๋„ ํ•จ์ˆ˜ Probability Density Function PDF๋ฅผ ์ถ”์ •. ์ด์ƒ์น˜ ํƒ์ง€ ๋“ฑ์— ์‚ฌ์šฉ
  • K-Means :
    - ๊ตฐ์ง‘ํ™”์—์„œ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜
    - ๊ตฐ์ง‘ ์ค‘์‹ฌ(centroid)์ด๋ผ๋Š” ์ž„์˜์˜ ์ง€์ ์„ ์„ ํƒํ•ด์„œ ํ•ด๋‹น ์ค‘์‹ฌ์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํฌ์ธํŠธ๋“ค์„ ์„ ํƒํ•˜๋Š” ๊ตฐ์ง‘ํ™”
    - ์ผ๋ฐ˜์ ์ธ ๊ตฐ์ง‘ํ™”์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ๋ฒ•
    - ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์†์„ฑ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งค์šฐ ๋งŽ์„ ๊ฒฝ์šฐ ๊ตฐ์ง‘ํ™”์˜ ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง



# iris ๋ฐ์ดํ„ฐ๋กœ ์‹ค์Šต

from sklearn.preprocessing import scale
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

iris = load_iris()
# ํŠน์ง• ์ด๋ฆ„ - ํ•ญ์ƒ ๋’ค์— (cm)๊ฐ€ ๋ถˆํŽธํ•จ

iris.feature_names
# ๋’ท๊ธ€์ž ์ž๋ฅด๊ธฐ

cols = [each[:-5] for each in iris.feature_names]
cols
# iris ๋ฐ์ดํ„ฐ ์ •๋ฆฌ

iris_df = pd.DataFrame(data = iris.data, columns=cols)
iris_df.head()
# ํŽธ์˜์ƒ ๋‘ ๊ฐœ์˜ ํŠน์„ฑ๋งŒ

feature = iris_df[['petal length', 'petal width']]
feature.head()


๊ตฐ์ง‘ํ™”

  • ๊ตฐ์ง‘ํ™”
    - n_clusters : ๊ตฐ์ง‘ํ™” ํ•  ๊ฐœ์ˆ˜, ์ฆ‰ ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์˜ ๊ฐœ์ˆ˜
    - init : ์ดˆ๊ธฐ ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์˜ ์ขŒํ‘œ๋ฅผ ์„ค์ •ํ•˜๋Š” ๋ฐฉ์‹์„ ๊ฒฐ์ •
    - max_iter : ์ตœ๋Œ€ ๋ฐ˜๋ณต ํšŸ์ˆ˜, ๋ชจ๋“  ๋ฐ์ดํ„ฐ์˜ ์ค‘์‹ฌ์  ์ด๋™์ด ์—†์œผ๋ฉด ์ข…๋ฃŒ
# ๊ตฐ์ง‘ํ™”
# n_clusters : ๊ตฐ์ง‘ํ™” ํ•  ๊ฐœ์ˆ˜, ์ฆ‰ ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์˜ ๊ฐœ์ˆ˜
# init : ์ดˆ๊ธฐ ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์˜ ์ขŒํ‘œ๋ฅผ ์„ค์ •ํ•˜๋Š” ๋ฐฉ์‹์„ ๊ฒฐ์ •
# max_iter : ์ตœ๋Œ€ ๋ฐ˜๋ณต ํšŸ์ˆ˜, ๋ชจ๋“  ๋ฐ์ดํ„ฐ์˜ ์ค‘์‹ฌ์  ์ด๋™์ด ์—†์œผ๋ฉด ์ข…๋ฃŒ

model = KMeans(n_clusters=3)
model.fit(feature)
# ๊ฒฐ๊ณผ ๋ผ๋ฒจ (์ง€๋„ํ•™์Šต์˜ ๋ผ๋ฒจ๊ณผ ๋‹ค๋ฆ„, ๊ตฐ์ง‘ ์ค‘์‹ฌ์„ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ, ์ˆœ์„œ๊ฐ€ ์•„๋‹˜)

model.labels_
# ๊ตฐ์ง‘ ์ค‘์‹ฌ๊ฐ’

model.cluster_centers_
# ๋‹ค์‹œ ์ •๋ฆฌ (๊ทธ๋ฆผ์„ ๊ทธ๋ฆฌ๊ธฐ ์œ„ํ•ด)

predict = pd.DataFrame(model.predict(feature), columns=['cluster'])
feature = pd.concat([feature, predict], axis=1)
feature.head()
# ๊ฒฐ๊ณผ ํ™•์ธ

centers = pd.DataFrame(model.cluster_centers_, columns=['petal length', 'petal width'])
center_x = centers['petal length']
center_y = centers['petal width']

plt.figure(figsize=(12, 8))
plt.scatter(feature['petal length'], feature['petal width'], c = feature['cluster'], alpha=0.5)
plt.scatter(center_x, center_y, s=50, marker='D', c='r')
plt.show()


make_blobs

  • make_blobs : ๊ตฐ์ง‘ํ™” ์—ฐ์Šต์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ๊ธฐ
# make_blobs - ๊ตฐ์ง‘ํ™” ์—ฐ์Šต์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ๊ธฐ

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=200, n_features=2, centers=3, 
                  cluster_std=0.8, random_state=0)
print(X.shape, y.shape)

unique, counts = np.unique(y, return_counts = True)
print(unique, counts)
# ๋ฐ์ดํ„ฐ ์ •๋ฆฌ

cluster_df = pd.DataFrame(data=X, columns=['ftr1', 'ftr2'])
cluster_df['target'] = y
cluster_df.head()
# ๊ตฐ์ง‘ํ™”

kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=200, random_state=13)
cluster_labels = kmeans.fit_predict(X)
cluster_df['kmeans-label'] = cluster_labels
# ๊ฒฐ๊ณผ ๋„์‹ํ™”

centers = kmeans.cluster_centers_
unique_labels = np.unique(cluster_labels)
markers = ['o', 's', '^', 'P', 'D', 'H', 'x']

for label in unique_labels:
    label_cluster = cluster_df[cluster_df['kmeans-label']==label]
    center_x_y = centers[label]
    plt.scatter(x=label_cluster['ftr1'], y=label_cluster['ftr2'],
                edgecolors='k', marker=markers[label])
    plt.scatter(x=center_x_y[0], y=center_x_y[1], s=200, color = 'white',
                alpha=0.9, edgecolors='k', marker=markers[label])
    plt.scatter(x=center_x_y[0], y=center_x_y[1], s=70, color = 'k',
                edgecolors='k', marker='$%d$' % label)

plt.show()
    
# ๊ฒฐ๊ณผ ํ™•์ธ

print(cluster_df.groupby('target')['kmeans-label'].value_counts())


๊ตฐ์ง‘ ํ‰๊ฐ€

  • ๊ตฐ์ง‘ ๊ฒฐ๊ณผ์˜ ํ‰๊ฐ€ : ๋ถ„๋ฅ˜๊ธฐ๋Š” ํ‰๊ฐ€ ๊ธฐ์ค€์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ๊ตฐ์ง‘์€ ๊ทธ๋ ‡์ง€ ์•Š๋‹ค. ๊ตฐ์ง‘ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์‹ค๋ฃจ์—ฃ ๋ถ„์„์„ ๋งŽ์ด ํ™œ์šฉํ•œ๋‹ค

  • ์‹ค๋ฃจ์—ฃ ๋ถ„์„ : ์‹ค๋ฃจ์—ฃ ๋ถ„์„์€ ๊ฐ ๊ตฐ์ง‘ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ์–ผ๋งˆ๋‚˜ ํšจ์œจ์ ์œผ๋กœ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๋Š”์ง€ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋‹ค๋ฅธ ๊ตฐ์ง‘๊ณผ๋Š” ๊ฑฐ๋ฆฌ๊ฐ€ ๋–จ์–ด์ ธ ์žˆ๊ณ , ๋™์ผ ๊ตฐ์ง‘๊ฐ„์˜ ๋ฐ์ดํ„ฐ๋Š” ์„œ๋กœ ๊ฐ€๊น๊ฒŒ ์ž˜ ๋ญ‰์ณ ์žˆ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค. ๊ตฐ์ง‘ํ™”๊ฐ€ ์ž˜ ๋˜์–ด ์žˆ์„์ˆ˜๋ก ๊ฐœ๋ณ„ ๊ตฐ์ง‘์€ ๋น„์Šทํ•œ ์ •๋„์˜ ์—ฌ์œ ๊ณต๊ฐ„์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

  • ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ : ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์ง€๋Š” ๊ตฐ์ง‘ํ™”์˜ ์ง€ํ‘œ




# ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import pandas as pd

iris = load_iris()
feature_names = ['sepal_length', 'sepal_width', 'patal_length', 'petal_width']
iris_df = pd.DataFrame(data=iris.data, columns=feature_names)
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, random_state=0).fit(iris_df)
# ๊ตฐ์ง‘ ๊ฒฐ๊ณผ ์ •๋ฆฌ

iris_df['cluster'] = kmeans.labels_
iris_df.head()
# ๊ตฐ์ง‘ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ์ž‘์—…

from sklearn.metrics import silhouette_samples, silhouette_score

avg_value = silhouette_score(iris.data, iris_df['cluster'])
score_values = silhouette_samples(iris.data, iris_df['cluster'])

print('avg_value', avg_value)
print('silhouette_samples() return ๊ฐ’์˜ shape', score_values.shape)
# yellowbrick ์„ค์น˜

#!pip install yellowbrick
# ์‹ค๋ฃจ์—ฃ ํ”Œ๋ž์˜ ๊ฒฐ๊ณผ (์ง์„ ํ˜•์œผ๋กœ ๊ตฌ๋ถ„๋œ ๊ฒƒ์ด ์ž˜ ๋œ ๊ฒƒ)

from yellowbrick.cluster import silhouette_visualizer

silhouette_visualizer(kmeans, iris.data, colors='yellowbrick')


์ด๋ฏธ์ง€ ๋ถ„ํ•  image segmentation

  • ์ด๋ฏธ์ง€ ๋ถ„ํ• 
    - ์ด๋ฏธ์ง€ ๋ถ„ํ•  image segmentation์€ ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ๋กœ ๋ถ„ํ• ํ•˜๋Š” ๊ฒƒ
    - ์‹œ๋งจํ‹ฑ ๋ถ„ํ•  semantic segmentation์€ ๋™์ผ ์ข…๋ฅ˜์˜ ๋ฌผ์ฒด์— ์†ํ•œ ํ”ฝ์…€์„ ๊ฐ™์€ ์„ธ๊ทธ๋จผํŠธ๋กœ ํ• ๋‹น
    - ์‹œ๋งจํ‹ฑ ๋ถ„ํ• ์—์„œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋‚ด๋ ค๋ฉด ์—ญ์‹œ CNN ๊ธฐ๋ฐ˜
    - ์ง€๊ธˆ์€ ๋‹จ์ˆœํžˆ ์ƒ‰์ƒ ๋ถ„ํ• ๋กœ ์‹œ๋„
# ์ด๋ฏธ์ง€ ์ฝ๊ธฐ

from matplotlib.image import imread

image = imread('./ladybug.png')
image.shape
plt.imshow(image)
# ์ƒ‰์ƒ๋ณ„๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง

from sklearn.cluster import KMeans

X = image.reshape(-1, 3)
kmeans = KMeans(n_clusters=8, random_state=13).fit(X) # ์ƒ‰์ƒ์„ 8๊ฐœ๋กœ ๊ตฌ๋ถ„ ์‹œ๋„
segmented_img = kmeans.cluster_centers_[kmeans.labels_]
segmented_img = segmented_img.reshape(image.shape)

# ๊ฒฐ๊ณผ  - ์ƒ‰์ƒ์˜ ์ข…๋ฅ˜๊ฐ€ ๋‹จ์ˆœํ•ด์ง

plt.imshow(segmented_img)
# ์ด๋ฒˆ์—๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ตฐ์ง‘์„ ๋น„๊ต

segmented_imgs = []
n_colors = (10, 8, 6, 4, 2)
for n_clusters in n_colors:
    kmeans = KMeans(n_clusters=n_clusters, random_state=13).fit(X)
    segmented_img = kmeans.cluster_centers_[kmeans.labels_]
    segmented_imgs.append(segmented_img.reshape(image.shape))
# ์ด๋ฒˆ์—๋Š” ์ข€ ๋ณต์žกํ•˜๊ฒŒ ๊ฒฐ๊ณผ๋ฅผ ์‹œ๊ฐํ™”

plt.figure(figsize=(10, 5))
plt.subplots_adjust(wspace=0.05, hspace=0.1)

plt.subplot(231)
plt.imshow(image)
plt.title('Original image')
plt.axis('off')


for idx, n_clusters in enumerate(n_colors):
    plt.subplot(232 + idx)
    plt.imshow(segmented_imgs[idx])
    plt.title('{} colors' .format(n_clusters))
    plt.axis('off')

plt.show()
# MNIST ๋ฐ์ดํ„ฐ

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=13)
# ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=5000, random_state=13)

log_reg.fit(X_train, y_train)
# ๊ฒฐ๊ณผ

log_reg.score(X_test, y_test)
# pipeline ์ „์ฒ˜๋ฆฌ ๋А๋‚Œ์œผ๋กœ kmeans๋ฅผ ํ†ต๊ณผ

from sklearn.pipeline import Pipeline

pipline = Pipeline([
    ('kmeans', KMeans(n_clusters=50, random_state=13)), 
    ('log_reg', LogisticRegression(multi_class='ovr', solver='lbfgs',
                                   max_iter=5000, random_state=13))
])

pipline.fit(X_train, y_train)
pipline.score(X_test, y_test)
# Gridsearch

from sklearn.model_selection import GridSearchCV

param_grid = dict(kmeans__n_clusters = range(2, 100))
grid_clf = GridSearchCV(pipline, param_grid, cv=3, verbose=2)
grid_clf.fit(X_train, y_train)
grid_clf.best_params_
grid_clf.score(X_test, y_test)

๐Ÿ’ป ์ถœ์ฒ˜ : ์ œ๋กœ๋ฒ ์ด์Šค ๋ฐ์ดํ„ฐ ์ทจ์—… ์Šค์ฟจ

profile
#๋ฐ์ดํ„ฐ๋ถ„์„ #ํผํฌ๋จผ์Šค๋งˆ์ผ€ํŒ… #๋ฐ์ดํ„ฐ #๋””์ง€ํ„ธ๋งˆ์ผ€ํŒ…

0๊ฐœ์˜ ๋Œ“๊ธ€