
Unsupervised Learning in Python
1. Clustering for dataset exploration
Unsupervised Learning
Unsupervised learning: a class of machine learning techniques for discovering patterns in data
- (ex) clustering customers by their purchases, compressing data using purchase patterns (dimensions reduction)
- Supervised learning vs. Unsupervised learning
- supervised: finds patterns for a prediction task
- (ex) classify tumors as benign/cancerous (labels)
- unsupervised: finds patterns in data w/o labels (w/o a specific prediction task)
K-means clustering
- finds clusters of samples
- number of clusters must be specified
- uses sklearn (scikit-learn)
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
model.cluster_centers_
- model.fit(samples)
- fits the model to the data by locating & remembering the regions where the different clusters occur
labels = model.predict(samples)
Cluster labels for new samples
- new samples can be assigned to existing clusters
- k-means remembers the mean of each cluster (the centroids)
- new samples are assigned to the cluster whose centroid is closest
Scatter plots
import matplotlib.pyplot as plt
xs = samples[:,0]
ys = samples = samples[:,2]
plt.scatter(xs, ys, c=labels)
plt.show()
- plt.scatter(x, y, c=labels, marker=’D’, s=50)
- c: color of labels defined by labels
- marker=’D’: diamond as marker
- s: size of marker
Evaluating a clustering
- compare the clusters with the original data
- measure quality of a clustering
- informs choice of how many clusters to look for
import pandas as pd
df = pd.DataFrame({'labels': labels, 'species': species})
ct = pd.crosstab(df['labels'], df['species'])
Crosstab of labels and species

Measuring clustering quality using only samples & cluster labels
- a good clustering has tight clusters
- tight clusters: samples in each cluster bunched together (not spread out)
- inertia: measures how spread out the clusters are
- lower is better
- measures distance from each sample to centroid of its cluster
- available after fit() method, as attribute inertia_
- decreases with increasing number of clusters
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)
- how to choose good clustering when tradeoff between inertia & number of clusters
- low inertia & not too many clusters
- where inertia begins to decrease more slowly than the speed of increase in number of clusters
Transforming features for better clustering
- in KMeans: feature variance = feature influence
- variance of a feature corresponds to its influence on the clustering algorithm
- to give every feature a chance, data needs to be transformed so that features have equal variance
- StandardScaler transforms each feature to have mean 0 & variance 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(samples)
StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)
- StandardScaler vs. KMeans
- StandardScaler: fit()/transform()
- KMeans: fit()/predict()
- assigns cluster labels to samples
- StandardScaler then KMeans
- use sklearn pipeline to combine the steps
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)
from sklearn.pipeline import make_pipeline
**pipeline = make_pipeline(scaler, kmeans)**
pipeline.fit(samples)
labels = pipeline.predict(samples)
Normalizer()
- StandardScaler() standardizes features by removing the mean & scaling to unit variance
- Normalizer() rescales each sample independently of the other
2. Visualization with hierarchical clustering and t-SNE
Visualizing hierarchies
- t-SNE: creates 2D map of a dataset
- conveys useful information about the proximity of samples from one another
- hierarchical clustering

Hierarchical clustering (Agglomerative)
- every country begins in a separate cluster
- at each step, the 2 closest clusters are merged
- continue until all countries in a single cluster
- divisive clustering works the other way around
The dendrogram
- read from bottom up
- vertical lines represent clusters
- joining of vertical lines = merging of clusters
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
mergings = linkage(samples, method='complete')
dendrogram(mergings, labels=country_names, leaf_rotation=90, leaf_font_size=6)
plt.show()
Cluster labels in hierarchical clustering
Intermediate clusterings & heights on dendrogram
- intermediate stage in the hierarchical clustering is specified by choosing a height on the dendrogram
Dendrograms
- y-axis (height on dendrogram) = distance between merging clusters
- don’t merge clusters further apart than this
- distance b/w clusters
- defined by Linkage method
- “complete” linkage: distance b/w clusters is distance b/w furthest points
- “single” linkage: distance b/w clusters is the distance b/w closest points
- specified via method parameter
Extracting cluster labels
- fcluster() function
- returns a NumPy array of cluster labels
- cluster labels start at 1 (not 0 like scikit-learn)
from scipy.cluster.hierarchy import linkage
mergings = linkage(samples, method='complete')
from scipy.cluster.hierarchy import fcluster
labels = fcluster(mergings, 15, criterion='distance')
Aligning cluster labels w/ country names
import pandas as pd
pairs = pd.DataFrame({'labels': labels, 'countries': country_names})
pairs.sort_values('labels')
t-SNE for 2-dimensional maps
- t-SNE: unsupervised learning method for visualization
- t-distributed stochastic neighbor embedding
- maps samples from high-dimensional space into 2 or 3D space so they can be visualized
- approximately preserves nearness of samples
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
model = TSNE(learning_rate=100)
transformed = model.fit_transform(samples)
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys, c=species)
plt.show()
- fit_transform() method
- simultaneously fits the model & transforms the data
- no separate methods for this function
- can’t extend the map to include new data samples
- must start over each time
- t-SNE learning rate
- choose learning rate for the data set
- wrong choice: points bunch together
- try values 50~200
- axes of t-SNE plot have no meaning
- changes every time even on same dataset
3. Decorrelating your data and dimension reduction
Visualizing the PCA transformation
Dimension reduction: finds patterns in data & uses the patterns to re-express the data in a compressed form
- more efficient storage & computation
- remove less-informative noise features
- noise features cause problems for prediction tasks (i.e., classification, regression)
Principal Component Analysis (PCA)
- fundamental dimension reduction technique
- Steps
- decorrelation
- dimension reduction
- decorrelation
PCA in coding
- PCA is a scikit-learn component
- fit() learns the transformation from given data
- how to shift & how to rotate the samples
- does not actually change them
- transform() applies the transformation that fit learned
- can be also applied to new data
- returns a new array of transformed samples
- same number of rows & columns
- columns: PCA features
from sklearn.decomposition import PCA
model = PCA()
model.fit(samples)
transformed = model.transform(sample)
- PCA features are not correlated (unlike features of original dataset)
Pearson correlation
- measures linear correlation of features
- value b/w -1 & 1
- value of 0 = no linear correlation
from scipy.stats import pearsonr
correlation, pvalue = pearsonr(width, length)
Principal components
- principal components = directions of variance
- directions in which samples vary the most
- PCA aligns principal components w/ the axes
- available as .components_ attribute of PCA object
- numpy array w/ 1 row for each principal component
- each row defines displacement from mean
attributes of PCA
- .components_ : principal components
- .mean_ : coordinates of the mean of data
Intrinsic dimension
Intrinsic dimension: number of features needed to approximate the dataset
- informs dimension reduction b/c it tells how much a dataset can be compressed
- the most compact representation
- can be detected w/ PCA
PCA identifies intrinsic dimensions
- intrinsic dimension = number of PCA features w/ significant variance
Plotting variances of PCA features
- samples: array of samples
import matplotlib as plt
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(samples)
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')
plt.show()
How to draw arrow
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)
- .arrow()
- 1st argument: x coordinate of starting point
- 2nd argument: y coordinate of starting point
- 3rd argument: x coordinate of ending point
- 4th argument: y coordinate of ending point
Dimension reduction w/ PCA
- dimension reduction: represents same data using less features
Dimension reduction w/ PCA
- PCA performs dimension reduction by discarding PCA features w/ lower variance, which it assumes to be noise & retaining higher variance PCA features, which it assumes to be informative
- specify how many features to keep i.e., PCA(n_components=2)
- intrinsic dimension is a good choice
Code
- samples: array of measurements (4 features)
- aim to decrease to 2 features
- species: list of species numbers
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(samples)
transformed = pca.transform(samples)
print(transformed.shape)
- results
- PCA reduced dimension to 2
- retained the 2 PCA features w/ highest variance
- important information preserved: species remain distinct
Word Frequency arrays
- rows represent documents & columns represent words
- entries measure presence of each word in each document
- most entries of the word frequency array is zero
- use scipy.sparse.csr_matrix
- csr_matrix remembers only the non-zero entries
from sklearn.decomposition import TruncatedSVD
model = TruncatedSVD(n_components=3)
model.fit(documents)
transformed = model.transform(documents)
How to create a tf-idf word frequency array
- TfidfVectorizer from sklearn
- transforms a list of documents into a word frequency array in the form of csr_matrix
- has fit() & transform() methods
- tf: frequency of word in document
- idf: reduces influence of frequent words, i.e. the
from sklearn.feature_extractio.text import TfidfVectorizer
tfidf = TfidfVectorizer()
csr_mat = tfidf.fit_transform(documents)
csr_mat.toarray()
words = tfidf.get_feature_names()
4. Discovering interpretable features
Non-negative matrix factorization (NMF)
- dimension reduction technique
- NMF models are interpretable (unlike PCA)
- all sample features must be non-negative for NMF to be applied
Interpretable parts
- NMF achieves its interpretability by decomposing samples as sums of their parts
- NMF expresses documents as combinations of topics (or themes)
- expresses images as combinations of patterns
Using scikit-learn NMF
- unlike PCA, desired number of components must always be specified
- works with NumPy arrays & csr_matrix
from sklearn.decomposition import NMF
model = NMF(n_components=2)
model.fit(samples)
nmf_features = model.transform(samples)
model.components_
NMF features
- non-negative
- can be used to reconstruct samples when combined with components
Reconstruction of sample
- multiply components by feature values & add up
- [2, 1] * [[1 0.5 0], [0.2 0.1 2.1]] → [2.2, 1.1, 2.1]
- can be expressed as a production of matrices
NMF fits to non-negative data only
NMF learns interpretable parts
from sklearn.decomposition import NMF
nmf = NMF(n_components=10)
nmf.fit(articles)
NMF components
- for documents:
- NMF components represent topics
- NMF features combine topics into documents
- for images, NMF components are parts of images
Grayscale images: no colors, only different shades of grey
- since there are only shades of grey, a grayscale image can be encoded by the brightness of every pixel
- represent brightness w/ value b/w 0 & 1
- convert to 2D array of numbers
Grayscale images as flat arrays
- enumerate the entries
- read-off the values row-by-row
- from left to right, top to bottom
- a flat array of non-negative numbers
A collection of grayscale images of the same size can be encoded as a 2D array
- each row represents an image as a flattened array
- each column represents a pixel
- NMF can be used
To recover the image
- reshape() method
- specify the dimensions of the original image as a tuple
- returns 2D array of pixel brightnesses
- use pyplot to show the image
bitmap = sample.reshape((2, 3))
from matplotlib import pyplot as plt
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.show()
def show_as_image(sample):
bitmap = sample.reshape((13,8))
plt.figure()
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()
Building recommender systems using NMF
task: recommend articles similar to article being read by customers
Strategy
- apply NMF to the word-frequency array of articles & use the resulting NMF features
- NMF feature values describe the topics
- so similar documents have similar NMF feature values
Apply NMF to the word-frequency array
- articles: word frequency array
from sklearn.decomposition import NMF
nmf = NMF(n_components=6)
nmf_features = nmf.fit_transform(articles)
Compare articles by NMF features
- different versions of the same document have same topic proportions but exact feature values may be different
- i.e., because one version uses many meaningless words → reduces values of NMF features representing the topics
- however, on a scatter plot of the NMF features, all these versions (weak & strong) lie on a single line passing through the origin
Cosine similarity: angle between the lines
- higher values: greater similarity
from sklearn.preprocessing import normalize
norm_features = normalize(nmf_features)
current_article = norm_features[23,:]
similarities = norm_features.dot(current_article)
similarities
Dataframes and Labels
- label similarities with article titles
import pandas as pd
norm_features = normalize(nmf_features)
df = pd.DataFrame(norm_features, index=titles)
current_article = df.loc['Dog bites man']
similarities = df.dot(current_article)
similarities.nlargest()
MaxAbsScaler
- transforms the data so that all users have the same influence on the model regardless of how many different artists they’ve listened to
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler