
Supervised Learning with scikit-learn
Classification
Supervised Learning
- Machine Learning: the art and science of giving computers the ability to learn to make decisions from data w/o being explicitly programmed
- (ex) learning to predict whether an email is spam, clustering Wikipedia into different categories, etc.
- Unsupervised learning: uses unlabeled data
- uncovering hidden patterns from unlabeled data
- (ex) clustering: grouping customers into distinct categories
- Reinforcement learning
- software agents interact with an environment
- learn how to optimize their behavior
- given a system of rewards & punishments
- draws inspiration from behavioral psychology
- Supervised learning: uses labeled data
- predictor variables/features and a target variable
- aim: predict the target variable, given the predictor variables
- classification: target variable consists of categories
- regression: target variable is continuous
- applications
- automate time-consuming or expensive manual tasks
- make predictions about the future
- need labeled data
Exploratory data analysis
The Iris dataset in scikit-learn
from scklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
iris = datasets.load_iris()
type(iris)
print(iris.keys())
type(iris.data), type(iris.target)
iris.data.shape
iris.target_names
x = iris.data
y = iris.target
df = pd.DataFrame(x, columns=iris.feature_names)
Visual EDA
_ = pd.plotting.scatter_matrix(df, c = y, figsize = [8,8],
s= 150, marker='D')
- c: color (data points in the figure will be colored by this value)
- figsize: size of figure
Result

- diagonal: histograms of the features corresponding to row & column
- off-diagonal: scatter plots of the column feature vs. row feature colored by target variable
The Classification Challenge
K-Nearest Neighbors (KNN)
- predicts the label of a data point by looking at the 'k' closest labeled data points
- the data points vote on what label the unlabeled point should have
Scikit-learn fit and predict
- all ML models implemented as Python classes
- they implement the algorithms for learning & predicting
- store the information learned from the data
- training a model on the data = 'fitting' a model to the data
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])
- data should be a NumPy array or pandas DataFrame
- features should be continuous values (not categories)
- there should be no missing values in the data
- each column is a feature & each row is a data point
- in fitting method, first data should be features & second should be target variable
x_new = np.array([[5.6, 2.5, 3.1, 2.6], [5.7, 2.6, 3.1, 2.6],
[1.5, 2.7, 4.1, 2.7]])
prediction = knn.predict(x_new)
- prediction: 3 by 1 array with a prediction for each observation in x_new
Measuring model performance
- accuracy is a commonly used metric in classification
- accuracy = fraction of correct predictions
Procedure
- split data into training & test set
- fit/train the classifier on the training set
- make predictions on test set
- compare predictions with the known labels
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=21, stratify=y)
- train_test_split()
- 1st argument: feature data, 2nd argument: targets or labels
- test_size: proportion of the original data to be used for the test set
- random_state: sets a seed for the random number generator that splits the data into train & test
- stratify: should equal label to ensure that labels in train & test sets are as they are in the original dataset
- returns 4 arrays: training data, test data, training labels, test labels
- by default, test data: 25%, training data: 75%
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)
knn.score(X_test, y_test)
Model Complexity
- larger k = smoother decision boundary = less complex model
- complex models run the risk of being sensitive to noise in the data, rather than reflecting trend —> overfitting
- too big K value —> underfitting
Regression
Introduction to Regression
- target value: continuous value
boston = pd.read_csv('boston.csv')
X = boston.drop('MEDV', axis=1).values
y = boston['MEDV'].values
X_rooms = X[:, 5]
y = y.reshape(-1, 1)
X_rooms = X_rooms.reshape(-1, 1)
Fitting a regression model
import numpy as np
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms), max(X-rooms)).reshape(-1,1)
Basics of Linear Regression
- define an error function for any given line & choose the line that minimizes the error function
- aka lost function (cost function)
- we minimize the sum of squares of residual
- Ordinary least squares (OLS): minimize sum of squares of residuals
- Linear regression higher dimensions
- must specify coefficient for each feature (x) & the variable b
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
reg_all.score(X_test, y_test)
Cross-validation
- cross-validation motivation
- model performance is dependent on way the data is split
- not representative of model's ability to generalize
- so, we use cross-validation to avoid the problem of the chosen metric being dependent on the train test split
- cross-validation basics
- split the dataset into 5 groups (folds)
- hold out the first fold as a test set & fit the model on the remaining 4 folds
- predict on the test
- compute the metric of interest
- repeat for the second fold, third fold.. etc.
- interpreting cross-validation
- as a result of cross-validation, we get 5 values of R squared from which we compute statistics of interests (mean, median, 95% confidence intervals)
- cross-validation & model performance
- split into five folds - 5-fold cross validation
- split into 10 folds - 10-fold cross validation
- split into k folds - k-fold cross validation (CV)
- trade-off of using more folds: more computationally expensive
Cross-validation in scikit-learn
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv = 5)
- cross_val_score(regressor, feature data, target data, cv = #)
- cv: specifies number of folds
- returns an array of cross-validation scores
- length of array = # of folds utilized
- score is R-squared (by default)
Regularized regression
- why regularize?
- if linear regression occurs in a high-dimensional space with large coefficients, it may lead to overfitting
- we should penalize large coefficients —> Regularization
Ridge regression
- loss function = OLS loss function +

- models are penalized for coefficients w/ a large magnitude
- alpha: parameter we need to choose
- controls complexity
- alpha = 0: OLS (possibly overfitting)
- high alpha: large coefficients are significantly penalized (underfitting)
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_tst = train_test_split(X, y, test_size=0.3,
random_state=42)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train, y_train)
ridge_pred=ridge.predict(X_test)
ridge.score(X_test, y_test)
- Ridge(alpha = #, normalize = True/False)
- alpha: alpha value
- normalize = True: all variables are on the same scale
Lasso regression
- loss function = OLS loss function +

- can be used to select important features of a dataset
- shrinks the coefficients of less important features to 0
- others are selected by the algorithms
from sklearn.linear_model import Lasso
X_train, X_test, y_train, y_tst = train_test_split(X, y, test_size=0.3 random_state=42)
lasso = Lasso(alpha=0.1, normalize=True)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
lasso.score(X_test, y_test)
Feature selection of Lasso regression
from sklearn.linear_model import Lasso
names = boston.drop('MEDV', axis=1).columns
lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_
_ = plt.plot(range(len(names)), lass_coef)
_ = plt.xticks(range(len(names)), names, rotation=60)
_ = plt.ylabel('Coefficients')
plt.show()

Fine-tuning your model
How good is your model?
- accuracy may not be the best measure
- ways to diagnose classification predictions
- Confusion matrix: 2-by-2 matrix that summarizes predictive performance (given binary classifier)

- top left & bottom right are correctly labeled (True)
- class of interest: positive class (i.e. spam)
- accuracy = (sum of the diagonal) / (total sum of the matrix)
- Metrics from the confusion matrix
- precision = (number of true positives) / (total number of true positives and false positives)
- aka positive predictive value (PPV)
- high precision = low false positive rate: low false positive rate
- (ex) not many real emails were predicted being spam
- recall = (number of true positives) / (total number of true positives and false negatives)
- aka sensitivity, hit rate, true positive rate
- high recall: classifier predicted most positives correctly
- F1 score = 2 (precision recall) / (precision + recall)
- aka harmonic mean of precision and recall
Confusion matrix in scikit-learn
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
knn = KNeighborsClassifier(n_neighbors=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
- classification_report(y_test, y_pred)
- 1st argument: true label, 2nd argument: prediction
Logistic regression and the ROC curve
- Logistic regression for binary classification
- logistic regression outputs probabilities
- given 1 feature, log reg outputs a probability p with respect to the target variable
- p > 0.5 → data labeled as 1
- p < 0.5 → data labeled as 0
- log reg produces a linear decision boundary

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
- by default, logistic regression threshold = 0.5
- thresholds affect true & false positive rates
- threshold == 0: model predicts 1 for all data
- true positive rate == false positive rate == 1
- threshold == 1: model predicts 0 for all data
- both true & false positive rates are 0
- threshold in between 0 & 1: series of different false positive & true positive rates
- Receiver Operating Characteristic (ROC) curve: the set of points we get when trying all possible thresholds
from sklearn.metrics import roc_curve
y_pred_prob = logreg.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show();
- roc_curve(y_test, y_pred_prob)
- 1st argument: actual labels, 2nd argument: predicted probabilities
- fpr: false positive rate, tpr: true positive rate
- logreg.predict_proba(X_test)[:,1]
- returns array with 2 columns
- each column contains the probabilities for the respective target values
- we choose the second column (index=1)
- the probabilities of the predicted labels being 1

Area under the ROC curve (AUC)
- larger area under the ROC curve —> better model
from sklearn.metrics import roc_auc_score
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 42)
logreg.fit(X_train, y_train)
y_pred_prob = logreg.predict_proba(X_test)[:,1]
roc_auc_score(y_test, y_pred_prob)
AUC using cross-validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')
Hyperparameter tuning
- linear regression: choosing parameters
- Ridge/lasso regression: choosing alpha
- kNN: choosing n-neighbors
Hyperparameters: parameters that cannot be explicitly learned by fitting the model
- (ex) alpha, n-neighbors, etc.
- choosing the correct hyperparameter = hyperparameter tuning
- essential to use cross-validation
- Grid search cross-validation

- try every combination of parameters in the grid
- fill up the grid
- choose the combination with best performance
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': np.arange(1, 50)}
knn = KNeighborsClassifer()
knn_cv = GridSearchCV(knn, param_grid, cv = 5)
knn_cv.fit(X, y)
knn_cv.best_params_
knn_cv.best_score_
Hold-out set for final evaluation
- how well can the model perform on never seen data?
- using all data for cross-validation is not ideal
- split data into training & hold-out set at the beginning
- perform grid search cross-validation on training set
- choose best hyperparameters & evaluate on hold-out set
Preprocessing and pipelines
- dealing with categorical features
- scikit-learn does not accept categorical features by default
- we need to encode categorical features numerically → dummy variables
- 0: observation was not the category
- 1: observation was the category
- Dummy variables


- Dealing with categorical features in Python
- scikit-learn: OneHotEncoder()
- pandas: get_dummies()
import pandas as pd
df = pd.read_csv('auto.csv')
df_origin = pd.get_dummies(df)
df_origin = pd.get_dummies(df, drop_first=True)
df_origin = df_origin.drop('origin_Asia', axis=1)
Handling missing data
- change all the missing data entries to 'NaN'
df.insulin.replace(0, np.nan, inplace=True)
Drop missing data
- drawback: we will have to drop a lot of data
df = df.dropna()
Imputing missing data: making an educated guess about the missing values
- (ex) using the mean of the non-missing entries
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X)
X = imp.transform(X)
- Imputer(missing_values = 'NaN', strategy='mean', axis=0)
- missing_values: missing values are represented by NaN
- axis=0 : we will impute along columns
- axis = 1: impute along row
Imputing within a pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
logreg = LogisticRegression()
steps = [('imputation', imp), ('logistic_regression', logreg)]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)
- steps: each step is a 2-tuple containing the name you wish to give the relevant step & estimator
Centering & Scaling
- why scale your data?
- many models use some form of distance to inform them
- features on larger scales can unduly influence the model
- (ex) KNN uses distance explicitly when making predictions
- we want features to be on a similar scale → normalizing (centering & scaling)
- ways to normalize data
- standardization: subtract the mean and divide by variance
- all features centered around 0 & have variance 1
- subtract the minimum & divide by range
- can normalize so the data ranges [-1,+1]
Scaling in scikit-learn
from sklearn.preprocessing import scale
X_scaled = scale(X)
Scaling in a pipeline
from sklearn.preprocessing import StandardScaler
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=21)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred)
knn_unscaled = KNeighborsClassifer().fit(X_train, y_train)
knn_unscaled.score(X_test, y_test)
CV & scaling in a pipeline
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
parameters = {knn__n_neighbors: np.arange(1, 50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
cv = GridSearchCV(pipeline, param_grid = parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)