
Applying logistic regression and SVM
Scikit-learn refresher
import sklearn.datasets
newsgroups = sklearn.datasets.fetch_20newgroups_vectorized()
X, y = newsgroups.data, newsgroups.target
X.shape
y.shape
from sklearn.neighbors import KNeighborsClassifer
knn = KNeighborsClassifer(n_neighbors = 1)
knn.fit(X, y)
y_pred = knn.predict(X)
Model evaluation
knn.score(X, y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
Applying logistic regression and SVM
Using LogisticRegression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.predict(X_test)
lr.score(X_test, y_test)
lr.predict_proba(X_train[:1])
LinearSVC
import sklearn.datasets
wine = sklearn.datasets.load_wine()
from sklearn.svm import LinearSVC
**svm = LinearSVC()**
svm.fit(wine.data, wine.target)
svm.score(wine.data, wine.target)
SVC (fits non-linear datasets by default)
import sklearn.datasets
wine = sklearn.datasets.load_wine()
**from sklearn.svm import SVC
svm = SVC()**
svm.fit(wine.data, wine.target)
svm.score(wine.data, wine.target)
- more complex models like nonlinear SVMs contain the risk of classifier overfitting
Complexity review
- underfitting: model is too simple, low training accuracy
- overfitting: model is too complex, low test accuracy
Linear decision boundaries
Decision boundary: tells us what class our classifier will predict for any value of x


- classifier predicts the blue class in the blue shaded area
- blue shaded area: feature 2 is small
- classifier predicts the red class in the red shaded area
- red shaded area: feature 2 is large
- decision boundary: dividing line between the two regions
- line can be in any orientation
- in this specific case, it is linear since it is horizontal
- in basic forms, logistic regression & SVMs are linear classifiers
- they learn linear decision boundaries
Vocabulary
- classification: supervised learning when the y-values are categories
- in contrast w/ regression (predicting continuous values)
- decision boundary: the surface separating different predicted classes
- linear classifier: a classifier that learns linear decision boundaries
- (ex) logistic regression, linear SVM
- linearly separable: a data set can be perfectly explained by a linear classifier

- left figure: no single line that separates the red and blue examples
- right figure: we could divide 2 classes w/ a straight line → linearly separable
Loss Functions
Linear classifiers: the coefficients
Dot product
x = np.arange(3)
y = np.arange(3, 6)
x*y
np.sum(x*y)
x@y
Linear classifier prediction
- raw model output = coefficients x features + intercept
- linear classifier prediction: compute raw model output, check the sign
- if positive, predict one class
- if negative, predict the other class
- this is the same for logistic regression & linear SVM
- .fit() is different but .predict() is the same
- differences in .fit() relate to loss functions
lr = LogisticRegression()
lr.fit(X, y)
lr.predict(X)[10]
lr.predict(X)[20]
lr.coef_ @ X[10] + lr.intercept_
lr.coef_ @ X[20] + lr.intercept_

What is a loss function?
Least squares: the squared loss
- scikit-learn’s LinearRegression minimizes a loss:
∑i=1n(true ith target value −predicted ith target value)2
- minimizes sum of squares of errors made on training set
- error is defined as the difference b/w the true target value & the predicted target value
- jiggle around the coefficients (parameters) until the error term (loss function) is small as possible
- minimization in coefficients/parameters is to be reached
- loss function is a penalty score that tells us how well/bad the model is doing on the training data
- “fit” function as running code that minimizes the loss
- scikit-learn model.score() isn’t necessarily the loss function
- could be, but not guaranteed
Classification errors: the 0-1 loss
- Squared loss is not appropriate for classification problems
- b/c y-values are categories, not numbers
- a natural loss for classification problem: number of errors
- 0-1 loss:
- 0 for a correct prediction
- 1 for incorrect prediction
- by summing this function over all training examples, we get the number of mistakes we’ve made on the training set
- since we add 1 to the total for each mistake
- but the loss is hard to minimize!
- thus LR & SVMs don’t use it
Minimizing a loss
from scipy.optimize import minimize
minimize(np.square, 0).x
- minimize(function, initial guess).x
- 1st: function
- 2nd: initial guess
- .x : grab the input value that makes the function as small as possible
- result is 0 for the above code b/c the function is minimized when x = 0
- the square of a number can only be zero or more
- smallest possible value is attained when x = 0
minimize(np.square, 2).x
- the very small number is normal for numerical optimization:
- we don’t expect exactly the right answer, but something very close
- inputs: model coefficients
- to answer the question: “what values of the model coefficients make my squared error as small as possible?”
- what linear regression is doing
Loss Function Diagrams
The raw model output

- Since we predict using the sign of the raw model output, the plot is divided into 2 halves
- the left half: predict the one class (-1)
- the right half: predict the other class (+1)
0-1 loss diagram

- By definition of 0-1 loss, incorrect predictions get a penalty of 1 & correct ones get no penalty
- this picture is the loss for a particular training
- to get the whole loss, we need to sum up the contribution from all examples
Linear regression loss diagram

- squared/quadratic function
- the raw model output is the prediction
- intuitively, the loss is higher as the prediction is further away from the true target value (1)
- problem: the left side is correct (increasing loss as value further away from 1) but the right side is not
- in the right side, we predict +1 & it is correct but the loss grows large regardless
- perfectly good models are considered bad by the loss
- we need specialized loss functions for classification
Logistic loss diagram

- used in logistic regression
- smoother version
- as you move to the right (towards the zone of correct predictions), loss goes down
Hinge loss

Logistic Regression
Logistic regression and regularization
- regularization combats overfitting by making the model coefficients smaller

- The figure shows the learned coefficients of a logistic regression model w/ default regularization
- In scikit-learn, the hyperparameter “C” is the inverse of the regularization strength
- larger C → less regularization
- smaller C → more regularization

- orange curve: with smaller value of C
- more regularization for our logistic regression model
- regularization makes the coefficients smaller
How does regularization affect training accuracy?
lr_weak_reg = LogisticRegression(C=100)
lr_strong_reg = LogisticRegression(C=0.01)
lr_weak_reg.fit(X_train, y_train)
lr_strong_reg.fit(X_train, y_train)
lr_weak_reg.score(X_train, y_train)
lr_strong_reg.score(X_train, y_train)
- model w/ weak regularization gets a higher training accuracy
- regularization: an extra term added to the original loss function, which penalizes large values of the coefficients
regularized loss =original loss + large coefficient penalty
- more regularization → lower training accuracy
- w/o regularization, we maximize the training accuracy
- when we add regularization, we modify the loss function to penalize large coefficients, which distracts from the goal of optimizing accuracy
- more regularization (smaller C)
→ more deviation from goal of maximizing training accuracy
→ lower training accuracy
How does regularization affect test accuracy?
lr_weak_reg.score(X_test, y_test)
lr_strong_reg.score(X_test, y_test)
- more regularization reduces training accuracy but IMPROVES test accuracy
- not having access to a particular feature ⇒ the corresponding coefficient set to zero
- regularizing (making your coefficient smaller) is like a compromise b/w not using the feature at all (setting the coefficient to zero) & fully using it (the un-regularized coefficient value)
- using a feature too heavily → overfitting
- regularization lessens overfitting
L1 vs. L2 regularization
- Lasso: linear regression w/ L1 regularization
- Ridge: linear regression w/ L2 regularization
- for other models like logistic regression we just say L1, L2, etc.
- both help reduce overfitting
- L1 performs feature selection
lr_L1 = LogisticRegression(penalty='l1')
lr_L2 = LogisticRegression()
lr_L1.fit(X_train, y_train)
lr_L2.fit(X_train, y_train)
plt.plot(lr_L1.coef_.flatten())
plt.plot(lr_L2.coef_.flatten())

- L1 regularization: set many of the coefficients to zero
- ignore all the coefficients
- it performed feature selection for us
- L2 regularization: shrinks the coefficients to be smaller
- analogous to what happens w/ Lasso & Ridge regression