AI, ML, DL Concepts

Donburi·2023년 1월 6일

AI DL ML

AI

목록 보기

1/1

구분

DL < NN < ML < AI

Artificial Intelligence (AI): science and engineering of making intelligence machines
Machine Learning (ML): subset/subfield of AI that self-learns through data
Neural Networks (NN)
Deep Learning (DL) - "deep" describes # of hidden layers

Note on ML

ML algorithms are better at seeing patterns in the data
Ethics: algorithms itself are neutral, input data can be biased thus affecting the algorithms

Type of Learning

Supervised Learning: data with labels
Unsupervised Learning: data without labels
Semi-supervised Learning: some data with labels, some without
Reinforcement Learning: learning based on errors and rewards

Data & Pre-processing

Type of Data

Structured Data: features have clearly defined meaning/data types (e.g. DB, table of data)
Unstructured Data (e.g. audio, image, text, etc.)

Common/General Pre-processing Techniques

Imputation: dealing with missing data/values (e.g. mean, interpolation, etc.)
Principal Component Analysis (PCA): dimensionality reduction
Standardization/Normalization: important especially when different features have different units (e.g. height will have more contribution in distance calculation compared to weight)

Bias (편향) & Variance (분산)

Trade-off between bias and variance

Bias: gap between predicted value and actual value (label)
- | human error - training error |
Variance: how scattered predicted value is from actual value (label)
- | training error - validation error |

Bias&Variance
Underfitting: high bias, low variance ("systematically wrong, doesn't capture complexity")

More complex network
Train longer

Overfitting: low bias, high variance ("vulnerable to noise and fluctuation in data")

More data
Regularisation (e.g. dropout, regularizer)

Common Techniques

Train/Dev/Test set

Validation or Development set: fine-tune hyperparameters (these are parameters that control learning, not learnt by the machine)
Test: evaluating the model ("final say")
Dev and Test set should come from the same distribution that is close to "real"

K-fold Cross Validation - useful when data is limited (use the whole training data instead of splitting it into train and dev)

Increase K: lower bias, higher variance
- Increasing the number of folds means we use more training data (but at the same time, risk poor generalisation)
Leave-One-Out Cross Validation (LOOCV): K=n where n is the number of training samples
- At each iteration, we use n-1 samples as training data and 1 sample as validation (thus, LOO)

Prediction

Ensemble (or bagging): average the results of various models (lower varaince)
Stacking: predictions of one model becomes the inputs of another model
- Idea: each model learns some part of the problem instead of the whole problem, potentially improving performance

ML

Supervised Learning

Classification - predicts class/label

Naive Bayes

Assumes that each evidence/feature/predictor makes an independent and equal contribution to the belief/label/prediction.

K-Nearest Neighbor (KNN)

Lazy (online) learning - learning happens real time as data is streamed in

Standardization is necessary
PCA can speed up KNN

Select K neighbors and predict using majority vote (break ties when necessary)
Close to a neighbor -> More likely to share common characteristics

Good value of K

Too small -> sensitive to outlier (high variance, low bias)
Too large (extreme K=n - prediction is the same for ALL data points)
In general, larger K yields lower variance, higher bias
Use CV to determine a good value

Neural Networks (NN)

Artificial Neural Network (ANN)
Perceptron
single artificial neuron (simplest ANN)
draws a linear boundary (binary classification) - thus can only model lineary separable problems (e.g. logical AND)

Multi-Layer Perceptron (MLP)
We can form MLP by connecting output of one perceptron to the input of another perceptron
Draws a curve boundary - thus can handle non-linearly separable problems (e.g. logical XOR)

2 linear boundaries can be joined by another neuron, formign a curve

An ideal activation function is non-linear and differentiable. Without a non-linear activation, MLP is no better than a single perceptron

Activation functions do NOT have to be continuous and differentiable at every point
ReLU is a good activation since it is non-saturating and converges faster than sigmoid

Convolutional Neural Network (CNN)
Feature Extractor: Convolution+Pooling
Classifier: Flatten+FC layers

Main Idea: we extract high-level features as we go through convolutions
(e.g. edges -> combination fo edges -> object models)

Pooling reduces the number of parameters and summarizes the output
Pooling & Strides help prevent overfitting

It is better to apply smaller kernel many times than applying larger kernel fewer times (better quality & less parameters)

Conv > FC

Preserves spatial context
Less number of parameters required (shared parameters - kernels)

Recurrent Neural Network (RNN)
Good for sequential data (e.g. text, audio, video)

Retains internal memory ("context")

tanh as activation

Long Short Term Memory (LSTM)

short term memory learning
overcomes the problem of exploding/vanishing gradients

Gated Recurrent Unit (GRU)

less parameters, simple

Regression (회귀) - predicts a value

Linear Regression

etc.

Logistic Regression

Sigmoid - non-linear (but saturating), values between 0 and 1 (can have a cutoff/threshold for binary classification)

Unsupervised Learning

Clustering

K-Means

Standardization is necessary
It is believed that PCA improves clustering results in practice

Select K initial centroids (random)
Compute distances and assign each data point to a cluster
Re-compute the centroids according to new membership
Repeat until cluster memberships do not change (will ALWAYS converge)

Density-based, Model-based, etc.

etc.

Optimization Algorithms

Gradient Descent (기울기 하강)

Gradient Descent
x := x - alpha * dJ/dx, where x is a parameter, J is the cost function

Reason for -?
- For negative gradient, x should increase
- For positive gradient, x should decrease
Why gradient descent instead of a closed-form solution?
- Often, there is no closed-form solution
- Even if there is, gradient descent is computationally cheaper

Exploding/Vanishing gradients (기울기 소실/폭주)

For a deep network, we have potential risk of exploding/vanishing gradients (during backward propagation)

Exploding gradients: parameters of layers closer to output vary dramatically, whereas parameters of layers closer to input do not change significantly
Vanishing gradients: parameters grow exponentially

Solution

Weight Initialization
Use non-saturating activation (e.g. ReLU)
Gradient Clipping

Batch

Batch size in training basically controls how many times the parameters are updated (since each epoch is a run through the whole training set)

Let N be # of training samples in the training set

Stochastic Gradient Descent: batch size = 1
Mini-batch Gradient Descent: 1 < batch size < N
Batch: batch size = N
For batch size > 1, it requires estimation in order to update the parameters (potential computational overhead)

Batch - slower but gives an optimal solution (given enough time)
Stochastic - faster but not optimal (keeps oscillating)
Mini-batch strikes a balance between the two

Optimization Algorithms

Gradient-based

Momentum - compute exponentially weighted average of gradients and use it to update weights
RMSprop - compute expoentially weighted average of gradients and use decay (to damp out oscillations)
Adam - combines momentum and RMSprop

Evaluation Metrics

Confusion Matrix

P or N: determined by prediction value
T or F: determined by whether prediction is correct

Metrics

Accuracy: focus on TP, TN (i.e. correctly identified cases)
F1: focus on FP, FN (better option for imbalanced data)
- harmonic mean of precision & recall
- precision: focus on predicting 1 correctly
- recall: focus on detecting 1 correctly

Donburi

AI, ML, DL Concepts

AI

구분

Data & Pre-processing

Bias (편향) & Variance (분산)

Common Techniques

ML

Supervised Learning

Classification - predicts class/label

Naive Bayes

K-Nearest Neighbor (KNN)

Neural Networks (NN)

Regression (회귀) - predicts a value

Linear Regression

Logistic Regression

Unsupervised Learning

Clustering

K-Means

Density-based, Model-based, etc.

Optimization Algorithms

Gradient Descent (기울기 하강)

Exploding/Vanishing gradients (기울기 소실/폭주)

Batch

Optimization Algorithms

Gradient-based

Evaluation Metrics

0개의 댓글