//

Machine and Deep learning

Christophe Cerisara

2018/2019

Machine Learning: introduction

Resources

  • To contact me: cerisara@loria.fr
  • Info on deep learning, including course: http://deeploria.gforge.inria.fr
  • Book
    • Deep Learning by I. Goodfellow, Y. Bengio and A. Courville. https://www.deeplearningbook.org/
    • Stanford course on ConvNet: http://cs231n.stanford.edu
    • Oxford course on Machine Learning: https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/

Requirements

  • Fundamentals of probability theory (Bayes rule, prior, posterior, likelihood, marginal, conditional, density, normal distribution…)
  • Basics of optimization (differentiability, gradient, local optima, convexity, chain rule, matrix algebra…)
  • Good programming skills in python

For next week

  • Strongly recommended: install ubuntu (virtualbox, dual boot, native…)
  • Mandatory: install python + numpy
  • Mandatory: install pytorch (https://pytorch.org)
  • Check pytorch:
    • copy, paste and run the second example in https://pytorch.org/tutorials/beginner/pytorch_with_examples.html (called PyTorch: Tensors)
    • Check that you obtain a value smaller than \(10^{-4}\) at iteration 500

Plan of course

  • Part 1: Introduction to Machine Learning

  • Part 2: Probabilistic models

  • Part 3: Deep learning

Machine learning introduction

  • Very brief reminder of linear models
  • Reminder fundamentals of parameter learning:
    • loss, risks
    • bias/variance tradeoff
  • Good practices for experimental evaluations

Probabilistic models

  • Markov Random Fields vs. Bayesian Networks
  • Naive Bayes, CRF
  • Training, maximum likelihood, EM

Deep learning

  • convex optimization: line search, steepest, Newton, BFGS, conjugate, SGD
  • feed-forward networks: backpropagation

  • convolutional networks
    • Model zoo for image processing
    • Model zoo for text processing
  • recurrent networks (RNN, LSTM, GRU):
    • vanishing gradient
  • attention and memory models
    • Model zoo for text processing
  • variational autoencoders
  • Generative Adversarial Networks

What is Machine Learning, and why ?

  • Let’s assume we want to solve a real-world problem: predict the weather tomorrow
  • We can exploit our knowledge: it’s raining more in fall…
  • Programming an expert system:
p=0.2
if summer: p=p/2
if rainYesterday: p=p*2
if pressure<1000: p=p*2

Pros and cons of expert approaches

  • Encoding of expert knowledge into a program
  • But limited by expert understanding
  • Agreement between experts ?
  • Gives unmaintainable programs for complex phenomena
  • The only solution when there is no data

Alternative: Machine learning (ML) approach

  • Do not be too much confident about our knowledge: we (experts) may be wrong !
  • Rather observe the data, and try and capture correlations between observations.

ML paradigm

  • Observe what happens during a few years (data, observations, evidence)
  • Implement a parametric function that predicts the weather, based on features \[f(X;\theta) = \sum_i \theta_i X_i\]
  • Adopt a decision rule: "It will rain iff \(f(X;\theta)>0\)
  • Compute \(\theta\) so as to minize the prediction error (training = optimization)

ML = knowledge-free approach ?

  • Does not rely on explicit human knowledge, but…
  • Which family of functions for \(f()\) ?
  • Which observations \(X\) to take ?
  • We can add constraints and priors into \(f()\)

Challenges for building an ML system

  • Where to get data from ?
  • Which features \(X\) to choose ?
  • Which model and parameters \(f(X;\theta)\) to use ?
  • How to train the parameters \(\theta\) ?
  • How to evaluate reliably whether our model is good or not ?

Where to get data from ?

  • Depends on the domain:
    • From sensors: temperature, microphone, camera, GPS…
    • From humans: books, Twitter, traces…
    • From simulations: go/chess/video games…
  • Reproducible research:
    • From standard datasets, benchmarks (Kaggle…)

A word about ethics in ML

  • Ethics in ML is a fundamental and difficult issue
    • fair evaluations, transparent ML
    • open data
    • 80% of bio papers not reproducible !
  • Never assume data is “objective”, is the truth
  • Data is often biased

E.g.: Microsoft Tay bot

Which features to choose ?

  • Features that are relevant for the task
  • Independent features ?
    • Some models require independent features
    • Most current models support correlated features
  • Diverse information
  • Many vs. a few features:
    • More features improve performances…
    • … but augment the number of parameters to estimate, and make training more difficult

Which form does \(X\) take ?

\(X=\) vector; each feature is encoded into 1+ dim:

  • Numerical feature: e.g., “Temperature today”
    • Encoded as a float in one dimension
    • Must be normalized !
  • Symbolic feature: e.g., “Season (summer, winter…)”
    • Encoded as a one-hot vector
    • Summer: \(X_i = [0,1,0,0]\) Winter: \(X_i = [0,0,0,1]\)

Which model ?

  • Standard discriminative models: linear, SVM, CRF, decision trees…
  • Bayesian models: HMM, LDA…
  • Complex discriminative models: Deep Neural Networks
  • Less common models: Memory-Based Learning, genetic algorithms…

How to train parameters ?

  • We will study how to train each type of model
    • Which objective: minimize prediction error, (re)generate \(X\)
    • 2 most important: cross-entropy and maximum likelihood
    • Most important algorithms: Stochastic Gradient Descent and Expectation Maximization

How to evaluate a model ?

  • Avoid models that are good on the training data, but fail on unknown data.
  • We want to assess generalization capabilities: bias-variance tradeoff

Linear models

Definitions

A linear model is defined as follows:

\[\hat y(x) = \sum_j x_{j} \theta_j\]

  • \(x=[x_1 \cdots x_N]^T\) is an observation vector: it contains data that is observed, known
  • There is at most one data type per dimension:
    • dim 1: temperature
    • dim 2: pression
  • Each “data type” is called a feature

Definitions

\[\hat y(x) = \sum_j x_{j} \theta_j\]

  • \(\hat y(x)\) is the predicted value (scalar); it may represent
    • A value we are interested in, but we don’t know: e.g., how much rain will fall tomorrow ? (regression)
    • The score of a class: e.g., will it rain tomorrow ? (classification)

Definitions

\[\hat y(x) = \sum_j x_{j} \theta_j\]

  • \(\theta\) is a vector of parameters: relates \(x\) to \(y\); it’s very hard to set by hand, so it is trained on a training corpus

Definitions

\[\hat y(x) = \sum_j x_{j} \theta_j\]

  • This is the equation of an hyperplane that contains the origin in \((0,x_1,\cdots,x_N,y)\)
  • To be more flexible, we add a constant scalar bias

\[\hat y(x) = \sum_j x_{j} \theta_j + b\]

  • The bias can be moved into the sum by adding one constant feature \(x_1=1\)

Linear models: Linear regression

Linear regression

  • regression = linear model of the relation between
    • \(x\): (years of education, seniority)
    • \(y\): income
  • \(y\) is the target value that we want to predict

Training

  • Imagine 2 different sets of parameters: \[\hat y = 3x_1 + 2x_2\] and \[\hat y = 1 + 2x_1 + x_2\]

  • Which model is the best ?

Training

  • Objective function to minimize (square error): \[J(\theta)=\sum_1^n (\hat y_i - y_i)^2\]

  • Training boils down computing \[\hat\theta = \arg\min_\theta J(\theta)\]

Supervised training

  • All true \(y\) are given in the training corpus
  • They are called gold values / gold labels
  • We can compute and thus optimize \(J(\theta)\) on the training corpus

Unsupervised training

  • The training corpus only contains \(X\); no \(y\) value is given
  • We can not compute \(J(\theta)\)
  • Other training criteria are used

Semi-supervised training

  • \(y\) is given for only some of the training examples
  • But we still want to use all data to train the models
  • Special cases:
    • Few-shot learning: a few \(y\) are known
    • One-shot learning: a single \(y\) is known

Solving linear regression

  • Let \(X\) be a matrix where
    • each row is one example
    • each column is one feature
  • Let \(y\) be a vector with the (gold) income for each observed example

We can express the square error as: \[J(\theta) = (X\theta-y)^T(X\theta-y)\]

Solving linear regression

  • Exercice: solve = derive the optimal \(\hat \theta\) that minimizes the square loss.

  • Reminder:
    • need \(\frac{\partial A \theta}{\partial \theta} = A^T\)
    • and \(\frac{\partial \theta^T A \theta}{\partial \theta}=2A^T \theta\)

Linear regression: solution

\[\hat \theta = (X^T X)^{-1} X^T y\]

  • Nice ! But…
    • Many problems are not linear
    • Many questions are not regression

Linear classification

  • What animal is shown on a picture ?
  • Will it rain tomorrow ?

  • The gold target \(y\) is discrete / symbolic
  • We need to map our continuous prediction \(f(x)\) into a set of classes / labels

Binary classification

  • Choose a decision boundary: \[\hat y(x) = sgn(f(x))\]

  • For the training corpus, give gold labels as \(y \in \{ -1, +1 \}\)
    • It’s not raining: \(y=-1\)
    • It’s raining: \(y=+1\)
  • Can be trained with minimum square error

Binary classification

  • \(X\) = input vector
  • \(Y\) = scalar
  • blackbox = linear function (for now…)

N-ary classification

  • Use one linear classifier per class: \(f_1(x), \cdots, f_N(x)\)
  • Decision rule: choose the class with the maximum predicted score \[\hat y(x) = \arg\max_c f_c(x)\]

  • In the training corpus, keep discrete labels \(y_i \in \{1,\cdots,N\}\)

N-ary classification

  • \(Y\) = vector = one score per class
  • blackbox = multi-linear function

Binary classification

  • Binary classification may have either one or two outputs (with different decision functions)
  • Double the number of parameters !

Problem: person name ?

  • Training dataset:
    Pierre qui roule n'amasse pas mousse
    *Jean* va manger avec *Marie*
    Il fait beau à Nancy
  • Problem: detect person names
    • We assume a fixed dictionary with \(N\) words
  • Exercice:
    • build a binary linear classifier
    • compute its optimal parameters
    • what is the predition error ?

Solution

  • Let \((w_1,y_1)\cdots,(w_N,y_N)\) be all our training samples

  • For any word \(x\), we compute indicator features:
    • \(x_i=1\) iff \(x=w_i\) for \(1\leq i \leq N\)

Solution

  • We can rearrange the dataset so that \(X=I\)
  • … (solve) … \(\hat\theta = y\)

Memory-based learning

  • Prediction error on the training dataset = 0%

Practical issues

Test sentence:

Jacques va manger avec Marie
  • What is the prediction for Jacques ?

Practical issues

Test sentence:

Jacques va manger avec Marie
  • How to handle out-of-vocabulary words (OOV) ?
    • Either in-voc, but issues with matrix inversion
    • Or we extend \(\theta\) afterwards with non-informative parameters

Theoretical issue

Test sentence:

Pierre va manger avec Marie
  • What is the prediction for Pierre ?

Theoretical issue

Test sentence:

Pierre va manger avec Marie
  • Erroneous prediction
  • This model is very bad: it severly overfits the training corpus
  • Objective of machine learning is to generalize to unseen data

Theoretical issue

  • Improve generalization capacity by
    • adding more training examples
    Pierre dort.
  • Now we get: \(\hat\theta(Pierre)=\frac 1 2\)
  • It’s better; but depending on the training ratio of Pierre, we’ll always get the same answer.

Limits of chosen representations

  • Features = only the word ID
  • This is not enough to disambiguate
    • Pierre qui roule n’amasse pas mousse
    • Pierre est venu
  • Better feature: word ID + context IDs
    • “roule” suggests “not a person”

Limits of square error

  • Solving with \(10^6\) training samples is hard

  • The square error is not the best possible measure for classification:
    • 1 extreme outlier may dramatically impact accuracy

Alt. objective: minimum error rate

  • How to count the number of prediction errors ?

\[1- \delta(y,sgn(f(x)))\]

  • We want to minimize this loss, but no analytical solution
    • Iterative training algorithms

Perceptron algorithm

Video perceptron

Linear models: The Perceptron algorithm

  • Finds a hyper plane that separates training data

    • Repeat until convergence:
      • For \(t=1 \cdots n\):
        • \(\hat y = sgn(f(x))\)
        • if \(\hat y \neq y\) then \(\theta = \theta + yX\)
  • Algorithm converges if training data separable (Novikoff,1962)

Limits of classification error loss

  • Multiple solutions:

  • Which solution is better (wrt generalization) ?

Margin

\(w\) is normal to the hyper-plane

distance \(r\) of any \(x\): \[w^T \left(x-yr\frac w {||w||} \right) + b=0\]

Margin

  • Solve for \(r\): \[r=\frac{y(w^Tx+b)}{||w||}\]

Margin

  • We choose a scaling of \(w\) and \(b\) s.t. \[y_i(w^Tx_i+b)\geq 1\]
  • with at least one equality, margin= \[\rho = \frac 2 {||w||}\]

Linear models: Maximum Margin linear classifier

  • Find \(\theta\) that minimizes \(\frac 1 2 || \theta ||\) s.t. \[y_i X_i\theta \geq 1 ~~~ \forall i\]

  • = Quadratic programming problem

  • Gives a unique hyperplane as far as possible from training examples

  • The solution only depends on a subset of training examples that appear on the margin = the support vectors

Linear models: Maximum Margin linear classifier

Limits:

  • if training data not separable, the hyperplane does not exist

  • even a single example can radically change the position of the hyperplane

Linear models: SVM

  • Transform your input data with \(F(a)\) into a higher dimensional space so that it is linearly separable
  • Kernel trick: Don’t explicit this transformation \(F(a)\) but use a kernel s.t. \(K(a,b)=F(a).F(b)\)

Wrap-up

  • Informal presentation of:
    • Components of ML systems:
      • Data encoding into features
      • Choice of parametric model
      • Training methods

Wrap-up

  • Data encoding into features
    • continuous vs. symbolic evidence
    • one-hot vector

Wrap-up

  • Choice of parametric model
    • linear model applied to
      • linear regression
      • linear classification

Wrap-up

  • 3 objective functions:
    • Square error
    • Classification error
    • Maximum margin
  • 2 training methods:
    • Exact solution for lin. reg. with SE
    • Perceptron algo for classif. with CE (http://computing.dcu.ie/~humphrys/Notes/Neural/single.neural.html)
  • Limitations

ML concepts

Loss

  • Loss measures how much does it cost you when the model is wrong for one example

\[L(x,y)=(y-f(x))^2\]

  • L2 loss
    • severly penalizes outliers (large values)
    • slower convergence

Classification loss

\[ L(x,y) = \begin{bmatrix}1 & y\neq f(x)\\ 0 & y=f(x)\end{bmatrix}\]

  • also known as 0-1 loss
  • When you sum over all examples, gives the classification error
  • Not differentiable, not convex

Hinge loss

  • Hinge loss: \(L = \max(0,1-yf(x))\)
    • tight convex upper bound of 0-1 loss
    • not differentiable at \(yf(x)=1\), but admits a subgradient
    • used in SVM to find a “corridor” of maximum width that separates data

Logistic loss

  • Logistic loss: \[\frac 1 {\ln 2} \ln(1+e^{-yf(x)})\]

  • differentiable everywhere
  • but more sensitive to outliers

Losses

Risk

  • Classifier risk = expected loss over all possible corpora

\[R(f)=\int p(X,Y)L(X,f(X)=Y)dXdY\]

  • It’s our real objective, as we want to optimize on test (unknown) samples !

  • more info: https://www.cs.cmu.edu/~epxing/Class/10715/lectures/RiskMin.pdf

Bayes Risk

  • Minimum risk when considering all possible classifiers: \(R^* = \min_f R(f)\)

  • Bayes classifier: \(f^* = \arg\min_f R(f)\)

  • Minimizing over \(f\) is good after marginalizing over all corpora
  • It’s dramatic when done before !

Risks: Empirical risk

  • Empirical Risk: \[R(f) = \sum_i L(y_i,f(x_i))\]

  • Law of large number: converges to the true risk
  • But empirical risk makes overfitting possible !

Generative vs. discriminative

  • Generative:
    • Model of the observations
    • \(P(X|Y)\)
  • Discriminative:
    • Model of the boundary
    • \(P(Y|X)\)

Generative model

  • Example:
    • X = pressure (scalar)
    • Y = will it rain ? (0 or 1)

Generative model

  • Gaussian model for \(P(X|Y)\):
    • Pick a centroid \(\mu_0\) and std dev \(\sigma_0\) without rain
    • Pick a centroid \(\mu_1\) and std dev \(\sigma_1\) when it rains
  • Likelihood to observe this atm pressure when it rains:

\[P(X|Y=1)=\frac 1 {\sqrt{2\pi}\sigma_1}e^{-\frac 1 2 \left( \frac{X-\mu_1} 2 \right)^2}\]

Unsupervised learning

  • Assume we don’t know any label \(Y\).
  • All we can do is fit a model on \(X\):

\[\hat\theta = \arg\max_\theta \prod_t P(X_t|\theta)\]

\[P(X|\theta)=\sum_y P(X,Y=y|\theta) = \sum_y P(X|Y=y,\theta)P(Y=y)\]

Unsup. training of generative model

  • Assume \(P(Y)\) uniform: we only need to maximize \[\prod_t \sum_i \frac 1 {\sqrt{2\pi}\sigma_i}e^{-\frac 1 2 \left( \frac{X_t-\mu_i} 2 \right)^2}\]

  • When we change \(\theta\), this likelihood change
  • So there exists an optimum that can be found for \(\theta\)

Discriminative model

  • Linear model: \[\hat y = w x + b\]

  • We want a probability: logistic regression: \[P(Y=1|X) = \frac 1 {1+e^{-(wx+b)}}\]

Unsup training of discriminative model

\[P(X|\theta)=\sum_y P(X,Y=y|\theta) = \sum_y P(Y=y|X,\theta)P(X)\]

  • Getting \(P(X)\) out of the sum: \[P(X|\theta)=P(X)\]

  • The observation likelihood does not depend on the parameters !

So it is impossible to maximize it…

Overfitting: Bias variance tradeoff

  • The prediction error can be decomposed into 3 components:
    • error due to “bias”
    • error due to “variance”
    • irreducible error

Overfitting: Bias variance tradeoff

  • Bias:
    • avg diff btw prediction and true value (if you repeat training+test many times)
    • results from simplifying assumptions about the form of the target
    • Low bias: DT, kNN, SVM
    • High bias: linear, logistic regression

Overfitting: Bias variance tradeoff

  • Variance:
    • variability of output for a given input (if you repeat…)
    • sensitivity of the model to changes in training data
    • Low var: linear, logistic regression
    • High var: DT, kNN, SVM

Overfitting: Bias variance tradeoff

  • Irreducible error:
    • Intrinsic ambiguity amongst classes
    • Ex: given the player ID, predict the output of throwing a dice ?

Overfitting: Bias variance tradeoff

Depends on both classifier + learning method

Overfitting: Bias variance tradeoff

  • Let \(Z\) be a random variable with distrib \(P\)

  • Let \(\bar{Z} = E_P[Z]\) be the average value of \(Z\)

Show that \[E\left[ (Z-\bar{Z})^2 \right] = E[Z^2] - \bar{Z}^2\]

Overfitting: Bias variance tradeoff

  • Let \(f\) be a true regression function \(y=f(x)+\epsilon\)
    • with \(\epsilon \sim N(0;\sigma^2)\)
  • We fit an hypothesis \(h(x) = w\cdot x +b\) with MSE

  • Let be a new data point \(y = f(x) + \epsilon\):
    • What is the expected error ? \[E\left[ (y-h(x))^2 \right]\]

!! Expected over the training corpora of \(h\): \(x\) is constant !!

Bias variance tradeoff

\(E[(h(x)-y)^2] = E[h^2(x)] -2E[h(x)]E[y] + E[y^2]\)

Apply the previous lemma to both square terms:

\(E[(h(x)-\bar{h(x)})^2] + \bar{h(x)}^2 -2\bar{h(x)}f(x) + E[(y-f(x))^2] + f(x)^2\)

Factorize into Variance + Bias + intrinsic error:

\(E[(h(x)-\bar{h(x)})^2] + (\bar{h(x)} - f(x))^2 + E[(y-f(x))^2]\)

Overfitting: Bias variance tradeoff

  • (Wrong) intuition: it’s more important to reduce bias
    • Yes, on the average in the long run, low-bias models may perform well
    • But long run avgs are irrelevant in practice !

Overfitting: Bias variance tradeoff

  • Bagging helps to reduce variance:
    • resample corpus \(\rightarrow\) train multiple models
    • average predictions
    • Ex: Random forests

Overfitting: Bias variance tradeoff

  • Large bias: underfitting
    • Increase nb of parameters
  • Large variance: overfitting
    • More regularization

Overfitting: Under/over-fitting

Overfitting: Under/over-fitting

  • Model “complexity” is related to the nb of parameters, but not only !
    • Notion of “capacity” in deep neural networks

How to find this optimum ?

  • There are theoretical measures (see Aikake’s Information Criteria)
  • But much better to use cross-validation

Wrapup

  • Concepts of:
    • Loss, risk
    • generative vs. discriminative
    • supervised vs. unsupervised
    • bias-variance tradeoff

Evaluation: holdout split

  • Our objective is to predict new (never seen) samples
    • So we never evaluate on the training set
  • Best approach: holdout data split

Evaluation: holdout split

  • Main issues:
    • Penalizing for small datasets
    • Experiments must never be run several times on the test set !! \(\rightarrow\) contamination

Evaluation: Cross-validation

Evaluation: Cross-validation

  • How many folds ?
    • Few: biased error estimates with low variance
    • Many (leave-one-out): unbiased error estimate but high variance
    • Practical: 5-folds or 10-folds

Evaluation: Parameters vs. hyper-parameters

  • Parameters: trained when optimizing the loss
  • Hyper-parameters: not trained with the loss
    • model topology
    • nb of iterations (epochs)
    • learning rate
    • regularization strength…

Hyper-parameters have to be tuned on a development set.

Evaluation: Hyper-parameter tuning

  • Best option: further split the training set on 2 subsets: training + development

  • Alternative: nested cross-validation

For each test fold i:
    - For each possible values of hyper-parms:
        For each dev fold j != i:
            - train on remaining folds
            - evaluate on dev fold j
        Average results on all dev folds
    - Pick the best hyper-parms
    - Retrain model on all folds but i
    - Evaluate on test fold
Average results over test folds

Evaluation: Hyper-parameter tuning

For each test fold i:
    - For each possible values of hyper-parms:
        For each dev fold j != i:
            - train on remaining folds
            - evaluate on dev fold j
        Average results on all dev folds
    - Pick the best hyper-parms
    - Retrain model on all folds but i
    - Evaluate on test fold
Average results over test folds
  • Issues:
    • Too slow !

Evaluation: Evaluation metrics

  • Standard classification metric: accuracy \[ acc = \frac{N_{OK}} {N_{tot}}\]

  • Question: on the same test set,
    • classifier C1 gives 71% of accuracy
    • classifier C2 gives 73% of accuracy
    • Is C2 better than C1 ?

Evaluation: Confidence interval

  • Standard statistical hypothesis test:
    • Hyp H0: C2 is not better than C1
  • Accuracy is a proportion: compute 95%-Wald confidence interval: \[p \pm 1.96 \sqrt{\frac {p(1-p)} n}\]

with \(n=\) nb of samples in the test corpus.

Evaluation: Confidence interval

  • Example with 1000 examples in test:
  • acc(C1) \(\in \{68.2\%; 73.8\%\}\)
  • acc(C2) \(\in \{70.2\%; 75.7\%\}\)

  • So we can not conclude that C2 is better than C1

Evaluation: Detection

Evaluation: Detection

  • Assume one class is very frequent (“Is there a cat in the picture ?”)
  • Cast as a detection problem

  • Assume next 2 classes:
    • positive (class of interest)
    • negative (very frequent)
  • Evaluated with:
    • precision: is every cat detected a real cat ?
    • recall: have we found every cat ?

Evaluation: Detection

  • True positive (TP): \(f(x)=+\) and \(y=+\)
  • False positive (FP): \(f(x)=+\) but \(y=-\)
  • True negative (TN): \(f(x)=-\) and \(y=-\)
  • False negative (FN): \(f(x)=-\) but \(y=+\)

Evaluation: Detection

Precision: \[ p = \frac {TP}{TP+FP}\]

Recall: \[ r = \frac {TP}{TP+FN}\]

Accuracy: \[ a = \frac {TP+TN}{TP+TN+FP+TN}\]

Evaluation: detection threshold

  • We may want to be sure that every positive detected is really positive

  • Adjust decision threshold:
    • pos. iff \(f(x)>\tau\)

Evaluation: detection threshold

  • Precision-recall curve:

Evaluation: ROC curve

  • True positive rate = recall
  • False positive rate = \(\frac{TN}{TN+FP}\)

Receiver Operating Characteristic:

References

  • https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf
  • http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf

Going further

  • How to define good features ?
  • SVM
  • Maximum entropy models (log-linear models)
  • Statistical tests: Chi2, McNemard
  • Curse of dimensionality: running time, overfitting, nb of samples required
  • kNN
  • MBL
  • clustering

K-Means clustering

  • Clustering = unsupervised

  • K-Means minimizes the Square Loss

  • Model composed of
    • \(K\) centroids \(c_i\)

K-Means clustering

  • Training:
    • Randomly select \(K\) initial centroids
    • Compute distance \(d(x_i,c_j)\)
    • Assign each \(x_i\) to its closest \(c_j\)
    • Re-compute each centroid from its assigned samples: \[c_j = \frac 1 N_j \sum_{i\in j} x_i\]
    • Stop when nothing change

K-Means

  • Very common clustering

  • But:
    • Does not handle categorical observations
    • Sensitive to outliers
    • Fails on non-linear dataset

K-Means

K-nearest neighbors

  • May be used for classification or regression
  • Instance of MBL (or Instance Based Learning)

  • Training:
    • choose hyper-parameter \(K\) manually
    • store all training samples \((x_i,y_i)\)

K-nearest neighbors

  • Classification of \(x\):
    • compute \(d(x,x_i)\) for all \(i\)
    • pick the \(K\) closest training samples \(x_i\)
    • assign \(y=\) majority vote
  • may use a weighted vote by distance

kNN

  • With infinite dataset, the 2-class kNN has an error rate \(\leq 2\times\) Bayes risk

  • Does not scale to large datasets:
    • Approximate nearest neighbors search
  • Does not work in high-dimensions: all distances are similar
    • Preprocessing to reduce dimensions

1-NN

  • Very effective given its simplicity

  • Decision boundary: Voronoi diagrams

Decision Trees classifier

DT learning

  • Several algorithms for training: CART, ID3, C4.5…

  • Advantages:
    • Easy to interpret
    • Efficient with large datasets
  • Drawbacks:
    • Not very accurate
    • Overfit easily / very non-robust to data samples

Wrapup

  • Concepts of:
    • cross-validation
    • hyper-parameter tuning