Christophe Cerisara

2018/2019

- To contact me: cerisara@loria.fr
- Info on deep learning, including course: http://deeploria.gforge.inria.fr
- Book
**Deep Learning**by I. Goodfellow, Y. Bengio and A. Courville. https://www.deeplearningbook.org/- Stanford course on ConvNet: http://cs231n.stanford.edu
- Oxford course on Machine Learning: https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/

- Fundamentals of probability theory (Bayes rule, prior, posterior, likelihood, marginal, conditional, density, normal distribution…)
- Basics of optimization (differentiability, gradient, local optima, convexity, chain rule, matrix algebra…)
- Good programming skills in python

- Strongly recommended: install ubuntu (virtualbox, dual boot, native…)
- Mandatory: install python + numpy
- Mandatory: install pytorch (https://pytorch.org)
- Check pytorch:
- copy, paste and run the second example in https://pytorch.org/tutorials/beginner/pytorch_with_examples.html (called
*PyTorch: Tensors*) - Check that you obtain a value smaller than \(10^{-4}\) at iteration 500

- copy, paste and run the second example in https://pytorch.org/tutorials/beginner/pytorch_with_examples.html (called

Part 1: Introduction to Machine Learning

Part 2: Probabilistic models

Part 3: Deep learning

- Very brief reminder of linear models
- Reminder fundamentals of parameter learning:
- loss, risks
- bias/variance tradeoff

- Good practices for experimental evaluations

- Markov Random Fields vs. Bayesian Networks
- Naive Bayes, CRF
- Training, maximum likelihood, EM

- convex optimization: line search, steepest, Newton, BFGS, conjugate, SGD
feed-forward networks: backpropagation

- convolutional networks
- Model zoo for image processing
- Model zoo for text processing

- recurrent networks (RNN, LSTM, GRU):
- vanishing gradient

- attention and memory models
- Model zoo for text processing

- variational autoencoders
Generative Adversarial Networks

- Let’s assume we want to solve a real-world problem: predict the weather tomorrow
- We can exploit our knowledge: it’s raining more in fall…
- Programming an expert system:

```
p=0.2
if summer: p=p/2
if rainYesterday: p=p*2
if pressure<1000: p=p*2
```

- Encoding of expert knowledge into a program
- But limited by expert understanding
- Agreement between experts ?
- Gives unmaintainable programs for complex phenomena
- The only solution when there is no data

- Do not be too much confident about our knowledge: we (experts) may be wrong !
- Rather observe the data, and try and capture correlations between observations.

- Observe what happens during a few years (data, observations, evidence)
- Implement a parametric function that predicts the weather, based on features \[f(X;\theta) = \sum_i \theta_i X_i\]
- Adopt a decision rule: "It will rain iff \(f(X;\theta)>0\)
- Compute \(\theta\) so as to minize the prediction error (training = optimization)

- Does not rely on explicit human knowledge, but…
- Which family of functions for \(f()\) ?
- Which observations \(X\) to take ?
- We can add constraints and priors into \(f()\)

- Where to get data from ?
- Which features \(X\) to choose ?
- Which model and parameters \(f(X;\theta)\) to use ?
- How to train the parameters \(\theta\) ?
- How to evaluate
*reliably*whether our model is good or not ?

- Depends on the domain:
- From sensors: temperature, microphone, camera, GPS…
- From humans: books, Twitter, traces…
- From simulations: go/chess/video games…

- Reproducible research:
- From standard datasets, benchmarks (Kaggle…)

- Ethics in ML is a fundamental and difficult issue
*fair*evaluations,*transparent*ML*open*data- 80% of bio papers not reproducible !

- Never assume data is “objective”, is the truth
- Data is often
*biased*

E.g.: Microsoft Tay bot

- Features that are
*relevant*for the task - Independent features ?
- Some models require independent features
- Most current models support correlated features

- Diverse information
- Many vs. a few features:
- More features improve performances…
- … but augment the number of parameters to estimate, and make training more difficult

\(X=\) vector; each feature is encoded into 1+ dim:

- Numerical feature: e.g., “Temperature today”
- Encoded as a float in one dimension
- Must be normalized !

- Symbolic feature: e.g., “Season (summer, winter…)”
- Encoded as a
*one-hot vector* - Summer: \(X_i = [0,1,0,0]\) Winter: \(X_i = [0,0,0,1]\)

- Encoded as a

- Standard discriminative models: linear, SVM, CRF, decision trees…
- Bayesian models: HMM, LDA…
- Complex discriminative models: Deep Neural Networks
- Less common models: Memory-Based Learning, genetic algorithms…

- We will study how to train each type of model
- Which objective: minimize prediction error, (re)generate \(X\)…
- 2 most important: cross-entropy and maximum likelihood
- Most important algorithms: Stochastic Gradient Descent and Expectation Maximization

- Avoid models that are good on the training data, but fail on unknown data.
- We want to assess
*generalization*capabilities:*bias-variance tradeoff*

A linear model is defined as follows:

\[\hat y(x) = \sum_j x_{j} \theta_j\]

- \(x=[x_1 \cdots x_N]^T\) is an observation vector: it contains data that is observed, known
- There is at most one data type per dimension:
- dim 1: temperature
- dim 2: pression
- …

- Each “data type” is called a
*feature*

\[\hat y(x) = \sum_j x_{j} \theta_j\]

- \(\hat y(x)\) is the predicted value (scalar); it may represent
- A value we are interested in, but we don’t know: e.g., how much rain will fall tomorrow ? (
*regression*) - The score of a class: e.g., will it rain tomorrow ? (
*classification*)

- A value we are interested in, but we don’t know: e.g., how much rain will fall tomorrow ? (

\[\hat y(x) = \sum_j x_{j} \theta_j\]

- \(\theta\) is a vector of parameters: relates \(x\) to \(y\); it’s very hard to set by hand, so it is
*trained*on a*training corpus*

\[\hat y(x) = \sum_j x_{j} \theta_j\]

- This is the equation of an hyperplane that contains the origin in \((0,x_1,\cdots,x_N,y)\)
- To be more flexible, we add a constant scalar
*bias*

\[\hat y(x) = \sum_j x_{j} \theta_j + b\]

- The bias can be moved into the sum by adding one constant feature \(x_1=1\)

- regression = linear model of the relation between
- \(x\): (years of education, seniority)
- \(y\): income

- \(y\) is the target value that we want to predict

Imagine 2 different sets of parameters: \[\hat y = 3x_1 + 2x_2\] and \[\hat y = 1 + 2x_1 + x_2\]

Which model is the best ?

Objective function to minimize (square error): \[J(\theta)=\sum_1^n (\hat y_i - y_i)^2\]

Training boils down computing \[\hat\theta = \arg\min_\theta J(\theta)\]

- All true \(y\) are given in the training corpus
- They are called
*gold values*/*gold labels* - We can compute and thus optimize \(J(\theta)\) on the training corpus

- The training corpus only contains \(X\); no \(y\) value is given
- We can not compute \(J(\theta)\)
- Other training criteria are used

- \(y\) is given for only some of the training examples
- But we still want to use all data to train the models
- Special cases:
- Few-shot learning: a few \(y\) are known
- One-shot learning: a single \(y\) is known

- Let \(X\) be a matrix where
- each row is one example
- each column is one feature

- Let \(y\) be a vector with the (gold) income for each observed example

We can express the square error as: \[J(\theta) = (X\theta-y)^T(X\theta-y)\]

Exercice: solve = derive the optimal \(\hat \theta\) that minimizes the square loss.

- Reminder:
- need \(\frac{\partial A \theta}{\partial \theta} = A^T\)
- and \(\frac{\partial \theta^T A \theta}{\partial \theta}=2A^T \theta\)

\[\hat \theta = (X^T X)^{-1} X^T y\]

- Nice ! But…
- Many problems are not linear
- Many questions are not regression

–

- What animal is shown on a picture ?
Will it rain tomorrow ?

- The gold target \(y\) is discrete / symbolic
We need to map our continuous prediction \(f(x)\) into a set of classes / labels

Choose a decision boundary: \[\hat y(x) = sgn(f(x))\]

- For the training corpus, give gold labels as \(y \in \{ -1, +1 \}\)
- It’s not raining: \(y=-1\)
- It’s raining: \(y=+1\)

Can be trained with minimum square error

- \(X\) = input vector
- \(Y\) = scalar
- blackbox = linear function (for now…)

- Use one linear classifier per class: \(f_1(x), \cdots, f_N(x)\)
Decision rule: choose the class with the maximum predicted score \[\hat y(x) = \arg\max_c f_c(x)\]

In the training corpus, keep discrete labels \(y_i \in \{1,\cdots,N\}\)

- \(Y\) = vector = one score per class
- blackbox = multi-linear function

- Binary classification may have either one or two outputs (with different decision functions)
- Double the number of parameters !

- Training dataset:

```
Pierre qui roule n'amasse pas mousse
*Jean* va manger avec *Marie*
Il fait beau à Nancy
```

- Problem: detect person names
- We assume a fixed dictionary with \(N\) words

- Exercice:
- build a binary linear classifier
- compute its optimal parameters
- what is the predition error ?

Let \((w_1,y_1)\cdots,(w_N,y_N)\) be all our training samples

- For any word \(x\), we compute indicator features:
- \(x_i=1\) iff \(x=w_i\) for \(1\leq i \leq N\)

- We can rearrange the dataset so that \(X=I\)
- … (solve) … \(\hat\theta = y\)

*Memory-based learning*

- Prediction error on the training dataset = 0%

Test sentence:

`Jacques va manger avec Marie`

- What is the prediction for
*Jacques*?

Test sentence:

`Jacques va manger avec Marie`

- How to handle out-of-vocabulary words (OOV) ?
- Either in-voc, but issues with matrix inversion
- Or we extend \(\theta\) afterwards with non-informative parameters

Test sentence:

`Pierre va manger avec Marie`

- What is the prediction for Pierre ?

Test sentence:

`Pierre va manger avec Marie`

- Erroneous prediction
- This model is very bad: it severly
*overfits*the training corpus - Objective of machine learning is to
*generalize*to unseen data

- Improve generalization capacity by
- adding more training examples

`Pierre dort.`

- Now we get: \(\hat\theta(Pierre)=\frac 1 2\)
- It’s better; but depending on the training ratio of
*Pierre*, we’ll always get the same answer.

- Features = only the word ID
- This is not enough to disambiguate
- Pierre qui roule n’amasse pas mousse
- Pierre est venu

- Better feature: word ID + context IDs
- “roule” suggests “not a person”

Solving with \(10^6\) training samples is hard

- The square error is not the best possible measure for classification:
- 1 extreme outlier may dramatically impact accuracy

- How to count the number of prediction errors ?

\[1- \delta(y,sgn(f(x)))\]

- We want to minimize this loss, but no analytical solution
- Iterative training algorithms

Finds a hyper plane that separates training data

- Repeat until convergence:
- For \(t=1 \cdots n\):
- \(\hat y = sgn(f(x))\)
- if \(\hat y \neq y\) then \(\theta = \theta + yX\)

- For \(t=1 \cdots n\):

- Repeat until convergence:
Algorithm converges if training data separable (Novikoff,1962)

- Multiple solutions:

- Which solution is better (wrt generalization) ?

\(w\) is normal to the hyper-plane

distance \(r\) of any \(x\): \[w^T \left(x-yr\frac w {||w||} \right) + b=0\]

- Solve for \(r\): \[r=\frac{y(w^Tx+b)}{||w||}\]

- We choose a scaling of \(w\) and \(b\) s.t. \[y_i(w^Tx_i+b)\geq 1\]
- with at least one equality, margin= \[\rho = \frac 2 {||w||}\]

Find \(\theta\) that minimizes \(\frac 1 2 || \theta ||\) s.t. \[y_i X_i\theta \geq 1 ~~~ \forall i\]

= Quadratic programming problem

Gives a unique hyperplane as far as possible from training examples

The solution only depends on a subset of training examples that appear on the margin = the

*support vectors*

Limits:

if training data not separable, the hyperplane does not exist

even a single example can radically change the position of the hyperplane

- Transform your input data with \(F(a)\) into a higher dimensional space so that it is linearly separable
- Kernel trick: Don’t explicit this transformation \(F(a)\) but use a kernel s.t. \(K(a,b)=F(a).F(b)\)

- Informal presentation of:
- Components of ML systems:
- Data encoding into features
- Choice of parametric model
- Training methods

- Components of ML systems:

- Data encoding into features
- continuous vs. symbolic evidence
- one-hot vector

- Choice of parametric model
- linear model applied to
- linear regression
- linear classification

- linear model applied to

- 3 objective functions:
- Square error
- Classification error
- Maximum margin

- 2 training methods:
- Exact solution for lin. reg. with SE
- Perceptron algo for classif. with CE (http://computing.dcu.ie/~humphrys/Notes/Neural/single.neural.html)

- Limitations

- Loss measures how much does it cost you when the model is wrong for one example

\[L(x,y)=(y-f(x))^2\]

- L2 loss
- severly penalizes outliers (large values)
- slower convergence

\[ L(x,y) = \begin{bmatrix}1 & y\neq f(x)\\ 0 & y=f(x)\end{bmatrix}\]

- also known as 0-1 loss
- When you sum over all examples, gives the classification error
- Not differentiable, not convex

- Hinge loss: \(L = \max(0,1-yf(x))\)
- tight convex upper bound of 0-1 loss
- not differentiable at \(yf(x)=1\), but admits a subgradient
- used in
*SVM*to find a “corridor” of maximum width that separates data

Logistic loss: \[\frac 1 {\ln 2} \ln(1+e^{-yf(x)})\]

- differentiable everywhere
but more sensitive to outliers

- Classifier risk = expected loss over all possible corpora

\[R(f)=\int p(X,Y)L(X,f(X)=Y)dXdY\]

It’s our real objective, as we want to optimize on test (unknown) samples !

more info: https://www.cs.cmu.edu/~epxing/Class/10715/lectures/RiskMin.pdf

Minimum risk when considering all possible classifiers: \(R^* = \min_f R(f)\)

Bayes classifier: \(f^* = \arg\min_f R(f)\)

- Minimizing over \(f\) is good
*after*marginalizing over all corpora It’s dramatic when done

*before*!

Empirical Risk: \[R(f) = \sum_i L(y_i,f(x_i))\]

- Law of large number: converges to the true risk
But empirical risk makes

*overfitting*possible !

- Generative:
- Model of the observations
- \(P(X|Y)\)

- Discriminative:
- Model of the boundary
- \(P(Y|X)\)

- Example:
- X = pressure (scalar)
- Y = will it rain ? (0 or 1)

- Gaussian model for \(P(X|Y)\):
- Pick a centroid \(\mu_0\) and std dev \(\sigma_0\) without rain
- Pick a centroid \(\mu_1\) and std dev \(\sigma_1\) when it rains

- Likelihood to observe this atm pressure when it rains:

\[P(X|Y=1)=\frac 1 {\sqrt{2\pi}\sigma_1}e^{-\frac 1 2 \left( \frac{X-\mu_1} 2 \right)^2}\]

- Assume we don’t know any label \(Y\).
- All we can do is fit a model on \(X\):

\[\hat\theta = \arg\max_\theta \prod_t P(X_t|\theta)\]

\[P(X|\theta)=\sum_y P(X,Y=y|\theta) = \sum_y P(X|Y=y,\theta)P(Y=y)\]

Assume \(P(Y)\) uniform: we only need to maximize \[\prod_t \sum_i \frac 1 {\sqrt{2\pi}\sigma_i}e^{-\frac 1 2 \left( \frac{X_t-\mu_i} 2 \right)^2}\]

- When we change \(\theta\), this likelihood change
So there exists an optimum that can be found for \(\theta\)

Linear model: \[\hat y = w x + b\]

We want a probability: logistic regression: \[P(Y=1|X) = \frac 1 {1+e^{-(wx+b)}}\]

\[P(X|\theta)=\sum_y P(X,Y=y|\theta) = \sum_y P(Y=y|X,\theta)P(X)\]

Getting \(P(X)\) out of the sum: \[P(X|\theta)=P(X)\]

The observation likelihood does not depend on the parameters !

So it is impossible to maximize it…

- The prediction error can be decomposed into 3 components:
- error due to “bias”
- error due to “variance”
- irreducible error

- Bias:
- avg diff btw prediction and true value (if you repeat training+test many times)
- results from simplifying assumptions about the form of the target
- Low bias: DT, kNN, SVM
- High bias: linear, logistic regression

- Variance:
- variability of output for a given input (if you repeat…)
- sensitivity of the model to changes in training data
- Low var: linear, logistic regression
- High var: DT, kNN, SVM

- Irreducible error:
- Intrinsic ambiguity amongst classes
- Ex: given the player ID, predict the output of throwing a dice ?

Depends on both classifier + learning method

Let \(Z\) be a random variable with distrib \(P\)

Let \(\bar{Z} = E_P[Z]\) be the average value of \(Z\)

Show that \[E\left[ (Z-\bar{Z})^2 \right] = E[Z^2] - \bar{Z}^2\]

- Let \(f\) be a true regression function \(y=f(x)+\epsilon\)
- with \(\epsilon \sim N(0;\sigma^2)\)

We fit an hypothesis \(h(x) = w\cdot x +b\) with MSE

- Let be a
*new*data point \(y = f(x) + \epsilon\):- What is the expected error ? \[E\left[ (y-h(x))^2 \right]\]

!! Expected over the training corpora of \(h\): \(x\) is constant !!

\(E[(h(x)-y)^2] = E[h^2(x)] -2E[h(x)]E[y] + E[y^2]\)

Apply the previous lemma to both square terms:

\(E[(h(x)-\bar{h(x)})^2] + \bar{h(x)}^2 -2\bar{h(x)}f(x) + E[(y-f(x))^2] + f(x)^2\)

Factorize into Variance + Bias + intrinsic error:

\(E[(h(x)-\bar{h(x)})^2] + (\bar{h(x)} - f(x))^2 + E[(y-f(x))^2]\)

- (Wrong) intuition: it’s more important to reduce bias
- Yes, on the average in the long run, low-bias models may perform well
- But long run avgs are irrelevant in practice !

*Bagging*helps to reduce variance:- resample corpus \(\rightarrow\) train multiple models
- average predictions
- Ex: Random forests

- Large bias: underfitting
- Increase nb of parameters

- Large variance: overfitting
- More regularization

- Model “complexity” is related to the nb of parameters, but not only !
- Notion of “capacity” in deep neural networks

How to find this optimum ?

- There are theoretical measures (see Aikake’s Information Criteria)
- But much better to use cross-validation

- Concepts of:
- Loss, risk
- generative vs. discriminative
- supervised vs. unsupervised
- bias-variance tradeoff

- Our objective is to predict
*new*(never seen) samples- So we never evaluate on the training set

- Best approach: holdout data split

- Main issues:
- Penalizing for small datasets
- Experiments must
*never*be run several times on the test set !! \(\rightarrow\) contamination

- How many folds ?
- Few: biased error estimates with low variance
- Many (leave-one-out): unbiased error estimate but high variance
- Practical: 5-folds or 10-folds

- Parameters: trained when optimizing the loss
- Hyper-parameters: not trained with the loss
- model topology
- nb of iterations (epochs)
- learning rate
- regularization strength…

Hyper-parameters have to be tuned on a *development set*.

Best option: further split the training set on 2 subsets: training + development

Alternative: nested cross-validation

```
For each test fold i:
- For each possible values of hyper-parms:
For each dev fold j != i:
- train on remaining folds
- evaluate on dev fold j
Average results on all dev folds
- Pick the best hyper-parms
- Retrain model on all folds but i
- Evaluate on test fold
Average results over test folds
```

```
For each test fold i:
- For each possible values of hyper-parms:
For each dev fold j != i:
- train on remaining folds
- evaluate on dev fold j
Average results on all dev folds
- Pick the best hyper-parms
- Retrain model on all folds but i
- Evaluate on test fold
Average results over test folds
```

- Issues:
- Too slow !

Standard classification metric: accuracy \[ acc = \frac{N_{OK}} {N_{tot}}\]

- Question: on the same test set,
- classifier C1 gives 71% of accuracy
- classifier C2 gives 73% of accuracy
- Is C2 better than C1 ?

- Standard statistical hypothesis test:
- Hyp H0: C2 is not better than C1

- Accuracy is a proportion: compute 95%-Wald confidence interval: \[p \pm 1.96 \sqrt{\frac {p(1-p)} n}\]

with \(n=\) nb of samples in the test corpus.

- Example with 1000 examples in test:
- acc(C1) \(\in \{68.2\%; 73.8\%\}\)
acc(C2) \(\in \{70.2\%; 75.7\%\}\)

So we can not conclude that C2 is better than C1

- Assume one class is very frequent (“Is there a cat in the picture ?”)
Cast as a

*detection*problem- Assume next 2 classes:
- positive (class of interest)
- negative (very frequent)

- Evaluated with:
- precision: is every cat detected a real cat ?
- recall: have we found every cat ?

- True positive (TP): \(f(x)=+\) and \(y=+\)
- False positive (FP): \(f(x)=+\) but \(y=-\)
- True negative (TN): \(f(x)=-\) and \(y=-\)
- False negative (FN): \(f(x)=-\) but \(y=+\)

Precision: \[ p = \frac {TP}{TP+FP}\]

Recall: \[ r = \frac {TP}{TP+FN}\]

Accuracy: \[ a = \frac {TP+TN}{TP+TN+FP+TN}\]

We may want to be sure that every positive detected is really positive

- Adjust decision threshold:
- pos. iff \(f(x)>\tau\)

- Precision-recall curve:

- True positive rate = recall
- False positive rate = \(\frac{TN}{TN+FP}\)

Receiver Operating Characteristic:

- https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf
- http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf

- How to define good features ?
- SVM
- Maximum entropy models (log-linear models)
- Statistical tests: Chi2, McNemard
- Curse of dimensionality: running time, overfitting, nb of samples required
- kNN
- MBL
- clustering

Clustering = unsupervised

K-Means minimizes the Square Loss

- Model composed of
- \(K\) centroids \(c_i\)

- Training:
- Randomly select \(K\) initial centroids
- Compute distance \(d(x_i,c_j)\)
- Assign each \(x_i\) to its closest \(c_j\)
- Re-compute each centroid from its assigned samples: \[c_j = \frac 1 N_j \sum_{i\in j} x_i\]
- Stop when nothing change

Very common clustering

- But:
- Does not handle categorical observations
- Sensitive to outliers
- Fails on non-linear dataset

- May be used for classification or regression
Instance of MBL (or Instance Based Learning)

- Training:
- choose hyper-parameter \(K\) manually
- store all training samples \((x_i,y_i)\)

- Classification of \(x\):
- compute \(d(x,x_i)\) for all \(i\)
- pick the \(K\) closest training samples \(x_i\)
- assign \(y=\) majority vote

- may use a weighted vote by distance

With infinite dataset, the 2-class kNN has an error rate \(\leq 2\times\) Bayes risk

- Does not scale to large datasets:
- Approximate nearest neighbors search

- Does not work in high-dimensions: all distances are similar
- Preprocessing to reduce dimensions

Very effective given its simplicity

Decision boundary: Voronoi diagrams

Several algorithms for training: CART, ID3, C4.5…

- Advantages:
- Easy to interpret
- Efficient with large datasets

- Drawbacks:
- Not very accurate
- Overfit easily / very non-robust to data samples

- Concepts of:
- cross-validation
- hyper-parameter tuning