//

2018/2019

# Machine Learning: introduction

## Resources

• To contact me: cerisara@loria.fr
• Info on deep learning, including course: http://deeploria.gforge.inria.fr
• Book
• Deep Learning by I. Goodfellow, Y. Bengio and A. Courville. https://www.deeplearningbook.org/
• Stanford course on ConvNet: http://cs231n.stanford.edu
• Oxford course on Machine Learning: https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/

## Requirements

• Fundamentals of probability theory (Bayes rule, prior, posterior, likelihood, marginal, conditional, density, normal distribution…)
• Basics of optimization (differentiability, gradient, local optima, convexity, chain rule, matrix algebra…)
• Good programming skills in python

## For next week

• Strongly recommended: install ubuntu (virtualbox, dual boot, native…)
• Mandatory: install python + numpy
• Mandatory: install pytorch (https://pytorch.org)
• Check pytorch:
• copy, paste and run the second example in https://pytorch.org/tutorials/beginner/pytorch_with_examples.html (called PyTorch: Tensors)
• Check that you obtain a value smaller than $$10^{-4}$$ at iteration 500

## Plan of course

• Part 1: Introduction to Machine Learning

• Part 2: Probabilistic models

• Part 3: Deep learning

## Machine learning introduction

• Very brief reminder of linear models
• Reminder fundamentals of parameter learning:
• loss, risks
• Good practices for experimental evaluations

## Probabilistic models

• Markov Random Fields vs. Bayesian Networks
• Naive Bayes, CRF
• Training, maximum likelihood, EM

## Deep learning

• convex optimization: line search, steepest, Newton, BFGS, conjugate, SGD
• feed-forward networks: backpropagation

• convolutional networks
• Model zoo for image processing
• Model zoo for text processing
• recurrent networks (RNN, LSTM, GRU):
• attention and memory models
• Model zoo for text processing
• variational autoencoders

## What is Machine Learning, and why ?

• Let’s assume we want to solve a real-world problem: predict the weather tomorrow
• We can exploit our knowledge: it’s raining more in fall…
• Programming an expert system:
p=0.2
if summer: p=p/2
if rainYesterday: p=p*2
if pressure<1000: p=p*2

## Pros and cons of expert approaches

• Encoding of expert knowledge into a program
• But limited by expert understanding
• Agreement between experts ?
• Gives unmaintainable programs for complex phenomena
• The only solution when there is no data

## Alternative: Machine learning (ML) approach

• Do not be too much confident about our knowledge: we (experts) may be wrong !
• Rather observe the data, and try and capture correlations between observations.

• Observe what happens during a few years (data, observations, evidence)
• Implement a parametric function that predicts the weather, based on features $f(X;\theta) = \sum_i \theta_i X_i$
• Adopt a decision rule: "It will rain iff $$f(X;\theta)>0$$
• Compute $$\theta$$ so as to minize the prediction error (training = optimization)

## ML = knowledge-free approach ?

• Does not rely on explicit human knowledge, but…
• Which family of functions for $$f()$$ ?
• Which observations $$X$$ to take ?
• We can add constraints and priors into $$f()$$

## Challenges for building an ML system

• Where to get data from ?
• Which features $$X$$ to choose ?
• Which model and parameters $$f(X;\theta)$$ to use ?
• How to train the parameters $$\theta$$ ?
• How to evaluate reliably whether our model is good or not ?

## Where to get data from ?

• Depends on the domain:
• From sensors: temperature, microphone, camera, GPS…
• From humans: books, Twitter, traces…
• From simulations: go/chess/video games…
• Reproducible research:
• From standard datasets, benchmarks (Kaggle…)

## A word about ethics in ML

• Ethics in ML is a fundamental and difficult issue
• fair evaluations, transparent ML
• open data
• 80% of bio papers not reproducible !
• Never assume data is “objective”, is the truth
• Data is often biased

E.g.: Microsoft Tay bot

## Which features to choose ?

• Features that are relevant for the task
• Independent features ?
• Some models require independent features
• Most current models support correlated features
• Diverse information
• Many vs. a few features:
• More features improve performances…
• … but augment the number of parameters to estimate, and make training more difficult

## Which form does $$X$$ take ?

$$X=$$ vector; each feature is encoded into 1+ dim:

• Numerical feature: e.g., “Temperature today”
• Encoded as a float in one dimension
• Must be normalized !
• Symbolic feature: e.g., “Season (summer, winter…)”
• Encoded as a one-hot vector
• Summer: $$X_i = [0,1,0,0]$$ Winter: $$X_i = [0,0,0,1]$$

## Which model ?

• Standard discriminative models: linear, SVM, CRF, decision trees…
• Bayesian models: HMM, LDA…
• Complex discriminative models: Deep Neural Networks
• Less common models: Memory-Based Learning, genetic algorithms…

## How to train parameters ?

• We will study how to train each type of model
• Which objective: minimize prediction error, (re)generate $$X$$
• 2 most important: cross-entropy and maximum likelihood
• Most important algorithms: Stochastic Gradient Descent and Expectation Maximization

## How to evaluate a model ?

• Avoid models that are good on the training data, but fail on unknown data.
• We want to assess generalization capabilities: bias-variance tradeoff

# Linear models

## Definitions

A linear model is defined as follows:

$\hat y(x) = \sum_j x_{j} \theta_j$

• $$x=[x_1 \cdots x_N]^T$$ is an observation vector: it contains data that is observed, known
• There is at most one data type per dimension:
• dim 1: temperature
• dim 2: pression
• Each “data type” is called a feature

## Definitions

$\hat y(x) = \sum_j x_{j} \theta_j$

• $$\hat y(x)$$ is the predicted value (scalar); it may represent
• A value we are interested in, but we don’t know: e.g., how much rain will fall tomorrow ? (regression)
• The score of a class: e.g., will it rain tomorrow ? (classification)

## Definitions

$\hat y(x) = \sum_j x_{j} \theta_j$

• $$\theta$$ is a vector of parameters: relates $$x$$ to $$y$$; it’s very hard to set by hand, so it is trained on a training corpus

## Definitions

$\hat y(x) = \sum_j x_{j} \theta_j$

• This is the equation of an hyperplane that contains the origin in $$(0,x_1,\cdots,x_N,y)$$
• To be more flexible, we add a constant scalar bias

$\hat y(x) = \sum_j x_{j} \theta_j + b$

• The bias can be moved into the sum by adding one constant feature $$x_1=1$$

## Linear regression

• regression = linear model of the relation between
• $$x$$: (years of education, seniority)
• $$y$$: income
• $$y$$ is the target value that we want to predict

## Training

• Imagine 2 different sets of parameters: $\hat y = 3x_1 + 2x_2$ and $\hat y = 1 + 2x_1 + x_2$

• Which model is the best ?

## Training

• Objective function to minimize (square error): $J(\theta)=\sum_1^n (\hat y_i - y_i)^2$

• Training boils down computing $\hat\theta = \arg\min_\theta J(\theta)$

## Supervised training

• All true $$y$$ are given in the training corpus
• They are called gold values / gold labels
• We can compute and thus optimize $$J(\theta)$$ on the training corpus

## Unsupervised training

• The training corpus only contains $$X$$; no $$y$$ value is given
• We can not compute $$J(\theta)$$
• Other training criteria are used

## Semi-supervised training

• $$y$$ is given for only some of the training examples
• But we still want to use all data to train the models
• Special cases:
• Few-shot learning: a few $$y$$ are known
• One-shot learning: a single $$y$$ is known

## Solving linear regression

• Let $$X$$ be a matrix where
• each row is one example
• each column is one feature
• Let $$y$$ be a vector with the (gold) income for each observed example

We can express the square error as: $J(\theta) = (X\theta-y)^T(X\theta-y)$

## Solving linear regression

• Exercice: solve = derive the optimal $$\hat \theta$$ that minimizes the square loss.

• Reminder:
• need $$\frac{\partial A \theta}{\partial \theta} = A^T$$
• and $$\frac{\partial \theta^T A \theta}{\partial \theta}=2A^T \theta$$

## Linear regression: solution

$\hat \theta = (X^T X)^{-1} X^T y$

• Nice ! But…
• Many problems are not linear
• Many questions are not regression

## Linear classification

• What animal is shown on a picture ?
• Will it rain tomorrow ?

• The gold target $$y$$ is discrete / symbolic
• We need to map our continuous prediction $$f(x)$$ into a set of classes / labels

## Binary classification

• Choose a decision boundary: $\hat y(x) = sgn(f(x))$

• For the training corpus, give gold labels as $$y \in \{ -1, +1 \}$$
• It’s not raining: $$y=-1$$
• It’s raining: $$y=+1$$
• Can be trained with minimum square error

## Binary classification

• $$X$$ = input vector
• $$Y$$ = scalar
• blackbox = linear function (for now…)

## N-ary classification

• Use one linear classifier per class: $$f_1(x), \cdots, f_N(x)$$
• Decision rule: choose the class with the maximum predicted score $\hat y(x) = \arg\max_c f_c(x)$

• In the training corpus, keep discrete labels $$y_i \in \{1,\cdots,N\}$$

## N-ary classification

• $$Y$$ = vector = one score per class
• blackbox = multi-linear function

## Binary classification

• Binary classification may have either one or two outputs (with different decision functions)
• Double the number of parameters !

## Problem: person name ?

• Training dataset:
    Pierre qui roule n'amasse pas mousse
*Jean* va manger avec *Marie*
Il fait beau à Nancy
• Problem: detect person names
• We assume a fixed dictionary with $$N$$ words
• Exercice:
• build a binary linear classifier
• compute its optimal parameters
• what is the predition error ?

## Solution

• Let $$(w_1,y_1)\cdots,(w_N,y_N)$$ be all our training samples

• For any word $$x$$, we compute indicator features:
• $$x_i=1$$ iff $$x=w_i$$ for $$1\leq i \leq N$$

## Solution

• We can rearrange the dataset so that $$X=I$$
• … (solve) … $$\hat\theta = y$$

Memory-based learning

• Prediction error on the training dataset = 0%

## Practical issues

Test sentence:

Jacques va manger avec Marie
• What is the prediction for Jacques ?

## Practical issues

Test sentence:

Jacques va manger avec Marie
• How to handle out-of-vocabulary words (OOV) ?
• Either in-voc, but issues with matrix inversion
• Or we extend $$\theta$$ afterwards with non-informative parameters

## Theoretical issue

Test sentence:

Pierre va manger avec Marie
• What is the prediction for Pierre ?

## Theoretical issue

Test sentence:

Pierre va manger avec Marie
• Erroneous prediction
• This model is very bad: it severly overfits the training corpus
• Objective of machine learning is to generalize to unseen data

## Theoretical issue

• Improve generalization capacity by
Pierre dort.
• Now we get: $$\hat\theta(Pierre)=\frac 1 2$$
• It’s better; but depending on the training ratio of Pierre, we’ll always get the same answer.

## Limits of chosen representations

• Features = only the word ID
• This is not enough to disambiguate
• Pierre qui roule n’amasse pas mousse
• Pierre est venu
• Better feature: word ID + context IDs
• “roule” suggests “not a person”

## Limits of square error

• Solving with $$10^6$$ training samples is hard

• The square error is not the best possible measure for classification:
• 1 extreme outlier may dramatically impact accuracy

## Alt. objective: minimum error rate

• How to count the number of prediction errors ?

$1- \delta(y,sgn(f(x)))$

• We want to minimize this loss, but no analytical solution
• Iterative training algorithms

Video perceptron

## Linear models: The Perceptron algorithm

• Finds a hyper plane that separates training data

• Repeat until convergence:
• For $$t=1 \cdots n$$:
• $$\hat y = sgn(f(x))$$
• if $$\hat y \neq y$$ then $$\theta = \theta + yX$$
• Algorithm converges if training data separable (Novikoff,1962)

## Limits of classification error loss

• Multiple solutions:

• Which solution is better (wrt generalization) ?

## Margin

$$w$$ is normal to the hyper-plane

distance $$r$$ of any $$x$$: $w^T \left(x-yr\frac w {||w||} \right) + b=0$

## Margin

• Solve for $$r$$: $r=\frac{y(w^Tx+b)}{||w||}$

## Margin

• We choose a scaling of $$w$$ and $$b$$ s.t. $y_i(w^Tx_i+b)\geq 1$
• with at least one equality, margin= $\rho = \frac 2 {||w||}$

## Linear models: Maximum Margin linear classifier

• Find $$\theta$$ that minimizes $$\frac 1 2 || \theta ||$$ s.t. $y_i X_i\theta \geq 1 ~~~ \forall i$

• Gives a unique hyperplane as far as possible from training examples

• The solution only depends on a subset of training examples that appear on the margin = the support vectors

## Linear models: Maximum Margin linear classifier

Limits:

• if training data not separable, the hyperplane does not exist

• even a single example can radically change the position of the hyperplane

## Linear models: SVM

• Transform your input data with $$F(a)$$ into a higher dimensional space so that it is linearly separable
• Kernel trick: Don’t explicit this transformation $$F(a)$$ but use a kernel s.t. $$K(a,b)=F(a).F(b)$$

## Wrap-up

• Informal presentation of:
• Components of ML systems:
• Data encoding into features
• Choice of parametric model
• Training methods

## Wrap-up

• Data encoding into features
• continuous vs. symbolic evidence
• one-hot vector

## Wrap-up

• Choice of parametric model
• linear model applied to
• linear regression
• linear classification

## Wrap-up

• 3 objective functions:
• Square error
• Classification error
• Maximum margin
• 2 training methods:
• Exact solution for lin. reg. with SE
• Perceptron algo for classif. with CE (http://computing.dcu.ie/~humphrys/Notes/Neural/single.neural.html)
• Limitations

# ML concepts

## Loss

• Loss measures how much does it cost you when the model is wrong for one example

$L(x,y)=(y-f(x))^2$

• L2 loss
• severly penalizes outliers (large values)
• slower convergence

## Classification loss

$L(x,y) = \begin{bmatrix}1 & y\neq f(x)\\ 0 & y=f(x)\end{bmatrix}$

• also known as 0-1 loss
• When you sum over all examples, gives the classification error
• Not differentiable, not convex

## Hinge loss

• Hinge loss: $$L = \max(0,1-yf(x))$$
• tight convex upper bound of 0-1 loss
• not differentiable at $$yf(x)=1$$, but admits a subgradient
• used in SVM to find a “corridor” of maximum width that separates data

## Logistic loss

• Logistic loss: $\frac 1 {\ln 2} \ln(1+e^{-yf(x)})$

• differentiable everywhere
• but more sensitive to outliers

## Risk

• Classifier risk = expected loss over all possible corpora

$R(f)=\int p(X,Y)L(X,f(X)=Y)dXdY$

• It’s our real objective, as we want to optimize on test (unknown) samples !

## Bayes Risk

• Minimum risk when considering all possible classifiers: $$R^* = \min_f R(f)$$

• Bayes classifier: $$f^* = \arg\min_f R(f)$$

• Minimizing over $$f$$ is good after marginalizing over all corpora
• It’s dramatic when done before !

## Risks: Empirical risk

• Empirical Risk: $R(f) = \sum_i L(y_i,f(x_i))$

• Law of large number: converges to the true risk
• But empirical risk makes overfitting possible !

## Generative vs. discriminative

• Generative:
• Model of the observations
• $$P(X|Y)$$
• Discriminative:
• Model of the boundary
• $$P(Y|X)$$

## Generative model

• Example:
• X = pressure (scalar)
• Y = will it rain ? (0 or 1)

## Generative model

• Gaussian model for $$P(X|Y)$$:
• Pick a centroid $$\mu_0$$ and std dev $$\sigma_0$$ without rain
• Pick a centroid $$\mu_1$$ and std dev $$\sigma_1$$ when it rains
• Likelihood to observe this atm pressure when it rains:

$P(X|Y=1)=\frac 1 {\sqrt{2\pi}\sigma_1}e^{-\frac 1 2 \left( \frac{X-\mu_1} 2 \right)^2}$

## Unsupervised learning

• Assume we don’t know any label $$Y$$.
• All we can do is fit a model on $$X$$:

$\hat\theta = \arg\max_\theta \prod_t P(X_t|\theta)$

$P(X|\theta)=\sum_y P(X,Y=y|\theta) = \sum_y P(X|Y=y,\theta)P(Y=y)$

## Unsup. training of generative model

• Assume $$P(Y)$$ uniform: we only need to maximize $\prod_t \sum_i \frac 1 {\sqrt{2\pi}\sigma_i}e^{-\frac 1 2 \left( \frac{X_t-\mu_i} 2 \right)^2}$

• When we change $$\theta$$, this likelihood change
• So there exists an optimum that can be found for $$\theta$$

## Discriminative model

• Linear model: $\hat y = w x + b$

• We want a probability: logistic regression: $P(Y=1|X) = \frac 1 {1+e^{-(wx+b)}}$

## Unsup training of discriminative model

$P(X|\theta)=\sum_y P(X,Y=y|\theta) = \sum_y P(Y=y|X,\theta)P(X)$

• Getting $$P(X)$$ out of the sum: $P(X|\theta)=P(X)$

• The observation likelihood does not depend on the parameters !

So it is impossible to maximize it…

• The prediction error can be decomposed into 3 components:
• error due to “bias”
• error due to “variance”
• irreducible error

• Bias:
• avg diff btw prediction and true value (if you repeat training+test many times)
• results from simplifying assumptions about the form of the target
• Low bias: DT, kNN, SVM
• High bias: linear, logistic regression

• Variance:
• variability of output for a given input (if you repeat…)
• sensitivity of the model to changes in training data
• Low var: linear, logistic regression
• High var: DT, kNN, SVM

• Irreducible error:
• Intrinsic ambiguity amongst classes
• Ex: given the player ID, predict the output of throwing a dice ?

Depends on both classifier + learning method

• Let $$Z$$ be a random variable with distrib $$P$$

• Let $$\bar{Z} = E_P[Z]$$ be the average value of $$Z$$

Show that $E\left[ (Z-\bar{Z})^2 \right] = E[Z^2] - \bar{Z}^2$

• Let $$f$$ be a true regression function $$y=f(x)+\epsilon$$
• with $$\epsilon \sim N(0;\sigma^2)$$
• We fit an hypothesis $$h(x) = w\cdot x +b$$ with MSE

• Let be a new data point $$y = f(x) + \epsilon$$:
• What is the expected error ? $E\left[ (y-h(x))^2 \right]$

!! Expected over the training corpora of $$h$$: $$x$$ is constant !!

$$E[(h(x)-y)^2] = E[h^2(x)] -2E[h(x)]E[y] + E[y^2]$$

Apply the previous lemma to both square terms:

$$E[(h(x)-\bar{h(x)})^2] + \bar{h(x)}^2 -2\bar{h(x)}f(x) + E[(y-f(x))^2] + f(x)^2$$

Factorize into Variance + Bias + intrinsic error:

$$E[(h(x)-\bar{h(x)})^2] + (\bar{h(x)} - f(x))^2 + E[(y-f(x))^2]$$

• (Wrong) intuition: it’s more important to reduce bias
• Yes, on the average in the long run, low-bias models may perform well
• But long run avgs are irrelevant in practice !

• Bagging helps to reduce variance:
• resample corpus $$\rightarrow$$ train multiple models
• average predictions
• Ex: Random forests

• Large bias: underfitting
• Increase nb of parameters
• Large variance: overfitting
• More regularization

## Overfitting: Under/over-fitting

• Model “complexity” is related to the nb of parameters, but not only !
• Notion of “capacity” in deep neural networks

How to find this optimum ?

• There are theoretical measures (see Aikake’s Information Criteria)
• But much better to use cross-validation

## Wrapup

• Concepts of:
• Loss, risk
• generative vs. discriminative
• supervised vs. unsupervised

## Evaluation: holdout split

• Our objective is to predict new (never seen) samples
• So we never evaluate on the training set
• Best approach: holdout data split

## Evaluation: holdout split

• Main issues:
• Penalizing for small datasets
• Experiments must never be run several times on the test set !! $$\rightarrow$$ contamination

## Evaluation: Cross-validation

• How many folds ?
• Few: biased error estimates with low variance
• Many (leave-one-out): unbiased error estimate but high variance
• Practical: 5-folds or 10-folds

## Evaluation: Parameters vs. hyper-parameters

• Parameters: trained when optimizing the loss
• Hyper-parameters: not trained with the loss
• model topology
• nb of iterations (epochs)
• learning rate
• regularization strength…

Hyper-parameters have to be tuned on a development set.

## Evaluation: Hyper-parameter tuning

• Best option: further split the training set on 2 subsets: training + development

• Alternative: nested cross-validation

For each test fold i:
- For each possible values of hyper-parms:
For each dev fold j != i:
- train on remaining folds
- evaluate on dev fold j
Average results on all dev folds
- Pick the best hyper-parms
- Retrain model on all folds but i
- Evaluate on test fold
Average results over test folds

## Evaluation: Hyper-parameter tuning

For each test fold i:
- For each possible values of hyper-parms:
For each dev fold j != i:
- train on remaining folds
- evaluate on dev fold j
Average results on all dev folds
- Pick the best hyper-parms
- Retrain model on all folds but i
- Evaluate on test fold
Average results over test folds
• Issues:
• Too slow !

## Evaluation: Evaluation metrics

• Standard classification metric: accuracy $acc = \frac{N_{OK}} {N_{tot}}$

• Question: on the same test set,
• classifier C1 gives 71% of accuracy
• classifier C2 gives 73% of accuracy
• Is C2 better than C1 ?

## Evaluation: Confidence interval

• Standard statistical hypothesis test:
• Hyp H0: C2 is not better than C1
• Accuracy is a proportion: compute 95%-Wald confidence interval: $p \pm 1.96 \sqrt{\frac {p(1-p)} n}$

with $$n=$$ nb of samples in the test corpus.

## Evaluation: Confidence interval

• Example with 1000 examples in test:
• acc(C1) $$\in \{68.2\%; 73.8\%\}$$
• acc(C2) $$\in \{70.2\%; 75.7\%\}$$

• So we can not conclude that C2 is better than C1

## Evaluation: Detection

• Assume one class is very frequent (“Is there a cat in the picture ?”)
• Cast as a detection problem

• Assume next 2 classes:
• positive (class of interest)
• negative (very frequent)
• Evaluated with:
• precision: is every cat detected a real cat ?
• recall: have we found every cat ?

## Evaluation: Detection

• True positive (TP): $$f(x)=+$$ and $$y=+$$
• False positive (FP): $$f(x)=+$$ but $$y=-$$
• True negative (TN): $$f(x)=-$$ and $$y=-$$
• False negative (FN): $$f(x)=-$$ but $$y=+$$

## Evaluation: Detection

Precision: $p = \frac {TP}{TP+FP}$

Recall: $r = \frac {TP}{TP+FN}$

Accuracy: $a = \frac {TP+TN}{TP+TN+FP+TN}$

## Evaluation: detection threshold

• We may want to be sure that every positive detected is really positive

• pos. iff $$f(x)>\tau$$

## Evaluation: detection threshold

• Precision-recall curve:

## Evaluation: ROC curve

• True positive rate = recall
• False positive rate = $$\frac{TN}{TN+FP}$$

## References

• https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf
• http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf

## Going further

• How to define good features ?
• SVM
• Maximum entropy models (log-linear models)
• Statistical tests: Chi2, McNemard
• Curse of dimensionality: running time, overfitting, nb of samples required
• kNN
• MBL
• clustering

## K-Means clustering

• Clustering = unsupervised

• K-Means minimizes the Square Loss

• Model composed of
• $$K$$ centroids $$c_i$$

## K-Means clustering

• Training:
• Randomly select $$K$$ initial centroids
• Compute distance $$d(x_i,c_j)$$
• Assign each $$x_i$$ to its closest $$c_j$$
• Re-compute each centroid from its assigned samples: $c_j = \frac 1 N_j \sum_{i\in j} x_i$
• Stop when nothing change

## K-Means

• Very common clustering

• But:
• Does not handle categorical observations
• Sensitive to outliers
• Fails on non-linear dataset

## K-nearest neighbors

• May be used for classification or regression
• Instance of MBL (or Instance Based Learning)

• Training:
• choose hyper-parameter $$K$$ manually
• store all training samples $$(x_i,y_i)$$

## K-nearest neighbors

• Classification of $$x$$:
• compute $$d(x,x_i)$$ for all $$i$$
• pick the $$K$$ closest training samples $$x_i$$
• assign $$y=$$ majority vote
• may use a weighted vote by distance

## kNN

• With infinite dataset, the 2-class kNN has an error rate $$\leq 2\times$$ Bayes risk

• Does not scale to large datasets:
• Approximate nearest neighbors search
• Does not work in high-dimensions: all distances are similar
• Preprocessing to reduce dimensions

## 1-NN

• Very effective given its simplicity

• Decision boundary: Voronoi diagrams

## DT learning

• Several algorithms for training: CART, ID3, C4.5…