Christophe Cerisara

2018/2019

Example: does person X appears in a photo ?

- small B&W photo: 300x200 = 60k inputs
- e.g., 3 layers (60k; 1000; 1)
- 60M parameters (=connections) !

- Sample 1: the person appears on the top-left of the photo
- Sample 2: the same person appears on the bottom-right of the photo
- Common information between both samples is lost !

- Exploit the structural properties within data
- E.g., for images: invariance to translation
- So we can use the same set of weights at various positions of the image
- \(\rightarrow\) convolutional networks

- E.g., for speech: Markov property of the signal
- \(\rightarrow\) recurrent networks

- E.g., for graph data:
- \(\rightarrow\) graph convolutional networks

Kernel size = 3x3

- Input image \(I\)
- Kernel \(K=\) matrix of size \(h \times w\)

\[(I * K)_{x,y} = \sum_{i=1}^h \sum_{j=1}^w K_{ij} \cdot I_{x+i-1,y+j-1}\]

- Assume the input has \(d\) channels (red/green/blue), the kernel dimensions are extended accordingly
- A bias and an activation are also typically used:

\[conv(I,K)_{x,y} = \sigma\left(b+\sum_{i=1}^h \sum_{j=1}^w \sum_{k=1}^d K_{ijk} \cdot I_{x+i-1,y+j-1,k}\right)\]

From left to right: stride=1, stride=2, receptive field=3

Kernel size 3x3, stride 2, padding 1

- Apply 5 convolutions \(\rightarrow\) 5-dim vector:

With appropriate kernel size, padding and stride1, the resulting “blurred image” has the same size as the input image

- We want to compress the information and reduce the layers size progressively down to the final layer:
- Option 1: remove padding, increase stride

- Option 2: downsampling = pooling

Max pooling = pick the max value within a filter size

- reduce dims: 200x200x50 \(\rightarrow\) 200x200x\(N_f\)
- add another non-linearity, with few parameters (deeper net)

- skip 0, or 1, or 3 pixels
- increase receptive field

`INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC`

- Prefer a stack of small filter CONV to one large receptive field CONV layer

conv layer use small filters (3x3, at most 5x5)

pool layer: usually 2x2 with stride 2

- Yann Lecun 1990

- Alex Krizhevsky, Ilya Sutskever, Geoff Hinton, 2012

- Szegedy et al., 2014

Inception module:

- Let the model choose:
- compute multiple transformations: 1x1-conv, 3x3-conv, 5x5-conv, 3x3-pool

- Reduce dimensionality:
- by adding 1x1-convolutions

Karen Simonyan, Andrew Zisserman, 2014

Classical architecture but much deeper

- Kaiming He et al., 2015

G. Huang et al., 2016

each layer directly connected to all following layers

- Use 1D-convolutions instead of 2D-convolutions
- shift the kernel along the time axis only

- Lookup table = Word Embeddings
- assign a single vector \(\vec w \in R^{100}\) to every word in the voc.
- e.g. \(\vec w_{le}\), \(\vec w_{chat}\), \(\vec w_{chien}\)

- whenever word \(w\) occurs in the input, this vector \(\vec w\) is inserted as input
- e.g. “le chat le chien” \(\rightarrow\) \((\vec w_{le}, \vec w_{chat}, \vec w_{le}, \vec w_{chien})\)

- the real values in these vectors are trained by back-propagation

- assign a single vector \(\vec w \in R^{100}\) to every word in the voc.
- The following convolutions use 100 input channels

Yoon Kim, 2014

Not very deep, but extremely efficient !

- ConvNets are
**reusable**amongst multiple tasks - Nobody retrains a CNN from scratch
- Remove the last layer of ImageNet ConvNet
- … or fix the N first layers

- plot hidden weights

ex: AlexNet, 1st and 2d layers:

- plot hidden weights

ex: AlexNet, 1st and 2d layers:

- plot hidden activations
- occluding parts of the image
- retrieve images that maximally activate a neuron
- visualize gradients
- DeconvNet

- embed images into 2D s.t. their (pair-wise) distance in 2D \(\simeq\) their distance before last layer

- Stanford CS231n
- pytorch tutorials: https://github.com/ritchieng/the-incredible-pytorch

MNIST official example

```
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x)
model = Net()
if args.cuda: model.cuda()
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
def train(epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
if args.cuda: data, target = data.cuda(), target.cuda()
data, target = Variable(data), Variable(target)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.data[0]))
def test():
model.eval()
# ...
```

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

- \(x\) = observation/input; \(o\) = output
- \(s\) = hidden neuron = represents the “state” (=memory)

\[s_t = f(Ux_t + W s_{t-1})\]

- RNN compute representations \(o\) of the inputs \(x\) with memory
- Often followed by a classifier layer

- All domains with a “time”-axis: language, speech, finance, DNA…
- speech recognition, translation, chatbots…

Back-Propagation Through Time:

- Key diff with standard backprop: the gradients
*sum*at every time step

Exemple: gradient with respect to output 3: \[\frac{\partial E_3}{\partial W} = \sum_{k=0}^3 \frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \left( \prod_{j=k+1}^3 \frac {\partial s_j}{\partial s_{j-1}} \right) \frac{\partial s_k}{\partial W}\]

- \(\nabla E_3\) contributes to steps 3, 2, 1 and 0: \[\frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \left( \prod_{j=k+1}^3 \frac {\partial s_j}{\partial s_{j-1}} \right) \frac{\partial s_k}{\partial W}\]
- All steps share the same \(W\): \[\frac{\partial E_3}{\partial W} = \sum_{k=0}^3 \frac{\partial E_3}{\partial \hat y_3} \cdots \]

- Even in MLP: \[ l_1 = U \cdot X\] \[ h_1 = \text{tanh}(l_1)\] \[ l_2 = V \cdot h_1\] \[ y = \text{tanh}(l_2)\] \[\frac {\partial E_y}{\partial l_1} = \frac{\partial E_y}{\partial y}\frac{\partial y}{\partial l_2}\frac{\partial l_2}{\partial h_1}\frac{\partial h_1}{\partial l_1}\]

Derivative of the \(tanh\) activation:

- The gradient of the activation is often close to 0
- When many such gradients are multiplied, the gradient becomes smaller and smaller, negligeable
After some time steps, no more training is possible

Exploding gradients are also possible, but can be solved with

*clipping*

- Careful initialization of the \(W\) matrices
- Careful tuning of regularization
- Use ReLU activations instead of sigmoid/tanh
- Use LSTM or GRU cells instead of simple neurons

- Computes the new state \(s_{t+1}\) from the previous state and input using 3 gates:
- input gate: how much of the input-derived state we let through
- forget gate: how much of the previous state we let through
- output gate: how much of the overall state will be passed out

- Bidirectional
- Deep (bidirectional) RNNs
- LSTM/GRUs
- Model zoo based on RNNs:
- seq2seq models
- attention networks
- key-value memory networks
- neural turing machine
- …

Cho et al., 2014

- Seq2Seq is also an Encoder-Decoder

model | in/out |
---|---|

seq | \(x_1,\cdots,x_t \rightarrow x_{t+1}\) |

seq2seq | \(x_1,\cdots,x_t \rightarrow x_{t+1}, x_{t+2}, \cdots\) |

Given the text, predict the topic:

```
The Republicans started a media offensive.
This is the first time the NBA organized a game in this city.
```

- Focus on the most important words: compute \(\alpha_t =P(i_t|w_*,c)\)
- \(c=\) context vector, \(i_t=\) importance of word \(t\)
- Proba obtained with a softmax and \(i_t=w_t \cdot c\)

- Seq2seq attention
- bi-LSTM encoder
- \(\alpha_t\) vary at every decoding timestep

- Attention can be used without recurrence (greatly reduce cost !)
- e.g.:
- memory networks: attention on embeddings
- transformers: idem

- Tensor dimensions are noted \((n,m,p,q)\)
- Element-wise ops:
- addition, subtraction
- Hadamard product: torch.mul()
- division

- dot product

`torch.dot(torch.tensor([4,2]), torch.tensor([3,1])) = [14]`

- matrix-vector product

`torch.mv(a,x)`

- matrix-matrix product

`torch.mm(a,b)`

- batch matrix-matrix product

`torch.matmul(a,b)`

- first dimensions =
*batch*dimensions ex: \((j,1,n,m) \times (k,m,p) \rightarrow (j,k,n,p)\)

batch must be

*broadcasted*dimensions: start comparing dims from right to left: they must be equal, one of them is 1, or one of them does not exist

- view

`torch.randn(4,3,5).view(12,5)`

(un)squeeze: add/del dimensions of size 1

- concatenate, split, slice… : https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/
do exercice

*Attention*

- 3 key components (vectors or batches of vectors):
- Query \(Q\)
- Keys \(K\)
- Value \(V\)

- Compare the query with every key with a
**compatibility measure** - Ex: scaled dot-product attention: \[\text{softmax}\left( \frac {Q^T K}{\sqrt{d_k}} \right) V\]

- Exercice: implement this measure as:

```
class ScaledDotProductAttention(nn.Module):
def __init__(self, scale):
```

- We assume Q,K,V have same sizes batch,seq,vecdim

```
class MultiHeadAttention(nn.Module):
def __init__(self, n_head, d):
```

- Every word in sequence compared to all other words of the sentence
- Implemented by feeding the same sequences of words \(X\) to \(Q=K=V\)

- 2 layers:
- conv1D w/ kernel=1 (d -> d_hid)
- relu
- conv1D w/ kernel=1 (d_hid -> d)

```
class PositionwiseFeedForward(nn.Module):
def __init__(self, d, d_hid, dropout=0.1):
```

- Create an EncoderLayer that stacks together a multi-head attention with a position-wise feed forward

```
class EncoderLayer(nn.Module):
def __init__(self, d, d_inner, n_head):
```

- Attention focuses on some words, based on their
*meaning*(embedding) - The position of words in the sentence is also important, and there is no recurrence to capture it
- We may add the (int) position to the embedding; but longer sentences are rare
- So we use repeating position patterns: sinusoids \[PE_(pos,2i) = \sin(pos/10000^{2i/d})\] \[PE_(pos,2i+1) = \cos(pos/10000^{2i/d})\]

- Simply adds (sums) word embeddings with position encodings (they have the same size)
- Stack multiple EncoderLayer using:

`nn.ModuleList([EncoderLayer(d, d_hid, n_head) for _ in range(n_layers)])`

- Create positional Embeddings as follows:

`self.pos_enc = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(max_len+1, d), freeze=True)`

- Implement the decoder similarly, test the whole model

## Comments on the implementation

Negative Log-Likelihoodloss function that does not include the final softmax - so the softmax must be defined in the modelview(), which may be used to “manipulate” tensor dimensionstrainingortestmode of a model: important when using dropout !