//

2018/2019

# Convolutional networks

## Limits of feed-forward networks: size

Example: does person X appears in a photo ?

• small B&W photo: 300x200 = 60k inputs
• e.g., 3 layers (60k; 1000; 1)
• 60M parameters (=connections) !

## Limits of feed-forward networks: no sharing

• Sample 1: the person appears on the top-left of the photo
• Sample 2: the same person appears on the bottom-right of the photo
• Common information between both samples is lost !

## Sharing parameters

• Exploit the structural properties within data
• E.g., for images: invariance to translation
• So we can use the same set of weights at various positions of the image
• $$\rightarrow$$ convolutional networks
• E.g., for speech: Markov property of the signal
• $$\rightarrow$$ recurrent networks
• E.g., for graph data:
• $$\rightarrow$$ graph convolutional networks

## 2D convolution

Kernel size = 3x3

## 2D convolution

• Input image $$I$$
• Kernel $$K=$$ matrix of size $$h \times w$$

$(I * K)_{x,y} = \sum_{i=1}^h \sum_{j=1}^w K_{ij} \cdot I_{x+i-1,y+j-1}$

• Assume the input has $$d$$ channels (red/green/blue), the kernel dimensions are extended accordingly
• A bias and an activation are also typically used:

$conv(I,K)_{x,y} = \sigma\left(b+\sum_{i=1}^h \sum_{j=1}^w \sum_{k=1}^d K_{ijk} \cdot I_{x+i-1,y+j-1,k}\right)$

## Stride / Shift

From left to right: stride=1, stride=2, receptive field=3

Kernel size 3x3, stride 2, padding 1

• Apply 5 convolutions $$\rightarrow$$ 5-dim vector:

## Pooling

• With appropriate kernel size, padding and stride1, the resulting “blurred image” has the same size as the input image

• We want to compress the information and reduce the layers size progressively down to the final layer:
• Option 1: remove padding, increase stride

## Pooling

• Option 2: downsampling = pooling

## Pooling

Max pooling = pick the max value within a filter size

## 1x1 convolution

• reduce dims: 200x200x50 $$\rightarrow$$ 200x200x$$N_f$$
• add another non-linearity, with few parameters (deeper net)

## Dilated convolution

• skip 0, or 1, or 3 pixels
• increase receptive field

## ConvNet architectures

INPUT -> [[CONV ->  RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
• Prefer a stack of small filter CONV to one large receptive field CONV layer

## Hyper-parameter choice

• conv layer use small filters (3x3, at most 5x5)

• pool layer: usually 2x2 with stride 2

## LeNet

• Yann Lecun 1990

## AlexNet

• Alex Krizhevsky, Ilya Sutskever, Geoff Hinton, 2012

• Szegedy et al., 2014

Inception module:

• Let the model choose:
• compute multiple transformations: 1x1-conv, 3x3-conv, 5x5-conv, 3x3-pool
• Reduce dimensionality:

## VGGNet

• Karen Simonyan, Andrew Zisserman, 2014

• Classical architecture but much deeper

## ResNet

• Kaiming He et al., 2015

## DenseNet

• G. Huang et al., 2016

• each layer directly connected to all following layers

## Model Zoo in Natural Language Processing

• Use 1D-convolutions instead of 2D-convolutions
• shift the kernel along the time axis only

## Collobert et al., 2011

• Lookup table = Word Embeddings
• assign a single vector $$\vec w \in R^{100}$$ to every word in the voc.
• e.g. $$\vec w_{le}$$, $$\vec w_{chat}$$, $$\vec w_{chien}$$
• whenever word $$w$$ occurs in the input, this vector $$\vec w$$ is inserted as input
• e.g. “le chat le chien” $$\rightarrow$$ $$(\vec w_{le}, \vec w_{chat}, \vec w_{le}, \vec w_{chien})$$
• the real values in these vectors are trained by back-propagation
• The following convolutions use 100 input channels

## Wide and deep CNN

Yoon Kim, 2014

Not very deep, but extremely efficient !

## Transfert learning

• ConvNets are reusable amongst multiple tasks
• Nobody retrains a CNN from scratch
• Remove the last layer of ImageNet ConvNet
• … or fix the N first layers

## Visualize convnets

• plot hidden weights

ex: AlexNet, 1st and 2d layers:

## Visualize convnets

• plot hidden weights

ex: AlexNet, 1st and 2d layers:

## Other approaches

• plot hidden activations
• occluding parts of the image
• retrieve images that maximally activate a neuron
• DeconvNet

## t-SNE

• embed images into 2D s.t. their (pair-wise) distance in 2D $$\simeq$$ their distance before last layer

## Refs

• Stanford CS231n
• pytorch tutorials: https://github.com/ritchieng/the-incredible-pytorch

## Implementation

MNIST official example

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)

def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x)

model = Net()
if args.cuda: model.cuda()

optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

def train(epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
if args.cuda: data, target = data.cuda(), target.cuda()
data, target = Variable(data), Variable(target)
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
100. * batch_idx / len(train_loader), loss.data))

def test():
model.eval()
# ...

• Show how to “move” the data and model onto GPU
• Illustrate the vanilla SGD optimizer
• Show the Negative Log-Likelihood loss function that does not include the final softmax - so the softmax must be defined in the model
• Show a useful method: view(), which may be used to “manipulate” tensor dimensions
• Show how to set the training or test mode of a model: important when using dropout !

# Recurrent neural net (RNN)

## Recurrent neural networks

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

## RNN

• $$x$$ = observation/input; $$o$$ = output
• $$s$$ = hidden neuron = represents the “state” (=memory)

$s_t = f(Ux_t + W s_{t-1})$

## RNN applications

• RNN compute representations $$o$$ of the inputs $$x$$ with memory
• Often followed by a classifier layer
• All domains with a “time”-axis: language, speech, finance, DNA…
• speech recognition, translation, chatbots…

## RNN training

Back-Propagation Through Time:

## Back-Propagation Through Time

• Key diff with standard backprop: the gradients sum at every time step

Exemple: gradient with respect to output 3: $\frac{\partial E_3}{\partial W} = \sum_{k=0}^3 \frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \left( \prod_{j=k+1}^3 \frac {\partial s_j}{\partial s_{j-1}} \right) \frac{\partial s_k}{\partial W}$

## BPTT

• $$\nabla E_3$$ contributes to steps 3, 2, 1 and 0: $\frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \left( \prod_{j=k+1}^3 \frac {\partial s_j}{\partial s_{j-1}} \right) \frac{\partial s_k}{\partial W}$
• All steps share the same $$W$$: $\frac{\partial E_3}{\partial W} = \sum_{k=0}^3 \frac{\partial E_3}{\partial \hat y_3} \cdots$

• Even in MLP: $l_1 = U \cdot X$ $h_1 = \text{tanh}(l_1)$ $l_2 = V \cdot h_1$ $y = \text{tanh}(l_2)$ $\frac {\partial E_y}{\partial l_1} = \frac{\partial E_y}{\partial y}\frac{\partial y}{\partial l_2}\frac{\partial l_2}{\partial h_1}\frac{\partial h_1}{\partial l_1}$

Derivative of the $$tanh$$ activation:

• The gradient of the activation is often close to 0
• When many such gradients are multiplied, the gradient becomes smaller and smaller, negligeable
• After some time steps, no more training is possible

• Exploding gradients are also possible, but can be solved with clipping

• Careful initialization of the $$W$$ matrices
• Careful tuning of regularization
• Use ReLU activations instead of sigmoid/tanh
• Use LSTM or GRU cells instead of simple neurons

## LSTM

• Computes the new state $$s_{t+1}$$ from the previous state and input using 3 gates:
• input gate: how much of the input-derived state we let through
• forget gate: how much of the previous state we let through
• output gate: how much of the overall state will be passed out

## RNN extensions

• Bidirectional
• Deep (bidirectional) RNNs
• LSTM/GRUs
• Model zoo based on RNNs:
• seq2seq models
• attention networks
• key-value memory networks
• neural turing machine

# Design patterns

Cho et al., 2014

## Seq2Seq

• Seq2Seq is also an Encoder-Decoder
model in/out
seq $$x_1,\cdots,x_t \rightarrow x_{t+1}$$
seq2seq $$x_1,\cdots,x_t \rightarrow x_{t+1}, x_{t+2}, \cdots$$

## Attention models

Given the text, predict the topic:

The Republicans started a media offensive.
This is the first time the NBA organized a game in this city.
• Focus on the most important words: compute $$\alpha_t =P(i_t|w_*,c)$$
• $$c=$$ context vector, $$i_t=$$ importance of word $$t$$
• Proba obtained with a softmax and $$i_t=w_t \cdot c$$

## Attention models

• Seq2seq attention
• bi-LSTM encoder
• $$\alpha_t$$ vary at every decoding timestep

## Attention

• Attention can be used without recurrence (greatly reduce cost !)
• e.g.:
• memory networks: attention on embeddings
• transformers: idem

## A primer on tensors

• Tensor dimensions are noted $$(n,m,p,q)$$
• Element-wise ops:
• division

## A primer on tensors

• dot product
torch.dot(torch.tensor([4,2]), torch.tensor([3,1])) = 
• matrix-vector product
torch.mv(a,x)
• matrix-matrix product
torch.mm(a,b)

## A primer on tensors

• batch matrix-matrix product
torch.matmul(a,b)
• first dimensions = batch dimensions
• ex: $$(j,1,n,m) \times (k,m,p) \rightarrow (j,k,n,p)$$

• batch must be broadcasted dimensions: start comparing dims from right to left: they must be equal, one of them is 1, or one of them does not exist

## A primer on tensors

• view
torch.randn(4,3,5).view(12,5)
• (un)squeeze: add/del dimensions of size 1

• concatenate, split, slice… : https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/
• do exercice Attention

## Another Attention

• 3 key components (vectors or batches of vectors):
• Query $$Q$$
• Keys $$K$$
• Value $$V$$
• Compare the query with every key with a compatibility measure
• Ex: scaled dot-product attention: $\text{softmax}\left( \frac {Q^T K}{\sqrt{d_k}} \right) V$

## Another Attention

• Exercice: implement this measure as:
class ScaledDotProductAttention(nn.Module):
def __init__(self, scale):
• We assume Q,K,V have same sizes batch,seq,vecdim

class MultiHeadAttention(nn.Module):
def __init__(self, n_head, d):

## Self-attention

• Every word in sequence compared to all other words of the sentence
• Implemented by feeding the same sequences of words $$X$$ to $$Q=K=V$$

## Encoder: feed forward

• 2 layers:
• conv1D w/ kernel=1 (d -> d_hid)
• relu
• conv1D w/ kernel=1 (d_hid -> d)
class PositionwiseFeedForward(nn.Module):
def __init__(self, d, d_hid, dropout=0.1):

## Encoder layer

• Create an EncoderLayer that stacks together a multi-head attention with a position-wise feed forward
class EncoderLayer(nn.Module):
def __init__(self, d, d_inner, n_head):

## Positional encoding

• Attention focuses on some words, based on their meaning (embedding)
• The position of words in the sentence is also important, and there is no recurrence to capture it
• We may add the (int) position to the embedding; but longer sentences are rare
• So we use repeating position patterns: sinusoids $PE_(pos,2i) = \sin(pos/10000^{2i/d})$ $PE_(pos,2i+1) = \cos(pos/10000^{2i/d})$

## Encoder

• Simply adds (sums) word embeddings with position encodings (they have the same size)
• Stack multiple EncoderLayer using:
nn.ModuleList([EncoderLayer(d, d_hid, n_head) for _ in range(n_layers)])
• Create positional Embeddings as follows:
self.pos_enc = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(max_len+1, d), freeze=True)

## Homework

• Implement the decoder similarly, test the whole model