Paper of the week - Transformer-XL
Transformers have had quite some success recently in Natural Language Processing, thanks to their greatly reduced computational cost as compared to recurrent models. They basically replace the recurrence with self-attention, which is however always limited to a predefined context history. This new paper extends their capacity to a much larger context by introducing back a recurring transition, but without back-propagating through it, along with a relative positional encoding. Their results are impressive on Language Modeling tasks.