BigBird | Notion

Big Bird: Transformers for Longer Sequences

Abstract

We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism.

The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware.

As a consequence of the capability to handle longer context, BIGBIRD drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

BERT를 필두로 한 다양한 Transformer 기반의 모델들은 NLP 분야에서 좋은 성능을 보이고 있지만, Transformer의 핵심 요소인 Full Attention Mechanism은 시퀀스 길이에 따라 $\small O(n^2)$의 복잡도가 발생하므로, 다룰수 있는 시퀀스 길이가 한정적이라는 한계점이 존재합니다. 이를 해결하기 위해 본 논문에서는 sparse attention mechanism를 통해 $\small O(n^2)$ 를 $\small O(n)$ 으로 줄인 BIGBIRD라는 모델을 소개합니다.

BIGBIRD는 Turing Complete함을 보임으로써 Full Attention Model의 특징들을 잘 보존하면서 범용적인 대안책이 될 수 있음을 증명했습니다.

Introduction

Models based on Transformers, such as BERT, are wildly successful for a wide variety of Natural Language Processing (NLP) tasks and consequently are mainstay of modern NLP research. Their versatility and robustness are the primary drivers behind the wide-scale adoption of Transformers.

The model is easily adapted for a diverse range of sequence based tasks – as a seq2seq model for translation, summarization, generation, etc. or as a standalone encoders for sentiment analysis, POS tagging, machine reading comprehension, etc. – and it is known to vastly outperform previous sequence models like LSTM. The key innovation in Transformers is the introduction of a self-attention mechanism, which can be evaluated in parallel for each token of the input sequence, eliminating the sequential dependency in recurrent neural networks, like LSTM.

This parallelism enables Transformers to leverage the full power of modern SIMD hardware accelerators like GPUs/TPUs, thereby facilitating training of NLP models on datasets of unprecedented size. This ability to train on large scale data has led to surfacing of models like BERT and T5, which pretrain transformers on large general purpose corpora and transfer the knowledge to down-stream task. The pretraining has led to significant improvement in low data regime downstream tasks as well as tasks with sufficient data and thus have been a major force behind the ubiquity of transformers in contemporary NLP.

The self-attention mechanism overcomes constraints of RNNs (namely the sequential nature of RNN) by allowing each token in the input sequence to attend independently to every other token in the sequence. This design choice has several interesting repercussions. In particular, the full self-attention have computational and memory requirement that is quadratic in the sequence length. We note that while the corpus can be large, the sequence length, which provides the context in many applications is very limited. Using commonly available current hardware and model sizes, this requirement