Deniz Koyuncu, Rensselaer Polytechnic Institute
Jacob Rich, Y Meadows Inc.
For the past two years, the “transformer” architecture has achieved state-of-the-art performance on most benchmark NLP tasks. A transformer model considers the entire input sequence of text at once – rather than one word or character at a time – and passes the text through a series of “attention layers”. Each layer computes a new “hidden” representation of the input in the form of a sequence of numerical feature vectors, before passing those vectors on to the next layer. The result is that the length of the sequence is maintained throughout the whole model.
More formally, for an input sequence of length T, at each depth there are also T hidden representations. Therefore, at the final output layer there is still one hidden representation corresponding to each of the input tokens.
This type of output is necessary in tasks like Named Entity Recognition or Parts-of-Speech Tagging, when each word, or “token”, must be considered independently. On the other hand, for NLP tasks such as Sequence Representation and Text Classification, a sequence-level hidden representation is sufficient – we don’t need each input token to have a corresponding output. Therefore, when using a typical pre-trained transformer model for one of these tasks, one has to first extract the token-level representations and then summarize them into one vector to represent the input text sequence as a whole. In the paper we’re looking at today, “Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing”, the authors propose another way around: first extract a sequence-level representation, and then, if the task requires it, extract the token-level representations from it.
The main benefit of this approach is computational – it allows for a large reduction in memory and compute power involved in deploying the network (and training it). The relative computational inexpensiveness of these smaller internal representations also allows for easily adding several more layers to the model, resulting in a deeper network and a higher-lever sequence representation.
The model consists of two parts. First is the encoder, which takes the input tokens and produces the sequence-level vector representation. When training on a task like Text Classification, in which only one output per sequence is required, the encoder is all that is needed. And secondly, there is an optional decoder to support tasks (including pre-training methods) that need token-level outputs.
The encoder consists of blocks containing attention modules and fully connected layers, and within each block the sequence length of the hidden representations is kept constant. Inspired by Image Classification models, in which the pixel representations are “pooled” between computational blocks to produce lower resolution representations, the authors introduce a pooling operation between blocks of the decoder that reduce the sequence length of the representation by half. (A simple stride-2 window-size-2 operation is employed, meaning that each pair of consecutive vectors is averaged together.)
In order to avoid too much loss of information in the pooling step, the authors also modify the self-attention applied at each first layer of a block. After a pooling operation, instead of computing the attention of each token with respect to each other token in the same sequence, as is usually done, the attention of each token in the pooled sequence is computed with respect to the tokens in the pre-pooled sequence; this results in a “smarter” pooling than just naive averaging.
The authors also introduce a decoder module to support down-stream tasks that require a token-level representation. This is also necessary for pre-training the model, since the standard BERT-style pre-training objective is to predict probable values for randomly masked tokens. The decoder first duplicates the hidden representation such that it has the same length as the original sequence, then it adds in the pre-pooled output of the first attention block. Finally, that hidden representation, whose length is equal to the number of tokens, is fed into two normal transformer layers producing token-wise outputs. The decoder can be used for a pre-training procedure that requires a hidden representation per each token, but then discarded if the down-stream task doesn’t require them.
The computational complexity of a transformer layer is O(T2D+TD2) where T and D denote the input sequence length and dimension of each output representation respectively . Because the complexity is proportional to the square of T, each time the sequence length is reduced by half, the computation of the next block is reduced much more than that.
The amount of resources saved by pooling the sequence allows the authors to add additional transformer layers to their proposed models. For example, as an alternative for a large transformer model with 24 layers, they propose using a Funnel-Transformer with a three-section encoder (pooling after each section) with ten layers in each section. The proposed architecture with 30 layers uses approximately 27% less FLOPs, but actually has 22% more parameters. In other words, they introduce more layers while reducing the computational expense of the model.
The authors compared the proposed model architecture with the existing approaches in two experimental settings, which we will call: (1) “standard scale”, in which the model is pre-trained with the baseline BERT settings, datasets, and objective; and (2) “extreme scale” to compare against more powerful pre-trained networks like XLNet and ELECTRA, which are pre-trained with more computational resources, and on more corpuses of data.
In these experiments, the authors chose three categories of model sizes – namely small, base, and large – and constructed baseline transformer models of each size. Then for each category, they introduced competing Funnel-Transformers architectures with the number of FLOPs less than or equal to the baseline as displayed in Table 1.
The models are then fine-tuned and tested specifically on benchmark tasks that involve sequence-level prediction such as sentiment classification. But additionally, one test is added (the SQuAD task for question answering) which requires token-level output to test how the model performs with the decoder included.
Some interesting take-aways from the results are:
The authors proposed an architecture which gradually decreases the sequence length of the hidden representations through pooling similar to models used for computer vision which gradually decrease the pixel resolution of the hidden representations.
The authors show that FLOPs saved by reducing the temporal resolution can be reinvested in additional layers and introduce a new trade-off between temporal resolution and the number of layers. The proposed Funnel-Transformer not only achieves a higher performance than the previous state-of-the-art models, but it also shows possible paths to further explore.
Deniz Koyuncu is a PhD student at Rensselaer Polytechnic Institute and works on applications of Machine Learning in bioinformatics.
Jacob Rich is Director of Machine Learning at YMeadows.
Reference implementation provided in the paper:
Hugging-Face Pre-trained Funnel Transformer Models:
Dai, Z., et al., Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. arXiv preprint arXiv:2006.03236, 2020.