Skip to main content

Transformers & Attention

·866 words·5 mins
Mahan
Author
Mahan
Less is More

Understanding Self-Attention and Transformers: A Deep Dive into Modern NLP Models
#

Transformers have revolutionized natural language processing (NLP), and at the heart of these models lies the self-attention mechanism. This blog post will break down key concepts such as softmax, recurrent neural networks (RNNs), minimal self-attention architecture, and the Transformer model itself, offering a detailed mathematical understanding of these foundational elements.

Softmax: The Gateway to Attention
#

In Transformers, the self-attention mechanism leverages the softmax function to compute attention weights, determining how much each token in an input sequence should influence the representation of a specific token.

Mathematically, for a token ( x_i ) in a sequence ( x_1, x_2, \ldots, x_n ), we compute a query vector ( q_i ) by multiplying ( x_i ) with a learned weight matrix ( Q ):

[ q_i = Qx_i ]

Similarly, we define key and value vectors for each token ( x_j ) using two other weight matrices ( K ) and ( V ):

[ k_j = Kx_j \quad \text{and} \quad v_j = Vx_j ]

The attention weight ( \alpha_{ij} ), indicating the contribution of ( x_j ) to ( x_i ), is computed using the softmax of the dot product between ( q_i ) and ( k_j ):

[ \alpha_{ij} = \frac{\exp(q_i^\top k_j)}{\sum_{j’} \exp(q_i^\top k_{j’})} ]

These weights sum to 1 across the sequence, allowing the model to focus on the most relevant tokens. The final representation ( h_i ) of ( x_i ) is then a weighted sum of the value vectors:

[ h_i = \sum_j \alpha_{ij} v_j ]

This mechanism enables the model to capture long-range dependencies and parallelize computations, making Transformers highly efficient.

The Role of Linear Algebra in Softmax
#

Linear algebra plays a crucial role in the efficient implementation of the softmax function. Here’s how:

  1. Matrix Multiplication for Efficient Computations: Softmax is computed using matrix multiplication, enabling parallel computation of attention weights.

  2. Dimensionality Reduction in Multi-Head Attention: Multi-head attention involves linearly transforming inputs into multiple lower-dimensional spaces, maintaining computational efficiency.

  3. Numerical Stability: Matrix operations enhance numerical stability, crucial for training deep networks.

  4. Low-Rank Approximations: Weight matrices in RNNs and attention mechanisms can be approximated using low-rank factorizations, reducing parameters and improving generalization.

Recurrent Neural Networks: A Brief Overview
#

RNNs are designed to process sequential data by maintaining a hidden state dependent on the current input and the previous hidden state. The basic RNN equation is:

[ h_t = \sigma(Wh_{t-1} + Ux_t) ]

However, RNNs face two major limitations:

  1. Parallelization Issues: The hidden state at each time step depends on the previous state, limiting parallelization.

  2. Linear Interaction Distance: The number of operations separating distant tokens scales linearly, making it difficult to capture long-range dependencies.

These limitations have led to the development of attention mechanisms and Transformers, which handle long-range dependencies and parallelization more effectively.

A Minimal Self-Attention Architecture
#

Self-attention is a method for focusing on relevant parts of the input sequence when computing token representations. In self-attention, the same elements are used to define queries, keys, and values.

The key-query-value self-attention mechanism, a core component of Transformer models, operates as follows:

  1. Compute Queries, Keys, and Values: For each token ( x_i ), compute ( q_i = Qx_i ), ( k_j = Kx_j ), and ( v_j = Vx_j ).

  2. Calculate Attention Weights: Compute the dot product between ( q_i ) and each ( k_j ), then apply softmax to obtain attention weights.

  3. Compute Output Representation: The output representation ( h_i ) is a weighted sum of the values ( v_j ), using the attention weights as coefficients.

This mechanism allows the model to dynamically focus on the most relevant parts of the sequence, overcoming the limitations of RNNs.

Position Representation in Transformers
#

Self-attention lacks an inherent sense of order, so Transformers add positional embeddings to input tokens. These embeddings can be learned or predefined, like sinusoidal encodings, which provide unique representations for each position. This positional information is crucial for capturing the sequence order in the model.

Elementwise Nonlinearity and Its Importance
#

Stacking self-attention layers alone isn’t sufficient. Nonlinear activation functions, such as ReLU, Sigmoid, or Tanh, are necessary to introduce complexity into the model. Without nonlinearities, stacking multiple layers would be equivalent to a single linear transformation, limiting the model’s expressive power.

The Transformer: A Revolutionary Architecture
#

The Transformer architecture builds on the self-attention mechanism, consisting of stacked blocks with multi-head self-attention, feed-forward layers, and other components like layer normalization and residual connections.

Multi-Head Self-Attention applies self-attention multiple times in parallel, with different projections, allowing the model to attend to various parts of the sequence simultaneously. This increases the model’s ability to focus on the most relevant tokens.

Layer Normalization and Its Impact
#

Layer normalization is crucial in Transformers, reducing uninformative variation in activations and providing stable inputs to subsequent layers. This stability is especially beneficial for the softmax function, improving numerical stability, mitigating vanishing gradients, and enhancing the overall training process.

Conclusion
#

Transformers and their self-attention mechanisms have transformed NLP by enabling efficient processing of long-range dependencies and parallelizing computations. Understanding the mathematical underpinnings of softmax, self-attention, and Transformer architectures is key to leveraging these models effectively in modern NLP tasks.

Related

Softmax
·1713 words·9 mins
Softmax stands as a pivotal component in neural network architectures, offering a means to convert raw scores into interpretable probabilities.
Variational-Auto-Encoder
·729 words·4 mins
The beauty of VAEs lies in their ability to generate new samples by randomly sampling vectors from this known region and then passing them through the generator part of our model.
Less is More Paper Review
·467 words·3 mins
Less is More: Parameter-Free Text Classification with Gzip offers a novel text classification method using gzip compression, eliminating manual parameter tuning.