RESONIKS

# Inside Transformers: An In-depth Look at the Game-Changing Machine Learning Architecture

*Note: AI tools are used as an assistant in this post!*

Generated by Microsoft Bing Image Creator

As the field of artificial intelligence (AI) continues to change at a rapid pace, some designs have stood out for how much they have changed the field. The Transformer model has become a game-changer among these. It has changed not only natural language processing, but also many other parts of machine learning.

In their seminal work “Attention is All You Need” from 2017, Vaswani et al. introduced the Transformer, which changed the way we understand and process sequences. The attention system, which was the key innovation of the Transformer model, changed the way machine learning works by making sequence-to-sequence tasks easier to do and making it easier to deal with long-range dependencies in data.

But what is it that makes Transformers so strong? How does it use attention mechanisms to successfully store information about where things are and how they depend on other things? And why has it become the go-to architecture for many modern machine learning jobs, even outside of natural language processing?

In this blog post, we’ll take a close look at how the Transformer architecture works on the inside. We’ll look at its main parts, from inputs and embeddings to the multi-head attention system and positional encoding, all the way to outputs. We will figure out how it changed the way sequence modeling was done and how its design makes it useful for a wide range of machine learning jobs.

This guide aims to explain how the Transformer model works, whether you are an experienced machine learning engineer, a researcher, or someone new to the field who wants to learn about one of the most important designs in AI. Join us as we learn more about this new technology that is really changing the way artificial intelligence works.

Before I get into the transformer architecture blocks, I’ll quickly talk about the history of the attention mechanism.

**Attention Mechanism Brief History**

The history of attention mechanisms in deep learning can be traced back to the development of recurrent neural networks (RNNs) and has evolved into the powerful architecture of transformers, which have become a dominant force in the field of natural language processing (NLP) and beyond. Here’s an overview of the key milestones in the history of attention mechanisms:

1. **Recurrent Neural Networks (RNNs):** RNNs were introduced in the late 1980s and early 1990s as a way to model sequential data. These networks are designed to maintain an internal state or “memory” that can capture information from previous time steps. However, RNNs suffer from the vanishing and exploding gradient problems, which make it difficult to capture long-term dependencies in sequences.

2.** Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU): **To overcome the limitations of RNNs, LSTMs were introduced by Hochreiter and Schmidhuber in 1997, and GRUs were later proposed by Cho et al. in 2014. These architectures use gating mechanisms to selectively update and forget information, enabling them to capture longer-range dependencies more effectively than traditional RNNs.

3. **Neural Machine Translation (NMT) and Seq2Seq Models:** In 2014, Sutskever et al. introduced the Sequence-to-Sequence (Seq2Seq) model, which uses an encoder-decoder architecture for tasks like machine translation. The encoder processes the input sequence and generates a fixed-size context vector, while the decoder generates the output sequence based on this context vector. However, this fixed-size representation can limit the model’s ability to handle long sequences.

4. **Attention Mechanism:** The attention mechanism was introduced by Bahdanau et al. in 2015 as a way to address the limitations of fixed-size context vectors in Seq2Seq models. Instead of compressing the entire input sequence into a single context vector, attention allows the model to weigh different parts of the input sequence when generating each token in the output sequence. This improves the performance of neural machine translation and other sequence-to-sequence tasks.

The encoder-decoder model with additive attention mechanism in __Bahdanau et al., 2015__., __source__2

5. **Self-Attention and Transformers:** In 2017, Vaswani et al. introduced the Transformer architecture, which relies on self-attention mechanisms to process input sequences. Transformers eliminate the need for recurrent connections and instead use a series of multi-head self-attention layers to process input tokens in parallel. This design allows for more efficient training, better handling of long-range dependencies, and improved scalability.

Since their introduction, Transformers have become the foundation for many state-of-the-art NLP models, such as BERT, GPT, and RoBERTa. Attention mechanisms have also been applied in other domains, such as computer vision and reinforcement learning, demonstrating their versatility and effectiveness in a wide range of applications.

We will go into the details of the self-attention mechanism and Transformer’s architecture in this blog post. To learn more about the works before Transformers, you can read __this awesome blog post__.

**Transformer Architecture**

The following slide shows the transformer architecture:

source: __Introduction to Deep Learning — Raschka__

Here is a high-level overview of the Transformer model pipeline:

1. **Input**: The model takes a sequence of words as input. The words are tokenized, and each token is represented by a unique integer.

2. **Embedding**: The integers are then converted into fixed-length vectors through an embedding layer. This vector representation captures the semantic meaning of the word.

3. **Positional Encoding**: Since Transformer models don’t inherently understand the order of words in a sequence (as they are not recurrent), a positional encoding is added to the word embeddings. This injects information about the position of the words in the sequence. The positional encoding can either be learned or be a fixed function of position.

4. **Self-Attention (or Scaled Dot-Product Attention)**: The heart of the Transformer model. It allows the model to weigh the relevance of each word when encoding a particular word. It’s a way of getting the context of each word in relation to all other words in the sentence. For a given word, it quantifies the ‘attention’ it should pay to all other words for a particular task.

5. **Multi-Head Attention**: This mechanism allows the model to focus on different positions, capturing various features from different perspectives. Essentially, it runs the self-attention mechanism in parallel multiple times (heads) with different learned linear transformations of the input, and then concatenates and transforms the results.

6. **Feed-Forward Neural Networks**: These are present in both the encoder and decoder. After multi-head attention, the output is passed through a feed-forward neural network independently for each position.

7. **Layer Normalization**: This is a technique to stabilize the learning process and accelerate training. Layer normalization is applied after each multi-head attention block and the feed-forward neural network.

8. **Residual Connections**: These are used around each of the two sub-layers (multi-head attention and FFNN) to prevent the vanishing gradient problem. Each sub-layer’s output is added to its input, and this result is then normalized.

9. **Encoder and Decoder Blocks**: The Transformer has an encoder-decoder structure. The encoder maps an input sequence of symbol representations to a sequence of continuous representations. The decoder generates an output sequence of symbols from these continuous representations. Each of these consists of a stack of identical layers (the number of layers is a hyperparameter). The decoder has an additional multi-head attention layer to attend to the encoder output.

10. **Output**: The final output of the Transformer is a sequence of vectors, where each vector corresponds to a word in the output sequence. These vectors can then be transformed into a probability distribution of output words using a final linear layer followed by a softmax.

Let’s dig deeper into some of these components now. I will explain positional encoding, self-attention, multi-head attention, and masked multi-head attention. The rest are simple and the above explanations are enough.

**Positional Encoding**

Positional encoding is a key component of the Transformer architecture that allows it to consider the order of words in a sequence. The transformer architecture doesn’t inherently understand the order of the sequence because it doesn’t have recurrence like RNNs or convolutions like CNNs. Positional encoding provides a way of injecting information about each token’s position in the sequence into the model.

**Absolute Positional Encoding: **In the original Transformer model, Vaswani et al. used a specific function to add a vector to the embedding of each token, providing absolute position information. For a given position, the elements of the positional encoding vector are computed as follows:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Here, pos is the position, and i ranges over the dimensions of the encoding (from 0 to d_model/2-1, where d_model is the dimensionality of the embeddings). The authors used sine for even indices and cosine for odd indices to create a positional encoding that would allow the model to easily learn to attend by relative position, as detailed in the paper.

The intuition here is that these functions create unique positional encodings that the model can easily distinguish and use to learn the position of each word in the sequence.

Sure, let’s look at how the Transformer uses positional encoding with an example.

Consider a simple sentence: “I love cats.” After tokenization, we might represent this sentence as a sequence of integers, where each integer is an index that corresponds to a word in our vocabulary: [9, 27, 301].

Each word is then transformed into a dense vector using a learned embedding. Let’s assume we’re using a 6-dimensional embedding space for simplicity. For example, the word “I” might be represented as a vector [0.1, 0.3, 0.9, -0.4, 0.5, 0.8], “love” as [-0.3, 0.8, -0.6, 0.1, 0.7, -0.8], and “cats” as [0.2, -0.9, 0.6, 0.3, -0.4, -0.5].

The Transformer then adds positional encodings to these embeddings to give position information.

For the absolute positional encoding, we would calculate the encoding as described in the original “Attention is All You Need” paper. For a 6-dimensional embedding, and for position ‘pos’, the positional encoding would look like this:

PE(pos, 0) = sin(pos / 10000^(0/6)) = sin(pos) PE(pos, 1) = cos(pos / 10000^(0/6)) = cos(pos) PE(pos, 2) = sin(pos / 10000^(2/6)) PE(pos, 3) = cos(pos / 10000^(2/6)) PE(pos, 4) = sin(pos / 10000^(4/6)) PE(pos, 5) = cos(pos / 10000^(4/6))

For position 1 (word “I”), we might get a positional encoding like [0.84, 0.54, 0.15, 0.99, 0.01, 1.0]. Similarly, we can calculate encodings for position 2 (“love”) and 3 (“cats”).

These positional encodings are then element-wise added to the word embeddings before they are fed into the encoder layers.

**Relative Positional Encoding:**

Relative positional encoding was later introduced to improve Transformer’s performance on certain tasks, where the relative position of words (how far apart they are) is more important than their absolute position in the sentence.

In this case, rather than adding a positional encoding to the embeddings before the attention calculation, the attention scores are adjusted based on the relative positions of words. Specifically, a bias is added to the attention scores that is based on a learned embedding for each possible relative position.

The exact details of how this is implemented can vary. For example, the paper “Attention is All you Need” introduced a “skewing” operation to efficiently calculate these relative positions.

In practice, relative positional encodings often perform similarly to absolute positional encodings. However, they might have advantages in certain settings, such as when dealing with very long sequences or when the relative position of words is particularly important.

Let’s do our example for relative positional encoding. In this case, the model learns a separate embedding for each possible relative position. For example, in a sequence of length 3, there are 3 possible relative positions: -1 (for a word one position earlier), 0 (for the same word), and +1 (for a word one position later). Each of these relative positions would have a separate learned embedding, which could be, for example:

PE(-1) = [0.2, -0.4, 0.3, -0.1, 0.5, -0.3] PE(0) = [0.1, -0.2, 0.3, -0.1, 0.2, -0.3] PE(+1) = [-0.1, 0.3, -0.2, 0.4, -0.3, 0.1]

In relative positional encoding, the embeddings are learned for each possible relative position. They’re not tied to any specific words, but rather indicate the relative positions between words.

For example, in a sequence of length 10, you’ll have possible relative positions ranging from -9 to +9 (a total of 19 relative positions, including 0). Each of these relative positions will have a learned embedding.

In practice, there’s often a maximum sequence length that the model can handle (due to memory and computational constraints), and we’d have a learned embedding for each possible relative position within that maximum length.

The embeddings are the same for all words and they’re learned based on relative positions, not the specific words at those positions. Once the embeddings are learned during the training process, they’re used to represent the same relative position in any context.

So, the learned relative positional embeddings represent the “distance” or “difference” between positions, and they are used to adjust the attention scores in the transformer model. This provides the model with information about the order of words in the sequence, which is crucial in many language processing tasks.

These relative positional embeddings would be used to adjust the attention scores, rather than being added to the word embeddings directly.

In both cases, the idea is to give the model some information about the order of the words in the sequence, since the self-attention mechanism by itself doesn’t have any way of considering word order.

**Self-Attention Mechanism**

The self-attention mechanism is the core component of the transfoemr architecture. Here is the main procedure in the self-attention mechanism:

1- **Derive attention weights:** similarity between the current input and all other inputs. It means that if you have an input sequence of x1, x2, …, xT, the self-attention mechanism will find the interaction/similarity between xiand xj .

2- **Normalize weights via softmax:** It then takes softmax over the similarities to have a kind of probability distribution over all x s. The sum of all attention weights would be 1.

`sum(aij, j=1,...,T) = 1`

3- **Compute the attention value from normalized weights and corresponding inputs:** To compute the attention value for thei_th input, we can use the following formula:

`Ai = sum(aij * xj, j=1,...,T)`

It is basically a weighted average over all inputs, and the output is again a vector that is context-aware.

One easy and basic way to compute attention weights is to just consider them as the dot product between input vectors as follows without any learning:

```
eij = xi^T . xj
aij = softmax([eij], j=1,...,T) = exp(eij) / sum(exp(eij), j=1,...,T)
```

The following image shows the procedure in a clear way. For x^(i), the embedding of the i-th token, we can calculate the dot product of it by all token embeddings from 0 to T to calculate the attention weights ( aij in our formula and wij in the image). Then we can calculate the output as a weighted average of all token embeddings. This new output/embedding of the i-th token ( Ai in our formula and o^(i) in the image) is a context-aware embedding for the i-th token and knows about the interaction between the i-th input and all other inputs.

source: Python Machine Learning — Third Edition — Raschka & Mirjalili

We can also learn the attention weights. But how? Using *self-attention mechanism!*

The idea is to add three trainable matrices, which will be multiplied by the input embeddings to calculate the following three values:

```
query = Wq * xi
key = Wk * xi
value = Wv * xi
```

Each input embedding will be multiplied by these three matrices as above and the query, key, and value will be generated.

source: __Introduction to Deep Learning — Raschka__

Let’s suppose we’re trying to determine the interaction of the second input with all the other inputs, denoted as A2. This requires us to compute the dot product of the second input’s query vector, represented by q2, with the key vectors of all inputs, from k1 to kT. This will yield the attention weights, ranging from a21 to a2T. Afterward, we’ll apply the softmax function to normalize these attention weights. Finally, we can calculate the weighted sum of all the value vectors, extending from v1 to vT. The following image shows all dimensions and also the attention embedding vector for the second input:

source: __Introduction to Deep Learning — Raschka__

Basically, we start with a token embedding of size de and calculate a vector with the dimension of dv as output.

The following image shows more examples and how to create the attention score matrix:

source: __Introduction to Deep Learning — Raschka__

You can also see the following image to get a better understanding of the dimensions and also how to calculate A in a matrix form and not only for one input:

source: __Introduction to Deep Learning — Raschka__

But, as you see, we have one scaling value in the formula: sqrt(dk) or temperature as used in some papers. Why do we need this term?

source: __Introduction to Deep Learning — Raschka__

I asked ChatGPT to see what it/he/she thinks. Here is the answer:

The scaling part of the Scaled Dot Product Attention formula helps mitigate a potential issue that arises when computing the dot products of high-dimensional vectors. When the dimensionality of the Key vectors (dk) is large, the dot products can become very large as well. This is because the dot product is a sum of element-wise products, and as the number of elements in the vectors increases, the sum can grow.

When the dot products become large, the input to the softmax function can have a wide range of values. The softmax function tends to squash large input values into very small output values, and in extreme cases, this can lead to numerical instability. Additionally, large input values can cause the softmax to produce a highly peaked distribution, which can result in vanishing gradients during backpropagation and slow down the training process.

To address this issue, the dot product values are scaled down by dividing them by the square root of the dimension of the Key vectors (sqrt(dk)). This normalization step helps maintain a more manageable range of values for the input to the softmax function. The scaling factor, sqrt(dk), is chosen because it is the expected value of the dot product of two random vectors with zero mean and unit variance.

The choice of the scaling factor sqrt(dk) is based on the assumption that the elements of the Query and Key vectors are independent random variables with zero mean and unit variance. Under this assumption, the dot product of two such vectors has an expected value of 0 and a variance of dk.

Let’s dive into this concept a bit more. Consider two random vectors u and v, each with n elements, where the elements are independent and identically distributed random variables with mean 0 and variance 1. The dot product of u and v is:

u · v = Σ(u_i * v_i), where i ranges from 1 to n.

Since u_i and v_i are independent, the expected value of their product is:

E[u_i * v_i] = E[u_i] * E[v_i] = 0 * 0 = 0.

Thus, the expected value of the dot product is:

E[u · v] = ΣE[u_i * v_i] = Σ0 = 0.

The variance of the dot product is given by:

Var(u · v) = ΣVar(u_i * v_i), as the elements are independent.Since u_i and v_i have unit variance, Var(u_i * v_i) = Var(u_i) * Var(v_i) = 1 * 1 = 1.

Therefore, the variance of the dot product is:

Var(u · v) = ΣVar(u_i * v_i) = Σ1 = n.

So, for vectors with dimensionality n, the dot product has an expected value of 0 and variance of n. In the case of the Scaled Dot Product Attention, we have n = dk.

Dividing the dot product by sqrt(dk) essentially normalizes the dot product values and scales down the variance to 1, which helps maintain a manageable range of values for the input to the softmax function. This normalization reduces the chance of having large values that could lead to vanishing gradients and numerical instability during training.If you used a scaling factor of dk instead of sqrt(dk), you would be over-correcting for the variance of the dot product, resulting in a variance smaller than 1. This could cause the softmax function to produce a more uniform distribution, which might negatively impact the model’s ability to capture the importance of the most relevant Key-Value pairs in the attention mechanism. The choice of sqrt(dk) balances the need to control the range of dot product values without overly suppressing their variation.

The sqrt(dk) is the standard deviation (std) of the dot product of two random vectors with zero mean and unit variance when the dimension of the vectors is dk.Recall that the variance of the dot product of two such vectors was derived as Var(u · v) = dk. To obtain the standard deviation, we take the square root of the variance:std(u · v) = sqrt(Var(u · v)) = sqrt(dk).

It also told me the following after some interactions about the above reasoning:

The choice of the scaling factor sqrt(dk) is not explicitly derived from the standard deviation of the dot product of two random vectors in the original paper “Attention is All You Need” by Vaswani et al. (2017).

However, the choice of sqrt(dk) can be intuitively justified by the discussion of the dot product of high-dimensional vectors with zero mean and unit variance. The idea is that dividing by sqrt(dk) helps to normalize the dot product values, preventing them from becoming too large and maintaining a manageable range of values for the input to the softmax function.

**Multi-Head Attention**

The next idea on top of the self-attention mechanism (scaled dot-product attention) is **Multi-Head Attention**. We can see multiple heads in the attention mechanism as kernels in convolutional neural networks. Each head is one scaled dot-product attention, and all of these heads will work in parallel and have separate matrices. For each head (self-attention layer), use different Wq, Wv, Wk, then concatenate the results, Ai . The main transformer paper used 8 attention heads, i.e., Wq1, Wv1, Wk1, ..., Wq8, Wv8, Wk8. This allows the network to attend to different parts of the sequence differently. The following image shows the difference between one head and multiple heads.

main transformer paper

The following image shows the input and output dimensions of the multi-head attention.

source: __Introduction to Deep Learning — Raschka__

As shown, in the main transformer paper, the input dimension is T*de and the output dimension after concatenating the output of all attention heads will be T*dv*h = T*de , so the input and output sizes are the same. Note that T is the number of input tokens and dv = de/h = 64 where de=512, h=8to have the output size of de = dv*hafter concatenation.

For the last Linear layer after concatenation, the dimensions are as follows:

source: __Introduction to Deep Learning — Raschka__

The main transformer paper used do = dv*h = de to have the same output size as the input: de. So, the output of the multi-head attention would be again T*do=T*de .

To recap, the following slide shows a single head scaled dot-product attention procedure:

source: __Introduction to Deep Learning — Raschka__

Additionally, the following slide demonstrates the process for multiple heads, starting with input that is multiplied by various weight matrices of various heads to create Query, Key, and Value, to concatenating the outputs of various heads and sending them into the Linear layer:

source: __Introduction to Deep Learning — Raschka__

We have multi-head attention in both the encoder and decoder. As you can see in the transformer architecture at the beginning of this post, in the multi-head attention block in the encoder part, the same input is used to generate Query, Key, and Value. This is called self-attention because it tries to learn the interaction of the input with itself. On the other hand, the multi-head attention block in the decoder of the transformer architecture, after the masked multi-head self-attention layer in the decode, uses the output of the encoder to generate Value and Key, and the output of the masked multi-head attention → Add&Norm to generate Query. This is called cross attention, as it tries to learn the interaction between input and output. This layer is designed to allow each position in the decoder to attend over all positions in the input sequence from the encoder.

In other words, when generating an output token, the decoder not only considers the previously generated tokens (through masked self-attention) but also takes into account the entire input sequence (through cross attention). This process is crucial to tasks like machine translation where a word’s translation often depends on multiple words or the entire context of the input sentence.

In terms of implementation, the cross-attention mechanism works similarly to the standard (self-) attention mechanism. The difference lies in what is used as queries, keys, and values:

- Queries come from the previous decoder layer (the output of the masked self-attention layer).

- Keys and values come from the output of the encoder.

This way, the decoder can focus on different parts of the source sequence for every target position.

To give a specific example, let’s consider a translation task from English to French. The encoder takes in the English sentence and generates a sequence of representations. Then, while generating the French translation, for each French word (each step in the decoder), the model uses the cross-attention to focus on different words in the English sentence. This helps it understand the context and semantics of the source sentence and produce a coherent and accurate translation.

But what is the masked version of the multi-head attention block. Masked multi-head attention is a variant of multi-head attention used in the decoder part of the Transformer model to prevent the model from “seeing” future tokens in the output sequence during training, which could lead to unrealistically good performance and problems at test time. This aligns with the autoregressive nature of sequence generation tasks like translation, summarization, etc., where you generate one token at a time.

In the standard multi-head attention mechanism, the attention scores are calculated for all pairs of tokens. In contrast, in the masked version, we add a mask to the attention scores before they’re passed through the softmax function, effectively zeroing-out positions that correspond to ‘future’ tokens. This mask ensures that when generating the output token at position i, the model only has access to output tokens at positions less than i.

To explain further, consider the sentence “The cat sat on the mat”. When the model is predicting the word “sat”, we don’t want it to have access to “on the mat”, because in a real-world scenario, it won’t have this future information. So we mask or hide this information, forcing the model to make the best prediction based on the words “The cat” only.

To perform the masking operation, an upper triangular matrix with entries set to a very large negative value (like -1e9) is created. This mask is then added to the scaled dot product of the query and key vectors. Because these large negative values are then passed through a softmax function, they become very close to zero, effectively ignoring the future tokens.

This mechanism is vital to maintain the causality property in the decoder, preventing the current output token from depending on future output tokens.

source: __Introduction to Deep Learning — Raschka__

The following slide shows another example for masking:

source: __Introduction to Deep Learning — Raschka__

In terms of learning different concepts via different heads, there is no guarantee that different attention heads in the Transformer architecture will learn different concepts. However, the hope is that by initializing the weights of the model randomly, and by having multiple heads that learn in parallel, the model will find different useful patterns to pay attention to.

Evidence suggests that different heads do indeed learn different types of attention patterns. For example, some heads may specialize in syntax (like paying attention to the previous word), others in longer-range dependencies (like matching parentheses in a sentence), or specific semantic relationships (like attending from an object to its attributes).

We can analyze this post-training by looking at the attention patterns that different heads produce on a dataset, or using probing tasks that aim to determine what kind of information is captured by different layers or heads in the model.

However, the interpretability of attention heads is still an open research question. While we can find heads that seem to correspond to certain intuitive patterns, many heads do not correspond to anything easily interpretable. Furthermore, recent research has shown that the attention weights might not be as interpretable as we’d like to think, and we should be cautious in over-interpreting them.

In terms of encouraging heads to learn different things during training, there’s ongoing research into techniques for doing this, such as using regularization terms in the loss function that encourage diversity among the heads.

**Add & Norm**

After the multi-head attention block in the encoder and also in some other parts of the transformer architecture, we have a block called Add&Norm. The Add part of this block is basically a residual block, similar to ResNet, which is basically adding the input of a block to the output of that block: x+layer(x) .

Normalization is a technique used in deep learning models to stabilize the learning process and reduce the number of training epochs needed to train deep networks. In the Transformer architecture, a specific type of normalization, called Layer Normalization, is used.

Layer Normalization (LN) is applied over the last dimension (the feature dimension) in contrast to Batch Normalization which is applied over the first dimension (the batch dimension).

Basically, we do Layer Normalization across the last dimension (which is the dimension of the features, or ‘channels’, or ‘heads’ in multi-head attention). This means that each feature in the feature vector has its own mean and variance computed for normalization, and this is done for each position separately.

In Batch Normalization, you calculate mean and variance for your normalization across the batch dimension, so you normalize your feature to have the same distribution for each example in a batch.

In Layer Normalization, you normalize across the feature dimension (or channels, or heads), and this normalization is not dependent on other examples in the batch. It’s computed independently for each example, hence it’s more suited for tasks where the batch size can be variable (like in sequence-to-sequence tasks, such as translation, summarization etc.).

Unlike Batch Normalization, Layer Normalization performs exactly the same computation at training and test times. It’s not dependent on the batch of examples, and it has no effect on the representation ability of the network.

In the Transformer model, Layer Normalization is applied in the following areas:

1. **After each sub-layer (Self-Attention or Feed-Forward)**: Each sub-layer (either a multi-head self-attention mechanism or a position-wise fully connected feed-forward network) in the Transformer is followed by a Layer Normalization step. This is combined with residual connections.

2. **Before the final output layer**: The output of the stack of decoder layers is also normalized before it is fed into the final linear layer and softmax for prediction.

The addition of Layer Normalization in these areas helps to stabilize the learning process and allows the model to be trained more effectively. Moreover, it also aids in achieving higher performance and faster training times.

source: __Introduction to Deep Learning — Raschka__

**Conclusion**

So, there you have it. We’ve traveled through the heart of the Transformer model and gotten a look at how this game-changing architecture works. From seeing how they handle dependencies and positional encoding to understanding the importance of attention processes, it’s easy to see why Transformers are making such a big splash in machine learning.

But hey, don’t forget that this is just the start! As we’ve seen, Transformer models aren’t just used for language tasks. They show up everywhere in machine learning, changing how we understand and process sequences. Who knows where we’ll see Transformers next? Research and new ideas are always being made.

No matter how long you’ve been in the field or how new you are, I hope this deep look has helped you understand the Transformer model and made you want to learn more. It’s pretty cool, right?

Don’t stop here, though. Continue to learn, look around, and ask questions. We keep pushing the limits in AI because that’s what we do.

For now, that’s all. Until next time, happy learning, and cheers to the great world of AI that keeps us all on our toes!

Source: https://kargarisaac.medium.com/inside-transformers-an-in-depth-look-at-the-game-changing-machine-learning-architecture-f619a704e72