
If you recently found out that the “T” in “GPT” stands for “transformer” and wish to learn more, you’ve come to the right place. They’re capable of doing all kinds of tasks, from doing your kid’s homework to drawing detailed pictures of whatever you want them to. But how do AI transformers work?
AI transformers work by employing self-attention mechanisms to connect every part of the input to each other. This allows transformers to capture dependencies between the different elements and give a highly relevant output to your prompt. They also utilize parallelization to process large datasets.
Keep reading as I’ll go into more detail about transformers and how they work from a technical standpoint. I’ll also explain why AI transformers are such an important invention and why they’re more powerful than their predecessors.
When Were AI Transformers First Introduced?
AI transformers were first introduced in a 2017 paper published by Google Brain. Unlike its other neural network predecessors, transformers are amenable to parallelization, meaning they can train using enormous datasets with fewer errors than their predecessors.
The paper “Attention Is All You Need” introduced the transformer model. It was nothing short of groundbreaking and revolutionary — it completely changed our approach to AI and deep learning.
The paper’s title doesn’t make much sense until you understand what is meant by “attention.”
Namely, “attention” refers to the mechanisms the transformer uses to focus on different parts of an input.
In layman’s terms, a transformer takes a sentence (input), breaks it down, processes each bit, and gives you a unique value (output).
And the transformer can run thousands, if not millions, of these calculations per second, given the hardware is powerful enough.
This is why they’re so revolutionary.
A 2021 paper titled “On the Opportunities and Risks of Foundation Models” calls transformers “foundation models.” Why, you might be asking? It’s because of their “critically central yet incomplete character.”
We’ll dive more into the broad scope of what AI transformers can do and how exactly they work later.
For now, let’s quickly go over why they were invented in the first place.
Why We Need AI Transformers
Before we had transformers, we had recurrent neural networks(RNNs) and convolutional neural networks (CNNs; and no, I’m not talking about the news channel).
RNNs were good at sequential dependencies, but they had rather short reference windows. For instance, they could only focus on one clause within a larger sentence.
They also struggled with parallelization, which made their use somewhat limited.
Long short-term memory (LSTM) was a variant of RNN that tried to address its limitations. They were better at dealing with vanishing gradients and had longer memories. Still, they faced many of the same issues as RNNs.
CNNs, on the other hand, were state-of-the-art in image processing thanks to their ability to recognize patterns. But, unlike RNNs, they struggled with sequential dependencies.
They also suffered from numerous other drawbacks — limited spatial resolution and interpretability, high parameter requirements, and they struggled with large datasets.
What Can AI Transformers Do?
Enter the Transformer. It can do everything that RNN and CNN can do, but better (for the most part). It’s good at parallelization, interpretation, and it has a significantly better attention span.
It can capture long-range dependencies in sequences thanks to self-attention. This means that you can give it 5-paragraph-long instructions, and it’ll follow them almost to a tee.
AI transformers can perform a wide range of tasks: writing, research, instruction following, image recognition, sentiment analysis, and more. Generative transformers can take one set of input data, learn from it, and come up with original solutions to never-before-seen problems.
Transformers are potent AI tools. If you’ve tried ChatGPT, Midjourney, Dall-E, or Stable Diffusion, then you know what they’re capable of.
They can take literally any prompt you can imagine and create a short body of text or image in less than 30 seconds.
In comparison to other AI architectures, transformers can:
- Capture long-range dependencies and utilize self-attention to handle complex sequences.
- Process large datasets using parallelization.
- Excel at natural language processing (NLP) tasks like writing, summarizing, research, etc.
- Generate creative, original images with the help of diffusion in less than 30 seconds.
- Learn from existing datasets to generate solutions to previously unseen problems.
But why are AI transformers so good at such a wide range of tasks?
Let’s take a deep dive to better understand their architecture, step-by-step.
How AI Transformers Work
As Hugging Face explains, transformers are all trained as language models. And for us humans, language is at the center of everything we do.
And while transformers are great at breaking down language and understanding it through numbers (statistically), they can’t do much with it yet.
That’s why a form of supervised training called transfer learning is utilized to fine-tune a model.
And that process is what differentiates ChatGPT from Dall-E, for instance, which are both based on the GPT architecture.
Let’s dive right into the deep end as I break down the AI transformer architecture presented in “Attention Is All You Need.”
A transformer consists of two parts: the encoder and the decoder. Here’s how they work:
Encoder
The input data is fed through an embedding layer, where each token (word) is mapped to a vector with continuous values.
The embedded data then goes through positional encoding. Without going too in-depth, it essentially creates sin and cos functions and adds them to the corresponding vectors. It gives the transformer information about the position (order) of the tokens in the input sequence.
The next part is the real meat of a transformer and where self-attention comes to play. It’s what allows the transformer to capture dependencies between the tokens from the input.
The data is fed into a multi-headed attention layer with add & norm layers in between. Self-attention is achieved by feeding the input into 3 layers: query, key, and value.
This system comes from retrieval systems. Let’s put that into an example to help you better understand how it works.
As illustrated by a user on Stack Exchange, the query is what you type into a search bar, the key is the video description, title, and other data, and the value is the best-matched videos.
The query is mapped against the key to get a score matrix. This enables the encoder to determine the relevance between different tokens. In simpler terms, it allows the transformer to prioritize different parts of the input sequence.
The score matrix is divided by the square root of the dimension of the queries and keys to get scaled scores.
Next, the normalized softmax of the scaled score gives probability values between 0 and 1. This is done so that the transformer knows what tokens to prioritize.
The attention weights from the softmax are then multiplied by the value vector to get the output vector.
Note that the query, key, and value vectors are split into N vectors before they’re fed through different self-attention heads. These different heads are what make up the multi-headed attention layer.
The outputs are then concatenated into a single output vector and go through another linear layer.
In the last step within the encoder, the output vector is added to the original input and goes through layer normalization.
That then goes through a pointwise feed-forward network, and the output is again combined with the layer normalization. The pointwise feed-forward data goes into the decoder. This is what allows transformers to learn from the process but also to continue processing the data.
Decoder
If you’re starting to feel a bit tired from understanding encoders, don’t worry. Decoders are more or less similar to encoders.
The input in the decoder is the outputs shifted right. It’s the target sequence but shifted so that the model can predict the token better. It prevents the transformer from seeing what it’s trying to predict.
Like the encoder, the input goes through output embedding and positional encoding.
It then goes into the first multi-headed attention layer to get the attention score for the input data.
The key difference here is that the decoder cannot tap into future tokens. This limitation is done on a word-by-word basis. The words only have access to themselves and previous words in the sequence.
Another key difference here is that a mask with negative infinities (-inf) in the top-right corner is added to the scaled scores. This is done before the softmax so that the -inf gets zeroed out for future tokens. It stops the model from paying attention to future words in a sequence.
In the second multi-headed attention layer, the encoder’s output is finally added to the decoder. They act as the queries and keys, whereas the outputs from the first multi-headed attention layer are the values.
This is where the magic happens. The decoder matches the data to learn what encoder input should be focused on.
It’s then fed to another pointwise feed-forward layer that allows the transformer to learn further and also to finish the process.
In the end, it goes through a linear classifier and a softmax layer. The word with the highest softmax value is the predicted word in the sequence and the one you see at the end.
Remember that the decoder can be stacked N less high, which is what allows for significantly better results.
As AI models advance and gain impressive capabilities in natural language processing, the question of their self-awareness is something one would wonder. Read our article “Does AI Know It’s AI” to explore whether AI machines are aware of their existence.
Final Thoughts
And that’s how transformers work, as described in the seminal paper “Attention Is All You Need.”
AI transformers are a revolutionary technology, as it’s rather good at understanding human language. It’s much more thorough compared to its predecessors.
Sources
- Nvidia: What Is a Transformer Model?
- YouTube: Illustrated Guide to Transformers Neural Network: A step by step explanation
- Hugging Face: How do Transformers work?
- Neurips: Attention is All You Need
- Arxiv: On the Opportunities and Risks of Foundation Models
- Wikipedia: Transformer (machine learning model)
- Vaclav Kosar: Transformer Embeddings and Tokenization
- Stack Exchange: What exactly are keys, queries, and values in attention mechanisms?
- Transformer Neural Network Definition | DeepAI
Recent Posts
Enhancing E-commerce With Personalized Product Recommendation
Introduction to AI-Driven Personalized Product Recommendations In today's competitive e-commerce landscape, businesses are constantly looking for ways to enhance customer experiences and drive...
In this blog post, I delve into how AI is used in customer analytics and user behavior analysis, uncovering the potential of customer segmentation, personalization, product recommendations, and...
