Attention is all you need

This weekend, I spent time reading Google’s now-seminal paper, Attention Is All You Need, which introduced the transformer architecture that forms the foundation of large language models like ChatGPT, Claude, and Gemini.

I’ve forgotten most of the math I studied during my engineering, which made it difficult to fully understand the paper. However, I was able to build some intuition. Here are my abstracted notes:

  1. Prior to transformers, techniques like RNNs (Recurrent Neural Networks) were used to train models:
    • RNNs process sequences one step at a time. If you’re reading a sentence, you read word 1, then word 2, then word 3. Each step depends on the previous one.
    • This made the process computationally slow—computing step 5 requires completing step 4, which requires step 3, and so on.
    • Since information has to travel far, RNNs start to fail on large contexts—just as information gets lost or modified in a long chain of Chinese whispers.
  2. The researchers proposed a radical new approach: instead of processing words sequentially, compute the relationships between all words in a sequence simultaneously.
  3. There’s a bunch of math behind this. The paper formalises it using three concepts:
    • Query: what I’m currently looking for
    • Key: what each word offers
    • Value: the information each word contains
  4. These are used to calculate “attention.” Attention is an elegant way of assigning weights to words based on their relationship with other words. For example, in the sentence “The dog sat on the chair as it was tired”, the calculation will enrich the word “it” with context such that its relationship with “dog” has a higher weight than its relationship with “chair.”
  5. Attention is content-aware but position-blind, which can be problematic. Without positional information, a transformer wouldn’t know which noun is the subject and which is the object—”man eats fish” and “fish eats man” would be indistinguishable. To solve this, the researchers added a positional encoding to each word before feeding it into the transformer. This preserves word order and meaning.
  6. The paper had several implications:
    • Training time compressed dramatically. The researchers achieved far superior results on a translation task in 3.5 days using 8 GPUs—at a fraction of the cost of the best models at that time.
    • Due to the parallel nature of computation, throwing more compute at the problem works well. Capital became a moat, which is why we’re seeing massive investments in GPU acquisition for training.
    • Transformers, due to their architecture, are extensible to multimodal inputs (voice, text, images) and diverse tasks (summarisation, text generation, translation).

Transformers made intelligence a scaling problem. Everything since has been a consequence of that.


For the more technically inclined, here is how the math works:

The formula is:

Attention(Q,K,V)=softmax(QKT/dk)V

QKT (The Score): We take the dot product of your Query with everyone else’s Keys. This tells us how much “attention” to pay to every other word.

1dk (The Scaling): If the vectors are too long, the dot products get huge, pushing the softmax into regions where gradients are tiny. We scale it down to keep the math stable.

softmax (The Probabilities): This turns the scores into percentages that sum to 1 (e.g., “Pay 90% attention to word A, 10% to word B”).

×V (The Weighted Sum): We multiply these percentages by the actual information (Values).

I asked Gemini to explain this to me through an example in Sal Khan style and it did a great job in my opinion:

Let’s walk through a concrete numerical example of how a Transformer calculates attention for a simple two-word sentence: “Hello World.”


Step 1: Turning Words into Vectors (Embeddings)

First, we can’t do math on letters, so we turn each word into a list of numbers. For this example, let’s say our Embedding Dimension is 3.

  • Hello [1,0,1]
  • World [0,1,1]

Step 2: Creating Q,K, and V

In the real world, we multiply these by weight matrices (WQ,WK,WV). To keep it simple for our example, let’s assume the model has “learned” weights that just pass the values through.

We’ll focus on the word “Hello.” It needs to calculate its new meaning.

  • Query (Q) for “Hello”: [1,0,1] (What is “Hello” looking for?)
  • Key (K) for “Hello”: [1,0,1]
  • Key (K) for “World”: [0,1,1]
  • Value (V) for “Hello”: [1,0,1]
  • Value (V) for “World”: [0,1,1]

Step 3: The Similarity Score (The Dot Product)

Now, “Hello” looks at every Key in the sentence (including its own) to see who it should pay attention to. We do the Dot Product (QK).

  1. “Hello” vs. “Hello”: (1×1)+(0×0)+(1×1)=2
  2. “Hello” vs. “World”: (1×0)+(0×1)+(1×1)=1

Step 4: Scaling and Softmax

“Hello” sees a score of 2 for itself and 1 for “World.” But we want these to be percentages.

First, we Scale. The paper says divide by dk. Here dk=3, so 31.73.

  • 2/1.731.15
  • 1/1.730.58

Now, we apply the Softmax function (ex/ex). This squashes the numbers so they sum to 1.

  • e1.153.16
  • e0.581.78
  • Total Sum: 3.16+1.78=4.94

Final Attention Weights:

  • Attention to “Hello”: 3.16/4.94=0.64 (64%)
  • Attention to “World”: 1.78/4.94=0.36 (36%)

Step 5: The Final Weighted Result

Now, we take those percentages and multiply them by the actual Values (V). This is the “information” each word brings to the table.

New Hello=(0.64×[1,0,1])+(0.36×[0,1,1])

New Hello=[0.64,0,0.64]+[0,0.36,0.36]

New Hello=[0.64,0.36,1.0]

The Big “Aha!” Moment

Look at what happened to the word “Hello.”

It started as [1,0,1]. But after the Attention mechanism, it became [0.64,0.36,1.0].

Because it “paid attention” to the word “World,” its new vector now contains 36% of the information from “World.” It has literally absorbed the context of the sentence into its own numerical representation. If this were a 50-word sentence, “Hello” would look at all 50 words simultaneously and pull in the most relevant bits from each one.


You might also like