Three Breakthroughs That Shaped the Modern Transformer Architecture

David Bressler
February 12, 2025

Introduction

Perhaps the most surprising aspect of the evolution of the Transformer architecture for language modeling is how consistent it has remained over the past near-decade. The most cutting-edge OpenAI or DeepSeek model would still look quite familiar to an NLP engineer from 2017, when the seminal paper “Attention Is All You Need” introduced the core design [Vaswani et al., 2017]. Nevertheless, a few refinements to the Transformer architecture have proven to be truly significant, improving model quality, stability, and efficiency. This blog covers three of the most important of those architectural improvements.

Scope Clarification
This post focuses on advancements to the Transformer’s architecture that specifically enhance model performance. Many other innovations exist that primarily focus on efficiency or memory reduction—sometimes at the cost of accuracy—and I will not delve into those here. Examples of such efficiency-focused improvements include the following:

  • Reformer [Kitaev et al., 2020] (uses locality-sensitive hashing for attention)
  • Linformer [Wang et al., 2020] (linearizing attention)
  • Performer [Choromanski et al., 2021] (random feature maps for linear attention)
  • Big Bird [Zaheer et al., 2020] (sparse attention over large contexts)

I also do not discuss training procedure innovations, including the important step of alignment, which typically involves reinforcement learning from human feedback (RLHF) [Stiennon et al., 2020; Ouyang et al., 2022], as well as other training advances such as large-batch training [You et al., 2019], mixed-precision training [Micikevicius et al., 2018], or parameter-efficient fine-tuning methods like LoRA [Hu et al., 2022]. These training-related improvements are critically important for building today’s massive language models, but they lie outside the scope of the architecture-centered focus here.

A QUICK REVIEW OF TRANSFORMERS

The Transformer architecture introduced by Vaswani et al. [2017] radically changed the landscape of NLP by discarding recurrence and convolutions in favor of self-attention. At a high level, a Transformer for language modeling has these key components (see Figure from paper):

  1. Embedding Layer
    Each word unit in the input sequence is converted to a continuous vector representation.
  2. Positional Encoding
    The original Transformer used a sinusoidal function to encode position information, which is then added to the input embeddings. This positional signal allows the self-attention mechanism to differentiate between tokens at different positions.
  3. Multi-Head Self-Attention
    Self-attention gathers and combines information across all positions in the sequence. The token at each position produces its own set of Queries, Keys, and Values vectors. For each token’s Query, attention weights are computed via a dot product with Keys from all tokens, then normalized (via softmax) and used to weight the Values vectors before summation. Multi-head attention means the process is repeated multiple times in parallel with different learned projections, allowing the model to attend to multiple representation subspaces.
  4. Feed-Forward Network (FFN)
    After self-attention, the position-wise feed-forward network applies the same MLP transformation to each token individually (i.e., independently of other positions), often with two linear layers with a nonlinearity in between (such as GELU). This allows each token to refine its own features in isolation before the model once again combines information across positions.
  5. Residual Connections & Layer Normalization
    Each sublayer (attention or FFN) is wrapped in a residual connection, plus a layer normalization (LN). In the original (“post-norm”) Transformer, layer normalization is applied after the residual sum.
  6. Stacked Layers
    The above (multi-head attention + FFN + normalization) block is repeated L times, forming a deep network. The final output can then be projected to vocabulary space for next-token prediction (language modeling) or used for classification, etc.

Despite its simplicity, the Transformer’s self-attention mechanism enables learning long-range dependencies efficiently across sequences, and the architecture scales well when combined with large amounts of data and distributed training.

Vaswani et al., 2017

1) MOVING NORMALIZATION OUTSIDE THE RESIDUAL STREAM (“PRE-NORM”)

Background: Layer Normalization

Layer normalization (LN) [Ba et al., 2016] normalizes the features of each sample so that they have zero mean and unit variance. In Transformers, LN helps stabilize training by controlling the distribution of activations inside each layer.

Post-Norm vs. Pre-Norm

The original Transformer applied LN in a “post-norm” fashion [Vaswani et al., 2017]: after each sublayer’s output was added to the residual stream, the combined sum was normalized. Formally (for a sublayer):

y = sublayer(x)

z = x + y

o = LN(z)

x = o

Subsequent research [Wang et al., 2019; Xiong et al., 2020] showed that switching to “pre-norm” (i.e., applying LN before the sublayer) yields better training stability and allows scaling to deeper networks. In “pre-norm,” the sublayer sees a normalized input, while the residual stream itself remains an unnormalized sum of all outputs (see Figure from Xiong et al., 2020):

x' = LN(x)

y  = sublayer(x')

o  = x + y

x  = o

Why Does Pre-Norm Help?

By keeping the residual pathway clean (just a direct sum across layers), gradients can backpropagate more directly to earlier layers. In post-norm, the gradient to earlier layers must pass through a normalization of (x + y). Because LN’s derivative is not simply an identity function, it can dampen or amplify signals in ways that are more unpredictable over many layers. Pre-norm avoids that complication for the residual stream, improving gradient flow and training stability—especially critical for very deep Transformers.

Moreover, in pre-norm:

  • Each sublayer consistently sees a well-conditioned input (via LN).
  • The residual sum can grow in magnitude, but that is not problematic because each next sublayer again normalizes its input before processing.
  • Many implementations also add small scaling factors to the residual connections to keep forward and backward signals in a stable range.

As a result, most modern Transformer-based architectures (e.g., GPT-3, PaLM, LLaMA) adopt pre-norm by default [Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023].

Xiong et al., 2020

2) ROTARY POSITIONAL ENCODINGS (RoPE)

Why Positional Embeddings?

Self-attention is inherently order-invariant if all tokens only differ by their embeddings. Positional encodings or embeddings break this symmetry by informing the model about where each token occurs within a sequence.

Originally, Transformers used absolute position embeddings [Vaswani et al., 2017], adding a sinusoidal or learned vector to each token’s input. Later, relative position embeddings [Shaw et al., 2018] were proposed, encoding the difference between query and key positions within the attention mechanism.

From Relative Position Embeddings to RoPE

Rotary Positional Encodings (RoPE) [Su et al., 2021/2023] extend relative position embeddings by applying a rotation operation to the Query and Key vectors that encodes their relative positions. Conceptually:

1. Pair the Dimensions
For a given token’s Query or Key vector of dimension d, split it into d/2 pairs. The pairs are indexed by ‘i’ (i.e. ‘i’ ranges from 1 to d/2).

2. Rotate Each Pair
Each pair (u, v) is treated like a 2D point or a complex number with magnitude sqrt(u^2 + v^2)​. The pair is rotated by an angle θ via:

3. Position-Dependent Frequencies

The angle θ is chosen based on the pair index ‘i’, and the token position ‘p’. Each pair index ‘i’ has a different “frequency” that determines how fast the 2D point is rotated, whereas each token position ‘p’ determines how much the 2D point is rotated. By rotating each pair of each vector by a reliable amount depending on position and pair index, this effectively encodes relative distances between positions in the phase differences across multiple frequencies.

4. Compute Attention
Once each pair of a given Query or Key has been rotated, the pairs are recombined back into a single vector of dimension d, and the standard dot-product attention is computed between the Queries and the Keys. The angles align in a way that the dot product depends on relative position rather than solely on absolute positions.

Benefits

  • Smooth Relative Encoding: RoPE encodes relative position information continuously, without needing a huge table of embeddings for all possible offsets.
  • Multi-Scale Awareness: The chosen frequencies for the pair indices are usually spaced exponentially, which captures both local and long-distance positional relationships.
  • Easy Extension to Long Contexts: With RoPE, you can extend context length by adjusting the rotation range, without having to learn new position embeddings.
  • Simplicity in Implementation: Only a rotation of Qs and Ks is required.

As a result, RoPE (and related approaches) has been adopted in many state-of-the-art LLMs that target long context capabilities, such as LLaMA [Touvron et al., 2023], DeepSeek, and several others.

Su et al., 2023

3) MIXTURE OF EXPERTS (MoE)

MoE in a Nutshell

A Mixture of Experts (MoE) layer replaces the standard feed-forward sublayer of the Transformer with multiple “expert” sub-networks. Each expert is itself typically a feed-forward network with the same shape as the original FFN. A gate decides which expert(s) each token will route to, usually based on the dot product between the token representation and some gating parameters [Shazeer et al., 2017; Lepikhin et al., 2021; Fedus et al., 2022].

How It Works

  1. Token-Level Routing: For each token, compute the dot product with each expert’s “centroid” vector to generate affinity scores.
  2. Top-K Selection: Choose the top-K experts (often K=1 or 2) according to the largest affinity scores.
  3. Expert Processing: Each selected expert is a separate FFN that transforms the token independently.
  4. Combine Outputs: If K>1, combine expert outputs (often via a weighted sum using the gating scores).

By routing tokens to only a fraction of the total experts, MoE can dramatically increase the model’s overall parameter count (capacity) without increasing the per-token compute cost by the same factor. It also allows the model to learn more specialized sub-networks that focus on certain token patterns or language phenomena.

DeepSeekMoE (2024) Advances

In “DeepSeekMoE” [Dai et al., 2024], the number of experts per MoE layer is increased, while the size of each individual expert is reduced proportionally. Key additional ideas:

  • Fine-Grained Expert Segmentation: Many smaller experts rather than fewer large ones.
  • Higher K: More experts can be selected simultaneously to leverage more specialized transformations.
  • Shared Expert Isolation: A subset of “shared” experts is used for all tokens, handling broadly useful knowledge, while the rest can specialize in narrower linguistic or domain-specific patterns.

This “fine-grained plus shared” approach maintains constant total compute while improving the potential for specialization. Each token picks from a broader pool of specialized experts, and the shared experts capture universal functions in a single place.

Dai et al., 2024

CONCLUSION

From the original post-norm Transformer to today’s cutting-edge language models, most of the core architecture of Transformers has remained intact: self-attention, residual connections, layer norms, and feed-forward layers. However, three architectural refinements have proven particularly impactful:

  1. Pre-Norm Layer Normalization – Simplifies gradient flow and stabilizes deeper models.
  2. Rotary Positional Encodings – Elegantly incorporate relative positional awareness at multiple scales.
  3. Mixture of Experts – Increases model capacity and specialization without linear increases in compute.

These innovations, combined with large-scale data, clever training strategies, and efficiency improvements, have powered the remarkable performance leaps in large language models.