Perhaps the most surprising aspect of the evolution of the Transformer architecture for language modeling is how consistent it has remained over the past near-decade. The most cutting-edge OpenAI or DeepSeek model would still look quite familiar to an NLP engineer from 2017, when the seminal paper “Attention Is All You Need” introduced the core design [Vaswani et al., 2017]. Nevertheless, a few refinements to the Transformer architecture have proven to be truly significant, improving model quality, stability, and efficiency. This blog covers three of the most important of those architectural improvements.
Scope Clarification
This post focuses on advancements to the Transformer’s architecture that specifically enhance model performance. Many other innovations exist that primarily focus on efficiency or memory reduction—sometimes at the cost of accuracy—and I will not delve into those here. Examples of such efficiency-focused improvements include the following:
I also do not discuss training procedure innovations, including the important step of alignment, which typically involves reinforcement learning from human feedback (RLHF) [Stiennon et al., 2020; Ouyang et al., 2022], as well as other training advances such as large-batch training [You et al., 2019], mixed-precision training [Micikevicius et al., 2018], or parameter-efficient fine-tuning methods like LoRA [Hu et al., 2022]. These training-related improvements are critically important for building today’s massive language models, but they lie outside the scope of the architecture-centered focus here.
The Transformer architecture introduced by Vaswani et al. [2017] radically changed the landscape of NLP by discarding recurrence and convolutions in favor of self-attention. At a high level, a Transformer for language modeling has these key components (see Figure from paper):
Despite its simplicity, the Transformer’s self-attention mechanism enables learning long-range dependencies efficiently across sequences, and the architecture scales well when combined with large amounts of data and distributed training.
Layer normalization (LN) [Ba et al., 2016] normalizes the features of each sample so that they have zero mean and unit variance. In Transformers, LN helps stabilize training by controlling the distribution of activations inside each layer.
The original Transformer applied LN in a “post-norm” fashion [Vaswani et al., 2017]: after each sublayer’s output was added to the residual stream, the combined sum was normalized. Formally (for a sublayer):
y = sublayer(x)
z = x + y
o = LN(z)
x = o
Subsequent research [Wang et al., 2019; Xiong et al., 2020] showed that switching to “pre-norm” (i.e., applying LN before the sublayer) yields better training stability and allows scaling to deeper networks. In “pre-norm,” the sublayer sees a normalized input, while the residual stream itself remains an unnormalized sum of all outputs (see Figure from Xiong et al., 2020):
x' = LN(x)
y = sublayer(x')
o = x + y
x = o
By keeping the residual pathway clean (just a direct sum across layers), gradients can backpropagate more directly to earlier layers. In post-norm, the gradient to earlier layers must pass through a normalization of (x + y). Because LN’s derivative is not simply an identity function, it can dampen or amplify signals in ways that are more unpredictable over many layers. Pre-norm avoids that complication for the residual stream, improving gradient flow and training stability—especially critical for very deep Transformers.
Moreover, in pre-norm:
As a result, most modern Transformer-based architectures (e.g., GPT-3, PaLM, LLaMA) adopt pre-norm by default [Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023].
Self-attention is inherently order-invariant if all tokens only differ by their embeddings. Positional encodings or embeddings break this symmetry by informing the model about where each token occurs within a sequence.
Originally, Transformers used absolute position embeddings [Vaswani et al., 2017], adding a sinusoidal or learned vector to each token’s input. Later, relative position embeddings [Shaw et al., 2018] were proposed, encoding the difference between query and key positions within the attention mechanism.
Rotary Positional Encodings (RoPE) [Su et al., 2021/2023] extend relative position embeddings by applying a rotation operation to the Query and Key vectors that encodes their relative positions. Conceptually:
1. Pair the Dimensions
For a given token’s Query or Key vector of dimension d, split it into d/2 pairs. The pairs are indexed by ‘i’ (i.e. ‘i’ ranges from 1 to d/2).
2. Rotate Each Pair
Each pair (u, v) is treated like a 2D point or a complex number with magnitude sqrt(u^2 + v^2). The pair is rotated by an angle θ via:
3. Position-Dependent Frequencies
The angle θ is chosen based on the pair index ‘i’, and the token position ‘p’. Each pair index ‘i’ has a different “frequency” that determines how fast the 2D point is rotated, whereas each token position ‘p’ determines how much the 2D point is rotated. By rotating each pair of each vector by a reliable amount depending on position and pair index, this effectively encodes relative distances between positions in the phase differences across multiple frequencies.
4. Compute Attention
Once each pair of a given Query or Key has been rotated, the pairs are recombined back into a single vector of dimension d, and the standard dot-product attention is computed between the Queries and the Keys. The angles align in a way that the dot product depends on relative position rather than solely on absolute positions.
As a result, RoPE (and related approaches) has been adopted in many state-of-the-art LLMs that target long context capabilities, such as LLaMA [Touvron et al., 2023], DeepSeek, and several others.
A Mixture of Experts (MoE) layer replaces the standard feed-forward sublayer of the Transformer with multiple “expert” sub-networks. Each expert is itself typically a feed-forward network with the same shape as the original FFN. A gate decides which expert(s) each token will route to, usually based on the dot product between the token representation and some gating parameters [Shazeer et al., 2017; Lepikhin et al., 2021; Fedus et al., 2022].
By routing tokens to only a fraction of the total experts, MoE can dramatically increase the model’s overall parameter count (capacity) without increasing the per-token compute cost by the same factor. It also allows the model to learn more specialized sub-networks that focus on certain token patterns or language phenomena.
In “DeepSeekMoE” [Dai et al., 2024], the number of experts per MoE layer is increased, while the size of each individual expert is reduced proportionally. Key additional ideas:
This “fine-grained plus shared” approach maintains constant total compute while improving the potential for specialization. Each token picks from a broader pool of specialized experts, and the shared experts capture universal functions in a single place.
From the original post-norm Transformer to today’s cutting-edge language models, most of the core architecture of Transformers has remained intact: self-attention, residual connections, layer norms, and feed-forward layers. However, three architectural refinements have proven particularly impactful:
These innovations, combined with large-scale data, clever training strategies, and efficiency improvements, have powered the remarkable performance leaps in large language models.