The paper introduces DeepSeek-R1, a series of reasoning models, including DeepSeek-R1-Zero and DeepSeek-R1. The former is a pioneering model trained through large-scale reinforcement learning (RL) without any supervised fine-tuning (SFT) as a preliminary step. It showcases significant reasoning capabilities but grapples with challenges like poor readability and language mixing. To enhance performance and address these issues, DeepSeek-R1 utilizes a multi-stage training process, including cold-start data.
The authors present several key contributions:
1. Post-Training Advances: They demonstrate that LLMs can develop reasoning capabilities solely through RL, achieving remarkable performance on reasoning tasks without SFT.
2. Pipeline Development: The paper outlines a comprehensive pipeline for enhancing reasoning performance through multiple RL and SFT stages.
3. Distillation of Reasoning Skills: The research illustrates how reasoning patterns from larger models can be transferred to smaller models, enabling better performance without the need for extensive computational resources.
This model employs a Group Relative Policy Optimization (GRPO) algorithm to optimize the reasoning process. Its training structure includes:
- Reward Modeling: It uses a rule-based reward system emphasizing accuracy and format, ensuring responses are structured properly.
- Training Template: The model generates a reasoning process followed by an answer, guided by a defined template to maintain clarity and coherence.
During training, DeepSeek-R1-Zero exhibits a trajectory of performance improvement, demonstrating the effectiveness of RL in enhancing reasoning capabilities.
To improve upon DeepSeek-R1-Zero, a four-stage training pipeline is designed:
1. Cold Start: A small amount of long Chain-of-Thought (CoT) data is used to fine-tune the base model, enhancing readability and preventing early instability.
2. Reasoning-oriented RL: The model undergoes RL training focused on reasoning-intensive tasks while implementing a language consistency reward to mitigate language mixing.
3. Rejection Sampling and SFT: After RL converges, the model collects SFT data to further refine its capabilities across various domains.
4. Comprehensive RL: A secondary RL phase aligns the model’s outputs with human preferences, enhancing helpfulness and harmlessness.
To improve upon DeepSeek-R1-Zero, a four-stage training pipeline is designed:
1. Cold Start: A small amount of long Chain-of-Thought (CoT) data is used to fine-tune the base model, enhancing readability and preventing early instability.
2. Reasoning-oriented RL: The model undergoes RL training focused on reasoning-intensive tasks while implementing a language consistency reward to mitigate language mixing.
3. Rejection Sampling and SFT: After RL converges, the model collects SFT data to further refine its capabilities across various domains.
4. Comprehensive RL: A secondary RL phase aligns the model’s outputs with human preferences, enhancing helpfulness and harmlessness.
The distillation process involves using DeepSeek-R1 as a teacher model to train smaller dense models, resulting in improved reasoning capabilities across a range of benchmarks.
The evaluation of DeepSeek-R1 reveals impressive performance metrics:
- AIME 2024: DeepSeek-R1 achieves 79.8% Pass@1, slightly surpassing OpenAI-o1-1217.
- MATH-500: It scores 97.3%, showcasing its strength in mathematical reasoning.
- Codeforces: A rating of 2029 indicates expert-level performance in coding tasks.
Comparison with distilled models shows that they also achieve competitive scores, with DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B significantly outperforming non-reasoning models.
The findings indicate that DeepSeek-R1 successfully enhances reasoning capabilities through advanced training methods, achieving results comparable to leading models in the field. Future research directions include addressing limitations in multi-language handling, enhancing general capabilities in complex tasks, and optimizing performance in software engineering contexts.