Understanding DeepSeek R1

Question

Understanding DeepSeek R1

Đăng 3 ngày cách đây ,post bởi BernardFenne (500 điểm)

DeepSeek-R1 is an open-source language model built on DeepSeek-V3-Base that's been making waves in the AI community. Not just does it match-or even surpass-OpenAI's o1 design in numerous criteria, but it also includes completely MIT-licensed weights. This marks it as the very first non-OpenAI/Google design to provide strong reasoning capabilities in an open and available way.

What makes DeepSeek-R1 especially exciting is its transparency. Unlike the less-open methods from some industry leaders, DeepSeek has published a detailed training method in their paper.
The model is likewise incredibly affordable, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).

Until ~ GPT-4, the common wisdom was that better designs required more data and calculate. While that's still valid, models like o1 and R1 show an option: inference-time scaling through reasoning.

The Essentials

The DeepSeek-R1 paper provided numerous models, but main among them were R1 and R1-Zero. Following these are a series of distilled models that, while intriguing, I will not talk about here.

DeepSeek-R1 uses 2 significant concepts:

1. A multi-stage pipeline where a small set of cold-start information kickstarts the model, followed by large-scale RL.
2. Group Relative Policy Optimization (GRPO), a reinforcement learning technique that counts on comparing numerous model outputs per prompt to prevent the need for a separate critic.

R1 and R1-Zero are both thinking designs. This essentially suggests they do Chain-of-Thought before responding to. For the R1 series of designs, this takes form as believing within a tag, before addressing with a last summary.

R1-Zero vs R1

R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base without any monitored fine-tuning (SFT). RL is utilized to optimize the model's policy to make the most of benefit.
R1-Zero attains exceptional precision but often produces confusing outputs, such as mixing numerous languages in a single reaction. R1 repairs that by incorporating minimal supervised fine-tuning and multiple RL passes, which improves both correctness and readability.

It is interesting how some languages may reveal certain concepts better, which leads the design to pick the most expressive language for the job.

Training Pipeline

The training pipeline that DeepSeek published in the R1 paper is tremendously fascinating. It showcases how they produced such strong thinking designs, and what you can anticipate from each stage. This includes the issues that the resulting models from each stage have, and how they solved it in the next phase.

It's fascinating that their training pipeline varies from the usual:

The usual training technique: Pretraining on large dataset (train to forecast next word) to get the base model → monitored fine-tuning → choice tuning by means of RLHF
R1-Zero: Pretrained → RL
R1: Pretrained → Multistage training pipeline with several SFT and RL phases

Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to guarantee the RL process has a decent beginning point. This gives a good model to start RL.
First RL Stage: Apply GRPO with rule-based benefits to improve reasoning correctness and formatting (such as forcing chain-of-thought into believing tags). When they were near convergence in the RL process, they transferred to the next action. The outcome of this step is a strong thinking design however with weak general abilities, e.g., poor format and language mixing.
Rejection Sampling + general information: Create brand-new SFT data through rejection sampling on the RL checkpoint (from action 2), integrated with monitored information from the DeepSeek-V3-Base design. They gathered around 600k premium thinking samples.
Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600k reasoning + 200k general jobs) for wider capabilities. This step resulted in a strong thinking model with basic capabilities.
Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to improve the last model, in addition to the reasoning benefits. The result is DeepSeek-R1.
They likewise did design distillation for lespoetesbizarres.free.fr numerous Qwen and Llama models on the reasoning traces to get distilled-R1 designs.

Model distillation is a method where you utilize an instructor model to enhance a trainee model by producing training information for the trainee design.
The teacher is usually a larger design than the trainee.

Group Relative Policy Optimization (GRPO)

The standard idea behind using reinforcement learning for LLMs is to fine-tune the design's policy so that it naturally produces more precise and beneficial responses.
They used a benefit system that checks not just for correctness however likewise for correct formatting and language consistency, so the design gradually finds out to prefer actions that satisfy these quality criteria.

In this paper, they encourage the R1 model to produce chain-of-thought reasoning through RL training with GRPO.
Instead of including a different module at inference time, the training process itself nudges the model to produce detailed, detailed outputs-making the chain-of-thought an emerging habits of the optimized policy.

What makes their method particularly fascinating is its reliance on straightforward, rule-based reward functions.
Instead of depending upon costly external designs or human-graded examples as in standard RLHF, the RL utilized for R1 uses easy requirements: it may give a higher reward if the answer is proper, if it follows the anticipated/ formatting, and if the language of the answer matches that of the timely.
Not counting on a benefit model likewise implies you do not have to hang out and effort training it, and it does not take memory and calculate far from your main model.

GRPO was introduced in the DeepSeekMath paper. Here's how GRPO works:

1. For each input prompt, the model creates various actions.
2. Each response receives a scalar benefit based on aspects like precision, format, and language consistency.
3. Rewards are changed relative to the group's performance, basically measuring just how much better each action is compared to the others.
4. The model updates its method somewhat to favor actions with higher relative advantages. It only makes small adjustments-using strategies like clipping and a KL penalty-to ensure the policy doesn't wander off too far from its initial behavior.

A cool element of GRPO is its versatility. You can utilize basic rule-based benefit functions-for instance, granting a perk when the model correctly utilizes the syntax-to guide the training.

While DeepSeek utilized GRPO, you might use alternative methods rather (PPO or PRIME).

For those aiming to dive deeper, Will Brown has written quite a nice application of training an LLM with RL using GRPO. GRPO has also already been included to the Transformer Reinforcement Learning (TRL) library, which is another great resource.
Finally, Yannic Kilcher has a fantastic video explaining GRPO by going through the DeepSeekMath paper.

Is RL on LLMs the path to AGI?

As a last note on explaining DeepSeek-R1 and the approaches they have actually presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based upon a point Yannic Kilcher made in his video.

These findings indicate that RL improves the design's total efficiency by rendering the output distribution more robust, simply put, wiki.whenparked.com it appears that the enhancement is associated to improving the right action from TopK rather than the enhancement of essential abilities.

Understanding DeepSeek R1

Your answer

0 Answers

Understanding DeepSeek R1

Your answer

0 Answers

BÀI LIÊN QUAN