Chinese AI startup DeepSeek has solved a problem that has frustrated AI researchers for several years. Its breakthrough in AI reward models could improve dramatically how AI systems reason and respond to questions.
In partnership with Tsinghua University researchers, DeepSeek has created a technique detailed in a research paper, titled “Inference-Time Scaling for Generalist Reward Modeling.” It outlines how a new approach outperforms existing methods and how the team “achieved competitive performance” compared to strong public reward models.
The innovation focuses on enhancing how AI systems learn from human preferences – a important aspect of creating more useful and aligned artificial intelligence.
- What are AI reward models, and why do they matter?
- The dual approach: How DeepSeek’s method works
- Implications for the AI Industry
- DeepSeek’s growing influence
- What’s next for AI reward models?
- See also: DeepSeek disruption: Chinese AI innovation narrows global technology divide
- Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
What are AI reward models, and why do they matter?
AI reward models are important components in reinforcement learning for large language models. They provide feedback signals that help guide an AI’s behaviour toward preferred outcomes. In simpler terms, reward models are like digital teachers that help AI understand what humans want from their responses.
“Reward modeling is a process that guides an LLM towards human preferences,” the DeepSeek paper states. Reward modeling becomes important as AI systems get more sophisticated and are deployed in scenarios beyond simple question-answering tasks.
The innovation from DeepSeek addresses the challenge of obtaining accurate reward signals for LLMs in different domains. While current reward models work well for verifiable questions or artificial rules, they struggle in general domains where criteria are more diverse and complex.
The dual approach: How DeepSeek’s method works
DeepSeek’s approach combines two methods:
- Generative reward modeling (GRM): This approach enables flexibility in different input types and allows for scaling during inference time. Unlike previous scalar or semi-scalar approaches, GRM provides a richer representation of rewards through language.
- Self-principled critique tuning (SPCT): A learning method that fosters scalable reward-generation behaviours in GRMs through online reinforcement learning, one that generates principles adaptively.
One of the paper’s authors from Tsinghua University and DeepSeek-AI, Zijun Liu, explained that the combination of methods allows “principles to be generated based on the input query and responses, adaptively aligning reward generation process.”
The approach is particularly valuable for its potential for “inference-time scaling” – improving performance by increasing computational resources during inference rather than just during training.
The researchers found that their methods could achieve better results with increased sampling, letting models generate better rewards with more computing.
Implications for the AI Industry
DeepSeek’s innovation comes at an important time in AI development. The paper states “reinforcement learning (RL) has been widely adopted in post-training for large language models […] at scale,” leading to “remarkable improvements in human value alignment, long-term reasoning, and environment adaptation for LLMs.”
The new approach to reward modelling could have several implications:
- More accurate AI feedback: By creating better reward models, AI systems can receive more precise feedback about their outputs, leading to improved responses over time.
- Increased adaptability: The ability to scale model performance during inference means AI systems can adapt to different computational constraints and requirements.
- Broader application: Systems can perform better in a broader range of tasks by improving reward modelling for general domains.
- More efficient resource use: The research shows that inference-time scaling with DeepSeek’s method could outperform model size scaling in training time, potentially allowing smaller models to perform comparably to larger ones with appropriate inference-time resources.
DeepSeek’s growing influence
The latest development adds to DeepSeek’s rising profile in global AI. Founded in 2023 by entrepreneur Liang Wenfeng, the Hangzhou-based company has made waves with its V3 foundation and R1 reasoning models.
The company upgraded its V3 model (DeepSeek-V3-0324) recently, which the company said offered “enhanced reasoning capabilities, optimised front-end web development and upgraded Chinese writing proficiency.” DeepSeek has committed to open-source AI, releasing five code repositories in February that allow developers to review and contribute to development.
While speculation continues about the potential release of DeepSeek-R2 (the successor to R1) – Reuters has speculated on possible release dates – DeepSeek has not commented in its official channels.
What’s next for AI reward models?
According to the researchers, DeepSeek intends to make the GRM models open-source, although no specific timeline has been provided. Open-sourcing will accelerate progress in the field by allowing broader experimentation with reward models.
As reinforcement learning continues to play an important role in AI development, advances in reward modelling like those in DeepSeek and Tsinghua University’s work will likely have an impact on the abilities and behaviour of AI systems.
Work on AI reward models demonstrates that innovations in how and when models learn can be as important increasing their size. By focusing on feedback quality and scalability, DeepSeek addresses one of the fundamental challenges to creating AI that understands and aligns with human preferences better.
See also: DeepSeek disruption: Chinese AI innovation narrows global technology divide
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.
For more info visit at Times Of Tech