Queen Researchers proposed quanlong-l1: reinforcement learning structure for long-term logic in larger language models

While large logic models (LRMS) have shown impressive abilities in the logic in a short context by Reinforcement Learning (RL), these benefits do not make it well in the long-range views. Applications such as multi-document QA, research synthesis and legal or financial analysis require the process of models and sequences of more than 100 or tokens. However, in such a regime, RL Optim Ptimization is reduced as a result of unstable policy updates and entropy collapse due to slower prize conversion, KL diversion fluctuations. These obstacles reveal the basic gap in the transition to LRMS in generalization in generalization in the long context of short context.

Quanlong-L1: Structured RL framework for adaptation in long context

To overcome these limits, the Quain Research Team represents Quanlong-L1A novel RL framework designed to adapt to logic tasks in long context. Framework is designed in three key phases:

  • Warm-up Inspired Fine-Tuning (SFT): Through the training of the Qurated Question-Sand-twist triplets, provides a stable initiality for the policy model, ensuring basic merit in the understanding and answer extraction in context.
  • Curriculum The gradual rising context presents the stage training process with reference length. This progress enables the model to achieve logic behaviors in a long context without destabilizing policy updates.
  • Difficulty vigorous precedent samples: In order to promote ER pores logic and strength in various inputs, the research enhances research by maintaining and reusing the rigid examples from the previous stages.

These stages are complemented by hybrid prize methods-a rule-based accurate match verification with semantic evaluation by Lightweight LLM-both precision and recall during policy training.

Technical design and methodological benefits

Quanlong-L1-related RL unifying recent progress in Optim Ptimization, especially Group And OppressionTo reduce the overhead of the calculation associated with an estimate of a long reference value:

  • Group Estimation of benefits by normalizing rewards in sample groups, eliminating a different value network requirement and promoting patterns of different pay generation.
  • Oppression Entropy involves methods such as dynamic templates, overlanth penalty shapes and asymmetric clipping thresholds to prevent the fall and reduce the length of length during training.

The prize function is defined as a maximum of two signals: a meaningful judgment of the deteraministic rule-based match and compact evaluator model (e.g., QWEN2.5-1.5B). This hybrid approach avoids excessive part of the harsh formats while maintaining purity on various instructions and strings.

Moreover, Optim is ptimized by structure Progressive reference scalingWhere the RL process transitions to controlled stages from 20K-token to 60k-token input length, stabilizes training mobility and facilitates policy generalization.

Practical results and benchmark operations

The Queenlong-L1 was evaluated on the seven long-sandy document QA benchmark, including dockmeth, frames, 2 Wikimaltihopka, HotpotQA, Music, Story and AS Spare. 32B variants, Quanlong-L1-32bStrong empirical performance showed:

  • It does outperform baseline models such as R1-Distill-Quan-32b Near 5.1 points And as the leading owned systems exceeded Open-O3-Nana And QWEN3-235B-A22B.
  • Had his performance Cloud -3.7- comparable with the ideaIndicates competitive logic ability under the extreme reference length.
  • Pass@or continuous updates with increased samples in analysis came out, by achieving Pass@2 average 73.7Surpass Dipseic-r1 And Open-O1-UrvavalocanEven at a lower sample rate.

Abelication chairs further recognize the individual contribution of SFT, phase RL and precedent sample. It is noteworthy that the RLA played a crucial role in enabling the rising logic behaviors such as grounding, subgol setting, probe and backtracking-trites that are not effectively induced by a single observed fine-tuning.

End

The Queenlong-L1 represents a systematic approach to equip LRMS with strong long reference-syllable capabilities through reinforcement education. Its design effectively eliminates the gap between the demand for the information-gen ENSE environment by connecting the short context skills and inspection initialization, course-based reference scaling and hybrid evaluation strategies. This structure not only achieves sophisticated results in the benchmark in the long context, but also shows the emergence of interpreting logic patterns during training.


Check the model on paper, hug face and githb page. All credit for this research goes to researchers of this project. Also, feel free to follow us Twitter And don’t forget to join us 95K+ ML Subredit And subscribe Our newsletter.


Asif Razzaq is the CEO of MarketechPost Media Inc. as a visionary entrepreneur and engineer, Asif is committed to increasing the possibility of artificial intelligence for social good. Their most recent effort is the inauguration of the artificial intelligence media platform, MarktecPost, for its depth of machine learning and deep learning news for its depth of coverage .This is technically sound and easily understandable by a large audience. The platform has more than 2 million monthly views, showing its popularity among the audience.

Scroll to Top