Realtime AI News
Active-GRPO: New Training Method Combines Imitation and Self-Improving Reasoning for Molecular Optimization
Researchers propose Active-GRPO, a method that integrates imitation learning with reinforcement learning via verifiable rewards to train large language models for scientific reasoning. Applied to molecular optimization, the approach addresses the limitations of both pure supervised fine-tuning and sparse-reward RLVR.
A new paper posted on arXiv introduces Active-GRPO (Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization), targeting a fundamental challenge in training LLMs for scientific reasoning tasks. The work focuses on instruction-based molecular optimization as its experimental domain.
Training LLMs for scientific reasoning presents a dilemma. Supervised fine-tuning (SFT) collapses multi-step reasoning chains because it only optimizes for final answers. Reinforcement learning with verifiable rewards (RLVR) provides better process guidance but suffers from sparse feedback signals, making training inefficient.
Active-GRPO bridges these approaches through a staged training strategy. In the early phase, the model learns effective reasoning patterns via reference-guided policy optimization (imitation learning). It then transitions to a self-improvement phase where verification signals continuously refine reasoning quality.

Molecular optimization is a critical task in drug discovery and materials science. It requires multi-step chemical reasoning — understanding molecular structures, predicting properties, and generating improved candidates — placing high demands on both the length and accuracy of LLM reasoning chains.
The paper's choice of molecular optimization as a testbed is deliberate. SFT produces unstable reasoning chains on this task, while RLVR's sparse reward structure offers insufficient training signal. Active-GRPO balances both approaches through its phased design.
The significance of this work extends beyond molecular design. By adapting GRPO-style methods — widely recognized since DeepSeek-R1 — to complex scientific reasoning tasks, Active-GRPO proposes a generalizable training paradigm for LLMs in research applications.
Areas to watch include whether Active-GRPO generalizes to other scientific reasoning domains such as mathematical proof, code generation, and experimental design, as well as its scalability to larger models.
Why it matters
Active-GRPO offers a new training paradigm combining imitation learning with reinforcement learning for LLM scientific reasoning, with potential to advance AI applications in drug discovery and materials science.