Active-GRPO: New Training Method Combines Imitation and Self-Improving Reasoning for Molecular Optimization

A new paper posted on arXiv introduces Active-GRPO (Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization), targeting a fundamental challenge in training LLMs for scientific reasoning tasks. The work focuses on instruction-based molecular optimization as its experimental domain.

Training LLMs for scientific reasoning presents a dilemma. Supervised fine-tuning (SFT) collapses multi-step reasoning chains because it only optimizes for final answers. Reinforcement learning with verifiable rewards (RLVR) provides better process guidance but suffers from sparse feedback signals, making training inefficient.

Active-GRPO bridges these approaches through a staged training strategy. In the early phase, the model learns effective reasoning patterns via reference-guided policy optimization (imitation learning). It then transitions to a self-improvement phase where verification signals continuously refine reasoning quality.

新研究Active-GRPO：用自适应模仿与自改进推理优化分子设计 — Image source: notebooklm.google

Molecular optimization is a critical task in drug discovery and materials science. It requires multi-step chemical reasoning — understanding molecular structures, predicting properties, and generating improved candidates — placing high demands on both the length and accuracy of LLM reasoning chains.

The paper's choice of molecular optimization as a testbed is deliberate. SFT produces unstable reasoning chains on this task, while RLVR's sparse reward structure offers insufficient training signal. Active-GRPO balances both approaches through its phased design.

The significance of this work extends beyond molecular design. By adapting GRPO-style methods — widely recognized since DeepSeek-R1 — to complex scientific reasoning tasks, Active-GRPO proposes a generalizable training paradigm for LLMs in research applications.

Areas to watch include whether Active-GRPO generalizes to other scientific reasoning domains such as mathematical proof, code generation, and experimental design, as well as its scalability to larger models.