Realtime AI News
One LLM Rewrite Suffices: Automated Skill Description Optimization Cuts Agent Engineering Time by 32x
A new paper from Microsoft Research and collaborators proposes an automated pipeline for optimizing enterprise AI agent skill descriptions, achieving routing F1 scores close to manually tuned levels (79.2% vs 79.4%) while reducing per-skill engineering effort from 120 minutes to 3.8 minutes. Systematic ablation reveals that a single LLM rewrite using available false-positive and false-negative cases captures most of the improvement.
A new arXiv paper from Microsoft Research and collaborators, published July 1, tackles a practical bottleneck in enterprise AI agents: skill description optimization. As enterprise agents scale to dozens of specialized skills, overlapping natural language descriptions cause the routing LLM to misdirect user queries — a failure mode the authors term skill collision.
Traditionally, engineers manually tune each skill description to maintain routing accuracy, a process that becomes unsustainable as the number of skills grows. The team deployed an automated description optimization pipeline on a production enterprise group chat agent with 9 skills and 372 regression test cases.
The results show the pipeline matches human-level precision: auto-generated descriptions achieved an average F1 of 79.2%, compared to 79.4% for manually tuned descriptions — a difference of only -0.20%, well within the 0.78% multi-seed noise floor. Per-skill engineering effort dropped from 120 minutes to 3.8 minutes, a 32x speedup.

Systematic ablation studies on both the production system and ToolBench (16,000 tools) yielded a striking finding: a single LLM rewrite using any available false-positive and false-negative cases captures most of the achievable improvement. Other design choices — iteration budget, feedback signal composition, dual editing of confused pairs, and training set size — each affected final F1 by less than 0.5%.
The paper also identifies clear boundaries for the approach. Description optimization resolves collisions caused by overlapping descriptions but cannot fix cases where two skills have genuinely overlapping intended scopes. The authors propose a diagnostic signal — a large train-validation F1 gap — that flags these cases for architectural rather than text-level intervention.
For enterprises deploying AI agents at scale, the finding has immediate practical value: teams no longer need hours of manual fine-tuning per skill. A single LLM rewrite driven by a handful of routing failures achieves comparable accuracy, dramatically lowering the operational cost of maintaining agent routing quality as skill inventories grow.
Why it matters
This research dramatically lowers the engineering cost of enterprise AI agent routing optimization, reducing hours of manual tuning to a single LLM call with comparable accuracy.