One LLM Rewrite Suffices: Automated Skill Description Optimization Cuts Agent Engineering Time by 32x

A new arXiv paper from Microsoft Research and collaborators, published July 1, tackles a practical bottleneck in enterprise AI agents: skill description optimization. As enterprise agents scale to dozens of specialized skills, overlapping natural language descriptions cause the routing LLM to misdirect user queries — a failure mode the authors term skill collision.

Traditionally, engineers manually tune each skill description to maintain routing accuracy, a process that becomes unsustainable as the number of skills grows. The team deployed an automated description optimization pipeline on a production enterprise group chat agent with 9 skills and 372 regression test cases.

The results show the pipeline matches human-level precision: auto-generated descriptions achieved an average F1 of 79.2%, compared to 79.4% for manually tuned descriptions — a difference of only -0.20%, well within the 0.78% multi-seed noise floor. Per-skill engineering effort dropped from 120 minutes to 3.8 minutes, a 32x speedup.

研究发现：一次LLM重写即可优化AI Agent技能路由，工程效率提升32倍 — Image source: notebooklm.google

Systematic ablation studies on both the production system and ToolBench (16,000 tools) yielded a striking finding: a single LLM rewrite using any available false-positive and false-negative cases captures most of the achievable improvement. Other design choices — iteration budget, feedback signal composition, dual editing of confused pairs, and training set size — each affected final F1 by less than 0.5%.

The paper also identifies clear boundaries for the approach. Description optimization resolves collisions caused by overlapping descriptions but cannot fix cases where two skills have genuinely overlapping intended scopes. The authors propose a diagnostic signal — a large train-validation F1 gap — that flags these cases for architectural rather than text-level intervention.

For enterprises deploying AI agents at scale, the finding has immediate practical value: teams no longer need hours of manual fine-tuning per skill. A single LLM rewrite driven by a handful of routing failures achieves comparable accuracy, dramatically lowering the operational cost of maintaining agent routing quality as skill inventories grow.