Realtime AI News

Do VLMs Search Like Humans? New Study Uses Reasoning Tokens as Reaction-Time Analog in Visual Search

A new arXiv study uses reasoning tokens in vision-language models as an analog to reaction time in human visual search, finding behavioral similarities across four classic paradigms.

PublishedJun 25, 2026, 12:00 Beijing time/Reads 0

A paper titled 'Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms' has been published on arXiv. The study asks whether VLMs exhibit the same behavioral signatures as human visual attention, using the number of reasoning tokens as an analog to reaction time.

The research adapts four classic paradigms: feature versus conjunction search, spatial-configuration (T-vs-L) search, enumeration, and the tilted/vertical line search. Through these experiments, the authors investigate whether VLMs distinguish between parallel 'pop-out' search and serial, attention-demanding search.

The source is arXiv cs.AI (ID 2606.25066), published on June 25, 2026.

Why it matters

This research builds a bridge between VLM reasoning behavior and human visual attention, offering insights for understanding model cognitive behavior and designing more human-like vision systems.

Vision-Language ModelVisual SearchCognitionarXiv

Sources

Source 1: https://arxiv.org/abs/2606.25066