LLM-Based Peer Review Survey: Fluency Is Not Enough, Reliability Remains a Challenge
A comprehensive survey finds that while LLMs can generate fluent peer review critiques, their reliability, robustness, and security as decision-support systems remain poorly understood.
LLMs Score Close to Human Examiners on Real GCSE Mock Exam Benchmark
A new dataset of 32,534 double-marked real student GCSE responses shows top LLMs agree with examiner consensus nearly as closely as two examiners agree with each other.
Detecting Jailbreaks from Within: Entropy Dynamics Across LLM Layers Reveal Harmful Intent
New research analyzes token-level predictive entropy trajectories across LLM layers to detect jailbreak attacks encoded in the model's internal representations.
G-SPIN: A Graph-Based Framework for Noisy ASR Error Correction
A new framework called G-SPIN uses graph structures to correct phonetically-similar residual errors in ASR output, going beyond naive token-level fixes.
Knowing ≠ Steering: Study Reveals Geometric Gap Between Detection and Control Directions in LLMs
New research shows that the direction detecting a behavior in LLM activations differs significantly from the direction that causes it, challenging a key interpretability assumption.
Do VLMs Search Like Humans? New Study Uses Reasoning Tokens as Reaction-Time Analog in Visual Search
A new arXiv study uses reasoning tokens in vision-language models as an analog to reaction time in human visual search, finding behavioral similarities across four classic paradigms.
The Hitchhiker's Guide to Agentic AI: A Comprehensive Reference from Foundations to Deployment
A new comprehensive practitioner's reference titled 'The Hitchhiker's Guide to Agentic AI' published on arXiv, covering the full stack from transformer architecture to production deployment.
LLM Evolution as an Industry-Scale Ecosystem: A Lifecycle Perspective on Continual Learning
A new survey paper reframes industrial continual learning for LLMs as a closed-loop update-and-release problem in an ecosystem, shifting focus from static benchmarks to real industrial needs.
What Actually Works for Spacecraft Fault-Tolerant Control: An Honest Benchmark of Learned and Classical Methods
A new study questions the reliability of learned fault-tolerant control methods for spacecraft, proposing a stricter benchmark that requires sustained pointing accuracy on unseen faults.
New Research Diagnoses and Mitigates Compounding Failures in Agentic Persuasion — RAG Semantic Leakage Identified as Key Trigger
A new arXiv paper finds that multi-agent debate systems in subjective tasks like persuasion suffer from severe problem drift and sycophantic conformity, identifying semantic leakage in standard RAG as a reproducible trigger.
Beyond Shapley: Efficient Exact Computation of Asymmetric Shapley Values Achieved in New Work
A new arXiv paper presents a method for computing Asymmetric Shapley Values using causal graphs, achieving polynomial-time computation in scenarios where SHAP is #P-hard.
InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating LLM Procedural Reasoning in Expert Investment Philosophy
A new benchmark, InvestPhilBench, evaluates large language models' procedural reasoning across 8 cognitive tiers in expert investment decision frameworks.
Project Auto-World: Using LLMs to Automate Benchmarking of Neural Relational Reasoners
A new arXiv paper proposes using large language models to automate benchmarking for relational reasoning, addressing the core problem of unknown instance difficulty in evaluating neural generalization.
BrainAgent: A Large Language Model-Driven Multi-Agent Framework for Autonomous Brain Signal Understanding
Researchers propose BrainAgent, a multi-agent LLM framework for autonomous brain signal understanding that lowers the technical barrier to brain-computer interface applications.
AI Research AgentScientific DiscoveryFrameworkarXiv
Heuresis: A Search Strategy Framework for Autonomous AI Research Agents Balancing Quality, Diversity, and Novelty
A new arXiv paper introduces Heuresis, a framework that abstracts the research pipeline into composable primitives, enabling open-ended scientific exploration while optimizing for quality, diversity, and novelty.
RWGBench: Evaluating Scholarly Positioning in Related Work Generation
New benchmark RWGBench evaluates LLM performance in generating related work sections from a citation-level scholarly positioning perspective, going beyond traditional summarization metrics.
Elo-Disentangled Chess Style Embeddings: New Method Separates Playing Strength from Individual Style
A new arXiv paper proposes per-player style embeddings for human chess that measure stylistic similarity via inner products while being approximately disentangled from playing strength (Elo).
ReviewGuard: Aligning LLM-Assisted Peer Review with Long-Term Scientific Impact
New framework ReviewGuard uses a two-stage architecture to align LLM-generated peer reviews with citation-based estimates of long-term scientific impact.
AI ResearchMarkov Decision ProcessesStatistical Verification
Confidence Sequences for Online Statistical Model Checking of Markov Decision Processes
New paper proposes using confidence sequences for online statistical model checking of MDPs, addressing the unrealistic assumption of exact probability knowledge.
WinDOM: Self-Family Distillation for Small-Model GUI Grounding
New research proposes WinDOM, combining self-family distillation with reinforcement learning to achieve breakthrough GUI grounding performance in ~2B parameter small models.
Offline Multi-agent Continual Cooperation via Skill Partition and Reuse
New research proposes extracting and reusing skills from multi-agent offline datasets to address catastrophic forgetting and plasticity loss in sequential task scenarios.
Autodata: An Agentic Data Scientist to Create High Quality Synthetic Data
New research introduces Autodata, a method enabling AI agents to act as data scientists building high-quality training and evaluation data, with self-optimization through Agentic Self-Instruct.
Agentic Evolution of Physically Constrained Foundation Models
Researchers build a physically grounded multi-agent discovery engine that autonomously designs hardware-compliant computing systems, addressing the hallucination problem in generalist AI agents.
Cost-Efficient Multi-Agent RAG: New Study Reveals Dichotomy in Assessment Mechanisms — Isolate vs. Score
A new arXiv paper reveals a sharp dichotomy in how multi-agent RAG models benefit from document assessment — per-document filtering vs. holistic scoring — and proposes model-adaptive strategies to reduce computational costs.
Position Spaces and Graphs: A Graph-Based Reasoning Framework for Spatial Relations
New research introduces position graphs, a formal graph-based reasoning framework using two strict partial orders to model horizontal and vertical alignment of discrete tokens.
Reasonable Motion: A General ASP Foundation for Environment Constrained Movement Trajectory Computation
New research presents a hybrid quantitative-qualitative method based on Answer Set Programming for computing constrained branching trajectory modes for moving objects in real-world settings.
OPPO: Omni-Perception Policy Optimization Framework for Multimodal Emotion Reasoning
A new arXiv paper proposes OPPO, a reinforcement learning framework that explicitly optimizes multimodal perception for emotion reasoning, addressing the underutilization of multimodal cues and hallucination in current Omni-MLLMs.
Automated Curriculum Learning for Multi-Domain RLVR: Leveraging Cross-Domain Transferability
A new arXiv paper proposes using cross-domain transferability of reasoning skills to dynamically adjust multi-domain RLVR training curricula, addressing inefficiency in fixed sampling strategies.
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning
Researchers introduce the cliff token concept, identifying the precise single token where an LLM's reasoning shifts from correctness toward failure in mathematical tasks.
Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation
A new paper systematically categorizes four failure modes of LLMs on research-level mathematics, where models are confidently but fluently wrong, based on the First Proof benchmark.
Fuzzy Quantification over OWL Ontologies and Knowledge Graphs: A Versatile Framework
New research presents a versatile framework for evaluating fuzzy quantification queries over standard and fuzzy ontologies as well as knowledge graphs, agnostic to quantifier type and evaluation method.
Trust and Liability in Autonomous AI Prescribing: New Study Examines H.R. 238 and Utah Pilot
A new arXiv paper examines autonomous AI systems transitioning from advisory to prescribing roles, noting that US bill H.R. 238 and Utah's prescription-renewal pilot authorize AI to prescribe, while identifying critical gaps in current regulatory guidelines.
Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions
Researchers propose the TSJ (Theater-Stage-Judge) framework, a longitudinal evaluation approach that reveals cumulative risks from prolonged AI companion interactions with cognition-developing users including children and adolescents.
GUI agent: Guided Exploration of User-Sensitive Screens
New research addresses the problem of LLM-driven GUI agents encountering user-sensitive information screens, proposing a guided exploration approach that enables user takeover when needed.
The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems
New paper proposes an 'Unfireable Safety Kernel' concept for execution-time AI alignment, addressing the fundamental vulnerability of safety controls inside agent runtimes.
Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models
New research reveals that low-bit post-training quantization causes reasoning models to generate longer chains of thought even when answers remain correct, adding hidden inference costs.