Guozhen AIGlobal AI field notes and model intelligence

Realtime AI News

Realtime AI News

中文
arXivSurveyPeer ReviewReliability

LLM-Based Peer Review Survey: Fluency Is Not Enough, Reliability Remains a Challenge

A comprehensive survey finds that while LLMs can generate fluent peer review critiques, their reliability, robustness, and security as decision-support systems remain poorly understood.

arXivBenchmarkLLMEducation

LLMs Score Close to Human Examiners on Real GCSE Mock Exam Benchmark

A new dataset of 32,534 double-marked real student GCSE responses shows top LLMs agree with examiner consensus nearly as closely as two examiners agree with each other.

arXivJailbreakSafetyInterpretability

Detecting Jailbreaks from Within: Entropy Dynamics Across LLM Layers Reveal Harmful Intent

New research analyzes token-level predictive entropy trajectories across LLM layers to detect jailbreak attacks encoded in the model's internal representations.

arXivASRSpeech Recognition

G-SPIN: A Graph-Based Framework for Noisy ASR Error Correction

A new framework called G-SPIN uses graph structures to correct phonetically-similar residual errors in ASR output, going beyond naive token-level fixes.

arXivASRRAG

Error-Aware TF-IDF RAG for ASR Error Correction

A lightweight RAG approach uses phonetically-aware TF-IDF retrieval to correct ASR hallucinations of rare entities and domain-specific terms.

arXivInferenceLLM

Dustin: Sparse Verification for Efficient Long-Context Speculative Decoding

Dustin introduces draft-augmented sparse verification to overcome the KV cache loading bottleneck in long-context speculative decoding.

arXivInterpretabilityAlignment

Knowing ≠ Steering: Study Reveals Geometric Gap Between Detection and Control Directions in LLMs

New research shows that the direction detecting a behavior in LLM activations differs significantly from the direction that causes it, challenging a key interpretability assumption.

Vision-Language ModelVisual SearchCognitionarXiv

Do VLMs Search Like Humans? New Study Uses Reasoning Tokens as Reaction-Time Analog in Visual Search

A new arXiv study uses reasoning tokens in vision-language models as an analog to reaction time in human visual search, finding behavioral similarities across four classic paradigms.

Agentic AIBookarXiv

The Hitchhiker's Guide to Agentic AI: A Comprehensive Reference from Foundations to Deployment

A new comprehensive practitioner's reference titled 'The Hitchhiker's Guide to Agentic AI' published on arXiv, covering the full stack from transformer architecture to production deployment.

LLMContinual LearningIndustrySurvey

LLM Evolution as an Industry-Scale Ecosystem: A Lifecycle Perspective on Continual Learning

A new survey paper reframes industrial continual learning for LLMs as a closed-loop update-and-release problem in an ecosystem, shifting focus from static benchmarks to real industrial needs.

RoboticsBenchmarkAI Safety

What Actually Works for Spacecraft Fault-Tolerant Control: An Honest Benchmark of Learned and Classical Methods

A new study questions the reliability of learned fault-tolerant control methods for spacecraft, proposing a stricter benchmark that requires sustained pointing accuracy on unseen faults.

AgentPersuasionRAGarXiv

New Research Diagnoses and Mitigates Compounding Failures in Agentic Persuasion — RAG Semantic Leakage Identified as Key Trigger

A new arXiv paper finds that multi-agent debate systems in subjective tasks like persuasion suffer from severe problem drift and sycophantic conformity, identifying semantic leakage in standard RAG as a reproducible trigger.

ExplainabilityShapleyCausalarXiv

Beyond Shapley: Efficient Exact Computation of Asymmetric Shapley Values Achieved in New Work

A new arXiv paper presents a method for computing Asymmetric Shapley Values using causal graphs, achieving polynomial-time computation in scenarios where SHAP is #P-hard.

LLMBenchmarkFinanceEvaluation

InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating LLM Procedural Reasoning in Expert Investment Philosophy

A new benchmark, InvestPhilBench, evaluates large language models' procedural reasoning across 8 cognitive tiers in expert investment decision frameworks.

BenchmarkingReasoningLLMarXiv

Project Auto-World: Using LLMs to Automate Benchmarking of Neural Relational Reasoners

A new arXiv paper proposes using large language models to automate benchmarking for relational reasoning, addressing the core problem of unknown instance difficulty in evaluating neural generalization.

LLMMulti-AgentBCIBrain-Computer Interface

BrainAgent: A Large Language Model-Driven Multi-Agent Framework for Autonomous Brain Signal Understanding

Researchers propose BrainAgent, a multi-agent LLM framework for autonomous brain signal understanding that lowers the technical barrier to brain-computer interface applications.

AI Research AgentScientific DiscoveryFrameworkarXiv

Heuresis: A Search Strategy Framework for Autonomous AI Research Agents Balancing Quality, Diversity, and Novelty

A new arXiv paper introduces Heuresis, a framework that abstracts the research pipeline into composable primitives, enabling open-ended scientific exploration while optimizing for quality, diversity, and novelty.

LLMBenchmarkScientific WritingEvaluation

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

New benchmark RWGBench evaluates LLM performance in generating related work sections from a citation-level scholarly positioning perspective, going beyond traditional summarization metrics.

ChessRepresentation LearningStylearXiv

Elo-Disentangled Chess Style Embeddings: New Method Separates Playing Strength from Individual Style

A new arXiv paper proposes per-player style embeddings for human chess that measure stylistic similarity via inner products while being approximately disentangled from playing strength (Elo).

LLMPeer ReviewScientific Research

ReviewGuard: Aligning LLM-Assisted Peer Review with Long-Term Scientific Impact

New framework ReviewGuard uses a two-stage architecture to align LLM-generated peer reviews with citation-based estimates of long-term scientific impact.

AI ResearchMarkov Decision ProcessesStatistical Verification

Confidence Sequences for Online Statistical Model Checking of Markov Decision Processes

New paper proposes using confidence sequences for online statistical model checking of MDPs, addressing the unrealistic assumption of exact probability knowledge.

GUI AgentsKnowledge DistillationSmall Models

WinDOM: Self-Family Distillation for Small-Model GUI Grounding

New research proposes WinDOM, combining self-family distillation with reinforcement learning to achieve breakthrough GUI grounding performance in ~2B parameter small models.

Multi-AgentReinforcement LearningContinual Learning

Offline Multi-agent Continual Cooperation via Skill Partition and Reuse

New research proposes extracting and reusing skills from multi-agent offline datasets to address catastrophic forgetting and plasticity loss in sequential task scenarios.

AI AgentsSecurityPrivacy

AI Snitches Get Glitches: Towards Evading Agentic Surveillance

New paper warns that widespread AI agent deployment could be abused for user surveillance, and proposes methods to evade such agentic surveillance.

AI AgentsSynthetic DataData Science

Autodata: An Agentic Data Scientist to Create High Quality Synthetic Data

New research introduces Autodata, a method enabling AI agents to act as data scientists building high-quality training and evaluation data, with self-optimization through Agentic Self-Instruct.

Multi-AgentFoundation ModelsScientific DiscoveryHardware

Agentic Evolution of Physically Constrained Foundation Models

Researchers build a physically grounded multi-agent discovery engine that autonomously designs hardware-compliant computing systems, addressing the hallucination problem in generalist AI agents.

Multi-AgentRAGEfficiencyarXiv

Cost-Efficient Multi-Agent RAG: New Study Reveals Dichotomy in Assessment Mechanisms — Isolate vs. Score

A new arXiv paper reveals a sharp dichotomy in how multi-agent RAG models benefit from document assessment — per-document filtering vs. holistic scoring — and proposes model-adaptive strategies to reduce computational costs.

Knowledge RepresentationGraph TheoryReasoning

Position Spaces and Graphs: A Graph-Based Reasoning Framework for Spatial Relations

New research introduces position graphs, a formal graph-based reasoning framework using two strict partial orders to model horizontal and vertical alignment of discrete tokens.

Knowledge RepresentationASPRobotics

Reasonable Motion: A General ASP Foundation for Environment Constrained Movement Trajectory Computation

New research presents a hybrid quantitative-qualitative method based on Answer Set Programming for computing constrained branching trajectory modes for moving objects in real-world settings.

MultimodalEmotion AIReinforcement LearningarXiv

OPPO: Omni-Perception Policy Optimization Framework for Multimodal Emotion Reasoning

A new arXiv paper proposes OPPO, a reinforcement learning framework that explicitly optimizes multimodal perception for emotion reasoning, addressing the underutilization of multimodal cues and hallucination in current Omni-MLLMs.

RLVRCurriculum LearningReasoningarXiv

Automated Curriculum Learning for Multi-Domain RLVR: Leveraging Cross-Domain Transferability

A new arXiv paper proposes using cross-domain transferability of reasoning skills to dynamically adjust multi-domain RLVR training curricula, addressing inefficiency in fixed sampling strategies.

LLMReasoningInterpretability

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers introduce the cliff token concept, identifying the precise single token where an LLM's reasoning shifts from correctness toward failure in mathematical tasks.

LLMMathematicsBenchmarkFailure Analysis

Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation

A new paper systematically categorizes four failure modes of LLMs on research-level mathematics, where models are confidently but fluently wrong, based on the First Proof benchmark.

AI AgentsAI ResearchCompression

Agentic System as Compressor: Quantifying System Intelligence in Bits

A new paper adopts a 'compression is intelligence' viewpoint, proposing to quantify AI agent system intelligence in bits.

Knowledge GraphsFuzzy LogicOWLOntology

Fuzzy Quantification over OWL Ontologies and Knowledge Graphs: A Versatile Framework

New research presents a versatile framework for evaluating fuzzy quantification queries over standard and fuzzy ontologies as well as knowledge graphs, agnostic to quantifier type and evaluation method.

AI RegulationHealthcareAI PrescribingarXiv

Trust and Liability in Autonomous AI Prescribing: New Study Examines H.R. 238 and Utah Pilot

A new arXiv paper examines autonomous AI systems transitioning from advisory to prescribing roles, noting that US bill H.R. 238 and Utah's prescription-renewal pilot authorize AI to prescribe, while identifying critical gaps in current regulatory guidelines.

AI SafetyLLMChildrenEthics

Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions

Researchers propose the TSJ (Theater-Stage-Judge) framework, a longitudinal evaluation approach that reveals cumulative risks from prolonged AI companion interactions with cognition-developing users including children and adolescents.

LLM AgentGUIPrivacySafety

GUI agent: Guided Exploration of User-Sensitive Screens

New research addresses the problem of LLM-driven GUI agents encountering user-sensitive information screens, proposing a guided exploration approach that enables user takeover when needed.

AI SafetyAI AgentsAlignment

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

New paper proposes an 'Unfireable Safety Kernel' concept for execution-time AI alignment, addressing the fundamental vulnerability of safety controls inside agent runtimes.

LLMQuantizationEfficiency

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

New research reveals that low-bit post-training quantization causes reasoning models to generate longer chains of thought even when answers remain correct, adding hidden inference costs.

Daily Briefs