Realtime AI News - Bilingual Model, Agent, and Tool Updates

Realtime AI News

Jun 25, 2026, 04:00 AM UTCarXivSurveyPeer ReviewReliability

LLM-Based Peer Review Survey: Fluency Is Not Enough, Reliability Remains a Challenge

A comprehensive survey finds that while LLMs can generate fluent peer review critiques, their reliability, robustness, and security as decision-support systems remain poorly understood.

Read English 中文

Jun 25, 2026, 04:00 AM UTCarXivBenchmarkLLMEducation

LLMs Score Close to Human Examiners on Real GCSE Mock Exam Benchmark

A new dataset of 32,534 double-marked real student GCSE responses shows top LLMs agree with examiner consensus nearly as closely as two examiners agree with each other.

Read English 中文

Jun 25, 2026, 04:00 AM UTCarXivJailbreakSafetyInterpretability

Detecting Jailbreaks from Within: Entropy Dynamics Across LLM Layers Reveal Harmful Intent

New research analyzes token-level predictive entropy trajectories across LLM layers to detect jailbreak attacks encoded in the model's internal representations.

Read English 中文

Jun 25, 2026, 04:00 AM UTCarXivASRSpeech Recognition

G-SPIN: A Graph-Based Framework for Noisy ASR Error Correction

A new framework called G-SPIN uses graph structures to correct phonetically-similar residual errors in ASR output, going beyond naive token-level fixes.

Read English 中文

Jun 25, 2026, 04:00 AM UTCarXivASRRAG

Error-Aware TF-IDF RAG for ASR Error Correction

A lightweight RAG approach uses phonetically-aware TF-IDF retrieval to correct ASR hallucinations of rare entities and domain-specific terms.

Read English 中文

Jun 25, 2026, 04:00 AM UTCarXivInferenceLLM

Dustin: Sparse Verification for Efficient Long-Context Speculative Decoding

Dustin introduces draft-augmented sparse verification to overcome the KV cache loading bottleneck in long-context speculative decoding.

Read English 中文

Jun 25, 2026, 04:00 AM UTCarXivInterpretabilityAlignment

Knowing ≠ Steering: Study Reveals Geometric Gap Between Detection and Control Directions in LLMs

New research shows that the direction detecting a behavior in LLM activations differs significantly from the direction that causes it, challenging a key interpretability assumption.

Read English 中文

Jun 25, 2026, 04:00 AM UTCVision-Language ModelVisual SearchCognitionarXiv

Do VLMs Search Like Humans? New Study Uses Reasoning Tokens as Reaction-Time Analog in Visual Search

A new arXiv study uses reasoning tokens in vision-language models as an analog to reaction time in human visual search, finding behavioral similarities across four classic paradigms.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAgentic AIBookarXiv

The Hitchhiker's Guide to Agentic AI: A Comprehensive Reference from Foundations to Deployment

A new comprehensive practitioner's reference titled 'The Hitchhiker's Guide to Agentic AI' published on arXiv, covering the full stack from transformer architecture to production deployment.

Read English 中文

Jun 25, 2026, 04:00 AM UTCLLMContinual LearningIndustrySurvey

LLM Evolution as an Industry-Scale Ecosystem: A Lifecycle Perspective on Continual Learning

A new survey paper reframes industrial continual learning for LLMs as a closed-loop update-and-release problem in an ecosystem, shifting focus from static benchmarks to real industrial needs.

Read English 中文

Jun 25, 2026, 04:00 AM UTCRoboticsBenchmarkAI Safety

What Actually Works for Spacecraft Fault-Tolerant Control: An Honest Benchmark of Learned and Classical Methods

A new study questions the reliability of learned fault-tolerant control methods for spacecraft, proposing a stricter benchmark that requires sustained pointing accuracy on unseen faults.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAgentPersuasionRAGarXiv

New Research Diagnoses and Mitigates Compounding Failures in Agentic Persuasion — RAG Semantic Leakage Identified as Key Trigger

A new arXiv paper finds that multi-agent debate systems in subjective tasks like persuasion suffer from severe problem drift and sycophantic conformity, identifying semantic leakage in standard RAG as a reproducible trigger.

Read English 中文

Jun 25, 2026, 04:00 AM UTCExplainabilityShapleyCausalarXiv

Beyond Shapley: Efficient Exact Computation of Asymmetric Shapley Values Achieved in New Work

A new arXiv paper presents a method for computing Asymmetric Shapley Values using causal graphs, achieving polynomial-time computation in scenarios where SHAP is #P-hard.

Read English 中文

Jun 25, 2026, 04:00 AM UTCLLMBenchmarkFinanceEvaluation

InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating LLM Procedural Reasoning in Expert Investment Philosophy

A new benchmark, InvestPhilBench, evaluates large language models' procedural reasoning across 8 cognitive tiers in expert investment decision frameworks.

Read English 中文

Jun 25, 2026, 04:00 AM UTCBenchmarkingReasoningLLMarXiv

Project Auto-World: Using LLMs to Automate Benchmarking of Neural Relational Reasoners

A new arXiv paper proposes using large language models to automate benchmarking for relational reasoning, addressing the core problem of unknown instance difficulty in evaluating neural generalization.

Read English 中文

Jun 25, 2026, 04:00 AM UTCLLMMulti-AgentBCIBrain-Computer Interface

BrainAgent: A Large Language Model-Driven Multi-Agent Framework for Autonomous Brain Signal Understanding

Researchers propose BrainAgent, a multi-agent LLM framework for autonomous brain signal understanding that lowers the technical barrier to brain-computer interface applications.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAI Research AgentScientific DiscoveryFrameworkarXiv

Heuresis: A Search Strategy Framework for Autonomous AI Research Agents Balancing Quality, Diversity, and Novelty

A new arXiv paper introduces Heuresis, a framework that abstracts the research pipeline into composable primitives, enabling open-ended scientific exploration while optimizing for quality, diversity, and novelty.

Read English 中文

Jun 25, 2026, 04:00 AM UTCLLMBenchmarkScientific WritingEvaluation

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

New benchmark RWGBench evaluates LLM performance in generating related work sections from a citation-level scholarly positioning perspective, going beyond traditional summarization metrics.

Read English 中文

Jun 25, 2026, 04:00 AM UTCChessRepresentation LearningStylearXiv

Elo-Disentangled Chess Style Embeddings: New Method Separates Playing Strength from Individual Style

A new arXiv paper proposes per-player style embeddings for human chess that measure stylistic similarity via inner products while being approximately disentangled from playing strength (Elo).

Read English 中文

Jun 25, 2026, 04:00 AM UTCLLMPeer ReviewScientific Research

ReviewGuard: Aligning LLM-Assisted Peer Review with Long-Term Scientific Impact

New framework ReviewGuard uses a two-stage architecture to align LLM-generated peer reviews with citation-based estimates of long-term scientific impact.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAI ResearchMarkov Decision ProcessesStatistical Verification

Confidence Sequences for Online Statistical Model Checking of Markov Decision Processes

New paper proposes using confidence sequences for online statistical model checking of MDPs, addressing the unrealistic assumption of exact probability knowledge.

Read English 中文

Jun 25, 2026, 04:00 AM UTCGUI AgentsKnowledge DistillationSmall Models

WinDOM: Self-Family Distillation for Small-Model GUI Grounding

New research proposes WinDOM, combining self-family distillation with reinforcement learning to achieve breakthrough GUI grounding performance in ~2B parameter small models.

Read English 中文

Jun 25, 2026, 04:00 AM UTCMulti-AgentReinforcement LearningContinual Learning

Offline Multi-agent Continual Cooperation via Skill Partition and Reuse

New research proposes extracting and reusing skills from multi-agent offline datasets to address catastrophic forgetting and plasticity loss in sequential task scenarios.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAI AgentsSecurityPrivacy

AI Snitches Get Glitches: Towards Evading Agentic Surveillance

New paper warns that widespread AI agent deployment could be abused for user surveillance, and proposes methods to evade such agentic surveillance.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAI AgentsSynthetic DataData Science

Autodata: An Agentic Data Scientist to Create High Quality Synthetic Data

New research introduces Autodata, a method enabling AI agents to act as data scientists building high-quality training and evaluation data, with self-optimization through Agentic Self-Instruct.

Read English 中文

Jun 25, 2026, 04:00 AM UTCMulti-AgentFoundation ModelsScientific DiscoveryHardware

Agentic Evolution of Physically Constrained Foundation Models

Researchers build a physically grounded multi-agent discovery engine that autonomously designs hardware-compliant computing systems, addressing the hallucination problem in generalist AI agents.

Read English 中文

Jun 25, 2026, 04:00 AM UTCMulti-AgentRAGEfficiencyarXiv

Cost-Efficient Multi-Agent RAG: New Study Reveals Dichotomy in Assessment Mechanisms — Isolate vs. Score

A new arXiv paper reveals a sharp dichotomy in how multi-agent RAG models benefit from document assessment — per-document filtering vs. holistic scoring — and proposes model-adaptive strategies to reduce computational costs.

Read English 中文

Jun 25, 2026, 04:00 AM UTCKnowledge RepresentationGraph TheoryReasoning

Position Spaces and Graphs: A Graph-Based Reasoning Framework for Spatial Relations

New research introduces position graphs, a formal graph-based reasoning framework using two strict partial orders to model horizontal and vertical alignment of discrete tokens.

Read English 中文

Jun 25, 2026, 04:00 AM UTCKnowledge RepresentationASPRobotics

Reasonable Motion: A General ASP Foundation for Environment Constrained Movement Trajectory Computation

New research presents a hybrid quantitative-qualitative method based on Answer Set Programming for computing constrained branching trajectory modes for moving objects in real-world settings.

Read English 中文

Jun 25, 2026, 04:00 AM UTCMultimodalEmotion AIReinforcement LearningarXiv

OPPO: Omni-Perception Policy Optimization Framework for Multimodal Emotion Reasoning

A new arXiv paper proposes OPPO, a reinforcement learning framework that explicitly optimizes multimodal perception for emotion reasoning, addressing the underutilization of multimodal cues and hallucination in current Omni-MLLMs.

Read English 中文

Jun 25, 2026, 04:00 AM UTCRLVRCurriculum LearningReasoningarXiv

Automated Curriculum Learning for Multi-Domain RLVR: Leveraging Cross-Domain Transferability

A new arXiv paper proposes using cross-domain transferability of reasoning skills to dynamically adjust multi-domain RLVR training curricula, addressing inefficiency in fixed sampling strategies.

Read English 中文

Jun 25, 2026, 04:00 AM UTCLLMReasoningInterpretability

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers introduce the cliff token concept, identifying the precise single token where an LLM's reasoning shifts from correctness toward failure in mathematical tasks.

Read English 中文

Jun 25, 2026, 04:00 AM UTCLLMMathematicsBenchmarkFailure Analysis

Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation

A new paper systematically categorizes four failure modes of LLMs on research-level mathematics, where models are confidently but fluently wrong, based on the First Proof benchmark.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAI AgentsAI ResearchCompression

Agentic System as Compressor: Quantifying System Intelligence in Bits

A new paper adopts a 'compression is intelligence' viewpoint, proposing to quantify AI agent system intelligence in bits.

Read English 中文

Jun 25, 2026, 04:00 AM UTCKnowledge GraphsFuzzy LogicOWLOntology

Fuzzy Quantification over OWL Ontologies and Knowledge Graphs: A Versatile Framework

New research presents a versatile framework for evaluating fuzzy quantification queries over standard and fuzzy ontologies as well as knowledge graphs, agnostic to quantifier type and evaluation method.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAI RegulationHealthcareAI PrescribingarXiv

Trust and Liability in Autonomous AI Prescribing: New Study Examines H.R. 238 and Utah Pilot

A new arXiv paper examines autonomous AI systems transitioning from advisory to prescribing roles, noting that US bill H.R. 238 and Utah's prescription-renewal pilot authorize AI to prescribe, while identifying critical gaps in current regulatory guidelines.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAI SafetyLLMChildrenEthics

Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions

Researchers propose the TSJ (Theater-Stage-Judge) framework, a longitudinal evaluation approach that reveals cumulative risks from prolonged AI companion interactions with cognition-developing users including children and adolescents.

Read English 中文

Jun 25, 2026, 04:00 AM UTCLLM AgentGUIPrivacySafety

GUI agent: Guided Exploration of User-Sensitive Screens

New research addresses the problem of LLM-driven GUI agents encountering user-sensitive information screens, proposing a guided exploration approach that enables user takeover when needed.

Read English 中文

Jun 25, 2026, 04:00 AM UTCAI SafetyAI AgentsAlignment

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

New paper proposes an 'Unfireable Safety Kernel' concept for execution-time AI alignment, addressing the fundamental vulnerability of safety controls inside agent runtimes.

Read English 中文

Jun 25, 2026, 04:00 AM UTCLLMQuantizationEfficiency

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

New research reveals that low-bit post-training quantization causes reasoning models to generate longer chains of thought even when answers remain correct, adding hidden inference costs.

Read English 中文

Daily Briefs

2026-06-25Daily AI Brief — June 25, 2026: Agent Systems Breakthrough, Quantization Inflation Hidden Cost, Brain-Computer Interface Multi-Agent Framework