TRUSTMEM: Learning Trustworthy Memory Consolidation for LLM Agents with Long-Term Memory
A new arXiv paper introduces the TRUSTMEM framework to address error accumulation and hallucination persistence in LLM agent long-term memory caused by generated write, revise, and delete operations.
Agentic Knowledge Tracing: A Multi-Agent LLM Architecture for Stealth Assessment of Financial Literacy in Serious Games
Researchers propose the Agentic BKT pipeline, a multi-agent LLM architecture that stealthily assesses financial competencies from open-ended gameplay events without disrupting the learning experience.
Supervised Reinforcement Learning Tackles Distributed Energy Resource Coordination
Researchers propose a supervised reinforcement learning approach for coordinating distributed energy resources (DERs), achieving more efficient energy management under the uncertainty and complexity that challenge traditional optimization methods.
MacroLens Benchmark Released: Multi-Task Financial Reasoning Under Macroeconomic Scenarios
Researchers release MacroLens, a multi-task benchmark designed for contextual financial reasoning under macroeconomic scenarios, addressing key challenges like data leakage and reporting lags in time-series evaluation.
Study Reveals 'Readout Blind Spot' in Looped Language Models: Dense Supervision Misses Hidden State Variables
A new study shows that dense per-loop cross-entropy loss in looped language models only controls variables exposed by the readout, not all hidden-state variables active in the recurrent transition, creating a systematic supervision blind spot.
Human-AI Collaboration Discovers Quantum Algorithms: From Vague Intuition to Mathematical Discovery
A new paper documents how human-AI co-discovery transformed a vague research intuition into concrete sign-embedding quantum algorithms for matrix equations and matrix functions, showing a new paradigm for AI-assisted mathematics.
AgentOdyssey: A New Framework for Evaluating Test-Time Continual Learning in AI Agents
AgentOdyssey procedurally generates open-ended text games to benchmark agents on exploration, knowledge acquisition, memory retention, and long-horizon planning.
OpenAI publishes a new research paper examining how AI agents are transforming work by handling longer, more complex tasks and expanding productivity across roles.
DiARC Paper: Distinguishing Positive and Negative Samples Improves LLM Reasoning on ARC Tasks
A new arXiv paper introduces DiARC, a method that improves large language models' performance on the Abstraction and Reasoning Corpus (ARC) by distinguishing positive and negative samples.
Cerebras Stock Plunges After First Earnings Since IPO as CEO Says Margin Outlook Misunderstood
AI chipmaker Cerebras saw its stock plummet after its first earnings report since going public, with a narrower gross margin forecast spooking investors.
Companies scramble to stop employees from burning through AI budgets with small tasks
TechCrunch reports that companies are rushing to stop employees from exhausting AI budgets on low-value small tasks, marking a shift from the 'tokenmaxxing' era to an era of 'token rationing'.