Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Consensus-Validated Memory Improves Agent Performance on Complex Tasks

2026·2 Zitationen·Zenodo (CERN European Organization for Nuclear Research)Open Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Large language model (LLM) agents do not learn across invocations. Each call starts from the same pretrained weights regardless of prior experience, preventing multi-agent systems from accumulating institutional knowledge. We present empirical evidence that consensus-validated institutional memory—where agent observations are summarized, proposed to a Byzantine fault-tolerant network, validated by weighted quorum, and committed to a queryable knowledge base—measurably improves downstream agent performance on complex sequential tasks. Using a controlled experimental framework applied to AI-driven CTF (Capture The Flag) challenge generation, we find: Calibration accuracy: Agents with access to institutional memory achieve target difficulty levels with 40% lower calibration error than memoryless agents (Cohen's d = -0.824, LARGE effect; mean gap 0.282 vs 0.468, n=20 per arm). Causal isolation of memory-mediated convergence: A 50-run sequential no-memory baseline (same machine, same session, (S)AGE OFF) does not converge, while the matched (S)AGE arm converges monotonically with variance collapse (stdev 0.159 to 0.113). This rules out temporal ordering as an alternative explanation for the observed convergence. Memory surpasses prompt engineering: An agent with a minimal "onboarding" prompt (job scope, scoring rubric, output format) and access to curated institutional memory achieved perfect calibration (gap = 0.0)—outperforming an expert-crafted 120-line domain-specific prompt whose best result across 20 runs was gap = 0.10. Institutional memory can substitute for domain-specific prompt engineering. Echo chamber effect: Without diversity controls, institutional memory converges to a single dominant pattern. All 20 sequential (S)AGE runs generated the same vulnerability type (Padding Oracle), and each run's observation reinforced the pattern. We identify knowledge diversity and novelty-aware summarization as necessary conditions for effective institutional memory. Consensus filtering works: The BFT validator quorum correctly accepts domain-relevant observations (scores 0.59–0.86) while rejecting cross-domain contamination from prior experiments (scores 0.35–0.50), demonstrating that governance prevents knowledge pollution. Quality consistency: (S)AGE runs achieve 100% quality consistency (93.0 for all 20 runs) versus vanilla variance (mean 90.1, std 13.1), indicating institutional memory stabilizes output quality. We report all results including null findings and failure modes. The experimental framework, statistical analysis engine, and (S)AGE infrastructure are open-source at https://github.com/l33tdawg/sage.

Autoren

Dhillon Andrew Kannabhiran

Institutionen

Marvell (Israel)(IL)

Themen

Artificial Intelligence in Healthcare and EducationExplainable Artificial Intelligence (XAI)Topic Modeling

Volltext beim Verlag öffnen

Consensus-Validated Memory Improves Agent Performance on Complex Tasks

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen