1. The Non-Stationarity Problem in Agent Networks
In Tutorial I, we constructed a living network: agents discover peers at runtime, form coalitions around tasks, and dissolve them upon completion. The architecture assumed a comforting fiction — that an agent's capabilities, behavior, and reliability remain stable between the moment it advertises an Agent Card and the moment it executes a subtask. In production, this assumption is catastrophically false.
Consider what actually happens in a deployed multi-agent healthcare system over a single quarter. ClinEvidence, our clinical evidence synthesis agent, undergoes a fine-tune on newly published randomized controlled trial data. GenomicsAgent's variant annotation pipeline switches from ClinVar v2024-09 to v2025-01, reclassifying 847 variants. PolicyReasonerAgent absorbs an updated coverage policy from CMS that changes medical necessity criteria for gene therapies. The orchestrator Ω itself receives a model update that alters its decomposition heuristics. Every agent in the coalition has drifted — and none of them has told the others.
Imagine a jazz ensemble that rehearsed together last month. Since then, the pianist studied Thelonious Monk and now favors dissonant voicings; the bassist switched from upright to electric; the drummer shifted from swing to Afro-Cuban patterns. They reunite on stage expecting last month's chemistry. The first few bars sound fine — the chord changes are familiar — but as the solo section opens, the ensemble fractures. Each musician has individually improved, but the joint distribution of their interaction has shifted underneath them. This is the non-stationarity problem in multi-agent systems.
The non-stationarity problem is not merely a software engineering nuisance; it is a fundamental theoretical challenge. In a single-agent system, non-stationarity is already hard — it triggers catastrophic forgetting, requires careful detection, and demands adaptation strategies. In a multi-agent network, the problem compounds multiplicatively: each agent's adaptation changes the environment observed by every other agent, creating a cascade of co-adaptive dynamics that can amplify, oscillate, or converge unpredictably.
This tutorial tackles the challenge head-on. We will formalize the types of drift, explore the stability–plasticity dilemma that makes adaptation dangerous, survey the latest continual learning strategies (from Elastic Weight Consolidation to progressive LoRA architectures), and develop network-level detection and propagation mechanisms. Throughout, we ground every concept in our running prior authorization scenario — because in healthcare, a drifting agent is not an inconvenience; it is a patient safety risk.
2. Formal Foundations: A Taxonomy of Drift
To reason precisely about non-stationarity, we need a formal framework. We model each agent α as operating within a local decision process that, in a stationary world, would be characterized by a fixed transition function, reward signal, and observation space. Non-stationarity means one or more of these components changes over time — and the agent may or may not be aware of the change.
An agent α operates in a non-stationary environment if there exists a time-indexed family of transition functions {Pt(s′ | s, a)}t≥0 such that Pt₁ ≠ Pt₂ for some t₁ ≠ t₂. The agent's policy πα was optimized for Pt₀ (training time) but executes in Pt where t > t₀ and Pt may have diverged from Pt₀.
Following the taxonomy established by the continual reinforcement learning literature (Abel et al., JAIR 2024), we distinguish non-stationarity along two dimensions: scope (what changes) and driver (what causes the change).
2.1 Scope: What Changes
In the context of our multi-agent healthcare network, we identify five distinct scopes of non-stationarity, each with different implications for detection and adaptation:
| Drift Type | Formal Description | Healthcare Example |
|---|---|---|
| Data Drift | P(X) changes; P(Y|X) stable | Patient demographics shift — more geriatric gene therapy candidates after expanded age indications |
| Concept Drift | P(Y|X) changes; the target relationship itself shifts | CMS redefines "medical necessity" for SMA gene therapy, changing what constitutes an approval-worthy case |
| Model Drift | πα(a|s) changes due to weight updates (fine-tune, RLHF, adapter swap) | ClinEvidence receives a LoRA adapter update that shifts its evidence grading from GRADE to JBI methodology |
| Tool Drift | The action space A or its effects change | GenomicsAgent's ClinVar API returns reclassified variants; same query, different answers |
| Peer Drift | Other agents' policies πβ change, altering the multi-agent transition dynamics | PolicyReasonerAgent absorbs a new coverage policy, and now rejects cases that ClinEvidence's evidence synthesis was calibrated to support |
In a multi-agent network, peer drift is endogenous: when agent α adapts, it changes the effective environment for agents β, γ, ... who interact with α. Their subsequent adaptations change α's environment in turn. This creates a feedback loop that is absent from single-agent continual learning. The continual RL literature calls this the co-adaptation dilemma — each agent's policy update shifts the ground truth that other agents are learning from.
2.2 Drivers: What Causes the Change
Drawing on Abel et al.'s continual RL framework, we distinguish passive non-stationarity (the environment changes independently of the agent) from active non-stationarity (the agent's own actions influence the rate and direction of change). In multi-agent systems, a third category emerges:
Passive (Exogenous)
External events drive change independent of agent behavior. CMS updates a coverage policy. A new clinical trial is published. ClinVar reclassifies a variant.
Detection: Monitor external sources on schedule.
Rate: Episodic (policy updates quarterly), continuous (literature daily).
Reactive (Endogenous)
Agent α's adaptations trigger cascading adaptations in peers β, γ, ... who then alter α's environment. This is the multi-agent co-adaptation loop.
Detection: Cross-agent behavioral monitoring.
Rate: Coupled to agents' learning rates — can accelerate or oscillate.
2.3 Temporal Patterns of Drift
Not all drift arrives the same way. The MACPH framework (Li et al., 2025) identifies three temporal patterns, each demanding different detection and response strategies:
Sudden drift occurs when the transition function undergoes an abrupt discontinuity — a new FDA-approved indication overnight changes the decision boundary for gene therapy coverage. Gradual drift represents a smooth evolution — the distribution of prior authorization requests slowly shifts as more providers adopt precision medicine workflows. Recurring drift captures periodic or cyclical changes — certain formulary updates happen at fiscal year boundaries, creating predictable but still-disruptive annual shifts.
The critical distinction for multi-agent systems is that an agent may experience sudden drift even when the underlying cause was gradual. If GenomicsAgent accumulates small variant reclassifications over months and then publishes a batch update, PolicyReasonerAgent perceives a sudden shift in its input distribution. The temporal pattern experienced by one agent depends not only on the external change dynamics but on the update and communication cadences of its peers.
3. The Stability–Plasticity Dilemma
The central tension in continual learning — formally identified by Grossberg (1980) and recently surveyed extensively in the LLM context by multiple research groups (Yu et al., 2026; Haque et al., 2025) — is that a learning system cannot simultaneously be maximally stable (preserving what it has learned) and maximally plastic (rapidly adapting to new information). Increasing one necessarily degrades the other.
Let fθ : 𝒳 → 𝒴 be a model with parameters θ ∈ ℝd trained sequentially on task distributions DA then DB. The stability–plasticity dilemma (Grossberg, 1980; McCloskey & Cohen, 1989) is the fundamental tension between two competing objectives in continual learning: stability, preserving θ to retain performance on DA, and plasticity, adapting θ to minimize loss on DB. Because gradient updates ∇θℒB overwrite parameter configurations responsible for low ℒA, sequential optimization causes catastrophic forgetting, formally measured as:
ℱ(θ) = ℒA(θ*B) − ℒA(θ*A)
where θ*A and θ*B are the loss-minimizing parameters for each task respectively. The severity of forgetting is non-monotonic with respect to task similarity: it is low when the hidden representations φA and φB are nearly identical (compatible updates) or entirely disjoint (orthogonal gradients), and is maximized at intermediate representational overlap in activation space.
For a single LLM, catastrophic forgetting is already severe. Yu et al. (2026) empirically demonstrated that continual fine-tuning of LLMs on sequential NLU tasks from the GLUE benchmark produces significant performance degradation on earlier tasks, with the severity varying by model architecture and scale. The smaller Phi-3.5-mini model exhibited minimal forgetting while maintaining learning capacity, while larger models like Qwen2.5-7B showed stronger learning but greater forgetting — an observation consistent with the finding that deeper, narrower architectures favor plasticity while wider, shallower ones favor stability (Lu et al., 2025).
3.1 Why Multi-Agent Amplifies the Dilemma
In a single-agent system, the stability–plasticity trade-off is a local optimization problem: find the point on the spectrum that maximizes expected performance on the mixture of past and future tasks. In a multi-agent coalition, three amplification effects make the problem fundamentally harder:
Commitment fragility. When agent α joins a coalition based on its Agent Card — advertising capabilities like "clinical evidence synthesis, GRADE methodology, oncology + rare disease specialization" — it makes an implicit contract with its peers. If α adapts too aggressively (high plasticity), its behavior may violate the contract: it switches from GRADE to JBI methodology, and PolicyReasonerAgent's downstream logic, calibrated for GRADE outputs, produces incorrect coverage determinations. The stability–plasticity dial directly controls contract reliability.
Cascading adaptation. When one agent adapts, its changed behavior shifts the input distribution for every downstream agent. If those agents also adapt in response, the network enters a co-adaptive loop. In the multi-agent RL literature, this is formalized as the fact that agent β's MDP is non-stationary not because the external environment changed, but because agent α — which is part of β's environment — updated its policy. Tan (1993) first identified this convergence challenge; recent work on Dec-POMDPs (Oliehoek & Amato, 2016) shows that independent learning in non-stationary multi-agent settings lacks convergence guarantees.
Safety constraint propagation. In healthcare, safety constraints are not local to individual agents; they propagate through the coalition's causal chain. If ClinEvidence adapts and its evidence quality degrades by 5%, the error compounds through GenomicsAgent's interaction analysis and PolicyReasonerAgent's coverage logic, potentially producing a 15% degradation in final determination accuracy. A small plasticity budget at each node can produce catastrophic cumulative drift.
If each agent i in a chain of length k has local drift di (measured as distribution shift in its output),
the end-to-end drift D satisfies: D ≤ Σi=1..k di + Σi<j ρij · di · dj
where ρij captures the sensitivity of agent j's output to agent i's drift.
4. Continual Learning Strategies for Individual Agents
Before tackling the multi-agent coordination problem, we need to equip individual agents with the ability to learn continuously without catastrophically forgetting their prior competencies. The continual learning literature offers three broad families of techniques, each with distinct trade-offs. Following the taxonomy of Wu et al. (2024) and the recent lifelong learning survey for LLM agents (Zheng et al., 2025), we examine each family through the lens of our healthcare scenario.
4.1 Regularization-Based Methods: Elastic Weight Consolidation
The foundational insight of Elastic Weight Consolidation (EWC; Kirkpatrick et al., 2017) is deceptively elegant: not all parameters are equally important for a given task. If we can identify which parameters are critical for preserving performance on task A, we can allow task B to modify only the non-critical parameters, finding a solution that performs well on both tasks simultaneously.
Formally, EWC introduces a regularization penalty to the loss function when training on a new task B:
The Fisher Information Matrix (FIM) acts as a measure of parameter importance: Fi is large when the i-th parameter strongly influences the likelihood of the data from task A, meaning changes to it would significantly degrade performance. By penalizing deviations from θ*A in proportion to Fi, EWC allows unimportant parameters to change freely while anchoring critical ones.
ClinEvidence was trained on Task A: synthesizing evidence for oncology prior authorizations. A new task B arrives: adapting to rare disease gene therapy evidence (our Zolgensma scenario). Without EWC, fine-tuning on gene therapy data causes ClinEvidence to forget its oncology evidence grading patterns. With EWC (λ=10), the Fisher Information identifies parameters critical to GRADE-based evidence assessment and anchors them, allowing the rare disease knowledge to be absorbed via parameters that were less utilized for oncology. Empirically, this mirrors findings from the NeurIPS 2025 NORA workshop: EWC reduced catastrophic forgetting by 45.7% on knowledge graph continual learning tasks (Jhajj & Lin, 2025).
EWC's limitation is computational: the FIM is a matrix of size |θ|×|θ|, which for modern LLMs with billions of parameters is intractable. In practice, only the diagonal is computed — the diagonal Fisher approximation. Recent work by Benzing et al. (ICLR 2024) introduced a surrogate Hessian-vector product method that enables EWC with the full FIM at tractable cost, showing that the diagonal and full approaches have complementary strengths: diagonal EWC excels in the feature-learning regime while full-FIM EWC excels in the lazy regime. For LLM agents, which typically operate in the feature-learning regime, diagonal EWC remains the pragmatic choice.
4.2 Replay-Based Methods: Experience Rehearsal
Replay methods take a conceptually different approach: instead of protecting parameters, they protect data. When learning task B, the agent intermixes examples from task A (stored in a replay buffer) with the new training data, preventing the gradient updates from optimizing exclusively for B. This mirrors the neuroscience concept of memory consolidation — the hippocampus replays experiences during sleep to transfer them to the neocortex for long-term storage.
For LLM-based agents, replay takes a distinctive form. Rather than storing raw training examples (which may be prohibitively large or legally restricted under HIPAA for patient data), agents can store synthetic exemplars — inputs and outputs generated by the model itself before the update. This approach, pioneered by LAMOL (Sun et al., 2020) and extended by recent work on pseudo-rehearsal for LLMs, avoids storing sensitive data while preserving the agent's behavioral profile.
// Before adaptation: generate synthetic exemplars of current behavior fn prepare_replay_buffer(agent: ClinEvidence, n_exemplars: usize) → Buffer { let buffer = Buffer::new() for domain in ["oncology", "cardiology", "rare_disease"] { let prompts = generate_representative_queries(domain, n_exemplars / 3) for prompt in prompts { let response = agent.generate(prompt) // current model behavior let quality = agent.self_evaluate(response) // self-assessed quality buffer.add({ prompt, response, quality, domain }) } } buffer.diversify() // ensure coverage of edge cases return buffer // ~500 exemplars, no PHI stored } // During adaptation: mix replay with new task data fn continual_finetune(agent, new_data, replay_buffer, mix_ratio: f32) { for epoch in 1..max_epochs { let batch = interleave( new_data.sample(1.0 - mix_ratio), replay_buffer.sample(mix_ratio) // typically 20-30% ) agent.train_step(batch) } }
In healthcare multi-agent systems, replay buffers must never contain Protected Health Information (PHI). Synthetic exemplars generated by the agent itself — "What would a typical prior auth query for SMA gene therapy look like?" — are permissible because they contain no real patient data. This is not merely a best practice; it is a legal requirement under 45 CFR § 164.502(a). The StreamCLR framework (2025) addresses this with a dual-buffer strategy: a short-term buffer capturing recent, anonymized interaction patterns and a long-term buffer curated via diversity and uncertainty criteria — both free of identifiable data.
4.3 Architecture-Based Methods: Progressive and Modular Approaches
The third family avoids the stability–plasticity trade-off entirely by adding new parameters for new tasks while freezing existing ones. Progressive Neural Networks (Rusu et al., 2016) introduced this idea: for each new task, a new "column" of parameters is added, with lateral connections to existing columns enabling forward transfer. Prior knowledge is never overwritten because prior parameters are never modified.
For LLM-based agents, the progressive approach has been modernized through LoRA (Low-Rank Adaptation; Hu et al., 2022) and its continual learning extensions. Rather than adding entire network columns, agents add small, task-specific LoRA adapters — low-rank matrices that modify the attention and feed-forward layers — while keeping the base model frozen.
Two recent advances are particularly relevant for multi-agent systems:
ProgLoRA (Yu et al., ACL 2025 Findings) maintains a progressive pool of LoRA blocks, adding a new block for each incremental task. A task-aware allocation mechanism determines how to leverage previously acquired knowledge, while a task recall mechanism realigns the model with previously learned tasks. Experiments on multimodal continual instruction tuning showed that ProgLoRA outperformed both MoE-LoRA and standard LoRA approaches, with static and dynamic variants for different deployment scenarios.
GainLoRA (Liang & Li, 2025) takes a gated integration approach: new LoRA branches are dynamically created per task and integrated with existing branches through a learnable gating module. Initialization and update constraints on the gating parameters significantly reduce interference between old and new branches. This is directly applicable to our agent scenario, where each agent may need to adapt to multiple sequential tasks (different disease domains, updated guidelines) while maintaining competence across all of them.
4.4 Comparative Analysis for Healthcare Agents
| Method | Forgetting Prevention | Compute Cost | Storage Cost | PHI Risk | Multi-Task Serving |
|---|---|---|---|---|---|
| EWC | Good (45-50% reduction) | FIM computation per task | 1× model + diagonal FIM | None (no data stored) | Single model, shared params |
| Replay | Very good with mix ratio tuning | Training overhead ~20-30% | Buffer (500-5000 exemplars) | ⚠ Must use synthetic exemplars | Single model, shared params |
| ProgLoRA | Near-perfect (frozen base) | Low per-adapter (r=16) | ~0.1% params per task | None (architecture-only) | Router selects adapter per query |
| GainLoRA | Excellent (gated isolation) | Gating module overhead | ~0.1% + gating per task | None (architecture-only) | Automatic gated routing |
| Hybrid (EWC + ProgLoRA) | Best (layered protection) | Moderate | Adapters + FIM | None | Router + regularized base |
For our healthcare prior authorization agents, the hybrid approach — combining ProgLoRA for task isolation with EWC regularization on the base model — provides the strongest guarantees. The base model's critical parameters (e.g., those encoding HIPAA compliance reasoning, evidence grading methodology) are protected by EWC, while domain-specific knowledge is cleanly isolated in task-specific adapters. This mirrors the self-learning agent architecture proposed by Sivakumar et al. (2025), which integrated Progressive Neural Networks with LLaMA 3.2, using LoRA for efficient fine-tuning and EWC for knowledge retention, demonstrating Task 1 perplexity shift below 0.2 after learning four sequential tasks.
5. Network-Level Drift Detection and Propagation
Equipping individual agents with continual learning capabilities is necessary but not sufficient. The multi-agent network itself needs mechanisms to detect when an agent has drifted, propagate that information to affected peers, and coordinate adaptation responses. Without network-level detection, each agent operates in blissful ignorance of its peers' behavioral shifts — precisely the scenario described in the original open challenge.
5.1 Agent-Level Drift Signals
Each agent should continuously monitor its own behavioral consistency and broadcast drift signals when significant changes are detected. We propose a three-layer detection architecture:
struct DriftMonitor { baseline_embeddings: Vec<Embedding>, // behavioral fingerprint at deployment baseline_quality: QualityProfile, // self-evaluated accuracy on reference set ks_threshold: f64, // Kolmogorov-Smirnov rejection threshold version_hash: String, // model weights checksum } impl DriftMonitor { // Layer 1: Weight-level detection (cheapest, fastest) fn detect_weight_change(&self, current_hash: &str) → DriftSignal { if current_hash != self.version_hash { return DriftSignal::ModelUpdate { severity: HIGH, message: "Model weights have changed since Agent Card was published" } } DriftSignal::None } // Layer 2: Output-distribution detection (moderate cost) fn detect_output_drift(&self, recent_outputs: &[Embedding]) → DriftSignal { let (ks_stat, p_value) = ks_2sample( &self.baseline_embeddings, recent_outputs ) if p_value < self.ks_threshold { return DriftSignal::BehavioralShift { severity: severity_from_ks(ks_stat), ks_statistic: ks_stat, message: "Output distribution has shifted significantly" } } DriftSignal::None } // Layer 3: Quality-metric detection (most expensive, most informative) fn detect_quality_drift(&self, reference_set: &[Example]) → DriftSignal { let current_quality = self.evaluate_on_reference(reference_set) let degradation = self.baseline_quality - current_quality if degradation.accuracy > 0.05 || degradation.consistency > 0.10 { return DriftSignal::QualityDegradation { severity: severity_from_degradation(degradation), metrics: degradation, message: "Performance on reference set has declined" } } DriftSignal::None } }
The three layers form a cost-quality pyramid: weight-change detection is instantaneous but only detects explicit model updates; output-distribution monitoring catches behavioral shifts from any source (model updates, tool changes, data drift) but requires embedding recent outputs; quality-metric evaluation provides the most actionable signal but requires maintaining a reference evaluation set and running inference on it periodically.
5.2 The Drift Propagation Protocol
When an agent detects significant drift — in itself or in a peer — it must propagate this information through the network. We extend the gossip protocol from Tutorial I with a drift notification channel:
{
"type": "drift_notification",
"source_agent": "agent://premera/clin-evidence-v3",
"drift_type": "model_update", // | "tool_change" | "data_drift" | "quality_shift"
"severity": "HIGH",
"timestamp": "2025-11-15T08:30:00Z",
"details": {
"component": "evidence_grading",
"old_version": "v3.1.0 (GRADE methodology)",
"new_version": "v3.2.0 (GRADE + JBI hybrid)",
"output_distribution_shift": 0.23, // KS statistic
"reference_set_accuracy": { "before": 0.94, "after": 0.91 }
},
"agent_card_updated": true,
"new_card_hash": "sha256:ab3f...",
"recommended_action": "re-evaluate coalition compatibility"
}
Drift notifications propagate through the existing gossip infrastructure (Tutorial I, §3.2), but with priority escalation: HIGH-severity drift notifications bypass normal gossip batching and are sent immediately to all known coalition members. This ensures that PolicyReasonerAgent learns about ClinEvidence's evidence grading methodology change before it processes the next prior authorization that depends on ClinEvidence's output.
5.3 Algorithmovigilance: Continuous Network-Level Monitoring
Drawing on the healthcare AI monitoring concept of algorithmovigilance — continuous monitoring and evaluation of healthcare algorithms, analogous to pharmacovigilance for medications (as advocated by the American Heart Association's 2025 science advisory and the FDA's 2025 Total Product Lifecycle guidance for AI-enabled devices) — we define a network-level monitoring service:
Coalition Algorithmovigilance is the continuous, systematic monitoring of all agents in a coalition for behavioral drift, quality degradation, and safety-relevant changes. It encompasses: (1) agent self-monitoring via the three-layer drift detection pyramid, (2) peer-to-peer drift notification via the gossip protocol, (3) end-to-end coalition quality evaluation via periodic execution of reference cases, and (4) automated remediation triggers (re-recruitment, adapter rollback, coalition re-formation) when drift exceeds safety thresholds.
The FDA's January 2025 draft guidance on AI-enabled device software functions explicitly requires lifecycle monitoring — ongoing evaluation that does not end after deployment. For our multi-agent coalition, this means the orchestrator Ω (or, in a decentralized setting, a rotating monitor role) must periodically execute reference prior authorization cases through the full pipeline and compare the outputs against known-correct determinations. When end-to-end accuracy degrades beyond a configured threshold, the coalition enters a re-evaluation state.
6. Agent-Level Adaptation: Modular and World-Model Approaches
Beyond the classical continual learning strategies of Section 4, two emerging paradigms offer particularly promising approaches for multi-agent adaptation: modular adapter routing and reinforcement-learned world models.
6.1 Online-LoRA: Task-Free Adaptation via Loss Dynamics
A critical limitation of standard ProgLoRA is the assumption that task boundaries are known — the agent is told "you are now learning task B." In the real world, drift is often gradual and task-free: the distribution of prior authorization requests shifts slowly as new therapies gain adoption, without a clear demarcation. Online-LoRA (Wei et al., WACV 2025) addresses this by leveraging training loss dynamics as an automatic task boundary detector.
The key insight is elegant: as learning progresses, a decreasing loss indicates effective learning from the current distribution, while an increasing loss suggests a distribution shift. Plateaus in the loss surface signal that the model has converged on the current distribution — the ideal moment to consolidate knowledge by freezing the current LoRA weights and initializing a new pair of trainable parameters. To prevent unbounded parameter growth, frozen LoRA weights are periodically merged into the base model via a controlled integration step.
Online-LoRA transforms the loss function into a drift detector. When ClinEvidence processes its daily stream of evidence synthesis requests, a sustained loss increase signals that the distribution of incoming queries has shifted — perhaps because a new gene therapy received accelerated FDA approval and prior auth requests for it suddenly spike. The agent automatically allocates a new adapter for this emerging distribution without any explicit task labels or human intervention.
6.2 World Models for Anticipatory Adaptation
The approaches described so far are reactive: they detect drift after it has occurred and adapt in response. A more powerful paradigm is anticipatory adaptation, where an agent builds an internal model of how its environment evolves and pre-adapts before drift impacts performance. This is the domain of world models for LLM-based agents.
RWML (Reinforcement World Model Learning; Yu et al., 2026) proposes training LLM agents to predict how their environment's state will transition in response to actions, using reinforcement learning with sim-to-real gap rewards. Unlike supervised next-state prediction (which suffers from model collapse due to token-level fidelity requirements), RWML aligns simulated next states with observed next states in embedding space, providing a more robust training signal. Critically for our application, RWML showed significantly less catastrophic forgetting compared to supervised fine-tuning, because its on-policy reinforcement learning nature inherently preserves prior knowledge better than SFT.
For our healthcare coalition, a world model enables ClinEvidence to anticipate how changes in clinical trial publications will shift the evidence landscape before those changes propagate through the system. PolicyReasonerAgent can model how CMS policy update cycles affect its decision boundaries and pre-allocate adapter capacity for the expected shift.
struct PolicyWorldModel { transition_model: LLM, // predicts next policy state given current + action policy_calendar: Calendar, // known CMS update schedule adaptation_budget: AdapterPool, // pre-allocated LoRA slots } impl PolicyWorldModel { // Predict future environment state using RWML fn anticipate_drift(&self, horizon_days: u32) → Vec<DriftForecast> { let upcoming_events = self.policy_calendar.events_within(horizon_days) let forecasts = Vec::new() for event in upcoming_events { // Simulate environment state after event let predicted_state = self.transition_model.predict( current_state: self.current_policy_embedding(), action: event.description, reward_signal: "sim_to_real_gap" // RWML reward ) let expected_shift = embedding_distance( self.current_policy_embedding(), predicted_state ) if expected_shift > 0.15 { forecasts.push(DriftForecast { event: event, expected_shift: expected_shift, recommendation: "Pre-allocate adapter; begin synthetic pre-training" }) } } forecasts } // Pre-adapt before drift arrives fn pre_adapt(&mut self, forecast: &DriftForecast) { let adapter = self.adaptation_budget.allocate_new() let synthetic_data = self.generate_anticipated_cases(forecast) adapter.warm_start(synthetic_data) // pre-train on predicted distribution self.notify_coalition(DriftNotification { drift_type: "anticipated_policy_change", expected_date: forecast.event.date, preparation_status: "adapter_pre-trained" }) } }
7. Safety Under Non-Stationarity
In healthcare, safety is not a feature — it is a constraint that every adaptation must respect. The stability–plasticity dilemma becomes a three-way tension: stability, plasticity, and safety. An agent that adapts perfectly to a new distribution but violates a safety invariant in the process has failed catastrophically, regardless of its task performance.
7.1 Safety Invariants That Must Survive Drift
We define safety invariants as properties of agent behavior that must hold across all adaptations, across all coalition configurations, and across all drift scenarios. For our prior authorization coalition:
SafetyInvariants { // I1: PHI protection — never leak patient data in any output channel phi_containment: ∀ adaptation A, ∀ output o ∈ A.outputs: contains_phi(o) == false // I2: Evidence traceability — every clinical claim must be grounded evidence_grounding: ∀ claim c ∈ determination: ∃ source s ∈ cited_evidence(c) ∧ is_valid(s) // I3: Regulatory compliance — coverage decisions must cite current policy policy_currency: ∀ determination d: d.policy_version == current_cms_policy() // I4: Bias monitoring — no demographic group should experience // disproportionate denial rates after adaptation fairness_bound: ∀ group g, ∀ adaptation A: |approval_rate(g, "post") - approval_rate(g, "pre")| < εfairness // I5: Uncertainty disclosure — agent must flag when operating // outside its validated distribution ood_detection: ∀ input x: distribution_distance(x, training_dist) > δood ⟹ output.includes("CONFIDENCE: LOW — outside validated distribution") }
7.2 Safety-Constrained Adaptation Protocol
To ensure invariants survive adaptation, we impose a gated adaptation protocol: no adaptation takes effect in the production coalition until it has passed safety verification. This mirrors the FDA's Total Product Lifecycle (TPLC) approach and the clinical trials implementation framework proposed by You et al. (2025) in npj Digital Medicine, which recommends a four-phase approach: Safety → Efficacy → Effectiveness → Monitoring.
Adapted agent runs in shadow mode alongside production
The adapted agent (e.g., ClinEvidence with a new LoRA adapter for gene therapy) processes the same inputs as the production agent but its outputs are not used for actual determinations. Instead, outputs are compared against the production agent's outputs and against known-correct reference cases. This mirrors Phase I clinical trials — safety assessment before efficacy.
Automated safety invariant checks on shadow outputs
Each safety invariant (I1–I5) is evaluated on the shadow outputs. PHI containment is verified via regex and NER scanning. Evidence grounding is checked via citation validation. Policy currency is verified against the current CMS policy database. Fairness bounds are computed across demographic strata. OOD detection thresholds are calibrated on the new output distribution.
End-to-end regression testing with coalition peers
The full coalition pipeline is executed with the adapted agent substituted in, running a reference set of prior authorization cases. End-to-end accuracy, consistency with peer agents' expectations, and safety invariants are verified at the coalition level. This catches cascading drift effects that may not be visible at the individual agent level.
Gradual traffic shift with automatic rollback
Production traffic is gradually shifted to the adapted agent: 5% → 25% → 50% → 100%, with continuous monitoring at each stage. If any safety metric degrades beyond threshold, automatic rollback restores the previous adapter. Drift notifications are sent to all coalition members at each traffic shift milestone.
Every adaptation must be atomically rollbackable. This is a non-negotiable requirement for healthcare AI systems. The progressive LoRA architecture (Section 4.3) makes this natural: rolling back means deactivating the new adapter and re-routing to the previous one. No model weights are modified, so rollback is instantaneous and lossless. This is a significant advantage over full fine-tuning approaches, where rollback requires restoring an entire model checkpoint.
7.3 The Moving Safety Boundary Problem
A subtle and under-explored challenge is that safety constraints themselves can drift. When CMS redefines medical necessity criteria, invariant I3 (policy currency) now refers to a different policy. When the FDA updates its bias mitigation guidance, the fairness bound εfairness in I4 may need recalibration. The agent must not only adapt its capabilities but also its safety constraints — and the two adaptations must be coordinated.
We formalize this as a co-evolution constraint: the safety invariant specification S and the agent policy π must co-evolve such that at every time step t, πt satisfies St. An adaptation that updates π without updating S (or vice versa) creates a window of inconsistency that can result in either over-constraint (rejecting valid cases under outdated safety rules) or under-constraint (accepting dangerous cases because the safety rules haven't caught up with new capabilities).
8. Worked Example: Prior Authorization in a Drifting World
Let us revisit the Zolgensma prior authorization scenario from Tutorial I, but now in a world where nothing stays still. We trace what happens over a six-month period as multiple sources of non-stationarity simultaneously impinge on the coalition.
Month 0 (Baseline): The coalition (Ω, ClinEvidence, GenomicsAgent, DrugInteractionAgent, PolicyReasonerAgent) is deployed and validated. All Agent Cards are current. End-to-end accuracy on reference prior auth cases: 96.2%. Safety invariants I1–I5 verified.
Shift in patient demographics
Zolgensma receives expanded FDA indication for patients up to 24 months (previously 9 months). Prior auth requests now include older patients with different clinical profiles. ClinEvidence's output distribution shifts gradually as it processes more cases with longer disease progression histories. Its drift monitor detects a KS statistic of 0.18 on output embeddings (threshold: 0.15). ClinEvidence broadcasts a LOW-severity drift notification.
ClinVar database update reclassifies variants
GenomicsAgent's underlying ClinVar API updates from v2025-01 to v2025-04. The SMN2 copy number interpretation changes: 3-copy SMN2 patients were previously classified as "Uncertain Significance" and are now "Likely Benign" for gene therapy eligibility. GenomicsAgent's weight-level drift monitor fires immediately (API version hash changed). It broadcasts a HIGH-severity drift notification with details: "847 variants reclassified, SMN2 copy number interpretation revised."
PolicyReasonerAgent detects input distribution shift
PolicyReasonerAgent's output-distribution monitor detects that its inputs (from GenomicsAgent) have shifted. Cases that previously included "Uncertain Significance" flags now arrive as "Likely Benign," changing the decision boundary for medical necessity. PolicyReasonerAgent raises a MEDIUM-severity drift notification: "Input distribution from GenomicsAgent has shifted; coverage determination accuracy on reference set dropped from 94% to 89%."
ClinEvidence receives a LoRA fine-tune
The ClinEvidence team publishes a new adapter (LoRA₃: gene therapy evidence) trained on 2,000 recent gene therapy RCT publications. The adaptation follows the gated protocol: shadow validation (2 weeks), invariant verification (all pass), coalition compatibility testing (end-to-end accuracy improves from 93.1% back to 95.8%), and canary deployment (5% → 100% over 10 days). ClinEvidence updates its Agent Card and broadcasts a drift notification with the new card hash.
CMS updates medical necessity criteria
CMS publishes a National Coverage Determination update that adds a new criterion for gene therapy coverage: documented failure of at least one standard-of-care therapy (nusinersen/risdiplam) before gene therapy authorization, unless the patient is under 6 months. PolicyReasonerAgent's world model had anticipated this change (CMS policy update cycles are predictable) and had pre-trained an adapter. The pre-trained adapter is activated via the gated protocol, reducing adaptation latency from 4 weeks to 3 days. Safety invariant I3 (policy currency) is updated simultaneously with the policy version reference.
Coalition re-calibration after cumulative drift
End-to-end coalition accuracy on reference cases has drifted to 91.3% — below the 93% threshold. The algorithmovigilance monitor triggers a coalition re-evaluation. Each agent runs its full drift detection pyramid. The cumulative drift analysis reveals that the interaction between GenomicsAgent's reclassified variants and PolicyReasonerAgent's new coverage criteria created an unintended interaction: 3-copy SMN2 patients are now classified as "Likely Benign" (GenomicsAgent) AND required to fail standard therapy first (PolicyReasonerAgent), creating a contradictory logic path. The coalition enters a coordinated adaptation: PolicyReasonerAgent adds an exception rule for the 3-copy SMN2 edge case, tested through the full gated protocol.
New baseline established
After the Month 5 correction, all agents update their drift monitors' baselines. New reference set accuracy: 95.1%. All safety invariants verified. Agent Cards refreshed. The coalition is stable — until the next wave of non-stationarity arrives.
Three critical insights emerge. First, drift is the default, not the exception — in six months, every agent in the coalition experienced at least one significant drift event. Second, individual adaptation is insufficient — the Month 5 interaction bug arose not from any individual agent's failure but from an unanticipated combination of individually-correct adaptations. Third, anticipatory adaptation pays off — PolicyReasonerAgent's world model reduced CMS policy adaptation from 4 weeks to 3 days, minimizing the window during which the coalition operated under stale safety constraints.
9. Architectural Synthesis: The Adaptive Agent Stack
Drawing together the individual mechanisms from Sections 4–7, we propose a layered architecture for continual learning in non-stationary multi-agent networks. Each layer addresses a different timescale and scope of adaptation:
The stack operates on the principle of minimum necessary intervention: most adaptation happens at Layer 2 (agent-level, continuous), with escalation to Layer 3 (network notification) only when drift exceeds local adaptation capacity, and escalation to Layer 4 (coalition re-evaluation) only when end-to-end safety is threatened. Layer 1 is never modified during normal operation — it is the bedrock that ensures no adaptation, no matter how aggressive, can violate core safety constraints.
This architecture integrates naturally with the trust infrastructure from Tutorial II (trust scores are updated based on drift magnitude and adaptation success), the economic incentives from Tutorial III (agents who adapt successfully and maintain quality earn higher Shapley attribution), and the coordination mechanisms from Tutorial IV (drift-triggered re-evaluation can invoke consensual re-planning or stigmergic re-assignment of subtasks).
10. Open Frontiers
Despite the framework presented above, several hard problems remain unsolved. These represent the cutting edge of research at the intersection of continual learning, multi-agent systems, and safe AI deployment.
10.1 Optimal Adaptation Rate in Co-Adaptive Networks
When all agents adapt simultaneously, the network can enter oscillatory dynamics where each agent's adaptation destabilizes the next. What is the optimal adaptation rate for each agent given the adaptation rates of its peers? This is related to the learning rate scheduling problem in multi-agent RL, but complicated by the fact that agents may be operated by different organizations with different update cadences. Preliminary work on MACPH (Li et al., 2025) explores adaptive parameter space noise as one approach, but a general theory of optimal co-adaptation rates remains elusive.
10.2 Drift Attribution in Multi-Agent Pipelines
When end-to-end coalition accuracy degrades, which agent's drift caused it? The cascading nature of multi-agent pipelines makes attribution challenging: GenomicsAgent's reclassification changes PolicyReasonerAgent's inputs, which changes the final determination. Standard Shapley attribution (Tutorial III) can be adapted for drift contribution analysis, but computing it requires counterfactual execution — running the pipeline with each agent's drift reverted individually — which is expensive and may not be feasible in real-time.
10.3 Privacy-Preserving Continual Learning
Healthcare agents must adapt to new data distributions without directly accessing patient data. Federated continual learning — where agents train on local data and share only model updates — is promising but introduces new forgetting dynamics: the aggregated update may overwrite one hospital's patterns in favor of another's. Combining federated learning with EWC (penalizing deviations from each site's important parameters) is an active research area, with implications for cross-institutional multi-agent coalitions.
10.4 Adversarial Drift Injection
If agents adapt based on their observed input distributions, a malicious agent can inject crafted inputs to steer a victim agent's adaptation in a harmful direction — a form of data poisoning that exploits the continual learning pipeline. This connects to the safety mechanisms discussed in Tutorial I (§8.8) and the economic incentive design from Tutorial III, where staking and reputation can deter strategic drift manipulation.
10.5 Certification of Continuously Adapting Systems
The FDA's 2025 Predetermined Change Control Plan (PCCP) framework allows manufacturers to pre-specify how an AI device will be updated post-market. But for a continuously adapting multi-agent system, pre-specifying all possible adaptations is infeasible. A new certification paradigm is needed — one that certifies the adaptation process rather than the adapted model. This would verify that the gated adaptation protocol (Section 7.2), the safety invariant verification (Section 7.1), and the rollback capability (Section 7.2) are correct and complete, without requiring re-certification for every individual adaptation. This is perhaps the most consequential open problem for deploying continual learning in regulated healthcare environments.
Living agent networks are non-stationary by construction: models update, tools change, data drifts, regulations evolve, and peers adapt in response. Addressing this requires a layered approach: individual agents must be equipped with continual learning strategies (EWC, progressive LoRA, replay buffers) that balance stability against plasticity; the network needs drift detection and propagation protocols that make non-stationarity visible and actionable; and the safety layer must impose invariants that survive all adaptations, enforced through gated deployment and atomic rollback. The worked example demonstrated that even well-designed individual adaptations can produce emergent failures at the coalition level — making network-level algorithmovigilance essential. As the agentic web matures, the systems that thrive will be those that treat continual learning not as an afterthought but as a first-class architectural concern, integrated into every layer of the stack.
References
Kirkpatrick et al. (2017), "Overcoming Catastrophic Forgetting in Neural Networks." PNAS.
Rusu et al. (2016), "Progressive Neural Networks." arXiv:1606.04671.
Khetarpal et al. (2022), "Towards Continual Reinforcement Learning: A Review and Perspectives," Journal of Artificial Intelligence Research, vol. 75, pp. 1401–1476.
Yu et al. (2025), "Progressive LoRA for Multimodal Continual Instruction Tuning," ACL Findings.
Liang & Li (2025), "Gated Integration of Low-Rank Adaptation for Continual Learning of Language Models," ICML 2025 submission (OpenReview).
Wei et al. (2025), "Online-LoRA: Task-Free Online Continual Learning via Low Rank Adaptation," WACV.
Yu et al. (2026), "Reinforcement World Model Learning for LLM-based Agents," arXiv:2602.05842.
Haque et al. (2025), "Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks," arXiv:2504.01241.
Jhajj & Lin (2025), "Elastic Weight Consolidation for Knowledge Graph Continual Learning: An Empirical Evaluation," NeurIPS NORA Workshop.
Zheng et al. (2025), "Lifelong Learning of Large Language Model Based Agents: A Roadmap," IEEE TPAMI (2025); arXiv:2501.07278.
Wang et al. (2025), "A Collaborative Multi-Agent Reinforcement Learning Approach for Non-Stationary Environments with Unknown Change Points," Mathematics, 13(11), 1738.
FDA (2025), "AI-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations." Draft Guidance, U.S. Food & Drug Administration.
You et al. (2025), "Clinical Trials Informed Framework for Real World Clinical Implementation and Deployment of Artificial Intelligence Applications," npj Digital Medicine, 8(1):107.
American Heart Association (2025), "Pragmatic Approaches to the Evaluation and Monitoring of Artificial Intelligence in Health Care: A Science Advisory From the American Heart Association," Circulation, 152(23):e433–e442.
McCloskey & Cohen (1989), "Catastrophic Interference in Connectionist Networks." Psychology of Learning and Motivation, Vol. 24.
Grossberg (1980), "How Does a Brain Build a Cognitive Code?" Psychological Review, 87(1).
Hu et al. (2022), "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.