Continual Learning in Non-Stationary Worlds: Adaptation, Drift, and Safety in Living Agent Networks

1. The Non-Stationarity Problem in Agent Networks

In Tutorial I, we constructed a living network: agents discover peers at runtime, form coalitions around tasks, and dissolve them upon completion. The architecture assumed a comforting fiction — that an agent's capabilities, behavior, and reliability remain stable between the moment it advertises an Agent Card and the moment it executes a subtask. In production, this assumption is catastrophically false.

Consider what actually happens in a deployed multi-agent healthcare system over a single quarter. ClinEvidence, our clinical evidence synthesis agent, undergoes a fine-tune on newly published randomized controlled trial data. GenomicsAgent's variant annotation pipeline switches from ClinVar v2024-09 to v2025-01, reclassifying 847 variants. PolicyReasonerAgent absorbs an updated coverage policy from CMS that changes medical necessity criteria for gene therapies. The orchestrator Ω itself receives a model update that alters its decomposition heuristics. Every agent in the coalition has drifted — and none of them has told the others.

Analogy

Imagine a jazz ensemble that rehearsed together last month. Since then, the pianist studied Thelonious Monk and now favors dissonant voicings; the bassist switched from upright to electric; the drummer shifted from swing to Afro-Cuban patterns. They reunite on stage expecting last month's chemistry. The first few bars sound fine — the chord changes are familiar — but as the solo section opens, the ensemble fractures. Each musician has individually improved, but the joint distribution of their interaction has shifted underneath them. This is the non-stationarity problem in multi-agent systems.

The non-stationarity problem is not merely a software engineering nuisance; it is a fundamental theoretical challenge. In a single-agent system, non-stationarity is already hard — it triggers catastrophic forgetting, requires careful detection, and demands adaptation strategies. In a multi-agent network, the problem compounds multiplicatively: each agent's adaptation changes the environment observed by every other agent, creating a cascade of co-adaptive dynamics that can amplify, oscillate, or converge unpredictably.

Sources of Non-Stationarity in a Living Agent Network

Fig. 1 — Five distinct sources of non-stationarity impinge on a coalition simultaneously. Model drift, data drift, tool drift, and policy drift arrive from the external environment; peer adaptation arises endogenously from the agents' own co-evolution.

This tutorial tackles the challenge head-on. We will formalize the types of drift, explore the stability–plasticity dilemma that makes adaptation dangerous, survey the latest continual learning strategies (from Elastic Weight Consolidation to progressive LoRA architectures), and develop network-level detection and propagation mechanisms. Throughout, we ground every concept in our running prior authorization scenario — because in healthcare, a drifting agent is not an inconvenience; it is a patient safety risk.

2. Formal Foundations: A Taxonomy of Drift

To reason precisely about non-stationarity, we need a formal framework. We model each agent α as operating within a local decision process that, in a stationary world, would be characterized by a fixed transition function, reward signal, and observation space. Non-stationarity means one or more of these components changes over time — and the agent may or may not be aware of the change.

Definition — Non-Stationary Agent Environment

An agent α operates in a non-stationary environment if there exists a time-indexed family of transition functions {P_t(s′ | s, a)}_t≥0 such that P_t₁ ≠ P_t₂ for some t₁ ≠ t₂. The agent's policy π_α was optimized for P_t₀ (training time) but executes in P_t where t > t₀ and P_t may have diverged from P_t₀.

Following the taxonomy established by the continual reinforcement learning literature (Abel et al., JAIR 2024), we distinguish non-stationarity along two dimensions: scope (what changes) and driver (what causes the change).

2.1 Scope: What Changes

In the context of our multi-agent healthcare network, we identify five distinct scopes of non-stationarity, each with different implications for detection and adaptation:

Drift Type	Formal Description	Healthcare Example
Data Drift	P(X) changes; P(Y\|X) stable	Patient demographics shift — more geriatric gene therapy candidates after expanded age indications
Concept Drift	P(Y\|X) changes; the target relationship itself shifts	CMS redefines "medical necessity" for SMA gene therapy, changing what constitutes an approval-worthy case
Model Drift	π_α(a\|s) changes due to weight updates (fine-tune, RLHF, adapter swap)	ClinEvidence receives a LoRA adapter update that shifts its evidence grading from GRADE to JBI methodology
Tool Drift	The action space A or its effects change	GenomicsAgent's ClinVar API returns reclassified variants; same query, different answers
Peer Drift	Other agents' policies π_β change, altering the multi-agent transition dynamics	PolicyReasonerAgent absorbs a new coverage policy, and now rejects cases that ClinEvidence's evidence synthesis was calibrated to support

Key Insight

In a multi-agent network, peer drift is endogenous: when agent α adapts, it changes the effective environment for agents β, γ, ... who interact with α. Their subsequent adaptations change α's environment in turn. This creates a feedback loop that is absent from single-agent continual learning. The continual RL literature calls this the co-adaptation dilemma — each agent's policy update shifts the ground truth that other agents are learning from.

2.2 Drivers: What Causes the Change

Drawing on Abel et al.'s continual RL framework, we distinguish passive non-stationarity (the environment changes independently of the agent) from active non-stationarity (the agent's own actions influence the rate and direction of change). In multi-agent systems, a third category emerges:

Passive (Exogenous)

External events drive change independent of agent behavior. CMS updates a coverage policy. A new clinical trial is published. ClinVar reclassifies a variant.

Detection: Monitor external sources on schedule.
Rate: Episodic (policy updates quarterly), continuous (literature daily).

Reactive (Endogenous)

Agent α's adaptations trigger cascading adaptations in peers β, γ, ... who then alter α's environment. This is the multi-agent co-adaptation loop.

Detection: Cross-agent behavioral monitoring.
Rate: Coupled to agents' learning rates — can accelerate or oscillate.

2.3 Temporal Patterns of Drift

Not all drift arrives the same way. The MACPH framework (Li et al., 2025) identifies three temporal patterns, each demanding different detection and response strategies:

Sudden drift occurs when the transition function undergoes an abrupt discontinuity — a new FDA-approved indication overnight changes the decision boundary for gene therapy coverage. Gradual drift represents a smooth evolution — the distribution of prior authorization requests slowly shifts as more providers adopt precision medicine workflows. Recurring drift captures periodic or cyclical changes — certain formulary updates happen at fiscal year boundaries, creating predictable but still-disruptive annual shifts.

The critical distinction for multi-agent systems is that an agent may experience sudden drift even when the underlying cause was gradual. If GenomicsAgent accumulates small variant reclassifications over months and then publishes a batch update, PolicyReasonerAgent perceives a sudden shift in its input distribution. The temporal pattern experienced by one agent depends not only on the external change dynamics but on the update and communication cadences of its peers.

3. The Stability–Plasticity Dilemma

The central tension in continual learning — formally identified by Grossberg (1980) and recently surveyed extensively in the LLM context by multiple research groups (Yu et al., 2026; Haque et al., 2025) — is that a learning system cannot simultaneously be maximally stable (preserving what it has learned) and maximally plastic (rapidly adapting to new information). Increasing one necessarily degrades the other.

Definition: Stability–Plasticity Dilemma

Let f_θ : 𝒳 → 𝒴 be a model with parameters θ ∈ ℝ^d trained sequentially on task distributions D_A then D_B. The stability–plasticity dilemma (Grossberg, 1980; McCloskey & Cohen, 1989) is the fundamental tension between two competing objectives in continual learning: stability, preserving θ to retain performance on D_A, and plasticity, adapting θ to minimize loss on D_B. Because gradient updates ∇_θℒ_B overwrite parameter configurations responsible for low ℒ_A, sequential optimization causes catastrophic forgetting, formally measured as:

ℱ(θ) = ℒ_A(θ^*_B) − ℒ_A(θ^*_A)

where θ^*_A and θ^*_B are the loss-minimizing parameters for each task respectively. The severity of forgetting is non-monotonic with respect to task similarity: it is low when the hidden representations φ_A and φ_B are nearly identical (compatible updates) or entirely disjoint (orthogonal gradients), and is maximized at intermediate representational overlap in activation space.

For a single LLM, catastrophic forgetting is already severe. Yu et al. (2026) empirically demonstrated that continual fine-tuning of LLMs on sequential NLU tasks from the GLUE benchmark produces significant performance degradation on earlier tasks, with the severity varying by model architecture and scale. The smaller Phi-3.5-mini model exhibited minimal forgetting while maintaining learning capacity, while larger models like Qwen2.5-7B showed stronger learning but greater forgetting — an observation consistent with the finding that deeper, narrower architectures favor plasticity while wider, shallower ones favor stability (Lu et al., 2025).

The Stability–Plasticity Spectrum in Multi-Agent Context

Fig. 2 — The stability–plasticity spectrum. Neither extreme is viable for multi-agent coalitions. The optimal point depends on drift characteristics and safety requirements.

3.1 Why Multi-Agent Amplifies the Dilemma

In a single-agent system, the stability–plasticity trade-off is a local optimization problem: find the point on the spectrum that maximizes expected performance on the mixture of past and future tasks. In a multi-agent coalition, three amplification effects make the problem fundamentally harder:

Commitment fragility. When agent α joins a coalition based on its Agent Card — advertising capabilities like "clinical evidence synthesis, GRADE methodology, oncology + rare disease specialization" — it makes an implicit contract with its peers. If α adapts too aggressively (high plasticity), its behavior may violate the contract: it switches from GRADE to JBI methodology, and PolicyReasonerAgent's downstream logic, calibrated for GRADE outputs, produces incorrect coverage determinations. The stability–plasticity dial directly controls contract reliability.

Cascading adaptation. When one agent adapts, its changed behavior shifts the input distribution for every downstream agent. If those agents also adapt in response, the network enters a co-adaptive loop. In the multi-agent RL literature, this is formalized as the fact that agent β's MDP is non-stationary not because the external environment changed, but because agent α — which is part of β's environment — updated its policy. Tan (1993) first identified this convergence challenge; recent work on Dec-POMDPs (Oliehoek & Amato, 2016) shows that independent learning in non-stationary multi-agent settings lacks convergence guarantees.

Safety constraint propagation. In healthcare, safety constraints are not local to individual agents; they propagate through the coalition's causal chain. If ClinEvidence adapts and its evidence quality degrades by 5%, the error compounds through GenomicsAgent's interaction analysis and PolicyReasonerAgent's coverage logic, potentially producing a 15% degradation in final determination accuracy. A small plasticity budget at each node can produce catastrophic cumulative drift.

Cumulative Drift Bound (informal): If each agent i in a chain of length k has local drift d i (measured as distribution shift in its output), the end-to-end drift D satisfies: D \leq Σ i=1..k d i + Σ i<j ρ ij \cdot d i \cdot d j where ρ ij captures the sensitivity of agent j's output to agent i's drift. The interaction terms (ρ ij \cdot d i \cdot d j) make multi-agent drift superadditive.

4. Continual Learning Strategies for Individual Agents

Before tackling the multi-agent coordination problem, we need to equip individual agents with the ability to learn continuously without catastrophically forgetting their prior competencies. The continual learning literature offers three broad families of techniques, each with distinct trade-offs. Following the taxonomy of Wu et al. (2024) and the recent lifelong learning survey for LLM agents (Zheng et al., 2025), we examine each family through the lens of our healthcare scenario.

4.1 Regularization-Based Methods: Elastic Weight Consolidation

The foundational insight of Elastic Weight Consolidation (EWC; Kirkpatrick et al., 2017) is deceptively elegant: not all parameters are equally important for a given task. If we can identify which parameters are critical for preserving performance on task A, we can allow task B to modify only the non-critical parameters, finding a solution that performs well on both tasks simultaneously.

Formally, EWC introduces a regularization penalty to the loss function when training on a new task B:

ℒ total (θ) = ℒ B (θ) + (λ/2) \cdot Σ i F i \cdot (θ i - θ* A,i)² EWC Loss: ℒ B is the new task loss; F i is the i-th diagonal element of the Fisher Information Matrix for task A; θ* A are the optimal parameters after task A; λ controls the stability-plasticity trade-off.

The Fisher Information Matrix (FIM) acts as a measure of parameter importance: F_i is large when the i-th parameter strongly influences the likelihood of the data from task A, meaning changes to it would significantly degrade performance. By penalizing deviations from θ*_A in proportion to F_i, EWC allows unimportant parameters to change freely while anchoring critical ones.

Scenario — EWC for ClinEvidence

ClinEvidence was trained on Task A: synthesizing evidence for oncology prior authorizations. A new task B arrives: adapting to rare disease gene therapy evidence (our Zolgensma scenario). Without EWC, fine-tuning on gene therapy data causes ClinEvidence to forget its oncology evidence grading patterns. With EWC (λ=10), the Fisher Information identifies parameters critical to GRADE-based evidence assessment and anchors them, allowing the rare disease knowledge to be absorbed via parameters that were less utilized for oncology. Empirically, this mirrors findings from the NeurIPS 2025 NORA workshop: EWC reduced catastrophic forgetting by 45.7% on knowledge graph continual learning tasks (Jhajj & Lin, 2025).

EWC's limitation is computational: the FIM is a matrix of size |θ|×|θ|, which for modern LLMs with billions of parameters is intractable. In practice, only the diagonal is computed — the diagonal Fisher approximation. Recent work by Benzing et al. (ICLR 2024) introduced a surrogate Hessian-vector product method that enables EWC with the full FIM at tractable cost, showing that the diagonal and full approaches have complementary strengths: diagonal EWC excels in the feature-learning regime while full-FIM EWC excels in the lazy regime. For LLM agents, which typically operate in the feature-learning regime, diagonal EWC remains the pragmatic choice.

4.2 Replay-Based Methods: Experience Rehearsal

Replay methods take a conceptually different approach: instead of protecting parameters, they protect data. When learning task B, the agent intermixes examples from task A (stored in a replay buffer) with the new training data, preventing the gradient updates from optimizing exclusively for B. This mirrors the neuroscience concept of memory consolidation — the hippocampus replays experiences during sleep to transfer them to the neocortex for long-term storage.

For LLM-based agents, replay takes a distinctive form. Rather than storing raw training examples (which may be prohibitively large or legally restricted under HIPAA for patient data), agents can store synthetic exemplars — inputs and outputs generated by the model itself before the update. This approach, pioneered by LAMOL (Sun et al., 2020) and extended by recent work on pseudo-rehearsal for LLMs, avoids storing sensitive data while preserving the agent's behavioral profile.

Synthetic replay for ClinEvidence before a fine-tune Pseudo-code

// Before adaptation: generate synthetic exemplars of current behavior
fn prepare_replay_buffer(agent: ClinEvidence, n_exemplars: usize) → Buffer {
    let buffer = Buffer::new()
    for domain in ["oncology", "cardiology", "rare_disease"] {
        let prompts = generate_representative_queries(domain, n_exemplars / 3)
        for prompt in prompts {
            let response = agent.generate(prompt)    // current model behavior
            let quality  = agent.self_evaluate(response)  // self-assessed quality
            buffer.add({ prompt, response, quality, domain })
        }
    }
    buffer.diversify()  // ensure coverage of edge cases
    return buffer       // ~500 exemplars, no PHI stored
}

// During adaptation: mix replay with new task data
fn continual_finetune(agent, new_data, replay_buffer, mix_ratio: f32) {
    for epoch in 1..max_epochs {
        let batch = interleave(
            new_data.sample(1.0 - mix_ratio),
            replay_buffer.sample(mix_ratio)  // typically 20-30%
        )
        agent.train_step(batch)
    }
}

HIPAA Constraint

In healthcare multi-agent systems, replay buffers must never contain Protected Health Information (PHI). Synthetic exemplars generated by the agent itself — "What would a typical prior auth query for SMA gene therapy look like?" — are permissible because they contain no real patient data. This is not merely a best practice; it is a legal requirement under 45 CFR § 164.502(a). The StreamCLR framework (2025) addresses this with a dual-buffer strategy: a short-term buffer capturing recent, anonymized interaction patterns and a long-term buffer curated via diversity and uncertainty criteria — both free of identifiable data.

4.3 Architecture-Based Methods: Progressive and Modular Approaches

The third family avoids the stability–plasticity trade-off entirely by adding new parameters for new tasks while freezing existing ones. Progressive Neural Networks (Rusu et al., 2016) introduced this idea: for each new task, a new "column" of parameters is added, with lateral connections to existing columns enabling forward transfer. Prior knowledge is never overwritten because prior parameters are never modified.

For LLM-based agents, the progressive approach has been modernized through LoRA (Low-Rank Adaptation; Hu et al., 2022) and its continual learning extensions. Rather than adding entire network columns, agents add small, task-specific LoRA adapters — low-rank matrices that modify the attention and feed-forward layers — while keeping the base model frozen.

Two recent advances are particularly relevant for multi-agent systems:

ProgLoRA (Yu et al., ACL 2025 Findings) maintains a progressive pool of LoRA blocks, adding a new block for each incremental task. A task-aware allocation mechanism determines how to leverage previously acquired knowledge, while a task recall mechanism realigns the model with previously learned tasks. Experiments on multimodal continual instruction tuning showed that ProgLoRA outperformed both MoE-LoRA and standard LoRA approaches, with static and dynamic variants for different deployment scenarios.

GainLoRA (Liang & Li, 2025) takes a gated integration approach: new LoRA branches are dynamically created per task and integrated with existing branches through a learnable gating module. Initialization and update constraints on the gating parameters significantly reduce interference between old and new branches. This is directly applicable to our agent scenario, where each agent may need to adapt to multiple sequential tasks (different disease domains, updated guidelines) while maintaining competence across all of them.

Progressive LoRA Architecture for Continual Agent Adaptation

Fig. 3 — Progressive LoRA architecture: the base LLM remains frozen; each new task adds a small adapter. A gating module routes inputs to the appropriate adapter(s), enabling forward transfer without catastrophic forgetting. LoRA₄ is currently training while LoRA₁₋₃ are frozen and serving.

4.4 Comparative Analysis for Healthcare Agents

Method	Forgetting Prevention	Compute Cost	Storage Cost	PHI Risk	Multi-Task Serving
EWC	Good (45-50% reduction)	FIM computation per task	1× model + diagonal FIM	None (no data stored)	Single model, shared params
Replay	Very good with mix ratio tuning	Training overhead ~20-30%	Buffer (500-5000 exemplars)	⚠ Must use synthetic exemplars	Single model, shared params
ProgLoRA	Near-perfect (frozen base)	Low per-adapter (r=16)	~0.1% params per task	None (architecture-only)	Router selects adapter per query
GainLoRA	Excellent (gated isolation)	Gating module overhead	~0.1% + gating per task	None (architecture-only)	Automatic gated routing
Hybrid (EWC + ProgLoRA)	Best (layered protection)	Moderate	Adapters + FIM	None	Router + regularized base

For our healthcare prior authorization agents, the hybrid approach — combining ProgLoRA for task isolation with EWC regularization on the base model — provides the strongest guarantees. The base model's critical parameters (e.g., those encoding HIPAA compliance reasoning, evidence grading methodology) are protected by EWC, while domain-specific knowledge is cleanly isolated in task-specific adapters. This mirrors the self-learning agent architecture proposed by Sivakumar et al. (2025), which integrated Progressive Neural Networks with LLaMA 3.2, using LoRA for efficient fine-tuning and EWC for knowledge retention, demonstrating Task 1 perplexity shift below 0.2 after learning four sequential tasks.

5. Network-Level Drift Detection and Propagation

Equipping individual agents with continual learning capabilities is necessary but not sufficient. The multi-agent network itself needs mechanisms to detect when an agent has drifted, propagate that information to affected peers, and coordinate adaptation responses. Without network-level detection, each agent operates in blissful ignorance of its peers' behavioral shifts — precisely the scenario described in the original open challenge.

5.1 Agent-Level Drift Signals

Each agent should continuously monitor its own behavioral consistency and broadcast drift signals when significant changes are detected. We propose a three-layer detection architecture:

Drift self-detection for an LLM-based agent Pseudo-code

struct DriftMonitor {
    baseline_embeddings: Vec<Embedding>,   // behavioral fingerprint at deployment
    baseline_quality: QualityProfile,      // self-evaluated accuracy on reference set
    ks_threshold: f64,                     // Kolmogorov-Smirnov rejection threshold
    version_hash: String,                  // model weights checksum
}

impl DriftMonitor {
    // Layer 1: Weight-level detection (cheapest, fastest)
    fn detect_weight_change(&self, current_hash: &str) → DriftSignal {
        if current_hash != self.version_hash {
            return DriftSignal::ModelUpdate {
                severity: HIGH,
                message: "Model weights have changed since Agent Card was published"
            }
        }
        DriftSignal::None
    }

    // Layer 2: Output-distribution detection (moderate cost)
    fn detect_output_drift(&self, recent_outputs: &[Embedding]) → DriftSignal {
        let (ks_stat, p_value) = ks_2sample(
            &self.baseline_embeddings,
            recent_outputs
        )
        if p_value < self.ks_threshold {
            return DriftSignal::BehavioralShift {
                severity: severity_from_ks(ks_stat),
                ks_statistic: ks_stat,
                message: "Output distribution has shifted significantly"
            }
        }
        DriftSignal::None
    }

    // Layer 3: Quality-metric detection (most expensive, most informative)
    fn detect_quality_drift(&self, reference_set: &[Example]) → DriftSignal {
        let current_quality = self.evaluate_on_reference(reference_set)
        let degradation = self.baseline_quality - current_quality
        if degradation.accuracy > 0.05 || degradation.consistency > 0.10 {
            return DriftSignal::QualityDegradation {
                severity: severity_from_degradation(degradation),
                metrics: degradation,
                message: "Performance on reference set has declined"
            }
        }
        DriftSignal::None
    }
}

The three layers form a cost-quality pyramid: weight-change detection is instantaneous but only detects explicit model updates; output-distribution monitoring catches behavioral shifts from any source (model updates, tool changes, data drift) but requires embedding recent outputs; quality-metric evaluation provides the most actionable signal but requires maintaining a reference evaluation set and running inference on it periodically.

5.2 The Drift Propagation Protocol

When an agent detects significant drift — in itself or in a peer — it must propagate this information through the network. We extend the gossip protocol from Tutorial I with a drift notification channel:

Drift notification message (extends A2A protocol) Schema

{
  "type": "drift_notification",
  "source_agent": "agent://premera/clin-evidence-v3",
  "drift_type": "model_update",       // | "tool_change" | "data_drift" | "quality_shift"
  "severity": "HIGH",
  "timestamp": "2025-11-15T08:30:00Z",
  "details": {
    "component": "evidence_grading",
    "old_version": "v3.1.0 (GRADE methodology)",
    "new_version": "v3.2.0 (GRADE + JBI hybrid)",
    "output_distribution_shift": 0.23,  // KS statistic
    "reference_set_accuracy": { "before": 0.94, "after": 0.91 }
  },
  "agent_card_updated": true,
  "new_card_hash": "sha256:ab3f...",
  "recommended_action": "re-evaluate coalition compatibility"
}

Drift notifications propagate through the existing gossip infrastructure (Tutorial I, §3.2), but with priority escalation: HIGH-severity drift notifications bypass normal gossip batching and are sent immediately to all known coalition members. This ensures that PolicyReasonerAgent learns about ClinEvidence's evidence grading methodology change before it processes the next prior authorization that depends on ClinEvidence's output.

5.3 Algorithmovigilance: Continuous Network-Level Monitoring

Drawing on the healthcare AI monitoring concept of algorithmovigilance — continuous monitoring and evaluation of healthcare algorithms, analogous to pharmacovigilance for medications (as advocated by the American Heart Association's 2025 science advisory and the FDA's 2025 Total Product Lifecycle guidance for AI-enabled devices) — we define a network-level monitoring service:

Definition — Coalition Algorithmovigilance

Coalition Algorithmovigilance is the continuous, systematic monitoring of all agents in a coalition for behavioral drift, quality degradation, and safety-relevant changes. It encompasses: (1) agent self-monitoring via the three-layer drift detection pyramid, (2) peer-to-peer drift notification via the gossip protocol, (3) end-to-end coalition quality evaluation via periodic execution of reference cases, and (4) automated remediation triggers (re-recruitment, adapter rollback, coalition re-formation) when drift exceeds safety thresholds.

The FDA's January 2025 draft guidance on AI-enabled device software functions explicitly requires lifecycle monitoring — ongoing evaluation that does not end after deployment. For our multi-agent coalition, this means the orchestrator Ω (or, in a decentralized setting, a rotating monitor role) must periodically execute reference prior authorization cases through the full pipeline and compare the outputs against known-correct determinations. When end-to-end accuracy degrades beyond a configured threshold, the coalition enters a re-evaluation state.

6. Agent-Level Adaptation: Modular and World-Model Approaches

Beyond the classical continual learning strategies of Section 4, two emerging paradigms offer particularly promising approaches for multi-agent adaptation: modular adapter routing and reinforcement-learned world models.

6.1 Online-LoRA: Task-Free Adaptation via Loss Dynamics

A critical limitation of standard ProgLoRA is the assumption that task boundaries are known — the agent is told "you are now learning task B." In the real world, drift is often gradual and task-free: the distribution of prior authorization requests shifts slowly as new therapies gain adoption, without a clear demarcation. Online-LoRA (Wei et al., WACV 2025) addresses this by leveraging training loss dynamics as an automatic task boundary detector.

The key insight is elegant: as learning progresses, a decreasing loss indicates effective learning from the current distribution, while an increasing loss suggests a distribution shift. Plateaus in the loss surface signal that the model has converged on the current distribution — the ideal moment to consolidate knowledge by freezing the current LoRA weights and initializing a new pair of trainable parameters. To prevent unbounded parameter growth, frozen LoRA weights are periodically merged into the base model via a controlled integration step.

Key Insight — Loss-Surface Drift Detection

Online-LoRA transforms the loss function into a drift detector. When ClinEvidence processes its daily stream of evidence synthesis requests, a sustained loss increase signals that the distribution of incoming queries has shifted — perhaps because a new gene therapy received accelerated FDA approval and prior auth requests for it suddenly spike. The agent automatically allocates a new adapter for this emerging distribution without any explicit task labels or human intervention.

6.2 World Models for Anticipatory Adaptation

The approaches described so far are reactive: they detect drift after it has occurred and adapt in response. A more powerful paradigm is anticipatory adaptation, where an agent builds an internal model of how its environment evolves and pre-adapts before drift impacts performance. This is the domain of world models for LLM-based agents.

RWML (Reinforcement World Model Learning; Yu et al., 2026) proposes training LLM agents to predict how their environment's state will transition in response to actions, using reinforcement learning with sim-to-real gap rewards. Unlike supervised next-state prediction (which suffers from model collapse due to token-level fidelity requirements), RWML aligns simulated next states with observed next states in embedding space, providing a more robust training signal. Critically for our application, RWML showed significantly less catastrophic forgetting compared to supervised fine-tuning, because its on-policy reinforcement learning nature inherently preserves prior knowledge better than SFT.

For our healthcare coalition, a world model enables ClinEvidence to anticipate how changes in clinical trial publications will shift the evidence landscape before those changes propagate through the system. PolicyReasonerAgent can model how CMS policy update cycles affect its decision boundaries and pre-allocate adapter capacity for the expected shift.

World model for anticipatory adaptation in PolicyReasonerAgent Pseudo-code

struct PolicyWorldModel {
    transition_model: LLM,         // predicts next policy state given current + action
    policy_calendar: Calendar,     // known CMS update schedule
    adaptation_budget: AdapterPool, // pre-allocated LoRA slots
}

impl PolicyWorldModel {
    // Predict future environment state using RWML
    fn anticipate_drift(&self, horizon_days: u32) → Vec<DriftForecast> {
        let upcoming_events = self.policy_calendar.events_within(horizon_days)
        let forecasts = Vec::new()

        for event in upcoming_events {
            // Simulate environment state after event
            let predicted_state = self.transition_model.predict(
                current_state: self.current_policy_embedding(),
                action: event.description,
                reward_signal: "sim_to_real_gap"  // RWML reward
            )

            let expected_shift = embedding_distance(
                self.current_policy_embedding(),
                predicted_state
            )

            if expected_shift > 0.15 {
                forecasts.push(DriftForecast {
                    event: event,
                    expected_shift: expected_shift,
                    recommendation: "Pre-allocate adapter; begin synthetic pre-training"
                })
            }
        }
        forecasts
    }

    // Pre-adapt before drift arrives
    fn pre_adapt(&mut self, forecast: &DriftForecast) {
        let adapter = self.adaptation_budget.allocate_new()
        let synthetic_data = self.generate_anticipated_cases(forecast)
        adapter.warm_start(synthetic_data)  // pre-train on predicted distribution
        self.notify_coalition(DriftNotification {
            drift_type: "anticipated_policy_change",
            expected_date: forecast.event.date,
            preparation_status: "adapter_pre-trained"
        })
    }
}

7. Safety Under Non-Stationarity

In healthcare, safety is not a feature — it is a constraint that every adaptation must respect. The stability–plasticity dilemma becomes a three-way tension: stability, plasticity, and safety. An agent that adapts perfectly to a new distribution but violates a safety invariant in the process has failed catastrophically, regardless of its task performance.

7.1 Safety Invariants That Must Survive Drift

We define safety invariants as properties of agent behavior that must hold across all adaptations, across all coalition configurations, and across all drift scenarios. For our prior authorization coalition:

Safety invariants for healthcare agent adaptation Invariant Specification

SafetyInvariants {
  // I1: PHI protection — never leak patient data in any output channel
  phi_containment: ∀ adaptation A, ∀ output o ∈ A.outputs:
      contains_phi(o) == false

  // I2: Evidence traceability — every clinical claim must be grounded
  evidence_grounding: ∀ claim c ∈ determination:
      ∃ source s ∈ cited_evidence(c) ∧ is_valid(s)

  // I3: Regulatory compliance — coverage decisions must cite current policy
  policy_currency: ∀ determination d:
      d.policy_version == current_cms_policy()

  // I4: Bias monitoring — no demographic group should experience
  //     disproportionate denial rates after adaptation
  fairness_bound: ∀ group g, ∀ adaptation A:
      |approval_rate(g, "post") - approval_rate(g, "pre")| < ε_fairness

  // I5: Uncertainty disclosure — agent must flag when operating
  //     outside its validated distribution
  ood_detection: ∀ input x:
      distribution_distance(x, training_dist) > δ_ood
      ⟹ output.includes("CONFIDENCE: LOW — outside validated distribution")
}

7.2 Safety-Constrained Adaptation Protocol

To ensure invariants survive adaptation, we impose a gated adaptation protocol: no adaptation takes effect in the production coalition until it has passed safety verification. This mirrors the FDA's Total Product Lifecycle (TPLC) approach and the clinical trials implementation framework proposed by You et al. (2025) in npj Digital Medicine, which recommends a four-phase approach: Safety → Efficacy → Effectiveness → Monitoring.

Phase 1 — Shadow Validation

Adapted agent runs in shadow mode alongside production

The adapted agent (e.g., ClinEvidence with a new LoRA adapter for gene therapy) processes the same inputs as the production agent but its outputs are not used for actual determinations. Instead, outputs are compared against the production agent's outputs and against known-correct reference cases. This mirrors Phase I clinical trials — safety assessment before efficacy.

Phase 2 — Invariant Verification

Automated safety invariant checks on shadow outputs

Each safety invariant (I1–I5) is evaluated on the shadow outputs. PHI containment is verified via regex and NER scanning. Evidence grounding is checked via citation validation. Policy currency is verified against the current CMS policy database. Fairness bounds are computed across demographic strata. OOD detection thresholds are calibrated on the new output distribution.

Phase 3 — Coalition Compatibility

End-to-end regression testing with coalition peers

The full coalition pipeline is executed with the adapted agent substituted in, running a reference set of prior authorization cases. End-to-end accuracy, consistency with peer agents' expectations, and safety invariants are verified at the coalition level. This catches cascading drift effects that may not be visible at the individual agent level.

Phase 4 — Canary Deployment

Gradual traffic shift with automatic rollback

Production traffic is gradually shifted to the adapted agent: 5% → 25% → 50% → 100%, with continuous monitoring at each stage. If any safety metric degrades beyond threshold, automatic rollback restores the previous adapter. Drift notifications are sent to all coalition members at each traffic shift milestone.

Critical — The Rollback Requirement

Every adaptation must be atomically rollbackable. This is a non-negotiable requirement for healthcare AI systems. The progressive LoRA architecture (Section 4.3) makes this natural: rolling back means deactivating the new adapter and re-routing to the previous one. No model weights are modified, so rollback is instantaneous and lossless. This is a significant advantage over full fine-tuning approaches, where rollback requires restoring an entire model checkpoint.

7.3 The Moving Safety Boundary Problem

A subtle and under-explored challenge is that safety constraints themselves can drift. When CMS redefines medical necessity criteria, invariant I3 (policy currency) now refers to a different policy. When the FDA updates its bias mitigation guidance, the fairness bound ε_fairness in I4 may need recalibration. The agent must not only adapt its capabilities but also its safety constraints — and the two adaptations must be coordinated.

We formalize this as a co-evolution constraint: the safety invariant specification S and the agent policy π must co-evolve such that at every time step t, π_t satisfies S_t. An adaptation that updates π without updating S (or vice versa) creates a window of inconsistency that can result in either over-constraint (rejecting valid cases under outdated safety rules) or under-constraint (accepting dangerous cases because the safety rules haven't caught up with new capabilities).

8. Worked Example: Prior Authorization in a Drifting World

Let us revisit the Zolgensma prior authorization scenario from Tutorial I, but now in a world where nothing stays still. We trace what happens over a six-month period as multiple sources of non-stationarity simultaneously impinge on the coalition.

Scenario — The Drifting Coalition

Month 0 (Baseline): The coalition (Ω, ClinEvidence, GenomicsAgent, DrugInteractionAgent, PolicyReasonerAgent) is deployed and validated. All Agent Cards are current. End-to-end accuracy on reference prior auth cases: 96.2%. Safety invariants I1–I5 verified.

Month 1 — Data Drift (Gradual)

Shift in patient demographics

Zolgensma receives expanded FDA indication for patients up to 24 months (previously 9 months). Prior auth requests now include older patients with different clinical profiles. ClinEvidence's output distribution shifts gradually as it processes more cases with longer disease progression histories. Its drift monitor detects a KS statistic of 0.18 on output embeddings (threshold: 0.15). ClinEvidence broadcasts a LOW-severity drift notification.

Month 2 — Tool Drift (Sudden)

ClinVar database update reclassifies variants

GenomicsAgent's underlying ClinVar API updates from v2025-01 to v2025-04. The SMN2 copy number interpretation changes: 3-copy SMN2 patients were previously classified as "Uncertain Significance" and are now "Likely Benign" for gene therapy eligibility. GenomicsAgent's weight-level drift monitor fires immediately (API version hash changed). It broadcasts a HIGH-severity drift notification with details: "847 variants reclassified, SMN2 copy number interpretation revised."

Month 2 — Peer Drift (Cascading)

PolicyReasonerAgent detects input distribution shift

PolicyReasonerAgent's output-distribution monitor detects that its inputs (from GenomicsAgent) have shifted. Cases that previously included "Uncertain Significance" flags now arrive as "Likely Benign," changing the decision boundary for medical necessity. PolicyReasonerAgent raises a MEDIUM-severity drift notification: "Input distribution from GenomicsAgent has shifted; coverage determination accuracy on reference set dropped from 94% to 89%."

Month 3 — Model Drift (Planned)

ClinEvidence receives a LoRA fine-tune

The ClinEvidence team publishes a new adapter (LoRA₃: gene therapy evidence) trained on 2,000 recent gene therapy RCT publications. The adaptation follows the gated protocol: shadow validation (2 weeks), invariant verification (all pass), coalition compatibility testing (end-to-end accuracy improves from 93.1% back to 95.8%), and canary deployment (5% → 100% over 10 days). ClinEvidence updates its Agent Card and broadcasts a drift notification with the new card hash.

Month 4 — Concept Drift (Regulatory)

CMS updates medical necessity criteria

CMS publishes a National Coverage Determination update that adds a new criterion for gene therapy coverage: documented failure of at least one standard-of-care therapy (nusinersen/risdiplam) before gene therapy authorization, unless the patient is under 6 months. PolicyReasonerAgent's world model had anticipated this change (CMS policy update cycles are predictable) and had pre-trained an adapter. The pre-trained adapter is activated via the gated protocol, reducing adaptation latency from 4 weeks to 3 days. Safety invariant I3 (policy currency) is updated simultaneously with the policy version reference.

Month 5 — Co-Adaptive Correction

Coalition re-calibration after cumulative drift

End-to-end coalition accuracy on reference cases has drifted to 91.3% — below the 93% threshold. The algorithmovigilance monitor triggers a coalition re-evaluation. Each agent runs its full drift detection pyramid. The cumulative drift analysis reveals that the interaction between GenomicsAgent's reclassified variants and PolicyReasonerAgent's new coverage criteria created an unintended interaction: 3-copy SMN2 patients are now classified as "Likely Benign" (GenomicsAgent) AND required to fail standard therapy first (PolicyReasonerAgent), creating a contradictory logic path. The coalition enters a coordinated adaptation: PolicyReasonerAgent adds an exception rule for the 3-copy SMN2 edge case, tested through the full gated protocol.

Month 6 — Stabilization

New baseline established

After the Month 5 correction, all agents update their drift monitors' baselines. New reference set accuracy: 95.1%. All safety invariants verified. Agent Cards refreshed. The coalition is stable — until the next wave of non-stationarity arrives.

Coalition Accuracy Over 6 Months Under Non-Stationarity

Fig. 4 — Coalition accuracy over six months. Each drift event degrades accuracy; each adaptation (green points) restores it. The Month 5 interaction bug (red) illustrates how individual adaptations can create emergent failures at the coalition level. The 93% safety threshold triggers re-evaluation.

Lessons from the Worked Example

Three critical insights emerge. First, drift is the default, not the exception — in six months, every agent in the coalition experienced at least one significant drift event. Second, individual adaptation is insufficient — the Month 5 interaction bug arose not from any individual agent's failure but from an unanticipated combination of individually-correct adaptations. Third, anticipatory adaptation pays off — PolicyReasonerAgent's world model reduced CMS policy adaptation from 4 weeks to 3 days, minimizing the window during which the coalition operated under stale safety constraints.

9. Architectural Synthesis: The Adaptive Agent Stack

Drawing together the individual mechanisms from Sections 4–7, we propose a layered architecture for continual learning in non-stationary multi-agent networks. Each layer addresses a different timescale and scope of adaptation:

The Adaptive Agent Stack — Four Layers of Continual Learning

Fig. 5 — The Adaptive Agent Stack. Layer 1 is immutable and provides the safety foundation. Layers 2–4 operate at increasing scope and decreasing frequency. Issues escalate upward: a drift detected at Layer 2 triggers a notification at Layer 3, which may trigger a coalition re-evaluation at Layer 4.

The stack operates on the principle of minimum necessary intervention: most adaptation happens at Layer 2 (agent-level, continuous), with escalation to Layer 3 (network notification) only when drift exceeds local adaptation capacity, and escalation to Layer 4 (coalition re-evaluation) only when end-to-end safety is threatened. Layer 1 is never modified during normal operation — it is the bedrock that ensures no adaptation, no matter how aggressive, can violate core safety constraints.

This architecture integrates naturally with the trust infrastructure from Tutorial II (trust scores are updated based on drift magnitude and adaptation success), the economic incentives from Tutorial III (agents who adapt successfully and maintain quality earn higher Shapley attribution), and the coordination mechanisms from Tutorial IV (drift-triggered re-evaluation can invoke consensual re-planning or stigmergic re-assignment of subtasks).

10. Open Frontiers

Despite the framework presented above, several hard problems remain unsolved. These represent the cutting edge of research at the intersection of continual learning, multi-agent systems, and safe AI deployment.

10.1 Optimal Adaptation Rate in Co-Adaptive Networks

When all agents adapt simultaneously, the network can enter oscillatory dynamics where each agent's adaptation destabilizes the next. What is the optimal adaptation rate for each agent given the adaptation rates of its peers? This is related to the learning rate scheduling problem in multi-agent RL, but complicated by the fact that agents may be operated by different organizations with different update cadences. Preliminary work on MACPH (Li et al., 2025) explores adaptive parameter space noise as one approach, but a general theory of optimal co-adaptation rates remains elusive.

10.2 Drift Attribution in Multi-Agent Pipelines

When end-to-end coalition accuracy degrades, which agent's drift caused it? The cascading nature of multi-agent pipelines makes attribution challenging: GenomicsAgent's reclassification changes PolicyReasonerAgent's inputs, which changes the final determination. Standard Shapley attribution (Tutorial III) can be adapted for drift contribution analysis, but computing it requires counterfactual execution — running the pipeline with each agent's drift reverted individually — which is expensive and may not be feasible in real-time.

10.3 Privacy-Preserving Continual Learning

Healthcare agents must adapt to new data distributions without directly accessing patient data. Federated continual learning — where agents train on local data and share only model updates — is promising but introduces new forgetting dynamics: the aggregated update may overwrite one hospital's patterns in favor of another's. Combining federated learning with EWC (penalizing deviations from each site's important parameters) is an active research area, with implications for cross-institutional multi-agent coalitions.

10.4 Adversarial Drift Injection

If agents adapt based on their observed input distributions, a malicious agent can inject crafted inputs to steer a victim agent's adaptation in a harmful direction — a form of data poisoning that exploits the continual learning pipeline. This connects to the safety mechanisms discussed in Tutorial I (§8.8) and the economic incentive design from Tutorial III, where staking and reputation can deter strategic drift manipulation.

10.5 Certification of Continuously Adapting Systems

The FDA's 2025 Predetermined Change Control Plan (PCCP) framework allows manufacturers to pre-specify how an AI device will be updated post-market. But for a continuously adapting multi-agent system, pre-specifying all possible adaptations is infeasible. A new certification paradigm is needed — one that certifies the adaptation process rather than the adapted model. This would verify that the gated adaptation protocol (Section 7.2), the safety invariant verification (Section 7.1), and the rollback capability (Section 7.2) are correct and complete, without requiring re-certification for every individual adaptation. This is perhaps the most consequential open problem for deploying continual learning in regulated healthcare environments.

Summary

Living agent networks are non-stationary by construction: models update, tools change, data drifts, regulations evolve, and peers adapt in response. Addressing this requires a layered approach: individual agents must be equipped with continual learning strategies (EWC, progressive LoRA, replay buffers) that balance stability against plasticity; the network needs drift detection and propagation protocols that make non-stationarity visible and actionable; and the safety layer must impose invariants that survive all adaptations, enforced through gated deployment and atomic rollback. The worked example demonstrated that even well-designed individual adaptations can produce emergent failures at the coalition level — making network-level algorithmovigilance essential. As the agentic web matures, the systems that thrive will be those that treat continual learning not as an afterthought but as a first-class architectural concern, integrated into every layer of the stack.

References

Kirkpatrick et al. (2017), "Overcoming Catastrophic Forgetting in Neural Networks." PNAS.

Rusu et al. (2016), "Progressive Neural Networks." arXiv:1606.04671.

Khetarpal et al. (2022), "Towards Continual Reinforcement Learning: A Review and Perspectives," Journal of Artificial Intelligence Research, vol. 75, pp. 1401–1476.

Yu et al. (2025), "Progressive LoRA for Multimodal Continual Instruction Tuning," ACL Findings.

Liang & Li (2025), "Gated Integration of Low-Rank Adaptation for Continual Learning of Language Models," ICML 2025 submission (OpenReview).

Wei et al. (2025), "Online-LoRA: Task-Free Online Continual Learning via Low Rank Adaptation," WACV.

Yu et al. (2026), "Reinforcement World Model Learning for LLM-based Agents," arXiv:2602.05842.

Haque et al. (2025), "Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks," arXiv:2504.01241.

Jhajj & Lin (2025), "Elastic Weight Consolidation for Knowledge Graph Continual Learning: An Empirical Evaluation," NeurIPS NORA Workshop.

Zheng et al. (2025), "Lifelong Learning of Large Language Model Based Agents: A Roadmap," IEEE TPAMI (2025); arXiv:2501.07278.

Wang et al. (2025), "A Collaborative Multi-Agent Reinforcement Learning Approach for Non-Stationary Environments with Unknown Change Points," Mathematics, 13(11), 1738.

FDA (2025), "AI-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations." Draft Guidance, U.S. Food & Drug Administration.

You et al. (2025), "Clinical Trials Informed Framework for Real World Clinical Implementation and Deployment of Artificial Intelligence Applications," npj Digital Medicine, 8(1):107.

American Heart Association (2025), "Pragmatic Approaches to the Evaluation and Monitoring of Artificial Intelligence in Health Care: A Science Advisory From the American Heart Association," Circulation, 152(23):e433–e442.

McCloskey & Cohen (1989), "Catastrophic Interference in Connectionist Networks." Psychology of Learning and Motivation, Vol. 24.

Grossberg (1980), "How Does a Brain Build a Cognitive Code?" Psychological Review, 87(1).

Hu et al. (2022), "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.