Tutorial · Multi-Agent Systems · Agent Recruitment

Semantic Capability Matching: From String Comparison to Neural Retrieval in Multi-Agent Discovery

How do agents decide that "clinical evidence synthesis" and "systematic review generation" refer to the same capability? A deep dive into the matching surface, tracing the path from brittle taxonomies through embedding-based retrieval to LLM-as-judge arbitration.

Authors: Nathan Crock, GPT 5.1 (research), Gemini 3.1 (research), Claude Opus 4.5 (coding)

1. The Matching Problem — Why Strings Aren't Enough

In Tutorial I, we introduced the Agent Card as the foundational metadata document enabling runtime discovery. Each card carries a skills array with human-readable descriptions such as "clinical evidence synthesis," "pharmacogenomic lookup," and "coverage determination." And we glossed over a critical question: when an orchestrator needs an agent that can perform "systematic literature review with GRADE assessment," how does a registry decide that ClinEvidence's skill labeled "clinical evidence synthesis" is a match?

This is the semantic capability matching problem, and it is far harder than it appears. As the original tutorial noted: "Does 'clinical evidence synthesis' match 'systematic review generation'? Does 'drug interaction analysis' subsume 'pharmacokinetic modeling'?" The answer depends on whether you treat capability descriptions as strings, as positions in a taxonomy, as points in a vector space, or as propositions for a reasoning engine to evaluate.

Formal Definition

Semantic Capability Matching is the problem of computing a graded relevance function M(q, c) → [0, 1] that, given a task requirement q and an advertised capability description c published on an Agent Card, returns the degree to which c satisfies q. This generalizes the classical exact / plugin / subsume / fail match categories from semantic web service discovery (Paolucci et al., 2002) into a continuous score.

A well-formed matcher must be:

  1. Lexically invariant. "Evidence synthesis" and "systematic review generation" should produce a high score despite sharing no words.
  2. Subsumption-sensitive and directional. A broad capability like "clinical evidence synthesis" should score highly against the narrower need "systematic review," but not necessarily vice versa. This asymmetry is what separates capability matching from symmetric similarity measures like cosine distance.
  3. Constraint-aware. Protocol compatibility, data-format alignment, and regulatory compliance must factor into the score alongside functional overlap.

No single technique satisfies all three properties, which is why the field has converged on layered retrieve-then-reason pipelines that combine embedding retrieval, ontological knowledge, and LLM-based judgment. To see why, consider five concrete challenges an orchestrator faces during our prior authorization scenario. Each one stresses a different property.

Five Challenges, Three Properties
① LEXICAL INVARIANCE SYNONYMY "evidence synthesis" ≟ "systematic review generation" Same capability, different words String match: 0/3 tokens overlap → miss CROSS-VOCABULARY "CYP2D6 poor metabolizer assessment" (clinical jargon) Querier and provider use different vocabularies SNOMED CT code vs. free-text skill description ② SUBSUMPTION SENSITIVITY SUBSUMPTION "drug interaction analysis" ⊇ "pharmacokinetic modeling"? Is the broader skill sufficient for the narrow task? Requires hierarchical, directional reasoning COMPOSABILITY "end-to-end prior auth" = evidence + policy + genomics? No single agent; must compose from parts Does the union of skills subsume the requirement? ③ CONSTRAINT AWARENESS CONSTRAINT-AWARENESS "HIPAA-compliant genomics analysis on VCF data" Skill matches, but constraints may not Must jointly evaluate capability + policy + format
Fig. 1 — Five real-world matching challenges, organized by the property each one stresses. No single technique covers all three groups.

The grouping reveals the structure of the problem. Lexical invariance, whether the challenge is synonymy or cross-vocabulary translation, demands dense semantic representations that capture meaning beyond surface tokens. Subsumption sensitivity, including the multi-agent case of composability where the question becomes whether a set of capabilities entails the requirement, demands hierarchical, directional reasoning. Constraint awareness demands structured evaluation of metadata that embedding models largely ignore. This is why the field has converged on layered, hybrid approaches: no single technique covers all three.

Analogy

Imagine hiring a surgical team through a job board. Exact-match keyword search would miss a candidate whose CV says "minimally invasive laparoscopic cholecystectomy" when you searched for "gallbladder removal surgery." A taxonomy would tell you that laparoscopic cholecystectomy is-a gallbladder removal. An embedding model would place both descriptions near each other in vector space. But only a domain expert (or a reasoning model acting as one) could assess whether that surgeon's board certification, malpractice history, and hospital privileges actually qualify them for your patient. Agent matching requires all of these layers.

2. Layer 1: Taxonomies and Ontologies — Structured Knowledge

The oldest and most rigorous approach to semantic matching relies on shared ontologies, which are formal, hierarchical representations of concepts and their relationships within a domain. Healthcare is arguably the most heavily ontologized domain in existence, with mature standards like SNOMED CT (over 350,000 concepts with is-a, part-of, and finding-site relationships), LOINC (laboratory and clinical observations), and HL7 FHIR (resource profiles for interoperable health data exchange).

The key insight is that ontologies encode relationships that pure text cannot: if "pharmacogenomic testing" is defined as a subclass of "genetic analysis" in SNOMED CT, then a registry can determine that an agent advertising "genetic analysis" capability may satisfy a requirement for "pharmacogenomic testing", not because the strings overlap, but because the ontology encodes a subsumption relationship between the concepts.

Formal Definition

An ontology is a formal, explicit specification of a shared conceptualization (Gruber, 1993). Formally, an ontology O = (C, R, A) consists of a set of concepts C, a set of typed semantic relations R ⊆ C × C (e.g., is-a, part-of, has-input), and a set of axioms A that constrain valid interpretations.

Ontology-based matching exploits this graph structure to compute semantic similarity between concepts. Three classical families exist:

  1. Path-based (symmetric). Similarity is inversely proportional to the shortest path between two concepts. The Wu-Palmer metric (1994) normalizes by depth: sim(c₁, c₂) = 2 · depth(LCA) / (depth(c₁) + depth(c₂)), where LCA is the least common ancestor in the taxonomy.
  2. Information-content-based (symmetric). Resnik (1995) defines similarity as the information content of the LCA, estimated from corpus frequency. Lin (1998) normalizes this across both concepts.
  3. Subsumption-based (asymmetric). Paolucci et al. (2002) define discrete match degrees: exact, plugin (provider is more general), subsume (provider is more specific), and fail. This is the only family that captures directionality.

For agent discovery, the choice between these families matters. A registry that treats "genetic analysis" as a match for "pharmacogenomic testing" because one subsumes the other is applying asymmetric subsumption logic, not symmetric distance. The symmetric families tell you two concepts are close; only the subsumption family tells you one concept covers the other.

2.1 Ontology-Anchored Agent Skills

How would ontology-based matching work for agent discovery? The idea is straightforward: each skill in an Agent Card is annotated not just with a free-text description but with one or more concept identifiers from a shared ontology. When a registry receives a query, it compares concept IDs rather than strings.

Ontology-Annotated Agent Card Skills (Extended Schema) JSON
{
  "skills": [
    {
      "id": "evidence-synthesis",
      "name": "Clinical Evidence Synthesis",
      "description": "Synthesizes clinical trial data and systematic reviews...",

      // NEW: Ontology anchors — machine-readable concept references
      "ontologyAnchors": [
        {
          "system": "http://snomed.info/sct",
          "code": "386053000",
          "display": "Evaluation procedure (procedure)",
          "subsumes": ["44889009"]  // Systematic review
        },
        {
          "system": "http://loinc.org",
          "code": "83009-4",
          "display": "Evidence review report"
        }
      ],

      // NEW: FHIR resource profile for I/O contract
      "fhirProfiles": {
        "input": ["http://hl7.org/fhir/StructureDefinition/ResearchStudy"],
        "output": ["http://hl7.org/fhir/StructureDefinition/EvidenceReport"]
      }
    }
  ]
}

With these annotations, an ontology-aware registry can answer questions like: "I need an agent that can produce an EvidenceReport from ResearchStudy inputs." It checks FHIR profile compatibility structurally, then uses SNOMED CT's subsumption hierarchy to assess whether the agent's declared capabilities cover the queried concept.

2.2 The Brittleness Problem

Ontologies bring precision, but they carry a fundamental tension: the ontology must exist before the agents do. This works in healthcare, where SNOMED CT and FHIR have decades of standardization behind them. But the agentic web is moving faster than any standards body. New agent capabilities such as "hallucination detection," "prompt injection resistance scoring," and "multi-modal retrieval-augmented generation" emerge weekly. No ontology committee can keep pace.

Critical Limitation

Ontology-based matching fails when:

  1. no shared ontology exists for the domain,
  2. the agents' capabilities are too novel to be covered by existing ontologies, or
  3. different agents anchor to different ontologies and no alignment exists between them.

In practice, only healthcare and a handful of other regulated domains have ontologies mature enough for reliable concept-based matching. For the general agentic web, we need something that works on raw text.

3. Layer 2: Embedding-Based Semantic Matching

The breakthrough that makes capability matching practical across arbitrary domains is dense vector embeddings. Instead of requiring agents to map their skills to a pre-defined ontology, we encode both the query and each candidate skill description into a shared high-dimensional vector space using a pre-trained language model, then compute similarity using cosine distance.

The landmark 2025 paper by Guo et al., "Agent Discovery in Internet of Agents" (arXiv:2511.19113), formalized this approach as a three-phase pipeline that has become the reference architecture for semantic agent discovery.

The Guo et al. (2025) Semantic Discovery Pipeline
PHASE 1: Semantic Profiling Skills & tools Roles & expertise State & constraints Serialize → encode with BERT / DistilRoBERTa PHASE 2: Scalable Indexing Product Quantization Partition → cluster → discrete codes Compact Codebook Semantic IDs for fast retrieval Incremental Maintenance New agents join without reindex PHASE 3: Continual Discovery Memory-Enhanced Retrieval Retain historical knowledge Capability Drift Detection Agents evolve; embeddings update Anti-Forgetting Mechanism Avoid catastrophic loss of old agents
Fig. 2 — Guo et al.'s three-phase pipeline: profile agents into embeddings, compress for scale, then discover continuously as the agent population evolves.

3.1 Structured Agent Profiling

The first phase transforms an Agent Card's structured metadata into a dense embedding vector. Guo et al. propose that each agent autonomously constructs a structured profile along three dimensions: skills and tools (executable actions like "route planning" or "evidence synthesis"), roles and expertise (high-level functions like "clinical specialist" or "policy reasoner"), and state and constraints (load, latency tolerance, compliance posture). This profile is serialized into natural language text, then encoded using a pre-trained model such as BERT or DistilRoBERTa to produce a dense vector representation.

Embedding-Based Agent Profiling Pseudo-code
// Phase 1: Generate semantic embedding for an agent's capabilities

function profile_agent(agent_card: AgentCard) → Vector:
    // Extract structured dimensions from the Agent Card
    skills_text = join([s.name + ": " + s.description for s in agent_card.skills])
    role_text   = agent_card.description
    state_text  = serialize_constraints(agent_card.constraints)

    // Compose a unified profile string
    profile = f("Agent: {agent_card.name}\n"
               "Role: {role_text}\n"
               "Skills: {skills_text}\n"
               "Protocols: {agent_card.protocols}\n"
               "Constraints: {state_text}")

    // Encode via pre-trained language model
    embedding = encoder.encode(profile)  // e.g., DistilRoBERTa → R^768

    return normalize(embedding)  // L2-normalize for cosine similarity

The critical property is that agents with functionally similar capabilities but lexically different descriptions will be positioned close to each other in embedding space. An agent described as "path planning" and one described as "route optimization" will have high cosine similarity, precisely because the language model was trained on billions of text passages where these phrases appear in similar contexts.

3.2 Scalable Indexing via Product Quantization

With potentially millions of agents in a future Internet of Agents, storing and searching full 768-dimensional embeddings becomes expensive. Guo et al. address this with a product quantization-inspired scheme. Each agent's embedding is partitioned into sub-vectors, each mapped to the nearest cluster centroid in a learned codebook. The result is a compact discrete semantic ID: a short code that preserves functional similarity while dramatically reducing storage and retrieval cost.

Key Insight

Scaling agent discovery relies on two complementary techniques. Product quantization (PQ) compresses each high-dimensional embedding into a short code via learned codebooks, slashing memory from O(n·d) floats to a handful of bytes per agent (Jégou et al., 2011). Graph-based indexing then replaces brute-force scanning with a greedy walk over a navigable graph, achieving sub-linear query time (Malkov & Yashunin, 2018). In practice the two are combined: an HNSW or inverted-file index narrows the candidate set, while PQ-compressed codes keep the memory footprint viable at billion scale. The trade-off is a small loss in ranking precision: on standard benchmarks (SIFT1M, GloVe, Deep1B) the recall@10 of these approximate methods typically falls only 2–4% below exact search, while delivering orders-of-magnitude speed-ups (Jégou et al., 2011; Aumüller et al., 2020).

3.3 The Semantic Gap: Where Embeddings Fall Short

Embedding-based matching is a massive improvement over string matching, but it has blind spots. Cosine similarity is symmetric (where sim("evidence synthesis," "systematic review") = sim("systematic review," "evidence synthesis")), but capability matching is often asymmetric. An agent that can do "clinical evidence synthesis" can probably handle a "systematic review" task, but an agent that only does "systematic reviews" may not cover the broader notion of "evidence synthesis" which could include expert panel consensus or meta-analysis.

Embeddings also struggle with negation and constraints. The phrases "HIPAA-compliant evidence synthesis" and "non-compliant evidence synthesis" are nearly identical in embedding space; the word "non" barely budges the 768-dimensional vector, yet the distinction is operationally fatal in healthcare.

4. Layer 3: LLM-as-Judge — Reasoning Over Capabilities

The most powerful and most expensive approach to capability matching is to use a large language model as a semantic judge. Rather than computing geometric distance in embedding space, we ask an LLM to reason about whether a candidate agent's capabilities satisfy a given task requirement. This approach addresses all three properties of a good capability matcher (lexical invariance, subsumption, and constraint-awareness) because the LLM brings world knowledge, domain expertise, and logical reasoning to bear.

Formal Definition

LLM-as-Judge Matching frames capability assessment as a natural language inference (NLI) task. Given a structured prompt containing the task requirement q, the candidate Agent Card Ca, and a domain-specific rubric R, the LLM returns a structured verdict: { match_score: [0,1], rationale: string, gaps: string[], confidence: [0,1] }. The rubric R encodes the evaluation criteria — functional overlap, constraint compatibility, I/O format alignment — ensuring deterministic evaluation dimensions even as the LLM's generation varies.

LLM-as-Judge Capability Matching Prompt Prompt Template
// System prompt for the LLM judge
"""You are an expert capability assessor for multi-agent systems.
Given a TASK REQUIREMENT and a CANDIDATE AGENT CARD, evaluate
whether the agent can fulfill the requirement.

Evaluate on these dimensions:
1. FUNCTIONAL OVERLAP: Does the agent's skill set cover the task?
2. SUBSUMPTION CHECK: Is the agent's capability a superset of need?
3. I/O COMPATIBILITY: Can the agent consume the required inputs and
   produce the expected outputs?
4. CONSTRAINT SATISFACTION: Does the agent meet all policy, protocol,
   and compliance constraints?
5. CONFIDENCE: How certain are you of this assessment?

Return a JSON object with:
  match_score (0.0-1.0), rationale, gaps[], confidence (0.0-1.0)
"""

// Task requirement from the orchestrator
TASK REQUIREMENT:
  Need an agent that can evaluate pharmacogenomic drug-gene
  interactions for CYP2D6 poor metabolizers, accepting VCF
  input and producing FHIR-compliant JSON output.
  Must operate in a HIPAA-compliant enclave with ephemeral
  data retention.

// Candidate Agent Card
CANDIDATE AGENT CARD:
  Name: GenomicsAgent
  Skills: variant annotation, pathway enrichment analysis,
          pharmacogenomic lookup
  I/O: VCF, JSON → JSON, TSV
  Protocols: a2a/1.0, mcp/1.1
  Constraints: Max 3 concurrent, HIPAA-compliant enclave only

The LLM judge would return something like match_score: 0.88, with a rationale noting that the pharmacogenomic lookup skill directly addresses CYP2D6 assessment, VCF input is supported, and the HIPAA enclave constraint is satisfied. A gap is noted: the agent's output includes TSV as a secondary format, and explicit FHIR profile compliance is not declared (only generic JSON).

4.1 The Cost-Accuracy Trade-off

LLM-as-Judge matching delivers the highest accuracy: research on related NLI tasks shows well-designed LLM judges achieve over 80% agreement with expert human assessors (Yu et al., 2025; Zhuge et al., 2024). But it comes at a steep price. A single GPT-4-class evaluation takes 1–3 seconds and costs 10–50× more than an embedding cosine similarity. If a registry holds 10,000 agents, invoking the LLM judge on every candidate is prohibitively expensive.

This is precisely why the field has converged on hybrid architectures, where embeddings serve as a fast pre-filter and the LLM judge is invoked only on the top-k candidates.

Non-Determinism Warning

LLM-as-Judge matching is non-deterministic. The same query-agent pair may receive slightly different scores on successive evaluations because of sampling variability in generation. For mission-critical discovery decisions, this instability is especially dangerous in healthcare, where an incorrectly matched agent could access PHI or make a clinical determination. To address this, the system must either use temperature=0 (greedy decoding), average scores over multiple runs, or employ multi-agent consensus judging (Qian et al., 2025). The consensus approach assigns three independent LLM judges to each evaluation and takes the majority vote, reducing variance at 3× the cost.

5. Hybrid Architectures — The Retrieve-Then-Reason Pipeline

The state of the art in 2025 is not any single technique but a layered pipeline that combines the speed of embeddings with the precision of LLM reasoning. This is the same retrieve-then-reason pattern that powers Retrieval-Augmented Generation (RAG), adapted for agent discovery.

The Retrieve-Then-Reason Discovery Pipeline
Task Query "CYP2D6 analysis..." Stage 0 Constraint Filter Protocol, HIPAA, I/O 10,000 → 800 agents Stage 1 Embedding Retrieval 800 → Top 20 by cosine Stage 2 Ontology Re-Rank SNOMED/FHIR boost 20 → Top 5 re-ordered Score Fusion α·emb + β·onto + γ·constraint Stage 3 LLM-as-Judge Detailed evaluation of top 5 candidates → ranked shortlist Top k Matched
Fig. 3 — The four-stage retrieve-then-reason pipeline. Each stage narrows the candidate set while increasing evaluation fidelity. Constraint filtering and embedding retrieval are fast (milliseconds); ontology re-ranking adds domain precision; LLM-as-Judge provides deep reasoning only on the final shortlist.

5.1 Score Fusion

The fusion stage combines signals from multiple matching layers into a single ranking score. The simplest approach is a weighted linear combination:

Score Fusion Function Pseudo-code
function fused_score(query, candidate) → Float:
    // Stage 0: Hard constraint filter (binary gate)
    if not passes_constraints(query.constraints, candidate.constraints):
        return 0.0  // Immediately disqualified

    // Stage 1: Embedding similarity (fast, broad recall)
    q_emb  = encoder.encode(query.text)
    c_emb  = encoder.encode(candidate.profile_text)
    sim_emb = cosine(q_emb, c_emb)

    // Stage 2: Ontology-based similarity (when anchors exist)
    sim_onto = 0.0
    if query.ontology_codes and candidate.ontology_anchors:
        sim_onto = wu_palmer_max(query.ontology_codes,
                                   candidate.ontology_anchors)

    // Stage 2b: I/O format compatibility (structural check)
    io_score = format_compatibility(query.required_input, query.required_output,
                                    candidate.input_modes, candidate.output_modes)

    // Weighted fusion
    α = 0.45   // embedding weight
    β = 0.30   // ontology weight (0 if no anchors available)
    γ = 0.25   // I/O compatibility weight

    return α * sim_emb + β * sim_onto + γ * io_score

When ontology anchors are unavailable, the weight β redistributes to α, gracefully degrading to pure embedding-based matching. This adaptability is what makes the hybrid architecture robust across both heavily-ontologized domains like healthcare and under-specified domains like general-purpose automation.

5.2 AgentDNS: A Hybrid Discovery Service

The AgentDNS system (Cui et al., 2025; arXiv:2505.22368, now an IETF Internet-Draft) instantiates exactly this hybrid pattern as a production service. It introduces a semantically rich namespace (agentdns://org/category/name) that decouples agent identity from network addresses, combined with a natural language-driven discovery endpoint. When an agent queries AgentDNS with a request such as "find me an agent that can analyze pharmacogenomic interactions in a HIPAA-compliant setting," the service performs hybrid retrieval: keyword matching against registered capability descriptions combined with RAG-style semantic search over embedded agent profiles. The top candidates are returned with metadata including supported protocols, pricing, and compatibility information.

Similarly, the Agent Name Service (ANS) by Narajala et al. (2025; arXiv:2505.10609), also submitted as an IETF Internet-Draft, takes a complementary approach with capability-aware resolution. ANS uses DNS-inspired naming conventions but extends the resolution algorithm to filter by declared agentCapability fields, using hierarchical and semantic indexing to pre-screen candidates against constraints and requirements before returning actionable endpoints.

6. Agent Naming and Discovery Services — Infrastructure at Scale

The emergence of both AgentDNS and ANS as IETF Internet-Drafts in 2025 signals a critical maturation: semantic capability matching is moving from research prototypes to standards-track infrastructure. Both proposals recognize that the agentic web needs something analogous to DNS, though significantly richer, because agents have capabilities, not just addresses.

🌐 AgentDNS (Cui et al., 2025)

Naming: agentdns://org/category/name

Discovery: Natural language queries via RAG + keyword hybrid

Interop: Protocol Adapter Layer for A2A, MCP translation

Billing: Unified authentication and billing across vendors

Status: IETF I-D (draft-liang-agentdns-00, Oct 2025)

🔐 ANS (Narajala et al., 2025)

Naming: DNS-inspired with hierarchical capability fields

Discovery: Capability-aware resolution with semantic indexing

Security: PKI certificates for verifiable agent identity

Lifecycle: Formal registration/renewal mechanism

Status: IETF I-D (draft-narajala-ans-00, May 2025)

Key Insight

DNS maps names → addresses. AgentDNS and ANS map capabilities → agents that can perform them. This is a fundamentally richer resolution problem. DNS returns a single IP address; agent discovery returns a ranked list of candidates with match scores, constraint compliance assessments, and protocol compatibility metadata. The semantic matching layer is the engine that makes this ranking possible.

7. Worked Example: Semantic Discovery for Prior Authorization

Let's replay the prior authorization scenario from Tutorial I, but this time we'll focus exclusively on how the orchestrator's discovery queries are matched against the agent population. We'll trace the entire hybrid pipeline for each agent that Orchestrator Ω needs to recruit.

Scenario Recap

A prior authorization request arrives for a patient prescribed codeine for post-surgical pain. Orchestrator Ω must assemble a coalition to evaluate four distinct questions:

  1. Clinical evidence for codeine in post-surgical contexts
  2. The patient's genomic profile (CYP2D6 metabolizer status)
  3. Whether the insurer's formulary and medical policy cover this treatment
  4. Potential drug-drug interactions

Ω has never worked with any of these agents before and must discover them at runtime from a registry of ~2,000 registered healthcare agents.

7.1 Query 1: Finding a Clinical Evidence Agent

The orchestrator issues its first discovery query:

Orchestrator's First Discovery Query Query
query = DiscoveryQuery {
    text: "Agent capable of synthesizing clinical evidence for
           post-surgical pain management treatments, with GRADE
           certainty assessment. Must accept clinical questions in
           PICO format and return structured evidence summaries.",
    required_input:  ["text/plain", "application/json"],
    required_output: ["application/json"],
    constraints: {
        protocols: ["a2a/1.0"],
        data_policy: "ephemeral"
    },
    ontology_hint: {
        system: "http://snomed.info/sct",
        code: "386053000"   // Evaluation procedure
    }
}

Here is how the pipeline processes this query step by step:

STAGE 0 — CONSTRAINT FILTER

2,000 → 340 candidates

The registry eliminates all agents that don't support a2a/1.0 protocol (drops 1,200), don't have ephemeral data retention policy (drops 320), or can't produce application/json output (drops 140). This is a fast, structural filter with no semantics involved, relying entirely on set intersection across metadata fields.

STAGE 1 — EMBEDDING RETRIEVAL

340 → Top 15 by cosine similarity

The query text is encoded via DistilRoBERTa. Cosine similarity is computed against the pre-indexed embeddings of the 340 surviving candidates. The top 15 include: ClinEvidence (0.91), a literature search agent (0.84), a clinical trial matching agent (0.79), a drug monograph agent (0.76), and several general-purpose research agents (0.65–0.73). Note that the trial matcher and monograph agent scored well despite not actually performing evidence synthesis; their descriptions happen to mention "clinical evidence" and "treatment evaluation."

STAGE 2 — ONTOLOGY RE-RANK

Top 15 → Top 5 re-ordered

ClinEvidence has an ontology anchor at SNOMED 386053000 (Evaluation procedure) with a declared subsumption path to 44889009 (Systematic review). The query's ontology hint directly matches. Wu-Palmer similarity: 1.0. The literature search agent has no ontology anchors, so its ontology score defaults to 0. The drug monograph agent anchors to a different SNOMED concept (drug formulary record), yielding a Wu-Palmer similarity of 0.42. After score fusion (α=0.45, β=0.30, γ=0.25), ClinEvidence surges to rank 1 with a fused score of 0.93. The literature search agent, despite a decent embedding score, drops to rank 4.

STAGE 3 — LLM-AS-JUDGE

Top 5 → Final ranked selection

The LLM judge evaluates the top 5 candidates in detail. For ClinEvidence, it confirms: functional overlap is excellent (skill description explicitly mentions GRADE assessment and PICO format), I/O is compatible, constraints are satisfied. Match score: 0.96. For the second-ranked candidate (a meta-analysis agent), the judge notes it produces statistical forest plots but not structured evidence summaries — it would need a downstream formatter. Match score: 0.71. Orchestrator Ω selects ClinEvidence.

7.2 Query 2: The Mid-Task Discovery — Finding a Genomics Agent

Now comes the harder case. ClinEvidence returns its evidence summary, which notes that codeine is a prodrug metabolized via CYP2D6. For CYP2D6 poor metabolizers, codeine provides no analgesic effect, a finding that triggers a need for pharmacogenomic assessment. This is a mid-task discovery: the orchestrator didn't know it would need a genomics agent until it read ClinEvidence's output.

Mid-Task Discovery Query — Pharmacogenomics Query
// This query is generated dynamically from ClinEvidence's output
query = DiscoveryQuery {
    text: "Agent capable of CYP2D6 metabolizer status assessment from
           patient genomic data. Must accept VCF format input and return
           pharmacogenomic annotations as FHIR-compatible JSON.
           HIPAA-compliant enclave required.",
    required_input:  ["application/vcf"],
    required_output: ["application/json"],
    constraints: {
        protocols: ["a2a/1.0"],
        data_policy: "hipaa_enclave"
    }
}

This query is semantically specific: "CYP2D6 metabolizer status assessment" is clinical pharmacogenomic jargon. The registry contains a GenomicsAgent whose skill is described as "pharmacogenomic lookup," a three-word phrase containing neither "CYP2D6" nor "metabolizer." Here is how each matching layer handles this:

❌ String Match: FAIL

Token overlap between "CYP2D6 metabolizer status assessment" and "pharmacogenomic lookup" is zero. Not a single shared word. A keyword-based registry would never find this agent.

✓ Embedding Match: 0.82

The language model was trained on biomedical corpora where CYP2D6 and pharmacogenomic frequently co-occur. Their embedding representations are close in vector space. The agent scores in the top 5 — but is ranked below a generic genetics agent whose description uses more of the query's words.

✓ Ontology Boost: +0.18

If GenomicsAgent anchors to SNOMED CT code 363779003 (Genotype determination), and the query resolves CYP2D6 assessment to the same branch, the ontology re-ranker boosts GenomicsAgent above the generic genetics agent.

✓ LLM Judge: 0.91

The LLM judge knows that CYP2D6 is a specific gene in the cytochrome P450 family, that pharmacogenomic lookup inherently covers CYP enzyme assessment, and that VCF input and HIPAA enclave constraints are satisfied. It confidently assigns 0.91 with a note: "pharmacogenomic lookup is a direct superset of CYP2D6 metabolizer status assessment."

This example illustrates why hybrid matching is essential. The string matcher saw zero overlap. The embedding matcher placed the agent in the neighborhood. The ontology provided structural confirmation. The LLM judge delivered the definitive verdict with clinical reasoning.

8. The Matching Algorithm — Pseudo-Code Deep Dive

Let's now formalize the complete semantic matching algorithm that a discovery service would implement. This integrates all four stages and handles the case where ontology anchors may or may not be present.

SemanticMatcher — Complete Discovery Pipeline Pseudo-code
class SemanticMatcher:
    encoder:       EmbeddingModel       // e.g., DistilRoBERTa, MiniLM-L6-v2
    index:         VectorIndex          // Product-quantized ANN index
    ontology_svc:  OntologyService      // SNOMED CT / FHIR terminology server
    llm_judge:     LLMClient            // GPT-4 class model for final evaluation
    agent_store:   AgentCardStore       // All registered Agent Cards

    // Configuration
    top_k_embedding  = 20
    top_k_rerank     = 5
    weights          = { emb: 0.45, onto: 0.30, io: 0.25 }

    function discover(query: DiscoveryQuery) → RankedList[MatchResult]:

        // ═══ STAGE 0: Hard Constraint Filter ═══
        candidates = agent_store.filter(
            protocol   ∈ query.constraints.protocols,
            data_policy ⊇ query.constraints.data_policy,
            output_modes ∩ query.required_output ≠ ∅
        )
        log("Stage 0: {len(candidates)} pass constraints")

        // ═══ STAGE 1: Embedding Retrieval ═══
        q_embedding = encoder.encode(query.text)
        emb_results = index.search(q_embedding,
                                    k=top_k_embedding,
                                    filter_ids=candidates.ids)
        // emb_results: [(agent_id, cosine_score), ...]

        // ═══ STAGE 2: Ontology Re-Ranking ═══
        scored = []
        for (agent_id, emb_score) in emb_results:
            card = agent_store.get(agent_id)

            // Ontology similarity (0 if no anchors on either side)
            onto_score = 0.0
            if query.ontology_hint and card.ontology_anchors:
                onto_score = max(
                    ontology_svc.wu_palmer(query.ontology_hint.code,
                                           anchor.code)
                    for anchor in card.ontology_anchors
                    if anchor.system == query.ontology_hint.system
                )

            // I/O format compatibility
            io_score = jaccard(
                set(query.required_input) ∩ set(card.all_input_modes),
                set(query.required_output) ∩ set(card.all_output_modes)
            )

            // Adaptive weight redistribution
            w = weights.copy()
            if onto_score == 0.0:
                // No ontology signal — redistribute weight to embedding
                w.emb += w.onto * 0.7
                w.io  += w.onto * 0.3
                w.onto = 0.0

            fused = w.emb * emb_score + w.onto * onto_score + w.io * io_score
            scored.append((agent_id, fused, card))

        // Sort by fused score, take top k for deep evaluation
        scored.sort(by=fused, descending=True)
        shortlist = scored[:top_k_rerank]

        // ═══ STAGE 3: LLM-as-Judge Deep Evaluation ═══
        results = []
        for (agent_id, fused, card) in shortlist:
            verdict = llm_judge.evaluate(
                task_requirement = query.text,
                agent_card       = card.to_json(),
                rubric           = CAPABILITY_RUBRIC,
                temperature      = 0.0    // Deterministic for reproducibility
            )
            results.append(MatchResult {
                agent_id:    agent_id,
                final_score: 0.4 * fused + 0.6 * verdict.match_score,
                rationale:   verdict.rationale,
                gaps:        verdict.gaps,
                confidence:  verdict.confidence
            })

        return results.sort(by=final_score, descending=True)

Several design decisions deserve attention. First, the adaptive weight redistribution in Stage 2: when ontology anchors are absent (the common case for general-purpose agents), the algorithm doesn't break; it simply shifts the ontology weight to the embedding and I/O channels. This makes the system robust across both heavily-ontologized domains (healthcare) and unconstrained ones.

Second, the final score blending in Stage 3: the LLM judge's verdict receives 60% of the final weight, reflecting that its deep reasoning is more reliable than geometric similarity for subtle distinctions. But the fused score from Stages 1–2 still contributes 40%, preventing the LLM's occasional hallucinated confidence from completely overriding strong structural signals.

Third, temperature=0.0 for the LLM judge: in healthcare contexts where reproducibility and auditability matter, we want the same query-agent pair to produce the same verdict every time. This sacrifices some of the LLM's creative reasoning capacity but is essential for regulatory compliance.

9. Open Challenges & Research Frontiers

Semantic capability matching has made extraordinary progress, advancing from keyword search to hybrid neural-symbolic pipelines, but several hard problems remain.

9.1 Composability Reasoning

Current matching evaluates individual agents against individual requirements. But coalition formation often demands skill composition: no single agent can handle "end-to-end prior authorization," but a coalition of four agents can. The matcher must reason about whether a set of capabilities, drawn from different agents, collectively covers a complex requirement. This is closer to automated planning than retrieval, and no existing discovery service handles it well. Future work may integrate task decomposition (via an LLM planner) directly into the discovery loop: decompose the requirement, discover agents for each sub-task, then verify the coalition's aggregate capability.

9.2 Adversarial Capability Claims

Agent Cards are self-reported. A malicious agent could falsely claim capabilities such as "HIPAA-compliant genomics analysis" in order to infiltrate a coalition. Semantic matching has no way to distinguish between a genuine capability and a fabricated one. This links directly to the trust mechanisms explored in Tutorial II (Trust Without a Central Authority): verifiable credentials and reputation systems must work in concert with semantic matching to validate that claimed capabilities are backed by attestable evidence.

9.3 Embedding Drift and Continual Learning

As the agent population evolves through new registrations, capability updates, and retirements, the embedding space must evolve too. The Guo et al. framework addresses this with memory-enhanced continual discovery, but the fundamental tension between plasticity (incorporating new agent types) and stability (not forgetting existing ones) remains a research frontier. The catastrophic forgetting problem, well-studied in continual learning for neural networks, reappears here at the infrastructure level: how do you update a codebook that serves millions of agents without invalidating existing semantic IDs?

9.4 Cross-Protocol Capability Translation

An agent that speaks MCP and one that speaks A2A may have equivalent capabilities but describe them in incompatible schema formats. AgentDNS addresses this with a Protocol Adapter Layer, but semantic matching across protocol boundaries remains largely unsolved, since skill descriptions, constraint vocabularies, and I/O format declarations all differ structurally. Ontology alignment between protocol-specific vocabularies (MCP's "tool" vs. A2A's "skill") is a prerequisite.

9.5 Privacy-Preserving Discovery

In healthcare and other regulated domains, even the act of searching for an agent reveals sensitive information. If Orchestrator Ω queries a registry for "CYP2D6 poor metabolizer assessment," the registry now knows that a patient somewhere may have a pharmacogenomic condition. Privacy-preserving discovery, which hides query content from the registry while still enabling accurate matching, requires techniques like homomorphic encryption over embeddings or secure multi-party computation. Both AgentDNS and ANS identify this as a critical future direction, but no production implementation exists.

9.6 Calibrating the LLM Judge

LLM-as-Judge matching inherits all the biases of the underlying model. An LLM trained predominantly on English-language biomedical literature may systematically over-rate agents whose descriptions use Western medical terminology and under-rate agents described in other clinical frameworks. Research on multi-lingual LLM judging shows Fleiss' Kappa values as low as 0.1–0.32 in low-resource language settings (Fu et al., 2025). For a global agentic web, the judge must be calibrated across languages, terminologies, and clinical traditions, a challenge that remains largely unaddressed.

Appendix: Matching Approaches — Trade-off Matrix

Approach Synonymy Subsumption Constraints Latency Cost Maturity
String / Keyword Poor None None <1ms Trivial Mature
Ontology (SNOMED, FHIR) Moderate Excellent Partial ~5ms Low Domain-specific
Embedding Similarity Excellent Moderate Poor ~10ms Low Mature
LLM-as-Judge Excellent Excellent Excellent 1–3s High Emerging
Hybrid Pipeline Excellent Excellent Excellent ~200ms Moderate Emerging (2025)

References

Guo, S. et al. (2025). "Agent Discovery in Internet of Agents: Challenges and Solutions." arXiv:2511.19113. This paper presents the reference architecture for semantic-driven capability discovery, introducing the three-phase pipeline of semantic profiling, scalable indexing, and continual discovery.

Cui, E. et al. (2025). "AgentDNS: A Root Domain Naming System for LLM Agents." arXiv:2505.22368; IETF I-D draft-liang-agentdns-00. This paper presents the first DNS-inspired naming and discovery service for agents, with natural language-driven semantic retrieval.

Narajala, V. S. et al. (2025). "Agent Name Service (ANS): A Universal Directory for Secure AI Agent Discovery and Interoperability." arXiv:2505.10609; IETF I-D draft-narajala-ans-00. This paper presents PKI-backed agent discovery with capability-aware resolution and protocol adapter layers.

Huang, K. et al. (2025). "ACNBP: Agent Communication and Negotiation Based Protocol." arXiv:2505.19301. This paper presents a zero-trust identity framework with Agent Name Service integration for structured discovery queries.

CASTER (2025). "Breaking the Cost-Performance Barrier in Multi-Agent Orchestration via Context-Aware Strategy for Task Efficient Routing." arXiv:2601.19793. This paper presents context-aware step-level routing that reduces inference cost by up to 72.4% while maintaining task performance.

MasRouter (2025). "Learning to Route LLMs for Multi-Agent Systems." ACL 2025. This paper presents a unified framework for jointly optimizing model selection and collaboration mode in multi-agent systems.

Yu, F. et al. (2025). "When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs." arXiv:2508.02994. This paper offers a comprehensive survey of agent-based evaluation frameworks, including multi-agent debate approaches for reliable judging.

Fu, Y. et al. (2025). "Multilingual LLM-Judge Reliability." An empirical study showing limited cross-lingual reliability of LLM judges with Fleiss' Kappa values of 0.1–0.32.

Jégou, H., Douze, M. & Schmid, C. (2011). "Product Quantization for Nearest Neighbor Search." IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 117–128. The foundational paper introducing product quantization and the IVFADC index, with recall benchmarks on SIFT1M and GIST1M showing that PQ-based approximate search retains 96–98% of exact recall@10.

Aumüller, M., Bernhardsson, E. & Faithfull, A. (2020). "ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms." Information Systems, 87, 101374. A systematic comparison of ~50 ANN implementations (including HNSW, IVF+PQ, and ScaNN) across standard datasets, confirming that graph-based and quantization-based methods routinely achieve >0.95 recall@10 at high-throughput operating points. Results available at ann-benchmarks.com.

Google (2025). Agent-to-Agent (A2A) Protocol Specification. — The open protocol defining Agent Cards, discovery mechanisms, and task lifecycle management.

W3C (2022). Decentralized Identifiers (DIDs) v1.0. — Self-sovereign cryptographic identity standard for agents.

SNOMED International (2025). SNOMED CT — Systematized Nomenclature of Medicine, Clinical Terms. — Over 350,000 healthcare concepts with formal subsumption relationships.

HL7 (2025). Fast Healthcare Interoperability Resources (FHIR) R4. — Structural interoperability standard for healthcare data exchange.