An Agentic Architecture for Scalable and Reproducible Data Standardization to OMOP CDM

Abstract

The secondary use of clinical data for biomedical research is a cornerstone of modern medicine, yet its potential is severely constrained by the persistent challenge of data heterogeneity. While the OMOP-CDM provides a robust standard for data harmonization, the process of transforming source data remains a significant bottleneck. This paper proposes a novel agentic architecture that reframes data standardization from an imperative scripting task to a declarative, AI-augmented knowledge-generation process. Our framework decomposes the complex mapping workflow into specialized agents, orchestrating LinkML declarative data modeling and Boundary ML (BAML) structured LLM interfacing. The system's primary output is a versionable, directly executable mapping specification.

1. The Standardization Bottleneck in Biomedical Research

The digital transformation of healthcare has produced an unprecedented volume and variety of clinical data. However, this potential remains largely untapped because the data is fragmented and siloed within disparate institutional systems. In practice, clinical information is organized using locally developed, inconsistent information models.

To overcome these barriers, the biomedical research community has championed Common Data Models (CDMs). Among these, the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM), maintained by OHDSI, has emerged as a de facto international standard. By defining a consistent data structure for clinical domains (e.g., conditions, drugs, procedures, measurements) and leveraging controlled terminologies like SNOMED CT, LOINC, and RxNorm, OMOP-CDM enables true semantic harmonization.

Despite the demonstrated value, the practical process of transforming source data into the standard model—the ETL pipeline—remains a severe bottleneck. Historically, ETL pipelines have been implemented through manual data mapping and the development of brittle, imperative scripts in SQL or Python. These scripts are difficult to create, debug, maintain, and reuse.

2. A Multi-Agent Framework for Intelligent Data Harmonization

Achieving a true leap in the scalability of data standardization requires a fundamental paradigm shift. We must move away from the practice of writing imperative transformation scripts and toward the AI-assisted generation of declarative, human-readable, and machine-executable mapping specifications. To realize this vision, we propose a modular, multi-agent system (MAS) where each agent is specialized for a particular role:

2.1 The Orchestrator Agent

Serves as the central nervous system. It initiates, monitors, and coordinates the entire data standardization job, defining the execution graph and sequencing worker agent invocations. It also handles checkpointing and human-in-the-loop review.

2.2 The Schema Definition Agent

Bootstraps formal, machine-readable data models of source datasets using LinkML and tools like schema-automator, transforming the mapping problem into a structure-to-structure reasoning task.

2.3 Semantic Grounding Agent

Performs coarse-grained semantic similarity matching using embeddings (like ClinicalBERT) to identify candidates, narrowing down the search space for the cognitive alignment step.

2.4 The Model Alignment Agent: The Cognitive Core

The cognitive heart of the system. It leverages type-safe, schema-aligned LLM interfacing via Boundary ML (BAML) to produce the executable mapping specifications.

2.5 Transformation & Validation Agent

Executes the generated mappings using the linkml-map library, transforming raw records and validating the output dataset against target OMOP LinkML constraints.

3. Example: Mapping Laboratory Data to OMOP

To make this concrete, consider a source lab event database mapped to the OMOP MEASUREMENT table.

Step 1: LinkML Schema Ingestion

classes:
  LabEvent:
    description: "Represents a single laboratory measurement event."
    attributes:
      subject_id: {identifier: true, description: "Patient identifier"}
      itemid: {range: integer, description: "Lab test identifier"}
      charttime: {range: datetime, description: "Time recorded"}
      valuenum: {range: float, description: "Numeric result"}

Step 2: Semantic Grounding Mappings

source:valuenum → target:value_as_number (Similarity: 0.98)
source:charttime → target:measurement_datetime (Similarity: 0.95)
source:itemid → target:measurement_source_value (Similarity: 0.85)

Step 3: BAML Structured Interfacing

BAML allows us to define functions that constrain probabilistic LLM outputs into predictable, compiled objects:

// BAML Type-Safe Definition
class MappingSpecification {
  class_derivations: map<string, ClassDerivation>
}

function GenerateMapping(
  source_schema: string,
  target_schema: string
) -> MappingSpecification {
  client GPT4_Turbo
  prompt #"...instruction content..."#
}

Step 4 & 5: Executable Map & Validation

The final YAML map is version-controlled and human-auditable. The data engineer reviews and validates the code in a single loop before final execution.

Discussion & Architectural Benefits

The proposed agentic architecture represents a significant step beyond traditional ETL and simple AI wrappers:

Modularity: Each agent acts as a self-contained component, much like a microservice. Components can be updated independently without affecting the orchestrator.
Reproducibility: The entire transformation logic is captured in explicit schemas and mappings rather than opaque imperative code.
Human-in-the-loop: The system is explicitly designed as a mapping copilot. The agents automate the 90% of tedious mapping, leaving the human expert in the role of validator and supervisor.

Conclusion & Future Directions

By combining declarative modeling with LinkML and reliable LLM interfacing with BAML, we transform the ETL bottleneck from a scripting task to a structured knowledge-generation loop. Future directions include developing richer graphical validation interfaces, expanding to genomic or unstructured text extraction agents, and orchestrating hierarchical critic-agents to review mapping candidates before final validation.