Declarative Data Standardization to OMOP-CDM using LinkML

Background

The secondary use of clinical data for research is severely constrained by data heterogeneity across healthcare systems. Although the OMOP Common Data Model (OMOP-CDM) provides a robust and widely adopted standard for observational research, the Extract–Transform–Load (ETL) process required to populate OMOP remains a major barrier.

Traditional ETL pipelines rely on bespoke, imperative code (e.g., SQL or Python), making them costly to develop, difficult to maintain, and hard to reproduce. This poster/brief report presents a paradigm shift in OMOP standardization, reframing ETL as a declarative, model-driven, and AI-augmented process using the Linked Data Modeling Language (LinkML).

Methods

We implemented a declarative OMOP standardization framework built on the LinkML ecosystem:

Source Database schema generation: Source schemas were automatically generated using LinkML Schema Automator, producing formal class–slot–enum definitions directly from relational databases. These schemas were semantically enriched using BioPortal-based annotators to suggest candidate mappings to OMOP standard vocabularies.
Target Schema representation: A canonical LinkML representation of the OMOP-CDM was used as the target schema.
AI-Augmented alignment mapping: A large language model (LLM) was employed to generate a LinkML-Map transformation specification aligning source and OMOP schemas. To ensure reliable structured output, we used BAML's schema-aligned parsing, constraining the LLM to produce syntactically valid LinkML-Map artifacts.
Execution and Validation: The LinkML-Map Transformer executed the declarative mappings to generate OMOP-CDM–conformant data, with validation performed against the OMOP LinkML schema.

Results

The proposed approach yielded qualitative improvements over traditional ETL pipelines:

YAML Mappings Externalization: All transformation logic was externalized into human-readable, version-controllable YAML artifacts, improving transparency, auditability, and reproducibility.
Reduced Malformation: The use of structured schemas and schema-constrained LLM outputs substantially reduced malformed mappings and simplified expert review.
Modularity: The workflow promoted modularity by separating source modeling, vocabulary grounding, mapping generation, and execution, allowing independent evolution of each component.
Expert Co-Pilot Model: Rather than replacing human expertise, the framework functioned as a mapping co-pilot, automating repetitive tasks while enabling domain experts to review and refine mappings.

linkml_map_spec.yaml LinkML YAML Mapping

class_derivations:
  Measurement:
    populated_from: LabEvent
    slot_derivations:
      person_id:
        populated_from: subject_id
      measurement_datetime:
        populated_from: charttime
      value_as_number:
        populated_from: valuenum
      unit_source_value:
        populated_from: valueuom

Conclusion

This work demonstrates that declarative, model-driven ETL using LinkML can significantly lower the technical barrier to OMOP-CDM adoption. By replacing brittle imperative scripts with explicit schemas and mapping specifications, the framework enhances maintainability, reproducibility, and trust.

Structured AI integration improves efficiency while preserving human-in-the-loop governance, which remains essential given the semantic ambiguity of clinical data and the limitations of LLMs. The approach balances standardization and local customization, enabling institutions to adapt mappings to local data while ensuring strict OMOP compliance. Declarative data standardization provides a scalable foundation for federated observational research and aligns well with OHDSI's goals of reusable, transparent, and shareable analytics pipelines.