Core Concepts of The Tuva Project

Disclaimer: For informational purposes only. This content is not medical or legal advice.
TL;DR

Tuva has 7 building blocks: Connectors (map raw data), Input Layer (standard format), Data Quality (validate), Normalization (standardize codes), Claims Preprocessing (group into encounters), Core Data Model (unified schema), and Data Marts (analytics-ready). Plus Terminology — healthcare code sets that ship with the package.

The Big Picture

Tuva's 7 building blocks form a pipeline. Each one transforms data a little further until you have analytics-ready datasets. Here's how they connect:

Tuva's 7 building blocks connected in a pipeline
Explain Like I'm 12

Think of Tuva like a car factory assembly line. Raw materials (healthcare data) come in the door. Station 1 (Connectors) shapes the parts. Station 2 (Input Layer) puts them on a standard conveyor belt. Station 3 (Data Quality) is quality control — it rejects defective parts. Station 4 (Normalization) paints everything the same color. Station 5 (Claims Preprocessing) assembles the parts into bigger components. Station 6 (Core Data Model) builds the car. Station 7 (Data Marts) adds the features (GPS, AC, radio). And the Terminology is the parts catalog everyone references.

Cheat Sheet

Stage What It Does Key Detail
Connectors Maps raw data → Input Layer format Pre-built for common sources; custom for proprietary
Input Layer Standardized data format Tuva expects Claims tables: medical_claim, pharmacy_claim, eligibility
Data Quality Validates data before processing 100+ automated checks on completeness, validity
Normalization Standardizes codes and values Maps to ICD-10, CPT, SNOMED standard terminologies
Claims Preprocessing Groups claim lines into encounters Handles claim types, service categories, grouping
Core Data Model Unified patient-centric schema Patient, condition, encounter, lab, medication tables
Data Marts Pre-built analytics 13+ marts: CMS-HCC, HEDIS, readmissions, PMPM, etc.

The 7 Building Blocks

Connectors — Mapping Raw Data In

Connectors are dbt projects that transform raw source data into Tuva's Input Layer. Think of them as adapters — they know how to read your specific data format and translate it into the format Tuva expects.

Tuva provides pre-built connectors for common healthcare data sources. If your data comes from a proprietary system, you write a custom connector. The connector handles field mapping, type casting, and basic transformations.

Tip: If your data doesn't have a pre-built connector, you'll build a custom one. It's just a dbt project with SQL SELECT statements mapping your columns to the Input Layer schema. No special tooling required.

Input Layer — The Standard Format

The Input Layer is the standardized format Tuva expects. Once your data is in Input Layer format, you can run the entire Tuva package. Key tables include:

  • medical_claim — Professional and facility claims
  • pharmacy_claim — Prescription drug claims
  • eligibility — Member enrollment and coverage periods
  • lab_result — Lab test results
  • condition — Diagnosis records
  • procedure — Procedure records

Each table has a defined schema with required and optional columns. The Input Layer documentation specifies every field, its data type, and whether it's required.

Key insight: The Input Layer is the most important concept in Tuva. Once your data is in Input Layer format, you can run the entire Tuva package with a single dbt build command. Everything downstream — data quality, normalization, the core model, all 13+ data marts — just works.

Data Quality — Validating Your Data

Tuva includes 100+ automated data quality checks that validate your data before it enters the pipeline. These checks catch problems early, before they propagate through every downstream model.

The checks fall into three categories:

  • Completeness — Are required fields populated? Does every claim have a patient ID, dates of service, and diagnosis codes?
  • Validity — Are codes valid? Is that ICD-10 code a real code? Are dates reasonable?
  • Consistency — Do dates make sense? Is the discharge date after the admission date? Does the claim amount match the sum of its line items?

The output is a data quality report that shows exactly what's wrong with your data, how many records are affected, and what needs to be fixed.

Normalization — Standardizing Codes

Normalization maps source codes to standard terminologies. Your source data might use proprietary codes, abbreviations, or non-standard formats. Normalization translates everything into a common language.

Examples of what normalization handles:

  • Custom drug codes → NDC (National Drug Code)
  • Facility type abbreviations → Standard CMS facility type codes
  • Discharge status values → Standard discharge status codes
  • Bill type codes → Standard UB-04 bill type codes

After normalization, every record in your data speaks the same language regardless of where it came from.

Claims Preprocessing — From Lines to Encounters

Raw claims data comes as individual claim lines — each line representing a single charge. Claims Preprocessing groups these lines into clinically meaningful encounters and episodes.

This is where raw claim data becomes useful for analytics:

  • Grouping — Multiple claim lines for the same hospital stay are combined into a single encounter
  • Service categories — Each encounter is classified as inpatient, outpatient, emergency department, skilled nursing, etc.
  • Claim type assignment — Professional vs. facility claims are identified and handled appropriately

Without this step, you'd be counting individual charge lines instead of actual patient visits.

Core Data Model — The Unified Schema

The Core Data Model is a unified, patient-centric schema that brings together all the processed data. Key tables include:

  • patient — One row per patient with demographics and enrollment info
  • condition — All diagnoses mapped to standard terminologies
  • encounter — Every patient visit with dates, providers, and settings
  • lab_result — Lab test results with standard LOINC codes
  • medication — Prescriptions and drug claims with NDC codes
  • procedure — All procedures with CPT/HCPCS codes
  • observation — Clinical observations and vitals

Every downstream data mart reads from the Core Data Model, not from raw data. This means you only need to get the data right once.

Warning: The Core Data Model is the foundation. If your Input Layer data is wrong, everything downstream will be wrong. Always check data quality first — fix issues at the source before running the full pipeline.

Terminology & Value Sets

Tuva ships with a comprehensive library of healthcare terminologies and value sets:

  • ICD-10-CM — Diagnosis codes (70,000+)
  • CPT — Procedure codes
  • SNOMED CT — Clinical terminology
  • HCPCS — Healthcare procedure codes (Level II)
  • NDC — National Drug Codes
  • CMS value sets — HCC mappings, CCSR categories, readmission flags

These are stored as versioned seed files and auto-loaded from S3, GCS, or Azure when you run dbt seed. Tuva also includes its own concept library for mapping conditions, which makes it easy to build condition-based cohorts (e.g., "all patients with diabetes").

How It All Fits Together

Here's the simplified end-to-end flow showing how your data moves through the Tuva pipeline:

1
Your Raw Data
Claims, eligibility, clinical records from your source systems
2
Connector
Maps raw columns to the Input Layer schema
3
Input Layer
Standardized tables: medical_claim, pharmacy_claim, eligibility
4
DQ + Normalization + Preprocessing
Validate, standardize codes, and group claims into encounters
5
Core Data Model
Unified patient-centric schema all marts read from
6
Data Marts
13+ pre-built analytics: CMS-HCC, HEDIS, readmissions, PMPM

Getting Started

Once your data is in Input Layer format, running Tuva is straightforward:

  1. dbt deps — Install the Tuva package and its dependencies
  2. dbt seed — Load terminology files (ICD-10, CPT, value sets) into your warehouse
  3. dbt build — Run the entire pipeline: data quality, normalization, core model, and data marts

That's it. Three commands to go from raw data to analytics-ready datasets.

Tuva uses dbt variables to control what runs:

  • claims_enabled — Set to true if you have claims data
  • clinical_enabled — Set to true if you have clinical/EHR data
  • tuva_marts_enabled — Control which specific data marts to build
Tip: Start with DuckDB locally to test the pipeline before deploying to your production warehouse. You can use Tuva's sample datasets to see the entire pipeline in action without any real data.

Test Yourself

What is the Input Layer and why is it important?

The Input Layer is the standardized data format Tuva expects. It defines specific tables (medical_claim, pharmacy_claim, eligibility, etc.) with defined schemas. It's the most important concept because once your data is in Input Layer format, you can run the entire Tuva package with a single dbt build — everything downstream just works.

What's the difference between a Connector and the Input Layer?

A Connector is the code that transforms your raw source data into Input Layer format — it's the bridge between your specific data and Tuva's standard. The Input Layer is the destination format itself — the standardized tables with defined schemas. Think of the connector as the translator, and the Input Layer as the language being translated into.

Name 3 tables in the Core Data Model.

The Core Data Model includes: patient (demographics and enrollment), condition (diagnoses), encounter (visits), lab_result (lab tests), medication (prescriptions), procedure (procedures), and observation (clinical observations). Any 3 of these is correct.

What does Claims Preprocessing do?

Claims Preprocessing groups individual claim lines into clinically meaningful encounters and episodes. It assigns service categories (inpatient, outpatient, ED, skilled nursing, etc.) and handles claim type classification (professional vs. facility). Without it, you'd be analyzing individual charge lines instead of actual patient visits.

How do you enable specific data marts in Tuva?

You use dbt variables to control which parts of Tuva run. Set claims_enabled: true for claims data, clinical_enabled: true for clinical/EHR data, and tuva_marts_enabled to control which specific data marts to build. These are configured in your dbt_project.yml file.

Interview Questions

Q: Explain the Tuva Project pipeline from raw data to analytics-ready output.

The Tuva pipeline has 7 stages: (1) Connectors map raw source data into a standardized Input Layer format. (2) The Input Layer provides the standard tables (medical_claim, pharmacy_claim, eligibility) that the rest of the pipeline reads from. (3) Data Quality runs 100+ automated checks for completeness, validity, and consistency. (4) Normalization maps source codes to standard terminologies (ICD-10, CPT, SNOMED). (5) Claims Preprocessing groups claim lines into encounters and assigns service categories. (6) The Core Data Model creates a unified patient-centric schema. (7) Data Marts provide 13+ pre-built analytics for risk adjustment, quality measures, readmissions, and cost analysis.

Q: Why is the Input Layer the most critical component of the Tuva pipeline?

The Input Layer is the contract between your source data and the entire Tuva framework. Every downstream component — data quality, normalization, preprocessing, the core model, and all 13+ data marts — depends on Input Layer tables having the correct schema. If the Input Layer is wrong (missing fields, wrong types, bad mappings), every downstream result will be wrong. It's also the decoupling point: any data source can be connected to Tuva as long as it maps to the Input Layer, making the rest of the pipeline source-agnostic.

Q: How does Tuva handle data from multiple healthcare data sources with different formats?

Tuva uses a Connector pattern. Each source gets its own connector — a separate dbt project with SQL SELECT statements that map source-specific columns to the Input Layer schema. Pre-built connectors exist for common sources, and you write custom connectors for proprietary data. The key insight is that all connectors output the same Input Layer format, so the rest of the pipeline doesn't need to know where the data came from. This allows you to combine data from multiple payers, EHRs, or clearinghouses into a single unified dataset.

Q: What is the difference between Tuva's Normalization and Claims Preprocessing stages?

Normalization focuses on codes and values — mapping proprietary codes to standard terminologies (ICD-10, CPT, NDC) and standardizing field values (facility types, discharge statuses, bill types). Claims Preprocessing focuses on structure — grouping individual claim lines into encounters, assigning service categories (inpatient, outpatient, ED), and classifying claim types (professional vs. facility). Normalization ensures everything speaks the same language; preprocessing ensures the data is organized into clinically meaningful units.

Q: What are the steps to deploy Tuva in a new data warehouse environment?

Three main steps: (1) dbt deps to install the Tuva package and dependencies, (2) dbt seed to load terminology files (ICD-10, CPT, SNOMED, value sets) into the warehouse — these are versioned seed files auto-loaded from cloud storage, and (3) dbt build to run the entire pipeline. Before this, you need a Connector that maps your source data to the Input Layer. You configure dbt variables like claims_enabled, clinical_enabled, and tuva_marts_enabled to control what runs. Tuva supports Snowflake, BigQuery, Redshift, and DuckDB (for development).