Terminology & Value Sets

Disclaimer: For informational purposes only. This content is designed for data professionals learning healthcare domain knowledge, not for medical or insurance advice.
TL;DR

Tuva ships with healthcare terminologies (ICD-10, CPT, SNOMED, HCPCS, NDC), value sets (CMS-HCC, CCSR, readmissions), and its own concept library for conditions and diagnostics — all version-controlled and auto-loaded from cloud storage via dbt seed.

Explain Like I'm 12

Imagine every country spoke a different language, and you needed a universal dictionary so everyone could understand each other. That's what healthcare terminology is — a shared set of codes so that when a doctor in New York says "E11.9," a computer in California knows it means "Type 2 Diabetes."

Tuva comes with all these dictionaries pre-loaded. You don't have to go find them, download them, or figure out the right version. They're already there, ready to use, and they update automatically when new codes come out.

Tuva Terminology Architecture

Tuva terminology layers: Code Systems, Value Sets, Concept Library, Reference Data

How Tuva Manages Terminology

Tuva stores its terminology data in DoltHub repositories — a version-controlled database that works like Git for data. The terminology files are distributed as versioned seed files from cloud storage (S3, GCS, or Azure Blob), and they're auto-loaded when you run dbt seed.

Why This Is a Big Deal

  • Version-controlled: Every terminology update is tracked. You can see exactly what changed between versions and when.
  • Reproducible: Pin a specific version in your dbt_project.yml and your pipeline will always use the same terminology, regardless of when you run it.
  • Automated: No manual downloads from CMS websites, no parsing Excel files, no building your own lookup tables. Run dbt seed and everything loads.
  • Multi-warehouse: The same terminology works across Snowflake, BigQuery, Redshift, Databricks, and DuckDB.
The old way vs. the Tuva way: Before Tuva, healthcare data teams spent weeks downloading code files from CMS, parsing them, loading them into staging tables, and maintaining update scripts. Tuva reduces this to one command: dbt seed.

Four Layers of Terminology

Tuva organizes its terminology into four distinct layers, each serving a different purpose in the analytics pipeline:

Layer 1: Code Systems

Code systems are the raw dictionaries of healthcare. Each code system defines a set of codes with descriptions. These are the standard vocabularies used across the entire US healthcare system.

Code System What It Describes Example Code Tuva Table
ICD-10-CM Diagnoses (what's wrong with the patient) E11.9 = Type 2 Diabetes without complications terminology.icd_10_cm
CPT Procedures (what was done) 99213 = Office visit, established patient terminology.cpt
SNOMED CT Clinical concepts (broader clinical terms) 44054006 = Type 2 Diabetes Mellitus terminology.snomed_ct
HCPCS Supplies, equipment, services (what was used) E0601 = CPAP device terminology.hcpcs
NDC Drugs (which specific medication) 0069-3150-83 = Specific Lipitor package terminology.ndc
How these connect: A single patient encounter might use all five code systems: ICD-10 for the diagnosis, CPT for the procedure, HCPCS for the equipment, NDC for the prescription, and SNOMED for the clinical record. Tuva has all of them loaded and cross-referenced.

Layer 2: Value Sets

Value sets are curated groupings of codes built for specific analytics use cases. Instead of working with raw codes, value sets tell you which codes belong to a particular concept or measure.

Key Value Sets in Tuva

  • CMS-HCC Mappings: Which ICD-10 diagnosis codes map to which HCC categories. This crosswalk is the foundation of risk adjustment. Example: ICD-10 E11.9 (Type 2 Diabetes) maps to HCC 19 (Diabetes without Complication).
  • CCSR Categories: Groups ICD-10 codes into ~530 clinically meaningful diagnosis categories and ~320 procedure categories.
  • Readmission Flags: Which diagnosis codes count as planned readmissions (excluded from penalty calculations) vs. unplanned readmissions.
  • Quality Measure Specifications: The code-level definitions for HEDIS-style measures — which codes define the denominator (eligible population) and numerator (measure met).
  • Chronic Condition Definitions: CMS Chronic Condition Warehouse (CCW) algorithms that define how to identify conditions like diabetes, CHF, and COPD from claims data.
Value sets power the data marts: Every data mart depends on value sets. The CMS-HCC mart uses the HCC mapping value set. The Quality Measures mart uses the measure specification value sets. The Readmissions mart uses readmission flag value sets. Without accurate value sets, the marts can't produce correct results.

Layer 3: Concept Library

The Concept Library is Tuva's own definitions for healthcare concepts. It builds on top of standard code systems but adds clinical meaning and cross-references that don't exist in any single standard.

What the Concept Library Contains

  • Conditions: Clinical definitions for diabetes, CHF, COPD, depression, CKD, and dozens more — each mapped to the relevant ICD-10 codes, HCCs, and CCW algorithms
  • Diagnostics: Lab tests and their clinical significance, including normal ranges and what abnormal values indicate
  • Procedures: Grouped procedures with clinical context (e.g., all codes that constitute a "knee replacement" episode)
  • Service Categories: Tuva's own classification of healthcare services (inpatient, outpatient professional, outpatient facility, ED, pharmacy, etc.)

The Concept Library is what makes Tuva more than just a code lookup tool. It provides the clinical logic layer that connects raw codes to meaningful analytics.

Layer 4: Reference Data

Reference data provides the supporting context that analytics pipelines need beyond clinical codes:

  • Calendar tables: Pre-built date dimensions with fiscal years, quarters, and healthcare-specific periods (measurement year, benefit year)
  • Geographic crosswalks: ZIP code to state, state to region, FIPS codes, MSA (Metropolitan Statistical Area) mappings
  • Provider type references: Taxonomy codes to provider specialties and types
  • Code type lookups: Mappings between different coding systems (e.g., place of service codes, bill type codes, discharge status codes)

How Normalization Uses Terminology

When Tuva normalizes your raw data, it maps your source codes to standard terminologies. Here's the flow for a single diagnosis:

  1. Source code: Your raw data has a diagnosis field with value "DX123" (a proprietary code from your source system)
  2. Normalization: Tuva's normalization layer maps "DX123" to the standard ICD-10 code E11.9 (Type 2 Diabetes without complications)
  3. Value set lookup: The CMS-HCC value set maps E11.9 to HCC 19 (Diabetes without Complication)
  4. Data mart output: The CMS-HCC mart uses HCC 19's coefficient to calculate the patient's RAF score

This chain — source code to standard code to value set to analytics output — is why terminology accuracy is critical. A wrong mapping at any step cascades through the entire pipeline.

Garbage in, garbage out: If your source data maps a diabetes code incorrectly, the HCC mart will calculate wrong RAF scores, the Chronic Conditions mart will miscategorize patients, and the Quality Measures mart will misidentify eligible populations. Terminology accuracy is the foundation of everything.

The Terminology Viewer

Tuva provides a web-based Terminology Viewer at thetuvaproject.com for browsing code systems and value sets interactively. You can:

  • Search codes: Look up any ICD-10, CPT, HCPCS, or SNOMED code and see its description
  • Browse value sets: See which codes belong to a specific HCC category, CCSR group, or quality measure
  • Explore mappings: Trace how a diagnosis code flows through the value set hierarchy (ICD-10 to HCC to RAF)
  • Compare versions: See what changed between terminology updates
Practical use: When a clinician asks "why was this patient flagged for HCC 85?" you can use the Terminology Viewer to trace backward from the HCC to the specific ICD-10 codes that triggered it. This is invaluable for coding accuracy reviews.

Managing Updates

Healthcare terminologies update regularly — ICD-10 codes change annually, CMS-HCC mappings update with each model version, and HEDIS measure specifications change yearly. Tuva handles this through a seed versioning system.

How Versioning Works

  1. Version pinning: Your dbt_project.yml specifies which terminology version to use
  2. Update process: Change the version number in your config file
  3. Re-seed: Run dbt seed to download and load the new terminology files
  4. Rebuild: Run dbt build to recalculate all marts with the updated terminology
# dbt_project.yml
vars:
  # Pin to a specific terminology version
  tuva_terminology_version: "0.10.0"
Backward compatibility: Because terminology versions are pinned, you can always reproduce historical results. Need to recalculate 2024 risk scores with the 2024 version of the HCC crosswalk? Just set the version back and re-run.

Test Yourself

Q: What are the four layers of terminology in Tuva?

The four layers are: (1) Code Systems — standard healthcare vocabularies like ICD-10, CPT, SNOMED, HCPCS, and NDC (the raw dictionaries), (2) Value Sets — curated groupings of codes for specific analytics like CMS-HCC mappings, CCSR categories, and quality measure specifications, (3) Concept Library — Tuva's own clinical definitions for conditions, diagnostics, and procedures that add meaning on top of standard codes, (4) Reference Data — supporting context like calendar tables, geographic crosswalks, and provider type lookups.

Q: How does Tuva distribute and load terminology data?

Tuva stores terminology data in DoltHub repositories (version-controlled databases). The data is distributed as versioned seed files from cloud storage (S3, GCS, or Azure Blob). To load them, you run dbt seed, which downloads the files for your pinned version and loads them into your data warehouse. This replaces the old manual process of downloading code files from CMS websites, parsing them, and building lookup tables.

Q: What is the difference between a code system and a value set?

A code system is a complete vocabulary — it defines all possible codes and their descriptions (e.g., all 70,000+ ICD-10-CM codes). A value set is a curated subset of codes from one or more code systems, grouped for a specific purpose. For example, the CMS-HCC value set takes specific ICD-10 codes and maps them to HCC categories. Code systems answer "what does this code mean?" while value sets answer "which codes belong to this concept?"

Q: Trace a diagnosis code from source data through Tuva's terminology to a data mart output.

Starting with a source code "DX123" in raw data: (1) Normalization maps "DX123" to the standard ICD-10 code E11.9 (Type 2 Diabetes) using the code system layer, (2) The CMS-HCC value set maps E11.9 to HCC 19 (Diabetes without Complication), (3) The CMS-HCC data mart uses HCC 19's coefficient to contribute to the patient's RAF score. This chain — source to standard code to value set to analytics output — demonstrates why terminology accuracy is foundational to the entire pipeline.

Q: How do you update Tuva's terminology to a new version?

Three steps: (1) Change the tuva_terminology_version variable in your dbt_project.yml to the new version number, (2) Run dbt seed to download and load the new terminology files into your warehouse, (3) Run dbt build to recalculate all data marts with the updated terminology. Because versions are pinned, you can always reproduce historical results by setting the version back to the original.

Interview Questions

Q: How does Tuva manage healthcare terminology, and why is this approach better than manual management?

Tuva manages terminology through version-controlled seed files stored in DoltHub repositories and distributed via cloud storage (S3/GCS/Azure). Loading is automated via dbt seed. This is better than manual management because: (1) Version control — every update is tracked, and you can see exactly what changed, (2) Reproducibility — pin a version in dbt_project.yml and results are consistent regardless of when you run, (3) Automation — no manual downloads from CMS websites, no parsing Excel files, no building lookup tables, (4) Multi-warehouse — same terminology works across Snowflake, BigQuery, Redshift, Databricks, and DuckDB. The old way took weeks of engineering time per update. Tuva reduces it to one command.

Q: What is the difference between code systems and value sets in Tuva?

Code systems are complete standard vocabularies — ICD-10-CM (70,000+ diagnosis codes), CPT (10,000+ procedure codes), SNOMED CT, HCPCS, NDC. They define every possible code and its description. They answer "what does this code mean?" Value sets are curated subsets of codes grouped for specific analytics purposes. The CMS-HCC value set maps specific ICD-10 codes to HCC categories for risk adjustment. CCSR value sets group ICD-10 codes into clinically meaningful categories. Quality measure value sets define eligible populations and numerator criteria. Value sets answer "which codes belong to this concept?" Every data mart depends on value sets to function correctly.

Q: How does Tuva's normalization process use terminology to standardize data?

Normalization maps source-specific codes to standard terminologies through a chain: (1) Raw source codes from the Input Layer (e.g., proprietary diagnosis code "DX123") are mapped to standard code system codes (e.g., ICD-10 E11.9), (2) Standard codes are then looked up in value sets (e.g., E11.9 maps to HCC 19 via the CMS-HCC crosswalk), (3) Data marts consume the standardized codes and value set mappings to produce analytics output (e.g., HCC 19's coefficient contributes to the RAF score). Each Connector includes the mapping logic specific to its source system. The accuracy of this chain is critical — a wrong mapping at any step cascades through the entire pipeline into incorrect analytics results.

Q: How are terminology updates handled in Tuva, and how do you ensure reproducibility?

Tuva uses a seed versioning system. The tuva_terminology_version variable in dbt_project.yml pins the exact terminology version your pipeline uses. To update: change the version number, run dbt seed to load new files, and dbt build to recalculate marts. Reproducibility is guaranteed because version pinning means the same version number always produces the same terminology data, regardless of when or where you run the pipeline. This is critical for regulatory compliance — if CMS audits your risk adjustment submissions, you need to prove exactly which terminology version was used. You can also run multiple versions in parallel for comparison (e.g., "what would our RAF scores look like under the new HCC model?").