The Tuva Project Interview Questions

TL;DR

25+ interview questions about The Tuva Project with hidden answers, organized by topic. Click "Show Answer" to reveal. Covers architecture, data marts, risk adjustment, quality measures, terminology, dbt integration, and real-world healthcare analytics scenarios.

Short on time? Focus on Architecture & Pipeline, CMS-HCC Risk Adjustment, and dbt Integration — these come up most in healthcare analytics interviews that involve Tuva.
YMYL Disclaimer: This content is for informational purposes only. It is designed for data professionals preparing for healthcare analytics interviews — not for medical advice or clinical decision-making.

Tuva Fundamentals

Q: What is The Tuva Project and what problem does it solve?

The Tuva Project is an open-source healthcare analytics framework built on dbt (data build tool). It solves the problem that every healthcare organization builds the same analytics from scratch — risk adjustment, quality measures, cost analysis, readmissions — each time re-implementing complex clinical logic. Tuva provides a standardized, tested, version-controlled pipeline that transforms raw healthcare claims data into analytics-ready tables. It includes pre-built connectors for common data sources, a normalization layer, a Core Data Model, 13+ analytics data marts, and all the healthcare terminology needed (ICD-10, CPT, HCC mappings, etc.). Instead of months of custom development, teams can have production-ready healthcare analytics running in days.

Q: What is the Input Layer in Tuva?

The Input Layer is Tuva's standardized interface for ingesting raw healthcare data. It defines a specific schema that all source data must conform to before entering the pipeline. The Input Layer includes tables for medical claims, pharmacy claims, eligibility, and provider data with defined columns and data types. Your raw data — whether from an EHR, claims warehouse, or flat files — must be mapped to this schema. This is where Connectors help: pre-built connectors handle the mapping from common source formats (Medicare LDS, Athena, etc.) to the Input Layer automatically.

Q: How do Tuva Connectors work?

Connectors are pre-built dbt packages that map data from specific source systems into Tuva's Input Layer format. Each connector knows the source system's schema and handles all the translation logic — column renaming, data type casting, code mapping, and structural transformation. For example, the Medicare LDS connector knows that the Medicare Limited Data Set uses specific column names and formats, and it transforms them into Tuva's standardized Input Layer tables. If no connector exists for your source, you build a custom one by creating dbt models that map your raw tables to the Input Layer schema. This is typically the most labor-intensive part of a Tuva implementation.

Q: What data warehouses does Tuva support?

Tuva supports five major cloud data warehouses: Snowflake, Google BigQuery, Amazon Redshift, Databricks, and DuckDB. The same dbt code works across all of them because Tuva uses dbt's adapter system to handle SQL dialect differences. DuckDB support is particularly useful for local development and testing — you can run the entire Tuva pipeline on your laptop without a cloud warehouse. For production, most organizations use Snowflake or BigQuery. The choice of warehouse doesn't affect the analytics logic or output — only the SQL syntax is adapted.

Q: Why is Tuva built on dbt rather than a custom framework?

Building on dbt gives Tuva several advantages: (1) SQL-native — healthcare analysts who know SQL can read, customize, and extend every model without learning a proprietary language, (2) Dependency management — dbt's ref() function ensures models run in the correct order across the 7-stage pipeline, (3) Testing — dbt's built-in testing framework validates data quality at every stage (not null, unique, accepted values, relationships), (4) Seed files — dbt seeds are perfect for loading terminology data (ICD-10 codes, HCC mappings), (5) Package management — Tuva can be installed as a dbt package via packages.yml, making version management trivial, (6) Community — the large dbt community means more contributors, faster bug fixes, and easier hiring.

Want deeper coverage? See Tuva Project Overview and Core Concepts.

Architecture & Pipeline

Q: Walk through Tuva's 7-stage pipeline from raw data to analytics output.

The 7 stages are:

1. Input Layer: Raw healthcare data mapped to Tuva's standardized schema (medical claims, pharmacy claims, eligibility, providers).
2. Connectors: Pre-built transformations that map specific source systems (Medicare LDS, Athena, etc.) into the Input Layer format.
3. Staging: Initial data cleaning, type casting, and basic validation of Input Layer data.
4. Normalization: Maps source-specific codes to standard terminologies (ICD-10, CPT, etc.) and standardizes formats across all sources.
5. Claims Preprocessing: Healthcare-specific logic — claim grouping, service categorization, encounter assignment, duplicate resolution.
6. Core Data Model: The clean, standardized analytical foundation with tables for conditions, encounters, eligibility spans, procedures, and medications.
7. Data Marts: 13+ pre-built analytics modules (CMS-HCC, Quality Measures, Readmissions, PMPM, etc.) that produce analytics-ready output tables.

Q: What is the difference between Normalization and Claims Preprocessing?

Normalization focuses on code standardization — it maps source-specific codes to standard healthcare terminologies. For example, mapping a proprietary diagnosis code to ICD-10, or standardizing gender values from "M/F" to "male/female." It's about making sure everyone speaks the same language.

Claims Preprocessing focuses on healthcare-specific business logic — it applies clinical and billing rules that require domain knowledge. This includes: grouping claim lines into complete claims, assigning service categories (inpatient, outpatient, ED, professional), identifying encounters from individual claims, resolving duplicate or overlapping claims, and applying claim type hierarchies. Normalization makes the data consistent; Claims Preprocessing makes it analytically meaningful.

Q: What does the Core Data Model contain?

The Core Data Model is the clean, standardized analytical foundation that all data marts build upon. Key tables include:

condition: Patient conditions with standardized ICD-10 codes, onset dates, and status
encounter: Clinical encounters (inpatient stays, ED visits, office visits) with dates, types, and providers
eligibility: Member enrollment spans with plan details and coverage dates
medical_claim: Standardized medical claims with normalized codes and amounts
pharmacy_claim: Standardized pharmacy claims with NDC codes and costs
procedure: Procedures performed with CPT/HCPCS codes
lab_result: Laboratory test results with LOINC codes
medication: Medication records with RxNorm codes

The Core Model is source-agnostic — regardless of whether data came from Medicare, a commercial payer, or an EHR, it all conforms to the same schema.

Q: How do data marts relate to the Core Data Model?

Data marts are specialized analytics layers that read from the Core Data Model and produce use-case-specific output tables. The relationship is one-to-many: one Core Model feeds many data marts. Each mart applies its own analytical logic: the CMS-HCC mart maps Core Model conditions to HCC categories and calculates RAF scores, the Quality Measures mart evaluates Core Model encounters and conditions against HEDIS specifications, the Financial PMPM mart aggregates Core Model claims and eligibility into per-member-per-month cost tables. This architecture means you build the Core Model once and get 13+ analytics products from it. Adding a new data source means building one connector — all existing marts automatically work with the new data.

Q: How does Tuva handle data quality?

Tuva handles data quality at multiple levels: (1) dbt tests at every pipeline stage — not_null, unique, accepted_values, and relationships tests validate data integrity, (2) Normalization validation — codes that can't be mapped to standard terminologies are flagged for review, (3) Claims Preprocessing checks — duplicate detection, claim line grouping validation, and service category assignment rules, (4) Data quality mart — a dedicated mart that generates quality metrics: mapping rates, null rates, duplicate rates, and code validity percentages across the pipeline, (5) Freshness checks — dbt source freshness tests alert when source data stops flowing. The data quality mart is particularly important — it gives you a dashboard-ready view of where your data is strong and where it needs attention.

Deeper coverage: Core Concepts

Data Marts & Analytics

Q: What is CMS-HCC and how does Tuva implement it?

CMS-HCC (Hierarchical Condition Categories) is the risk adjustment model CMS uses to determine Medicare Advantage plan payments. Tuva implements it through a dedicated data mart that: (1) Maps ICD-10 diagnosis codes from the Core Model's condition table to HCC categories using the official CMS crosswalk (loaded as terminology seed data), (2) Applies the HCC hierarchy — when a patient has both a higher-severity and lower-severity condition in the same hierarchy, only the higher one counts, (3) Calculates demographic risk factors (age, sex, Medicaid dual-eligibility, institutional status), (4) Computes interaction terms (certain HCC combinations that amplify risk), (5) Produces the final RAF score per patient by summing all components. Output tables include patient_risk_factors (individual HCCs), patient_risk_scores (final RAF), hcc_suspecting (undercoded conditions), and hcc_recapture (documentation gaps for chronic conditions).

Q: Explain PMPM and how Tuva's Financial PMPM mart calculates it.

PMPM (Per-Member-Per-Month) is the foundational metric in healthcare finance. Formula: PMPM = Total Cost / Member Months. It normalizes spending to a per-person-per-month basis for fair comparison. Tuva's Financial PMPM mart calculates it by: (1) Pulling paid amounts from the Core Model's medical_claim and pharmacy_claim tables, (2) Calculating member months from the eligibility table (each month a member is enrolled = 1 member month), (3) Breaking down costs by service category (inpatient, outpatient, professional, ED, pharmacy), (4) Aggregating at multiple levels: total, by payer/plan, by provider, by service category, and by time period. The output tables (pmpm_prep and pmpm_payer_plan) are designed to plug directly into BI tools for financial dashboards showing cost trends, category breakdowns, and variance analysis.

Q: How does the readmissions mart work?

The Readmissions mart identifies 30-day all-cause readmissions from the Core Model's encounter data. It works by: (1) Identifying index admissions — qualifying inpatient stays (excluding certain categories like psychiatric, rehab, and transfers), (2) Looking forward 30 days from each discharge date for any subsequent inpatient admission, (3) Classifying readmissions as planned (scheduled surgeries, planned procedures) or unplanned (unexpected returns) using CMS's planned readmission algorithm, (4) Flagging potentially preventable readmissions based on diagnosis and timing patterns. Output tables include the index admission details, readmission details, days between discharge and readmission, and preventability flags. This feeds CMS penalty calculations under the Hospital Readmissions Reduction Program (HRRP), where hospitals with excess readmissions face up to 3% payment reductions.

Q: What is the difference between HCC suspecting and HCC recapture?

Both are revenue recovery tools, but they target different gaps:

HCC Suspecting identifies conditions that are likely present but never documented on claims. It uses indirect evidence: a patient on insulin (from pharmacy claims) without a diabetes diagnosis on medical claims, or abnormal lab values without a corresponding condition code. These are net-new HCC opportunities — conditions that have never been coded.

HCC Recapture tracks conditions that were documented in prior years but haven't been re-documented this year. Since CMS requires chronic conditions to be documented annually for risk adjustment, any condition that "drops off" means lost RAF score. These are documentation gaps — the condition still exists, it just hasn't been coded yet this year.

Together, they form a complete revenue integrity strategy: suspecting finds new money, recapture prevents losing existing money.

Q: Name 5 data marts in Tuva and explain their use cases.

1. CMS-HCC (Risk Adjustment): Maps diagnoses to HCC categories and calculates RAF scores for Medicare Advantage payment optimization. The most commercially valuable mart.

2. Quality Measures (HEDIS): Calculates HEDIS-style quality measures, identifies care gaps, and tracks compliance rates. Feeds directly into CMS STAR rating improvement efforts.

3. Financial PMPM: Calculates Per-Member-Per-Month costs broken down by service category, provider, and time period. The foundation of healthcare financial reporting.

4. Chronic Conditions: Identifies and groups patients by chronic conditions using CMS CCW definitions. Powers population health segmentation and care management targeting.

5. Readmissions: Flags 30-day all-cause readmissions and classifies them as planned vs. unplanned. Used for CMS penalty avoidance and quality improvement initiatives.

Deeper coverage: Data Marts & Analytics

Terminology & Value Sets

Q: How does Tuva manage healthcare terminology?

Tuva manages terminology through a version-controlled seed system. Terminology data (ICD-10 codes, CPT descriptions, HCC mappings, CCSR categories, etc.) is stored in DoltHub repositories and distributed as versioned files from cloud storage (S3/GCS/Azure). When you run dbt seed, these files are automatically downloaded and loaded into your data warehouse. The version is pinned in dbt_project.yml, ensuring reproducibility. This replaces the traditional approach of manually downloading code files from CMS websites, parsing them, and building lookup tables — a process that typically took weeks of engineering time per update.

Q: What is the difference between code systems and value sets in Tuva?

Code systems are complete standard vocabularies — the "raw dictionaries" of healthcare. ICD-10-CM has 70,000+ diagnosis codes, CPT has 10,000+ procedure codes, SNOMED CT has hundreds of thousands of clinical concepts. Each code has a description. They answer: "what does this code mean?"

Value sets are curated subsets of codes grouped for specific analytics purposes. The CMS-HCC value set maps specific ICD-10 codes to HCC categories. CCSR value sets group ICD-10 codes into ~530 clinically meaningful categories. Quality measure value sets define which codes constitute the eligible population and numerator criteria for each HEDIS measure. They answer: "which codes belong to this concept?"

Every data mart depends on value sets. Without the HCC mapping value set, the CMS-HCC mart can't calculate RAF scores. Without measure specification value sets, the Quality Measures mart can't identify care gaps.

Q: How does Tuva's normalization process use terminology to standardize data?

Normalization uses terminology in a chain: (1) Source-specific codes from the Input Layer (e.g., proprietary code "DX123") are mapped to standard code system codes using source-specific mapping tables (e.g., ICD-10 E11.9 for Type 2 Diabetes), (2) Standard codes are validated against the terminology seed tables to ensure they're legitimate codes with descriptions, (3) Invalid or unmappable codes are flagged in data quality reports, (4) Downstream, value sets further classify the standardized codes (e.g., E11.9 maps to HCC 19 via the CMS-HCC crosswalk). The terminology layer is the foundation — if a code doesn't map correctly during normalization, it cascades into incorrect results in every data mart that uses it.

Q: How are terminology updates handled in Tuva?

Three-step process: (1) Update the tuva_terminology_version variable in dbt_project.yml to the new version number, (2) Run dbt seed to download and load the updated terminology files, (3) Run dbt build to recalculate all data marts with the new terminology. Version pinning ensures reproducibility — the same version number always produces the same terminology data. This is critical for regulatory compliance: if CMS audits risk adjustment submissions, you need to prove which terminology version (and therefore which HCC crosswalk) was used. You can also run multiple versions in parallel to compare impact (e.g., "how would RAF scores change under the new CMS-HCC model version?").

Deeper coverage: Terminology & Value Sets

dbt Integration & Practical

Q: How do you install and configure Tuva in a dbt project?

Installation follows standard dbt package management: (1) Add Tuva to your packages.yml file with the desired version, (2) Run dbt deps to download the package, (3) Configure your dbt_project.yml with Tuva-specific variables — database/schema settings, which data marts to enable, terminology version, and source table references, (4) Build or configure your connector to map your source data to Tuva's Input Layer, (5) Run dbt seed to load terminology data, (6) Run dbt build to execute the entire pipeline. The most time-consuming step is typically building the connector (step 4) if a pre-built one doesn't exist for your source system. Everything else is configuration.

Q: What dbt variables control which data marts run in Tuva?

Tuva uses dbt variables in dbt_project.yml to control mart execution. There's a master switch: tuva_marts_enabled: true enables all marts at once. For granular control, individual variables exist for each mart: cms_hcc_enabled, quality_measures_enabled, readmissions_enabled, financial_pmpm_enabled, chronic_conditions_enabled, ed_classification_enabled, etc. Setting any variable to true activates that mart; false (or omitting it) skips it. This is useful because not every organization needs every mart. A commercial health plan might enable CMS-HCC and Quality Measures but skip AHRQ measures. A hospital system might focus on Readmissions and ED Classification.

Q: How would you build a custom Connector for a new data source?

Building a custom connector involves: (1) Understand the source schema: Document every table and column in your source system, with data types and sample values, (2) Map to the Input Layer: For each Input Layer table (medical_claim, pharmacy_claim, eligibility, provider), identify which source columns map to which Input Layer columns, (3) Build dbt models: Create SQL models that SELECT from your source tables and transform/rename columns to match the Input Layer schema. Handle data type casting, null handling, and code standardization, (4) Handle edge cases: Source-specific quirks like proprietary codes that need mapping, date format differences, and missing fields that need default values, (5) Add tests: Write dbt tests to validate the connector output matches Input Layer expectations. (6) Document: Record all mapping decisions for future maintenance. The connector is typically the hardest part of a Tuva implementation because it requires deep knowledge of both the source system and Tuva's expectations.

Q: How does Tuva's architecture leverage dbt features like refs, tests, and seeds?

Tuva leverages dbt features extensively:

ref(): Every model in the 7-stage pipeline uses ref() to reference upstream models, creating a dependency graph that ensures correct execution order. This is what makes the pipeline work — the CMS-HCC mart automatically knows it depends on the Core Model, which depends on Claims Preprocessing, and so on.

Tests: Tuva includes hundreds of built-in tests across the pipeline: not_null on critical fields, unique on primary keys, accepted_values on categorical fields (e.g., gender must be 'male' or 'female'), and relationships tests that validate foreign keys between tables. These run automatically during dbt build.

Seeds: All terminology data (ICD-10 codes, HCC mappings, value sets) is loaded via dbt seed. This makes terminology version-controlled, portable across warehouses, and easy to update.

Macros: Tuva uses dbt macros for reusable SQL logic (e.g., date calculations, code validation functions) and for warehouse-specific SQL generation (handling Snowflake vs. BigQuery syntax differences).

Scenario Questions

Q: A health plan wants to improve their CMS STAR rating. How would you use Tuva to identify gaps in care?

Structured approach using Tuva:

1. Enable the Quality Measures mart: Set quality_measures_enabled: true and run dbt build. This calculates HEDIS-style measures across the member population.

2. Identify measures closest to STAR thresholds: Query the summary tables to find measures where your compliance rate is just below the next STAR cut-point. A measure at 71% where 72% earns an extra star is your highest-ROI target.

3. Generate member-level care gap lists: For priority measures, pull the list of members in the denominator (eligible) who are not in the numerator (measure not met). These are your actionable care gaps.

4. Segment by provider: Cross-reference care gaps with the Provider Attribution mart to identify which medical groups have the most open gaps. Target provider education and incentives there.

5. Cross-reference with Chronic Conditions: Members with multiple chronic conditions often have more care gaps. Use the Chronic Conditions mart to prioritize outreach to the highest-risk members.

6. Track over time: Run the Quality Measures mart monthly to track gap closure rates and forecast year-end STAR performance.

Q: Your risk adjustment team suspects undercoding. How would you use Tuva's HCC suspecting mart?

1. Enable the HCC Suspecting mart: Set the appropriate variable and run dbt build.

2. Review the suspecting list: The mart produces a list of members with conditions suggested by indirect evidence but not documented on medical claims. Key signals include:
Pharmacy-to-diagnosis gaps: Member is on insulin but has no diabetes diagnosis code
Lab-to-diagnosis gaps: Abnormal HbA1c values without a diabetes diagnosis
Historical conditions: Chronic conditions documented in prior years but missing from current claims (this overlaps with recapture)

3. Prioritize by RAF impact: Not all suspected HCCs are equal. Sort by the HCC coefficient to focus on conditions that would add the most to RAF scores. HCC 85 (CHF) has a much higher coefficient than HCC 19 (Diabetes without complication).

4. Generate provider worklists: Group suspected conditions by rendering provider. Send targeted lists to coding teams and providers for chart review and documentation improvement.

5. Quantify revenue impact: Calculate: (suspected RAF increase) x (number of affected members) x (monthly capitation rate) x 12 = annual revenue opportunity.

6. Compliance guardrail: Suspected conditions must be confirmed through legitimate chart review. The suspecting mart identifies opportunities — it does NOT justify coding without clinical documentation.

Q: You need to build a financial dashboard showing PMPM trends. What Tuva tables would you use?

Primary data source: The Financial PMPM mart's output tables — pmpm_prep and pmpm_payer_plan.

Dashboard components:
Total PMPM trend line: Monthly total PMPM from pmpm_payer_plan, showing medical + pharmacy combined. Add year-over-year comparison.
Service category breakdown: Stacked bar/area chart showing inpatient, outpatient, professional, ED, and pharmacy PMPM components. Identifies which categories drive cost changes.
Medical vs. pharmacy split: Two trend lines showing how the mix between medical and pharmacy spend is shifting.
Top cost drivers: Join PMPM data with the CCSR or Chronic Conditions mart to show which conditions or clinical categories drive the most cost.
Provider-level analysis: PMPM by provider group, highlighting outliers (providers significantly above or below peers).

Supporting context from other marts: Chronic Conditions mart for disease prevalence context, CMS-HCC mart for risk-adjusted comparisons (a plan with sicker members should have higher PMPM), and eligibility data for member month calculations and population mix analysis.

Q: You're migrating from a custom analytics pipeline to Tuva. What's your approach?

Migration strategy:

Phase 1 — Assessment (2 weeks): Document the current pipeline: what data sources feed it, what transformations it applies, what outputs it produces. Map each existing output to a Tuva equivalent. Identify any custom analytics that Tuva doesn't cover (you'll need to build these as extensions).

Phase 2 — Parallel build (4-6 weeks): Install Tuva alongside the existing pipeline. Build or configure connectors for your data sources. Enable the data marts that correspond to your current outputs. Run both pipelines in parallel.

Phase 3 — Validation (2-4 weeks): Compare outputs between the old pipeline and Tuva. Discrepancies will exist — investigate each one. Common causes: different claim grouping logic, different code mapping versions, different business rules for edge cases. Document and resolve each difference. Get sign-off from analytics consumers that Tuva outputs are acceptable.

Phase 4 — Cutover (1-2 weeks): Redirect BI tools and downstream consumers to Tuva output tables. Keep the old pipeline available (read-only) for 30 days as a safety net. Decommission the old pipeline after the grace period.

Key risks: The custom pipeline likely has undocumented business rules and edge case handling. Expect the validation phase to take longer than planned. Involve domain experts (actuaries, quality analysts) in the comparison to catch subtle differences.