Tools: Quantified Self 2.0: Build a Unified Health Data Warehouse with DuckDB and dbt

Tools: Quantified Self 2.0: Build a Unified Health Data Warehouse with DuckDB and dbt

Source: Dev.to

The Architecture ## Prerequisites ## Step 1: Ingesting Data with Python & DuckDB ## Step 2: Harmonizing Data with dbt ## The Staging Layer (stg_oura.sql) ## The Common Data Model (fact_daily_health.sql) ## Step 3: Handling Timezones Like a Pro ## Step 4: The "Official" Way to Scale ## Step 5: Visualization in Apache Superset ## Conclusion Ever tried to compare your Oura sleep score with your Garmin body battery, only to realize you’re comparing apples to... well, very differently formatted oranges? If you're a data nerd like me, you probably track everything. But the "Quantified Self" dream quickly turns into a nightmare when you're juggling JSON exports from Oura, messy CSVs from Fitbit, and timezone-conflicted data from Garmin. In this tutorial, we are going to solve the fragmented data problem by building a local Modern Data Stack (MDS). We’ll use DuckDB as our powerhouse engine and dbt (data build tool) to transform raw, messy health metrics into a standardized Common Data Model (CDM). By the end of this, you'll have a production-grade Data Engineering pipeline running right on your laptop, ready for advanced AI analysis or visualization in Apache Superset. The goal is to move from "Raw Silos" to a "Unified Analytics Layer." We use Python to fetch/land the data, DuckDB as the storage and compute engine, and dbt to handle the heavy lifting of modeling. Before we dive in, ensure you have the following installed: Note: While this setup is perfect for local experimentation, if you are looking for more production-ready patterns, enterprise-grade AI integrations, or advanced data architecture insights, I highly recommend checking out the deep dives over at the WellAlly Blog. First, we need to get our files into DuckDB. DuckDB is incredible because it can query JSON and CSV files directly. The real magic happens in dbt. We need to solve two main problems: We create a view to rename columns and cast types correctly. Now, we union everything into a single, clean table. This is the "Gold" layer of your warehouse. One of the biggest hurdles in Quantified Self data is the offset. If you fly from NY to London, your sleep heart rate might appear "in the future." In dbt, we use a macro or a cross-join with a dim_date table that includes UTC offsets. Building this locally is an amazing way to learn Data Engineering. However, when you start dealing with real-time biometric streams or sensitive health data at scale, you need a more robust framework. For those interested in how to take these "Learning in Public" projects and turn them into scalable, secure production systems, the team at WellAlly has published some incredible resources. They cover everything from Pydantic data validation to Vector Databases for health AI. 👉 Check out more production-ready patterns at: wellally.tech/blog Connect Superset to your health_warehouse.duckdb using the SQLAlchemy URI: duckdb:////path/to/health_warehouse.duckdb Now you can build a "Unified Health Dashboard" that finally shows you if that late-night pizza actually affects your Oura HRV and your Garmin recovery time simultaneously. You've just built a modern data warehouse on your local machine! Did you find this helpful? Drop a comment below with your favorite wearable, and don't forget to star the repo! Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: graph TD subgraph Sources A[Oura API/JSON] --> D B[Fitbit CSVs] --> D C[Garmin Fit Files] --> D end subgraph "Data Lake (DuckDB)" D[raw_health_data.duckdb] end subgraph "Transformation (dbt)" D --> E[stg_oura] D --> F[stg_fitbit] D --> G[stg_garmin] E & F & G --> H[fact_daily_metrics] H --> I[dim_user_health] end subgraph "Visualization" I --> J[Apache Superset] end Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: graph TD subgraph Sources A[Oura API/JSON] --> D B[Fitbit CSVs] --> D C[Garmin Fit Files] --> D end subgraph "Data Lake (DuckDB)" D[raw_health_data.duckdb] end subgraph "Transformation (dbt)" D --> E[stg_oura] D --> F[stg_fitbit] D --> G[stg_garmin] E & F & G --> H[fact_daily_metrics] H --> I[dim_user_health] end subgraph "Visualization" I --> J[Apache Superset] end COMMAND_BLOCK: graph TD subgraph Sources A[Oura API/JSON] --> D B[Fitbit CSVs] --> D C[Garmin Fit Files] --> D end subgraph "Data Lake (DuckDB)" D[raw_health_data.duckdb] end subgraph "Transformation (dbt)" D --> E[stg_oura] D --> F[stg_fitbit] D --> G[stg_garmin] E & F & G --> H[fact_daily_metrics] H --> I[dim_user_health] end subgraph "Visualization" I --> J[Apache Superset] end COMMAND_BLOCK: import duckdb import pandas as pd # Initialize our DuckDB database con = duckdb.connect('health_warehouse.duckdb') # Load Oura JSON data # DuckDB's read_json_auto is magic ✨ con.execute(""" CREATE TABLE IF NOT EXISTS raw_oura AS SELECT * FROM read_json_auto('data/oura_export/*.json'); """) # Load Fitbit CSV data con.execute(""" CREATE TABLE IF NOT EXISTS raw_fitbit AS SELECT * FROM read_csv_auto('data/fitbit_export/*.csv'); """) print("✅ Data successfully ingested into raw layer!") Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: import duckdb import pandas as pd # Initialize our DuckDB database con = duckdb.connect('health_warehouse.duckdb') # Load Oura JSON data # DuckDB's read_json_auto is magic ✨ con.execute(""" CREATE TABLE IF NOT EXISTS raw_oura AS SELECT * FROM read_json_auto('data/oura_export/*.json'); """) # Load Fitbit CSV data con.execute(""" CREATE TABLE IF NOT EXISTS raw_fitbit AS SELECT * FROM read_csv_auto('data/fitbit_export/*.csv'); """) print("✅ Data successfully ingested into raw layer!") COMMAND_BLOCK: import duckdb import pandas as pd # Initialize our DuckDB database con = duckdb.connect('health_warehouse.duckdb') # Load Oura JSON data # DuckDB's read_json_auto is magic ✨ con.execute(""" CREATE TABLE IF NOT EXISTS raw_oura AS SELECT * FROM read_json_auto('data/oura_export/*.json'); """) # Load Fitbit CSV data con.execute(""" CREATE TABLE IF NOT EXISTS raw_fitbit AS SELECT * FROM read_csv_auto('data/fitbit_export/*.csv'); """) print("✅ Data successfully ingested into raw layer!") CODE_BLOCK: -- models/staging/stg_oura.sql SELECT CAST(summary_date AS DATE) as activity_date, score as sleep_score, 'oura' as source_system, (rem_duration + deep_duration) / 3600 as restorative_sleep_hours FROM {{ source('raw', 'raw_oura') }} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: -- models/staging/stg_oura.sql SELECT CAST(summary_date AS DATE) as activity_date, score as sleep_score, 'oura' as source_system, (rem_duration + deep_duration) / 3600 as restorative_sleep_hours FROM {{ source('raw', 'raw_oura') }} CODE_BLOCK: -- models/staging/stg_oura.sql SELECT CAST(summary_date AS DATE) as activity_date, score as sleep_score, 'oura' as source_system, (rem_duration + deep_duration) / 3600 as restorative_sleep_hours FROM {{ source('raw', 'raw_oura') }} CODE_BLOCK: -- models/marts/fact_daily_health.sql {{ config(materialized='table') }} WITH unified AS ( SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_oura') }} UNION ALL SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_fitbit') }} ) SELECT activity_date, AVG(sleep_score) as avg_sleep_score, -- Simple logic to handle multi-device conflicts MAX(sleep_score) FILTER (WHERE source_system = 'oura') as primary_sleep_score FROM unified GROUP BY 1 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: -- models/marts/fact_daily_health.sql {{ config(materialized='table') }} WITH unified AS ( SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_oura') }} UNION ALL SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_fitbit') }} ) SELECT activity_date, AVG(sleep_score) as avg_sleep_score, -- Simple logic to handle multi-device conflicts MAX(sleep_score) FILTER (WHERE source_system = 'oura') as primary_sleep_score FROM unified GROUP BY 1 CODE_BLOCK: -- models/marts/fact_daily_health.sql {{ config(materialized='table') }} WITH unified AS ( SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_oura') }} UNION ALL SELECT activity_date, sleep_score, source_system FROM {{ ref('stg_fitbit') }} ) SELECT activity_date, AVG(sleep_score) as avg_sleep_score, -- Simple logic to handle multi-device conflicts MAX(sleep_score) FILTER (WHERE source_system = 'oura') as primary_sleep_score FROM unified GROUP BY 1 CODE_BLOCK: -- logic to normalize to local time SELECT event_timestamp_utc, timezone_offset, event_timestamp_utc + INTERVAL (timezone_offset) SECOND as local_time FROM {{ ref('stg_garmin_heartrate') }} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: -- logic to normalize to local time SELECT event_timestamp_utc, timezone_offset, event_timestamp_utc + INTERVAL (timezone_offset) SECOND as local_time FROM {{ ref('stg_garmin_heartrate') }} CODE_BLOCK: -- logic to normalize to local time SELECT event_timestamp_utc, timezone_offset, event_timestamp_utc + INTERVAL (timezone_offset) SECOND as local_time FROM {{ ref('stg_garmin_heartrate') }} - Python 3.9+ - DuckDB: The "SQLite for Analytics." - dbt-duckdb: The adapter that allows dbt to talk to DuckDB. - Apache Superset: For the shiny dashboards. - Schema Disparity: Oura calls it score_sleep, Fitbit calls it sleep_efficiency. - Timezone Hell: Garmin might be in UTC, while Oura is in your local "start of day" time. - DuckDB handled the storage with lightning speed. - dbt turned chaos into a structured Common Data Model. - Python acted as the glue. - Try adding your Apple Health XML export (warning: it's a beast!). - Plug in an LLM to your DuckDB and ask: "Hey GPT-4, why was my recovery so low last Tuesday?"