Tools: Scuderia Data Ep.3

Tools: Scuderia Data Ep.3

🚛 Episode 3 — Fuel Logistics (Azure Data Factory)

🔄 What ADF Does (and Doesn't Do)

🧱 ADF Core Concepts

Linked Services — The Fuel Truck Models

Datasets — The Fuel Manifests

Pipelines — The Delivery Route

Triggers — The Dispatch Schedule

📐 Ingestion Patterns

Full Load (One-Time or Periodic Snapshot)

Incremental Load (Watermark-Based)

Event-Driven (File Arrival)

🔀 ADF vs Databricks for Orchestration

🏁 Pit Stop Summary "The best fuel in the world is useless if it never reaches the car." Your fuel tank (ADLS Gen2) is ready and waiting. But raw data doesn't teleport itself from SAP, Salesforce, IoT devices, or REST APIs into your lake. You need a logistics system — and that's Azure Data Factory (ADF). ADF is the fuel truck fleet of your data platform. It moves data. It doesn't transform it deeply (that's Spark's job), but it knows every road, every connection type, and every schedule. Think of ADF as the logistics manager, not the engineer. It coordinates movement. Databricks does the heavy manufacturing. A Linked Service is a connection definition — it tells ADF how to connect to a system. Each source or destination system needs one. A Dataset describes the shape and location of data at a linked service. It's the cargo manifest for your fuel truck. A Pipeline is a sequence of activities — Copy, Execute Notebook, Delete, Validation, and more. It's the delivery route the truck follows. Triggers define when a pipeline runs: Load everything from source each time. Simple, but expensive at scale. Only load rows newer than the last run. Use a watermark column (e.g., updated_at). An event trigger fires when a file lands in a watched container. Ideal for partner data feeds, SFTP drops, and IoT batches. A common question: should I orchestrate with ADF or with Databricks Workflows? In practice, many platforms use both: ADF for ingestion orchestration, Databricks Workflows for transformation orchestration. Next Episode → The fuel is in the tank. Now let's meet the race car — Azure Databricks itself. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. as well , this person and/or

Code Block

Copy

{ "name": "ls_adls_scuderia", "type": "AzureBlobFS", "typeProperties": { "url": "https://scuderiadatastorage.dfs.core.windows.net", "accountKey": { "type": "AzureKeyVaultSecret", "secretName": "adls-key" } } } CODE_BLOCK: { "name": "ls_adls_scuderia", "type": "AzureBlobFS", "typeProperties": { "url": "https://scuderiadatastorage.dfs.core.windows.net", "accountKey": { "type": "AzureKeyVaultSecret", "secretName": "adls-key" } } } CODE_BLOCK: { "name": "ls_adls_scuderia", "type": "AzureBlobFS", "typeProperties": { "url": "https://scuderiadatastorage.dfs.core.windows.net", "accountKey": { "type": "AzureKeyVaultSecret", "secretName": "adls-key" } } } CODE_BLOCK: Source System → [Copy Activity] → ADLS /raw/entity/snapshot_date=2026-03-12/ CODE_BLOCK: Source System → [Copy Activity] → ADLS /raw/entity/snapshot_date=2026-03-12/ CODE_BLOCK: Source System → [Copy Activity] → ADLS /raw/entity/snapshot_date=2026-03-12/ COMMAND_BLOCK: last_watermark = read from control table new_data = SELECT * FROM source WHERE updated_at > last_watermark copy new_data → ADLS update control table with new watermark COMMAND_BLOCK: last_watermark = read from control table new_data = SELECT * FROM source WHERE updated_at > last_watermark copy new_data → ADLS update control table with new watermark COMMAND_BLOCK: last_watermark = read from control table new_data = SELECT * FROM source WHERE updated_at > last_watermark copy new_data → ADLS update control table with new watermark - Schedule trigger: Every day at 02:00 - Tumbling window: Time-partitioned batches - Event trigger: Fires when a file arrives in ADLS - Manual: On-demand - ADF is the fuel logistics system — it moves data, not transforms it - Core components: Linked Services, Datasets, Pipelines, Triggers - Key patterns: Full load, incremental watermark, event-driven - ADF and Databricks Workflows are complementary, not competing