Data Pipelines

Get data from A to B, reliably. Batch, streaming, or both. Built to run without hand-holding.

Context

Pipeline engineering

A pipeline that runs is easy. A pipeline that runs correctly at 3am on a holiday, handles failures gracefully, and remains understandable six months later, that's engineering.

I build pipelines with proper testing, monitoring, and documentation. The kind that don't page you at midnight.

Capabilities

What I build

01

Batch Processing

Scheduled data processing for analytics, reporting, and warehouse loads. Optimized for reliability and cost efficiency.

  • ETL/ELT pipeline development
  • Incremental processing patterns
  • Data quality and validation
  • Idempotent, rerunnable jobs
02

Real-Time Streaming

For when batch isn't fast enough. Sub-second latency with proper handling of late data and failures.

  • Spark Structured Streaming
  • Kafka-based event streaming
  • Change Data Capture (CDC)
  • Event-driven architectures

Data Integration

Connect to any source: APIs, databases, SaaS platforms, files, message queues.

Orchestration

Airflow, Dagster, Databricks Workflows, or cloud-native schedulers.

Monitoring & Alerting

SLA tracking, anomaly detection, and runbooks for when things go wrong.

Honest Take

Do you actually need real-time?

Real-time streaming is 3-5x more complex and expensive than batch. Most "real-time requirements" are actually "faster batch" requirements.

Before building streaming infrastructure, I'll help you figure out if you actually need it, or if hourly batch would solve the problem.

Fraud detection: yes, real-time matters

Live operational dashboards: maybe, depends on use case

?

Analytics dashboards: probably not, hourly is usually fine

?

Monthly reporting: definitely not

Stack

Technologies

Python PySpark SQL dbt Databricks Delta Lake Spark Streaming Kafka Debezium Airflow Dagster Fivetran Airbyte dlt DuckDB MotherDuck DuckLake

Need reliable data pipelines?

Let's talk about what you're trying to move and where.