Case study

Telemetry pipeline for a connected hardware manufacturer

Turning raw device logs from a global fleet of over four million units into analytics-ready tables that the business could actually plan around.

4M+

devices in fleet

5 yrs

engagement

Lead

role

AWS

platform

Context

A connected fleet, growing fast

The client was a consumer hardware manufacturer with a connected device installed in millions of homes around the world. Each device emitted semi-structured log data, and the volume kept growing as the fleet did.

The data was rich but not usable. It sat in storage as raw files, with no consistent schema and no way for the analytics, product, and support teams to ask straightforward questions of it.

I joined the engagement early, eventually moved into the principal engineer and project lead role, and stayed with it for five years.

Problem

Logs in, nothing out

Three things had to be true at the same time. The pipeline had to scale with the fleet, which meant doubling roughly every couple of years. It had to produce stable, well-shaped tables that could feed everything from personalised recommendations to error analyses to management reports. And it had to be operable, because no one wanted to be paged on a Sunday because a single device shipped a malformed payload.

Most of the failure modes came from the data itself. Schemas drifted as firmware versions rolled out. Late-arriving and out-of-order events were normal, not exceptional. Reprocessing a year of history had to be cheap enough that we would actually do it when a definition changed.

Approach

Boring on purpose

We built the pipeline on Spark and AWS, with a clear separation between the raw landing layer, a normalised event layer, and the curated tables that downstream consumers actually queried. Each layer had its own contract, and changes flowed through a review process rather than landing as surprises in production.

CI/CD and automated testing were not optional. Every change shipped through GitLab pipelines, with unit and integration tests catching most of the schema and edge-case issues before they hit a real run. Monitoring went into Grafana and Instana, with alerting routed through OpsGenie so the on-call rotation always had context.

Once the foundation was stable, the work shifted toward growing the team and the system together. I moved into the principal and lead role, set the engineering standards, and made sure the next set of engineers could ship without me holding the pen on every change.

Deliverables

What was shipped

ETL codebase

Spark and Java jobs transforming semi-structured logs into a layered data model, with unit tests, integration tests, and clear ownership.

Analytics-ready tables

Curated tables for personalised recommendations, error analysis, and management dashboards, documented and versioned.

CI/CD and quality gates

GitLab pipelines, automated test suites, and code review standards that made deployments routine rather than risky.

Monitoring and on-call

Grafana and Instana dashboards, OpsGenie routing, and runbooks the team could follow at three in the morning without calling for help.

Team and process

Mentored engineers, established review and release practices, and led the project as the system and team scaled together.

Operational track record

Five years of running the pipeline through fleet growth, firmware changes, and product pivots without losing trust in the numbers.

Outcome

Numbers people trust

By the time I rolled off, the pipeline was processing logs from over four million devices and feeding multiple downstream products. Personalised recommendations, error and quality reporting, and the management view of the fleet all sat on top of the same curated tables.

The bigger win was less visible. The team had moved from heroic firefighting to routine releases. New engineers could onboard in days rather than weeks. The client could ask new questions of the data and get an answer in the same sprint, not the next quarter.

Stack

Technologies

Apache Spark Java AWS Maven GitLab CI Grafana Instana OpsGenie

Sitting on a pile of telemetry?

Happy to walk through what a sensible first step looks like.