Upgrading a Data Pipeline for a Fleet of 1,500+ Edge Devices
- Jul 10, 2022
- 2 min read
The Situation:
An AI software startup operated a fleet of 1,500+ edge devices generating real-time sensor data. Over time, the data pipeline had evolved into a fragile web of SQL jobs with minimal documentation and growing performance bottlenecks.
This became critical when the engineer who built and maintained the system left the company.
At that point, the pipeline was not only slow, but also a single point of failure no one really understood.
The Challenge:
The company needed to:
Stabilize a production system already under load
Restore confidence in data correctness and availability
Enable the pipeline to scale with a growing device fleet
Do all of this without disrupting ongoing operations
This required both short-term containment and a longer-term architectural reset.
The Approach:
I worked closely with a software engineer to first make the existing system legible.
That meant:
Tracing and documenting the full data flow end-to-end
Identifying bottlenecks, failure modes, and unnecessary complexity
Fixing high-impact bugs to stabilize the current pipeline
In parallel, I designed and implemented a new architecture optimized for real-time ingestion and visualization at scale. The goal was not a file-by-file rewrite, but a simpler system with clear boundaries and fewer moving parts.
We transitioned from the legacy ETL setup to a pipeline built on the Elastic stack, allowing us to migrate incrementally while keeping the business running.
Results:
The new system delivered immediate and measurable improvements:
~90% reduction in pipeline code complexity
Real-time processing and visualization of device data
Horizontal scalability as the fleet continued to grow
Lower maintenance burden for the engineering team
Just as importantly, the pipeline was now understood. Clear documentation and a simpler design meant the team could diagnose issues and extend the system without relying on institutional memory.
Why this matters:
Many data pipelines fail quietly, becoming slower and harder to debug as systems scale. The real risk isn't just reduced performance, but also a lack of organizational understanding.
This engagement succeeded because we treated the problem as both a technical and organizational failure, addressing root causes rather than layering fixes on top.
_________________________________________________
If your data pipeline has become a bottleneck (or a liability), I'm open to conversations where the goal is long-term stability, not temporary patches.