Unplanned vehicle downtime costs commercial fleets thousands of dollars per day per vehicle. When a truck breaks down mid-route, the cost is not limited to the repair itself. There is the tow, the missed delivery window, the rescheduled load, the idle driver, and the cascading impact on every downstream commitment. For large fleets, unplanned maintenance is one of the single largest controllable expenses.
Traditional maintenance strategies fall into two categories. Reactive maintenance waits for something to break. Calendar-based maintenance services vehicles on fixed intervals every 15,000 miles or every 90 days regardless of actual component condition. The first is expensive and dangerous. The second is wasteful, replacing parts that still have useful life while occasionally missing failures that develop between service windows.
AI-powered predictive maintenance offers a third option: maintain when the data says maintenance is needed. Not before, not after, but precisely when component degradation signals indicate that intervention will prevent a failure.
We built a predictive maintenance system for a commercial fleet platform that processes telemetry from thousands of vehicles in real time. The system reduced unplanned downtime by 18% and improved fuel efficiency by 12%. This post covers how it works from raw sensor data to calibrated alerts that fleet managers actually trust.
The Data Pipeline
Predictive maintenance is fundamentally a data engineering problem before it is a machine learning problem. The quality of predictions depends entirely on the quality, completeness, and timeliness of the telemetry data feeding the system.
Our pipeline ingests data from five primary sources.
Tire Pressure Monitoring Systems (TPMS). Pressure readings, temperature measurements, and inflation history for every tire on every vehicle. TPMS data arrives at high frequency approximately every 30 seconds when the vehicle is in motion. Tire failures account for a disproportionate share of roadside breakdowns, making this one of the most valuable data streams.
Engine diagnostics. OBD-II diagnostic trouble codes, coolant temperature, oil pressure, RPM patterns, and engine load data. Unlike TPMS, engine diagnostics are primarily event-driven. Readings are transmitted when values cross thresholds or when the system polls at fixed intervals. The data is rich but irregular.
Fuel system telemetry. Consumption rates, fuel economy trends, and injector performance metrics. Fuel data is typically aggregated hourly rather than streamed in real time. Degradation in fuel efficiency is often the earliest detectable signal that something in the engine, drivetrain, or tire system is not performing correctly.
Environmental context. Ambient temperature, altitude, and road conditions inferred from GPS coordinates and accelerometer data. Environmental factors are critical for normalization. A tire pressure reading of 95 PSI means something different at sea level in July than it does at 8,000 feet in January.
Maintenance history. Past repairs, parts replaced, service intervals, and mechanic notes. Historical maintenance data provides the labels that supervised models need it tells the system what failure looks like and when it happened.
The core engineering challenge is that these data streams arrive at different frequencies, in different formats, with different reliability characteristics. TPMS transmits every 30 seconds; OBD-II fires on events; fuel data aggregates hourly. Sensors fail, transmit erroneous readings, or drift out of calibration over time. Cellular connectivity gaps create holes in the data.
The pipeline must handle missing data through imputation, reject outliers that would corrupt downstream features, detect and correct sensor drift, and align all streams into a unified time-series representation per vehicle. We use Apache Kafka for ingestion, with stream processing that applies data quality checks before any data reaches the feature store. Approximately 4% of raw telemetry is flagged and either corrected or discarded at this stage.
Feature Engineering: Where Predictive Power Lives
Raw sensor readings are noisy and, in isolation, largely uninformative. A single tire pressure measurement tells you almost nothing about whether that tire will fail next week. The predictive power lives in features that capture change over time and relationships between sensors.
Our feature engineering pipeline produces several categories of derived features.
Rolling statistics. For each sensor on each vehicle, we compute 7-day and 30-day moving averages, variance, rate of change, and trend direction. A tire that has been losing 0.3 PSI per day for the past two weeks is behaving differently from one that dropped 2 PSI overnight. Both are below threshold, but the failure modes and the urgency are entirely different.
Cross-sensor correlations. Individual sensors tell part of the story. Combinations tell the rest. A tire pressure drop coinciding with a temperature spike suggests a potential blowout risk. Rising coolant temperature paired with increasing fuel consumption may indicate a cooling system failure that is forcing the engine to work harder. We compute pairwise correlation features across all sensor families, and these cross-sensor features consistently rank among the most important in our models.
Usage pattern features. How a vehicle is driven affects how quickly components degrade. We extract hard braking frequency, rapid acceleration events, idle time percentage, highway-versus-city driving ratio, and average load weight where available. Two identical trucks with identical mileage can have vastly different maintenance needs if one runs highway routes and the other operates in stop-and-go urban delivery.
Degradation curves. For each component category, we model how sensor readings typically change over time relative to the last maintenance event. A healthy engine shows a predictable oil pressure curve between oil changes. A degrading engine deviates from that curve. The deviation magnitude and acceleration are strong predictive features.
Contextual normalization. Raw features are normalized against environmental conditions and vehicle specifications. Tire pressure behavior at altitude is different from behavior at sea level. A fuel efficiency drop after a route change is normal; the same drop on an unchanged route is a signal.
The key insight from our work is that teams often invest heavily in model architecture while underinvesting in feature engineering. In our experiments, improving the feature set consistently produced larger accuracy gains than swapping model architectures. The best model with mediocre features underperforms a simpler model with well-engineered features.
Model Architecture
The predictive maintenance system uses a multi-model architecture rather than a single monolithic model. Different failure modes require different modeling approaches.
Anomaly detection for unknown failure modes. Not all failures match known patterns. Isolation forests and autoencoders detect unusual telemetry patterns that deviate from a vehicle’s established baseline without requiring labeled examples of every possible failure type. The autoencoders learn a compressed representation of “normal” behavior for each vehicle class. When reconstruction error exceeds a dynamic threshold, the system flags the anomaly for review. This catches novel failure modes that supervised models, trained only on historical failures, would miss entirely.
Remaining useful life (RUL) prediction. For known component types tires, brakes, batteries, filters, belts gradient-boosted models estimate the number of days until maintenance is needed. These models are trained on historical telemetry leading up to confirmed maintenance events and component replacements. XGBoost and LightGBM have consistently outperformed deep learning alternatives in our fleet context, likely because the feature engineering captures the relevant temporal patterns and the tabular structure of the data suits tree-based methods.
Handling class imbalance. This is one of the most critical and most underappreciated challenges in predictive maintenance. Actual component failures are rare events typically 1% to 3% of the dataset. A naive classifier that predicts “no failure” for every observation achieves 97% accuracy and is completely useless. We address class imbalance through a combination of SMOTE (Synthetic Minority Over-sampling Technique) for training data augmentation, cost-sensitive learning that penalizes missed failures far more heavily than false alarms, and ensemble methods that aggregate predictions across multiple resampled training sets. The goal is not to maximize overall accuracy but to maximize recall on the failure class while keeping false positives within an acceptable range.
Alert calibration. The threshold between a useful alert and a false positive that erodes fleet manager trust is the most operationally important parameter in the entire system. We calibrate alert thresholds per fleet based on their expressed tolerance for false positives. A fleet with an in-house maintenance shop and spare vehicles can tolerate more alerts because acting on a false positive costs relatively little. A long-haul fleet with tight schedules and no spare capacity needs higher confidence before an alert is worth acting on. This calibration is not a one-time setting it is continuously adjusted based on feedback.
MLOps: Keeping Models Accurate Over Time
A predictive maintenance model that is trained once and deployed forever will degrade. Fleet operations are not static. Vehicles are added and retired. Routes change seasonally. New vehicle models with different sensor characteristics join the fleet. Components from different manufacturers age differently. The data distribution shifts continuously.
Weekly retraining. Models retrain on a weekly cadence as new telemetry and confirmed maintenance outcomes arrive. The retraining pipeline validates that the new model outperforms the current production model on a holdout set before promotion. If the new model does not improve, the existing model remains in production and the training data is inspected for quality issues.
Distribution shift detection. We monitor for changes in the statistical distribution of incoming telemetry features. When a fleet adds a batch of new vehicles from a different manufacturer, the sensor characteristics shift. When routes change seasonally, driving patterns shift. Automated drift detection flags these shifts so that models can be retrained or recalibrated proactively rather than after prediction quality has already degraded.
A/B testing for model promotion. New model versions are never promoted blindly. We run shadow deployments where the candidate model generates predictions alongside the production model without surfacing those predictions to fleet managers. After sufficient volume accumulates, we compare precision, recall, and calibration metrics between the two versions. Only models that demonstrate statistically significant improvement are promoted.
Mechanic feedback loops. When the system generates a maintenance alert, the outcome is tracked. Did the mechanic confirm the predicted issue? Was a different issue found? Was the vehicle inspected and found healthy? This feedback is the most valuable training signal in the entire pipeline. It closes the gap between what the model predicts and what actually happens in the shop. Over time, these feedback labels dramatically improve model accuracy for fleet-specific failure patterns.
Per-segment monitoring. Aggregate accuracy metrics hide problems. A model might perform well overall while failing badly for a specific vehicle type or component category. We track prediction accuracy per vehicle type, per component, per fleet, and per route category. Degradation in any segment triggers investigation and targeted retraining.
Results and Lessons Learned
The system has been running in production for over a year. The measurable outcomes are significant.
18% reduction in unplanned downtime. Vehicles that previously broke down mid-route are now flagged for maintenance before the failure occurs. The remaining unplanned downtime is primarily caused by failure modes not yet represented in the training data road hazard damage, electrical failures with no telemetry precursor, and sudden catastrophic events.
12% improvement in fuel efficiency. This result combines two factors: predictive maintenance ensures that vehicles operate at peak mechanical efficiency, and the telemetry analysis identified systematic fuel waste patterns underinflated tires, clogged air filters running past their effective life, and misaligned wheels that calendar-based maintenance was not catching.
Maintenance cost reduction through predictive scheduling. Calendar-based maintenance replaces components on a fixed schedule regardless of actual condition. Predictive scheduling extends the life of components that are still healthy while catching degradation that would have been missed between fixed intervals. The net effect is fewer unnecessary part replacements and fewer emergency repairs.
The technical lessons are as instructive as the results. Data quality matters more than model complexity. We spent more engineering time on the ingestion pipeline, data validation, and feature engineering than on model architecture and that allocation proved correct. Fleet-specific calibration is essential because no universal alert threshold works across fleets with different vehicles, routes, and operational constraints. Mechanic feedback loops close the prediction-reality gap faster than any amount of additional training data. And the hardest problem is not building a model that predicts failures; it is building a system that fleet managers trust enough to act on.
Looking Ahead
Predictive maintenance is one of the most tangible applications of AI in fleet operations. The return on investment is directly measurable in reduced downtime, lower repair costs, and improved vehicle utilization. The data already exists in most modern commercial fleets TPMS, OBD-II, and fuel monitoring are standard equipment. The impact is immediate and compounds over time as feedback loops improve model accuracy.
The hard part is not the model. It is building a data pipeline that handles the messiness of real-world telemetry, engineering features that capture meaningful degradation patterns, managing the severe class imbalance inherent in failure prediction, and calibrating alerts so that the people who receive them trust the system enough to act. Get those right, and the models are the straightforward part.