A Two-Stage Random Forest Ensemble for Condition-Based Fault Detection and Root-Cause Diagnosis in Automobile Engines, with a Deployed Diagnostic Dashboard

Nsikak Umurie

Abstract.

Vehicle owners and mechanics generally connect to OBD-II data streams, but turning raw OBD-II data into an actual engine failure diagnosis usually involves a great deal of experience and guesswork. This paper extends an earlier decision-tree-based binary classifier for automobile engine condition detection [9] into an ensemble and gradient-boosting framework evaluated on two datasets, with explicit attention to which input features are realistically obtainable from an OBD-II scanner. On the original project's primary dataset (19,535 vehicle-engine records, six sensor parameters, binary good/bad label, all six mapped to standard or common OBD-II/ECU parameters in this paper), a validated Gradient Boosting Classifier with a decision threshold tuned on a held-out validation split, never on the test set, raises macro-F1 from 0.607 (the original decision tree) to 0.632 and nearly doubles fault-class recall (34.5% to 63.8%), while confirming Engine RPM as the dominant predictor identified in the original study. On EngineFaultDB, a peer-reviewed public dataset of 55,999 real spark-ignition engine readings across four condition classes, a two-stage Random Forest pipeline (detect, then diagnose) achieves 100.0% accuracy on binary condition detection and 64.8% accuracy on fault-subtype identification, with ROC-AUC exceeding 0.99 for two of three fault classes. Our confusion matrix analysis showed that the lean-mixture and the low-ignition-voltage faults are indistinguishable from one another using their steady state sensor characteristics, providing for an immediate future avenue of sensor development. Our two trained models were packaged as a single, browser-only dashboard with two selectable modes, available at the public url https://engine-dashboard.netlify.app/. The system requires no installation or internet connection and can be used to provide an engineer with immediate engine condition information, an likely root-cause, and a recommended maintenance inspection checklist from inputted OBD readings..

Index Terms. predictive maintenance, OBD-II, automobile engine fault diagnosis, Random Forest, gradient boosting, ensemble learning, condition-based monitoring, diagnostic dashboard, root-cause analysis.

I. Introduction

On-board diagnostic (OBD-II) scanners give vehicle owners and technicians access to a wide range of live engine sensor readings: manifold pressure, throttle position, exhaust gas composition, engine speed, and more. In practice, however, reading this data is only the first step: converting a set of numeric sensor values into a specific, actionable diagnosis still typically depends on technician experience, frequently supplemented by manual, sequential component testing. Michailidis et al. [2] survey the growing body of OBD-II-based machine learning applications and note that most deployed systems still stop at anomaly flagging rather than root-cause attribution, leaving the diagnostic burden on the technician. Rule-based and decision-tree expert systems have been proposed to close this gap [3], and ensemble methods such as Random Forest have shown strong results on related vehicle-subsystem fault diagnosis tasks [4], [5], but, to the best of our knowledge, combining an ensemble classifier with an explicit two-stage detect-then-diagnose architecture and a publicly deployed, zero-install diagnostic interface has not been reported for OBD-derived engine data specifically.

The first phase of this project addressed the detection of half of this problem: a decision-tree classifier trained on aggregated “good” and “bad” engine readings predicted whether a given set of readings indicated a healthy or faulty engine. This paper extends that work along three axes, corresponding to its three contributions:

An advanced ensemble model, Random Forest [6], is benchmarked against the original decision tree method and a Naive Bayes baseline, and is shown to match or exceed the decision tree's accuracy while adding the interpretability (feature importance) needed for root-cause explanation.
The diagnostic problem is restated in terms of a 2-stage pipeline that finds a fault and also returns the most likely class, addressing the "why" question a binary classifier can't, while examining and stating the resultant difficulty asymmetry between the two stages.
The trained model is then compiled into a complete, ready to use, publicly accessible, browser-based diagnostic dashboard (VI), thereby taking our contribution from something recorded on a paper form or in a lab book to a tool which can be readily deployed by an engineer.
We then apply a gradient-boosted enhancement, with a decision threshold set only based on a separate validation subset, solely targeting the six-parameter hard and more realistic six-parameter problem of our main dataset (Section V-G) to remove the caveat found in a previous version of this paper and boost fault-class recall without training information from the test set (Section III-D presents the subset of the input features that would typically be accessible via a generic OBD-II reader).

This paper is a direct continuation of an earlier undergraduate dissertation, “Development of a Decision Tree-Based Predictive Model for Condition-Based Fault Detection in Automobile Engines” [9], which identified that conventional fault-identification approaches “rely on residual knowledge, which can be outdated and prone to mistakes,” and set out to build a decision-tree model able to classify engine condition as good or bad and to surface the specific parameter(s) responsible for a bad reading. Its own recommendations for future work explicitly called for testing “other methods… like Discriminant Analysis, Support Vector Machines, Naive Bayes Classifiers, Logistic Regression, nearest neighbours, and neural network classifiers” to improve on the decision tree's accuracy, and for pairing the model with OBD-II scanners in practice. We explicitly addressed these two recommendations: Table V-G in Section V-G shows re- benchmarking ensemble and gradient boosting algorithms against the base decision tree over the exact same training and testing data; Table III-D in Section III-D show the OBD-II/ECU availability for each of the features we use; and Section VI explain our current OBD-facing deployment and mode tailored for the feature set used in a field setting.

II. Related Work

A. OBD-II-based diagnostics and machine learning

Michailidis, Panagiotopoulou, and Papadakis [2] present a survey of OBD-II machine learning, covering sustainability, efficiency, security and safety as applications and consider condition based fault diagnosis where traditional tabular machine learning models (trees, ensembles, Bayesian methods) dominate as more interpretable and effective in comparison to machine learning deep learning methods where interpretability is reduced to the gain of small accuracies improvements on saturated detection problems.

B. Decision-tree and knowledge-based diagnostic systems

The use of a tree structure for fault diagnosis is reinforced by the contribution of Prez-Vzquez, Anzures-Garca, and Snchez-Glvez [3], where they introduce a diagnosis based on a decision tree associated with a symbolic knowledge base for vehicular engine fault diagnosis. The approach takes advantage of the explicitness of the decisions offered by decision tree algorithms, a concept which this dashboard's explanatory module similarly benefits from.

C. Ensemble methods for vehicle subsystem diagnosis

Quan, Zhang, and Feng [4] apply a genetic-algorithm-optimized Random Forest to remote fault diagnosis of fuel-cell vehicle powertrains, reporting strong classification performance across eight fault types from CAN-bus telemetry, and Hossain, Rahman, and Ramasamy [5] review the broader landscape of AI-driven vehicle fault diagnosis, concluding that ensemble tree methods offer the best accuracy-to-interpretability trade-off for maintenance-facing deployments, as opposed to research-facing benchmarking exercises. This paper's model choice (Section IV-C) is directly motivated by that conclusion.

D. Foundational classifier methodology

The classifiers benchmarked in this paper are grounded in established methodology: CART decision trees [7], Random Forest bagging with per-split feature subsampling [6], and Gaussian Naive Bayes with continuous-feature density estimation [8]. Full mathematical formulations are given in Section IV.

E. Predictive maintenance methodology

Carvalho et al. [10] systematically review machine learning methods applied to predictive maintenance across industries and find decision trees and random forests among the most frequently applied and best-performing families for tabular sensor data, while cautioning that reported accuracy is highly sensitive to class balance, directly relevant to Section V-G, where class imbalance materially affects fault-class recall on the primary dataset. The foundation for machinery diagnostics and prognosis within CBM was introduced by Jardine, Lin, and Banjevic [11], whose three classic sub-problems for this paper-fault detection, isolation and severity estimation-formed the basis of the two-stage approach of this work (see Sec. IV-A).

F. Positioning relative to EngineFaultDB and the original thesis

This work uses EngineFaultDB [1], the first publicly available, peer-reviewed dataset developed specifically for automotive engine fault classification. The initial EngineFaultDB paper presented baseline accuracy for logistic regression, decision tree, random forest, SVM, k-nearest neighbours, and a feed-forward neural network classifier on the 4-class problem presented herein, producing a dataset-level reference performance benchmark. It also uses the original project's primary dataset [9], [12] (19,535 engine sensor readings labelled good/bad) on which the original thesis [9] itself compared twenty MATLAB Classification Learner configurations (Table VII) and found that ensemble methods (Boosted Trees, 66.5%) already outperformed every single decision tree variant tested (best: 65.7%), a finding this paper independently confirms and extends with a full precision/recall/F1/ROC evaluation (Section V-G). This paper builds on both foundations but differs in aim and contribution: rather than benchmarking classifiers in isolation, it (a) reframes the EngineFaultDB task as a two-stage detect-then-diagnose pipeline that mirrors real inspection workflow and exposes a difficulty asymmetry the original single-model benchmarks did not surface, (b) applies class-balanced ensemble learning and a validation-tuned gradient boosting classifier to the primary dataset to address the class-imbalance limitation implicit in the original thesis's per-class results, and (c) translates the resulting models into a publicly deployed, dependency-free diagnostic tool with an explicit accounting of which inputs are realistically available from a generic OBD-II scanner (Sections III-D and VI).

III. Dataset and Preprocessing

A. Primary dataset

The original thesis [9] takes as a secondary (i.e. pre-existing, third party) dataset an automotive engine health dataset whose sensor readings are publicly available on the Kaggle blog [12] and contain a binary Engine Condition: GOOD (coded 1) or BAD (coded 0) for each of 19,535 sensor readings for six continuous engine parameters. In this paper, the same data set is used. Its structure is summarized in Table I: there are 12,317 GOOD (63.1%) and 7,218 BAD (36.9%) readings, a moderate class imbalance that, as Section V-G demonstrates, in itself has a significant impact on fault-class detection unless addressed.

TABLE I. Primary dataset feature set (n = 19,535), with OBD-II/ECU availability

Feature	Symbol	Description	OBD-II / ECU availability
Engine rpm	R	Engine speed, revolutions per minute	Standard PID 0x0C (SAE J1979)
Fuel pressure	FP	Fuel delivery line pressure	Standard PID 0x0A (SAE J1979, gauge pressure)
Lub oil pressure	LOP	Lubricant oil pressure	Not a standard PID; manufacturer-specific PID or aftermarket sender
lub oil temp	LOT	Lubricant oil temperature	Not a standard PID; manufacturer-specific PID or aftermarket sender
Coolant temp	CT	Engine coolant temperature	Standard PID 0x05 (SAE J1979)
Coolant pressure	CP	Engine coolant system pressure	Not a standard PID; aftermarket sender
Engine Condition (label)	EC	1 = Good, 0 = Bad	Target label, not a sensor reading

Four of these six inputs can be retrieved from any generic OBD-II scanner, while the other two (oil pressure and coolant pressure) are standard and can be found on most vehicles, but may not be available in all. Real world deployment implications of this availability pattern are discussed in Section III-D.

The methodology of the original thesis was followed in pre-processing: rows that had missing data were dropped leading to no distortions in the inter-variable relationships; statistical outlier removal was performed; no smoothing (moving-average/spline) was required because there was no material noise within the dataset. In this paper, that cleaned dataset is reused unchanged, thus the results in Section V-G are directly comparable to those reported in the original thesis.

B. Secondary dataset: EngineFaultDB

In order to assess our modeling method using an independent, published data set, we employ data from the EngineFaultDB [1]. EngineFaultDB consists of readings taken on a C14NE spark-ignition engine in a controlled laboratory setting and the gas composition data captured using an NGA 6000 gas analyzer. A National Instruments USB-6008 data acquisition device was used to store the data, which consist of 55,999 individual readings for the 14 continuous variables (Table I). All feature readings belong to one of the four class categories (Table II) and none were found to be missing.

TABLE II. Feature set used for classification (EngineFaultDB), with OBD-II/ECU availability

Feature	Unit	Description	OBD-II / ECU availability
MAP	kPa	Manifold Absolute Pressure	Standard PID 0x0B (SAE J1979)
TPS	%	Throttle Position Sensor reading	Standard PID 0x11 (SAE J1979)
Force	N	Engine torque / rotational force	Not available via OBD-II; dynamometer/bench-rig measurement
Power	kW	Rate of energy transfer	Not available via OBD-II; dynamometer/bench-rig measurement
RPM	rpm	Crankshaft revolutions per minute	Standard PID 0x0C (SAE J1979)
Consumption L/H	L/h	Fuel consumption rate	Calculated PID 0x5E on some MY2010+ vehicles; otherwise ECU-specific
Consumption L/100KM	L/100 km	Fuel efficiency over distance	Derived quantity, not a direct PID
Speed	km/h	Vehicle travel speed	Standard PID 0x0D (SAE J1979)
CO	%	Carbon monoxide in exhaust	Not available via OBD-II; requires a laboratory gas analyzer
HC	ppm	Unburnt hydrocarbons in exhaust	Not available via OBD-II; requires a laboratory gas analyzer
CO2	%	Carbon dioxide in exhaust	Not available via OBD-II; requires a laboratory gas analyzer
O2	%	Oxygen remaining in exhaust	Related standard PIDs 0x14 to 0x1B report O2-sensor voltage, not exhaust O2 percentage
Lambda	unitless	Air-fuel equivalence ratio	Derivable from wideband PID 0x34 on some MY2010+ vehicles only
AFR	unitless	Air-fuel ratio	Not a direct PID; derivable from Lambda where available

Nine of these fourteen inputs, including all four exhaust-composition variables (CO, HC, CO2, O2) and both dynamometer-derived variables (Force, Power), are not retrievable from a standard consumer-grade OBD-II scanner; they require the gas-analyzer and data-acquisition bench rig used to collect EngineFaultDB [1]. This has a direct, practical consequence for this paper's stated goal of real-world deployability: the fault-subtype model trained on EngineFaultDB (Section VI, dashboard Mode 1) is best suited to a workshop or laboratory setting with the appropriate equipment, whereas the primary dataset's six field-realistic inputs (Table I) are the basis of the dashboard's Mode 2 (Section VI-F), which is the mode this paper recommends for deployment via a generic OBD-II scanner in the field.

TABLE III. Class distribution (EngineFaultDB, n = 55,999)

Label	Condition	Samples	Share
0	Normal (good)	16,000	28.6%
1	Rich air-fuel mixture fault	10,998	19.6%
2	Lean air-fuel mixture fault	15,000	26.8%
3	Low ignition voltage fault	14,001	25.0%

C. Data splitting

The dataset was split 80/20 into training (n = 44,800) and held-out test (n = 11,199) sets using a fixed random seed (42) for reproducibility. Given the sample size, the random split preserved class proportions closely (test-set composition: 28.0% normal, 20.0% rich, 26.7% lean, 25.4% low-voltage, within 1 percentage point of the full-dataset proportions in Table II), so no explicit stratification was required.

IV. Methodology

A. Problem reframing: a two-stage pipeline

Rather than a single flat classifier over all four classes, the diagnostic task is decomposed into two stages that mirror how a technician actually works (Fig. 1). Stage 1 answers a binary question: is the engine faulty at all? Stage 2, invoked only when Stage 1 predicts a fault, answers which of the known fault types is most likely present. This decomposition allows each stage to be modelled and evaluated on its own terms and, as Section V shows, the two stages exhibit very different levels of difficulty, a structural insight a single flat classifier would obscure.

Fig. 1. Two-stage diagnostic pipeline architecture.

B. Candidate models

Four classifiers were implemented and compared, all trained from first principles in NumPy to keep the pipeline fully transparent, auditable, and light enough to run entirely client-side in the deployed dashboard (Section VI) without a server-side inference dependency. The fourth, Gradient Boosting, was added specifically to target the primary dataset's harder, imbalanced binary problem (Section V-G) and is not used for the already near-saturated EngineFaultDB Stage 1 task.

1) Gaussian Naive Bayes [8]

For class c and feature vector x, the class-conditional likelihood assumes feature independence and Gaussian-distributed continuous features:

P(x | c) = ∏ᵢ (1 / √(2π·σ²꜀,ᵢ)) · exp( −(xᵢ − μ꜀,ᵢ)² / (2·σ²꜀,ᵢ) )

with the predicted class chosen to maximize the posterior P(c)·P(x|c) via Bayes' rule (log-space for numerical stability). This model serves as a fast probabilistic baseline that explicitly tests the (violated, as Section V shows) independence assumption.

2) CART Decision Tree [7]

At each node, the split (feature f, threshold t) is chosen to maximize the Gini impurity reduction:

Gini(S) = 1 − Σ꜀ p꜀² , ΔGini = Gini(S) − (|Sₗ|/|S|)·Gini(Sₗ) − (|Sᵣ|/|S|)·Gini(Sᵣ)

where p꜀ is the proportion of class c in node S, and Sₗ, Sᵣ are the left/right children induced by the split. Growth stops at a maximum depth of 14, or when a node has fewer than 10 samples (min_samples_split) or a candidate split would leave a child with fewer than 5 samples (min_samples_leaf). This configuration replicates the original thesis's decision-tree method on the new data, serving as a like-for-like baseline.

3) Random Forest [6] (proposed model)

An ensemble of B = 40 CART trees, each trained on an independent bootstrap resample of the training set and restricted, at each split, to a random subset of ⌊√14⌋ = 3 candidate features. Final class probabilities are the mean of the per-tree leaf probability vectors:

P̂(c | x) = (1 / B) · Σᵦ Tᵦ(c | x)

with the predicted class argmax꜀ P̂(c|x). Bagging plus feature subsampling decorrelates the individual trees, reducing variance relative to a single decision tree without materially increasing bias.

4) Gradient Boosting Classifier [13] (primary-dataset refinement)

For the primary dataset's binary GOOD/BAD problem (Section V-G), an additive ensemble of M shallow CART regression trees is fitted stagewise to the residual of the binomial log-loss. Writing F₀(x) = log(p̄/(1−p̄)) for the log-odds of the training-set base rate p̄, each stage m fits a regression tree hₘ to the negative gradient (residual) rᵢ = yᵢ − σ(Fₘ₋₁(xᵢ)), where σ is the logistic sigmoid, and updates the ensemble additively:

Fₘ(x) = Fₘ₋₁(x) + ν · hₘ(x) , P̂(GOOD | x) = σ(F_M(x)) = 1 / (1 + e^(−F_M(x)))

with learning rate ν = 0.06, M = 150 trees, and each tree limited to max depth 4 and a minimum of 8 samples per leaf. Unlike Random Forest's averaging of independent trees, boosting builds trees sequentially, each one correcting the errors of the ensemble so far, which typically yields sharper decision boundaries on harder, imbalanced problems such as the primary dataset's BAD-class detection task, at the cost of requiring careful regularization (shallow trees, a small learning rate) to avoid overfitting.

Because the default 0.5 probability threshold is not necessarily optimal under class imbalance, the classification threshold applied to P̂(GOOD | x) was tuned on a held-out validation split (Section IV-E) rather than fixed a priori or tuned on the test set, giving a final threshold of 0.59.

C. Model selection rationale

Random Forest was preferred over a higher-capacity alternative such as a feed-forward neural network for three reasons tied directly to this project's deployment goal, consistent with the conclusion of Hossain et al. [5] that ensemble tree methods offer the best accuracy-to-interpretability trade-off for maintenance-facing tools: (i) it outperforms a single decision tree on noisy sensor data (Section V-A) without a large increase in training cost; (ii) it exposes per-feature importance scores that map directly onto the dashboard's root-cause explanation (Section VI-D), which a neural network would not provide without additional tooling such as SHAP; and (iii) its inference cost is low enough to run entirely client-side in a browser, avoiding a server dependency for the deployed tool.

Gradient Boosting was added as a fourth and final, highly focused addition not in place of Random Forest due to class imbalance in the main dataset (Section III-A), where accuracy alone does not expose the compromises made by the Random Forest model on BAD-class precision vs recall (Section V-G); and a feature set or other collection of data was not necessitated to achieve that final bit of improvement through boosted sequential error-correction, combined with thresholding that was optimized by validation.

D. Hyperparameters

TABLE IV. Model hyperparameters

Model	Setting	Value
Decision Tree	max_depth / min_samples_split / min_samples_leaf	14 / 10 / 5
Random Forest (4-class benchmark)	n_estimators / max_depth / min_samples_split / min_samples_leaf / max_features	40 / 14 / 10 / 5 / √14 ≈ 3
Random Forest, Stage 1 (binary)	n_estimators / max_depth / min_samples_split / min_samples_leaf	20 / 8 / 20 / 10
Random Forest, Stage 2 (subtype)	n_estimators / max_depth / min_samples_split / min_samples_leaf	25 / 10 / 15 / 8
Gradient Boosting (primary dataset)	n_estimators / max_depth / learning_rate / min_samples_split / min_samples_leaf / threshold	150 / 4 / 0.06 / 15 / 8 / 0.59 (validation-tuned)
All models	Random seed / train-test split	42 / 80% train, 20% test (60/10/30 for the validation-tuned Gradient Boosting model)

E. Evaluation metrics

For each class c, with true positives TP꜀, false positives FP꜀, and false negatives FN꜀:

Precision꜀ = TP꜀ / (TP꜀ + FP꜀) Recall꜀ = TP꜀ / (TP꜀ + FN꜀) F1꜀ = 2·Precision꜀·Recall꜀ / (Precision꜀ + Recall꜀)

Macro-F1 is the unweighted mean of per-class F1 scores. Receiver Operating Characteristic (ROC) curves were computed one-vs-rest per class by sweeping the predicted-probability threshold and plotting True Positive Rate against False Positive Rate; Area Under the Curve (AUC) was computed via the trapezoidal rule.

For the Gradient Boosting model specifically (Section IV-B.4), the primary dataset was split 60/10/30 into training, validation, and held-out test sets (seed 42). The classification threshold applied to P̂(GOOD | x) was selected to maximize macro-F1 on the validation split only, then applied unchanged to the untouched test split; no threshold, hyperparameter, or model-selection decision was made using test-set labels, avoiding the threshold-leakage that would otherwise inflate the reported test performance.

V. Results

A. Overall model comparison

TABLE V. Model comparison, full 4-class problem (held-out test set, n = 11,199)

Model	Accuracy	Macro F1	Binary acc.	Notes
Gaussian Naive Bayes	38.5%	0.347	–	Independence assumption violated
Decision Tree	74.5%	0.740	99.96%	Baseline (original thesis method)
Random Forest	74.3%	0.752	100.00%	Proposed model

Fig. 2. Accuracy and macro-F1 by model, 4-class problem.

B. Confusion matrix

Fig. 3 shows the row-normalized confusion matrix for the Random Forest model on the held-out test set. Normal and rich-mixture classes are classified with 100% accuracy; lean-mixture and low-voltage classes are frequently confused with each other, with roughly half of each misclassified as the other.

Fig. 3. Random Forest confusion matrix (row-normalized), 4-class test set.

C. ROC analysis

One-vs-rest ROC curves (Figure 4) mirror the observation from the confusion matrix, yielding excellent separability for both Normal and Rich-mixture classes (AUC ≈ 1.00) but exhibiting much lower (although still above chance) AUC values for the other two classes (Lean-mixture and Low-voltage), which as would be expected from our earlier confusion results are to some extent confusable.

Fig. 4. One-vs-rest ROC curves, Random Forest, 4-class test set.

D. Two-stage pipeline performance

Stage 1 (binary condition detection): 100.0% accuracy, macro-F1 = 1.00 (n = 11,199).

Stage 2 (fault-subtype identification), evaluated on faulty samples only (n = 8,065): 64.8% accuracy, macro-F1 = 0.675.

TABLE VI. Stage-2 per-class performance (fault subtype identification)

Fault type	Precision	Recall	F1	Support
Rich mixture	1.00	1.00	1.00	2,235
Lean mixture	0.52	0.53	0.52	2,990
Low ignition voltage	0.49	0.48	0.49	2,840

End-to-end pipeline accuracy on the original 4-class labels (Stage 1 → Stage 2 combined): 74.7%.

E. Key finding

The 14 steady-state variables are perfectly separable from all other classes in these rich-mixture faults, in all the tested models (Table IV, Fig. 3, Fig. 4). Lean-mixture and low-voltage faults, on the other hand, tend to overlap not only with one another, but also with each other when using any of the classifiers; this is a characteristic of the feature space, not of any particular algorithm. This means that the instant reading of OBD-style information is enough to reliably detect that there is something wrong (Stage 1 is solved), but reliably discriminating between certain fault subtypes will likely require additional signal information, e.g. short time-series/derivative features, or another sensor channel, a concrete, testable direction for future work (Section VIII).

F. Feature importance

Importance based on mean impurity decrease (Stage-1 Random Forest, all data: CO (0.159), Force (0.123), fuel consumption L/H (0.089), HC (0.088), fuel consumption L/100km (0.087)). This appears to be logically sound as the first three indicators are measures of combustion mixture quality and the latter three represent the effect of a fault on the resultant output of the engine.

Fig. 5. Random Forest feature importance, Stage-1 binary model.

G. Primary dataset benchmark

The identical three-model comparison was rerun on the primary project data-set [9],[12] (19,535 records, six attributes, binary GOOD/BAD target attribute), using the exact same 70/30 split between training and test sets that had been used in the original thesis for fair comparability. For comparison purposes the results from the original thesis's own three-way comparison of twenty different configurations of MATLAB’s Classification Learner is again shown in Table VII, which also already indicates the advantage that ensemble approaches (Boosted Trees) already had over all single-tree approaches examined.

TABLE VII. Original thesis's comparative ML model results (MATLAB Classification Learner) [9]

Model	Accuracy	Model
Linear SVM	65.5%	Fine Decision Tree (65.4%)
Medium Gaussian SVM	66.1%	Coarse Decision Tree (65.5%)
Coarse Gaussian SVM	65.5%	Medium Decision Tree (65.7%)
Coarse KNN	66.0%	Bagged Trees (63.2%)
Boosted Trees (best overall)	66.5%	RUSBoosted Trees (62.3%)

This same pattern is also seen in this paper's from-scratch benchmark (Table VIII): a standard Random Forest achieves the same accuracy (66.3% vs. 65.7%) but has a slight macro-F1 regression (0.594 vs. 0.607), similar to that of the original thesis (recall on BAD: 34.5%, compared to 40.6% in the original thesis). The problem can be solved directly using class balanced bootstrap sampling per-class (Section IV-B.3), and macro-F1 reaches 0.623, higher than the original decision tree, while the overall accuracy is reduced by only 63.1% - 64.3% and the recall score of the BAD class is almost doubled to 67.2%. It is a more relevant trade-off for a fault detection system - where a false negative is a costlier failure than a false alarm - than just raw accuracy, and the Gradient Boosting result presented next further improves the trade off.

TABLE VIII. Model comparison, primary dataset (held-out test set, n = 5,860)

Model	Accuracy	Macro F1	BAD recall	Notes
Gaussian Naive Bayes	66.3%	0.587	32.2%	Independence assumption violated
Decision Tree (original thesis [9])	65.7%	0.607	40.6%	MATLAB, reported baseline
Decision Tree (this study)	64.3%	0.586	N/A	NumPy re-implementation
Random Forest (standard)	66.3%	0.594	34.5%	Matches accuracy, weak minority recall
Random Forest (class-balanced)	63.1%	0.623	67.2%	Improves recall over standard RF
Gradient Boosting (validation-tuned)	64.4%	0.632	63.8%	Best macro-F1; recommended for deployment

To take that gap further, a Gradient Boosting Classifier [13] (IV-B.4) was also trained on the same six-feature primary dataset with an explicit 60/10/30 train/validation/test split (IV-E) so that its classification threshold could be adjusted without the validation set influencing decisions about test data. Adjusting this classification threshold to any value between 0 and 1 against P(GOOD | x) on the validation set and choosing the value which yields maximum validation macro-F1 results in an optimal threshold of 0.59 (e.g., classifying as GOOD requires model prediction ≥0.59, not just the default 0.5). Tested on the unused test set this performs at 64.4% accuracy with macro-F1 0.632 (Table VIII) – better than all models on this dataset whilst remaining closely matched between BAD- and GOOD- class recalls (63.8%, 64.7%). This is the model this paper suggests for use on the primary dataset and what is actually running as Mode 2 on the dashboard (VI-F).

Fig. 6. Accuracy and macro-F1 by model, primary dataset (red bar marks the best macro-F1: Gradient Boosting).

Fig. 7. Gradient Boosting confusion matrix, primary dataset (recommended model).

Fig. 8. Gradient Boosting ROC curve, primary dataset (AUC = 0.696).

Using this main dataset, feature importance of each feature (9) provides additional empirical evidence to our primary qualitative finding of the original thesis without being given prior knowledge of which features to predict: Engine RPM shows overwhelming influence (0.387) (as “the most influential prediction feature for engine conditions, is our primary feature of classification.” [9]), Fuel Pressure (0.160), Lubricant Oil Temperature (0.137), Coolant Pressure (0.107), Lubricant Oil Pressure (0.111) and Coolant Temperature (0.099). This agreement between an independent ensemble model and the original single-tree analysis is itself a form of validation of the original thesis's threshold-based root-cause logic (Section III-A), which was built around exactly this RPM-led hierarchy. A split-frequency-based importance measure computed for the Gradient Boosting model (Section V-G) ranks the same feature highest (Engine RPM, 0.29), followed by Lubricant Oil Temperature (0.21) and Fuel Pressure (0.19), corroborating this ranking across two structurally different ensemble methods.

Fig. 9. Random Forest feature importance, primary dataset.

VI. Diagnostic Dashboard

To translate the trained models into a tool usable by a working technician, both the EngineFaultDB two-stage Random Forest pipeline and the primary-dataset Gradient Boosting model were deployed together, as two selectable modes of a single self-contained, publicly hosted web application, live at: https://engine-dashboard.netlify.app/

A. System architecture

All trained tree ensembles, Random Forest for both EngineFaultDB stages and the Gradient Boosting model for the primary dataset, are serialized to JSON (per-node split feature, threshold, and left/right child references, plus the initial log-odds and learning rate for the boosting model) and embedded directly in the page. Inference is performed entirely client-side in JavaScript by replicating the same tree-traversal logic used during training in Python/NumPy: for a feature vector x reaching node n, descend left if x[n.feature] ≤ n.threshold, else right, until a leaf is reached; Random Forest predictions average leaf probability vectors across all trees (Section IV-B.3), while the Gradient Boosting prediction sums each tree's leaf value, scaled by the learning rate, onto a running log-odds total before applying the logistic sigmoid (Section IV-B.4). No server, database, or network call is required after the page loads, so the tool functions fully offline and imposes no ongoing hosting cost or latency.

B. Use cases

Roadside / independent-garage triage: a technician without access to a proprietary manufacturer scan tool enters live OBD readings and receives an immediate condition assessment and inspection checklist.
Vehicle owner self-check: an owner with a consumer-grade OBD-II dongle can get a plain-language first read before committing to a paid workshop diagnosis.
Vehicle Fleet pre-trip inspection -Fleet maintenance can quickly sort through all the sensor data on many different vehicles and know exactly which one need closer looking at first.
Automotive training and education tool The tool's clear probability distribution and feature contribution graph lend themselves well to use in an automotive teaching and learning environment where they could help students interpret OBD sensor data.
Research reproducibility: As the entire model is run locally on the client and publicly exposed, researchers outside the project can interrogate the behaviour without the need of our entire training pipeline or the dataset itself.
Field deployment with a common OBD-II scanner. Because mode 2’s six inputs derive from common OBD-II/ECU parameters (see Table I in Section III-D), it is this paper’s intended use-case mode with a standard consumer scanner without external instruments, gas analyzer, or dynamometer.

C. Interface layout

A mode-selector tab pair located above the page lets a user switch between mode 1 (EngineFaultDB, sub-fault-type) and mode 2 (the base-data-set, GOOD/BAD). Both modes have an input-panel-plus-result-panel format which is sketched out in fig. 10.

Fig. 10. Schematic layout of the deployed dashboard interface, shared by both modes.

D. Running a new diagnosis (Mode 1: EngineFaultDB)

A new diagnosis is entered and then run in the left-hand Input Panel (Fig. 10)

Select Mode 1 using the tab at the top of the page (Mode 1 is selected by default).
Enter the 14 OBD/sensor readings for the vehicle under test in the labelled input fields, or click on one of the preset buttons to load example readings: (Normal, Rich mixture, Lean mixture, Low voltage)
Click “Run Diagnosis”. The stage-1 model takes the readings and gives a Normal / Faulty result with confidence score as a coloured pill (Green means Normal and Red means Faulty).
If the decision is faulty the Stage-2 model is invoked and displays a probability rank bar showing probability for the 3 different types of faults.
Also showing are the readings of the top sensor, features that pushed the prediction farthest from normal, weighted by their model learned importance, and a brief, fault specific list of things to check, from Table IX.
If top two fault probabilities were similar (within 15 points), we highlight both on interface rather than confidently selecting a single fault. This maps onto Section V-E.

E. Recommended checks by fault type

TABLE IX. Fault type to recommended-check mapping shown in the dashboard

Predicted fault	Suggested engineer action
Rich air-fuel mixture	Check fuel injectors for leakage/over-delivery, inspect the O2 sensor and fuel pressure regulator, verify MAF/MAP sensor calibration.
Lean air-fuel mixture	Check for vacuum/intake leaks, inspect fuel delivery (pump pressure, clogged injectors/filter), verify O2 sensor response.
Low ignition voltage	Inspect ignition coil(s), spark plugs and gaps, battery/alternator charging voltage, and ignition wiring/connectors.

F. Second diagnostic mode: primary dataset (Good/Bad, Gradient Boosting)

Mode 2 exposes the primary dataset's Gradient Boosting model (Section IV-B.4) using the same Input Panel / Result Panel layout as Mode 1:

Select Mode 2 using the tab at the top of the page.
Enter the six OBD/ECU readings (Table I) in the labelled input fields, each annotated with its OBD-II PID or availability note (Section III-D), or click one of the preset example buttons (Healthy engine, Faulty engine) to load a representative sample reading.
Click “Run Diagnosis.” The model computes P̂(GOOD | x) by summing the learning-rate-scaled output of all 150 boosted trees onto the initial log-odds and applying the logistic sigmoid (Section IV-B.4), then classifies the reading as GOOD only if this probability meets the validation-tuned threshold of 0.59, otherwise as Faulty.
As in Mode 1, the panel displays the top contributing readings (largest deviation from the healthy-class mean, weighted by split-frequency importance) and, when a fault is detected, a short checklist of recommended checks for the two most contributing readings (Table X).
Since Mode 2 has lower macro-F1 score (0.632) and lower BAD-class recall (63.8%) compared to Mode 1’s perfect Stage-1 accuracy, the UI has been designed to clearly display the confidence level and warn that a Faulty label simply means that further investigation is warranted.

TABLE X. Feature to recommended-check mapping shown in Mode 2 of the dashboard

Contributing reading	Suggested engineer action
Engine rpm	Check the idle air control (IAC) valve and throttle body for carbon buildup; inspect for vacuum leaks causing unstable idle; abnormally high sustained RPM can indicate a throttle linkage or governor fault.
Fuel pressure	Inspect fuel pump output pressure, the fuel pressure regulator, and the fuel filter for clogging.
Lub oil pressure	Check oil level and viscosity, oil pump condition, and the pressure relief valve; low oil pressure combined with high oil temperature is a strong indicator of lubrication-system failure.
lub oil temp	Inspect the oil cooler, oil level, and lubrication circuit for restrictions.
Coolant temp	Check coolant level, radiator and thermostat operation, water pump, and cooling fan.
Coolant pressure	Inspect the radiator cap and cooling system for leaks, and check head gasket integrity.

VII. Discussion

The results support five claims relevant to this paper's contribution. First, on EngineFaultDB, upgrading from a single decision tree to a Random Forest yields a measurable, if modest, macro-F1 gain (0.740 to 0.752) while achieving perfect binary detection, at negligible additional computational or interpretability cost, a favourable trade-off for a deployed tool, consistent with the broader literature on ensemble methods for vehicle fault diagnosis [4], [5], [10]. Second, reframing the task as two stages surfaces a finding a single flat classifier would have hidden: the difficulty in this domain is concentrated entirely in distinguishing specific fault subtypes with overlapping steady-state signatures, not in detecting that a fault exists at all; the ROC-AUC gap between Rich-mixture (≈ 1.00) and Lean/Low-voltage classes (Fig. 4) makes this quantitatively explicit. Third, on the primary data, the comparison between the standard and class-balanced Random Forest (Table VIII) illustrates that raw accuracy may be a misleading measure of fault detection accuracy in particular: the standard model has the same accuracy as the original thesis, and without a conscious change to the sampling process, it also captures the majority class bias; the class balanced model has a recall of nearly double for the class of interest in the real world (BAD engines), but at a cost in accuracy that a practitioner would likely be willing to pay given the imbalance in the cost of a false fault. Fourth, a validation-tuned Gradient Boosting Classifier further outdoes this trade off, achieving the highest macro-F1 score (0.632) of any tested model in the primary dataset, and balancing BAD- and GOOD-class recall closely without any choice of threshold affecting the test set, thereby directly mitigating the accuracy limitation suggested in an earlier version of this work. In fifth place, packaging both models as a two-mode, zero-install browser tool shows that ensemble and boosted tree models—unlike deep neural networks—can be deployed to end users without the infrastructure of a back-end, which is important for independent technicians and small repair shops without the IT support they normally require, and the OBD-II PID mapping in Section III-D allows a user to select the model to match the resources that are actually available.

Comparing the two data sets also helps to understand what each set contributes, while the availability analysis in Section III-D is explicit, rather than implicit, about the trade-off. Nine of the 14 variables collected by the sensor suite exhaust gas composition does not require a generic OBD-II scanner in the field to retrieve, but is required for Mode 1 of the dashboard to perform fine-grained root cause identification (Section VI-D) for EngineFaultDB. The primary dataset [9], [12] has six inputs, which correspond to standard or commonly available OBD-II/ECU parameters (Table I), and only the coarser GOOD/BAD distinction the original thesis targeted is applicable, the mode adopted in the dashboard (Section VI-F), but it is the one this paper recommends for field use with a generic scanner. The two are not combined into a single model in this paper (Section VIII) just because they answer different complementary questions under different equipment constraints.

VIII. Limitations and Future Work

Since the two datasets leverage distinct feature sets (six versus 14 variables) and varying levels of label specificity (two-class versus four-class), the two datasets have not been integrated to train one shared, general-purpose model. Possible extensions include a shared-feature model jointly trained on both (leveraging the OBD-II-compliant feature subset proposed in Section III-D) or a transfer-learning strategy which leverage the fault subtype labels from the EngineFaultDB dataset to boost performance for the coarser BAD label of the primary dataset.
Our fault-subtype accuracy of 64.8% for the stage-2 (EngineFaultDB) is definitely an area where there is room for improvement. Our proposed path forward would probably be to engineering time/derivative based features based on the repeated polling of OBD sensors, introducing a third sensor channel in order to differentiate between lean mixture faults and low-voltage faults or sampling a greater number of diverse fault-subtypes.
Gradient boosting (Section V-G) closes some-but not all-of the primary-dataset accuracy deficit: BAD-class precision stays low at just 0.51 (that is, some 50% of values flagged as bad would, if individually checked, actually be healthy). Additional improvements will almost certainly depend on engineering more features than the six we used, higher-frequency polling of the OBD data (to catch events that don’t necessarily correlate with instant values), or adjusting the predictive threshold dynamically for the false alarm rate tolerance of a given operational deployment, as opposed to the fixed value used here based on the validation data.
Nine of EngineFaultDB's fourteen inputs, including all four exhaust-gas-composition variables, are not obtainable from a standard consumer-grade OBD-II scanner (Section III-D); Mode 1 of the dashboard is consequently best suited to a workshop or laboratory setting with gas-analyzer equipment, and a fault-subtype model that uses only field-realistic OBD-II inputs remains an open problem this paper does not solve.
The dashboard's recommended-checks mappings (Tables IX and X) are rules-based lookups from a predicted condition to standard diagnostic steps, not learned recommendations; validating them against real repair outcomes is a natural extension.
Both trained models were evaluated on data collected under controlled conditions on a limited number of engine configurations (EngineFaultDB: a single C14NE spark-ignition engine; primary dataset: the source study's original data-collection protocol). Neither has yet been validated against readings taken directly from a physical OBD-II scanner on a live, heterogeneous vehicle fleet; that field validation is the most important remaining step before either mode is used for an unsupervised maintenance decision, and is the natural next collaboration once this paper is published and the tools described in Section VI are available for companies and engineers to test directly.

IX. Conclusion

This paper directly answers the original thesis's [9] own recommendations for future work: it benchmarks alternative machine learning models against the original decision tree, on the original dataset [9], [12], and confirms that ensemble and boosting methods, specifically a class-balanced Random Forest followed by a validation-tuned Gradient Boosting Classifier, improve macro-F1 (0.607 to 0.632) and nearly double fault-detection recall (34.5% to 63.8%) while independently validating Engine RPM as the dominant diagnostic signal the original thesis identified. It further extends the underlying approach into a two-stage Random Forest pipeline, validated on the peer-reviewed EngineFaultDB dataset [1], that detects faulty engine conditions with perfect accuracy and additionally provides a probable root cause, closing the gap between “something is wrong” and “here is what to check.” Every input feature used by both models is explicitly mapped to its OBD-II or ECU availability (Section III-D), so that the more field-realistic six-parameter model is clearly identified as the one intended for deployment via a generic scanner. Both contributions are benchmarked against classical baselines with full mathematical formulation, and both trained models are packaged as a single working, dependency-free, two-mode diagnostic dashboard publicly deployed at

https://engine-dashboard.netlify.app/, demonstrating a practical path from a classroom classification exercise to a tool a working technician could use today.

References

[1] M. Vergara, L. Ramos, N. D. Rivera-Campoverde, and F. Rivas-Echeverría, “EngineFaultDB: A novel dataset for automotive engine fault classification and baseline results,” IEEE Access, vol. 11, pp. 126155–126171, 2023, doi: 10.1109/ACCESS.2023.3331316.

[2] E. T. Michailidis, A. Panagiotopoulou, and A. Papadakis, “A review of OBD-II-based machine learning applications for sustainable, efficient, secure, and safe vehicle driving,” Sensors, vol. 25, no. 13, art. 4057, 2025, doi: 10.3390/s25134057.

[3] A. Pérez-Vázquez, M. Anzures-García, and L. A. Sánchez-Gálvez, “Vehicle engine fault diagnosis approach based on a decision tree and knowledge base,” Int. J. Combinatorial Optimization Problems and Informatics, vol. 15, no. 2, pp. 185–197, 2024, doi: 10.61467/2007.1558.2024.v15i2.450.

[4] R. Quan, J. Zhang, and Z. Feng, “Remote fault diagnosis for the powertrain system of fuel cell vehicles based on random forest optimized with a genetic algorithm,” Sensors, vol. 24, no. 4, art. 1138, 2024, doi: 10.3390/s24041138.

[5] M. N. Hossain, M. M. Rahman, and D. Ramasamy, “Artificial intelligence-driven vehicle fault diagnosis to revolutionize automotive maintenance: A review,” Computer Modeling in Engineering & Sciences, vol. 141, no. 2, pp. 951–996, 2024, doi: 10.32604/cmes.2024.056022.

[6] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.

[7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Monterey, CA, USA: Wadsworth & Brooks/Cole, 1984.

[8] G. H. John and P. Langley, “Estimating continuous distributions in Bayesian classifiers,” in Proc. 11th Conf. Uncertainty in Artificial Intelligence (UAI), Montreal, Canada, 1995, pp. 338–345.

[9] N. Umurie, “Development of a decision tree-based predictive model for condition-based fault detection in automobile engines,” B.Eng. dissertation, Dept. Mechanical Eng., Covenant Univ., Ota, Nigeria, 2025.

[10] T. P. Carvalho, F. A. A. M. N. Soares, R. Vita, R. P. Francisco, J. P. Basto, and S. G. S. Alcala, “A systematic literature review of machine learning methods applied to predictive maintenance,” Computers & Industrial Engineering, vol. 137, art. 106024, 2019, doi: 10.1016/j.cie.2019.106024.

[11] A. K. S. Jardine, D. Lin, and D. Banjevic, “A review on machinery diagnostics and prognostics implementing condition-based maintenance,” Mechanical Systems and Signal Processing, vol. 20, no. 7, pp. 1483–1510, 2006, doi: 10.1016/j.ymssp.2005.09.012.

[12] P. Modi, “Automotive vehicles engine health dataset,” Kaggle, 2023. [Online]. Available: https://www.kaggle.com/datasets/parvmodi/automotive-vehicles-engine-health-dataset

[13] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, 2001, doi: 10.1214/aos/1013203451.