A Two-Stage Random Forest Ensemble for Condition-Based Fault Detection and Root-Cause Diagnosis in Automobile Engines, with a Deployed Diagnostic Dashboard

Nsikak Umurie


Abstract. 

Vehicle owners and mechanics generally connect to OBD-II data streams, but turning raw OBD-II data into an actual engine failure diagnosis usually involves a great deal of experience and guesswork. This paper extends an earlier decision-tree-based binary classifier for automobile engine condition detection [9] into an ensemble and gradient-boosting framework evaluated on two datasets, with explicit attention to which input features are realistically obtainable from an OBD-II scanner. On the original project's primary dataset (19,535 vehicle-engine records, six sensor parameters, binary good/bad label, all six mapped to standard or common OBD-II/ECU parameters in this paper), a validated Gradient Boosting Classifier with a decision threshold tuned on a held-out validation split, never on the test set, raises macro-F1 from 0.607 (the original decision tree) to 0.632 and nearly doubles fault-class recall (34.5% to 63.8%), while confirming Engine RPM as the dominant predictor identified in the original study. On EngineFaultDB, a peer-reviewed public dataset of 55,999 real spark-ignition engine readings across four condition classes, a two-stage Random Forest pipeline (detect, then diagnose) achieves 100.0% accuracy on binary condition detection and 64.8% accuracy on fault-subtype identification, with ROC-AUC exceeding 0.99 for two of three fault classes. Our confusion matrix analysis showed that the lean-mixture and the low-ignition-voltage faults are indistinguishable from one another using their steady state sensor characteristics, providing for an immediate future avenue of sensor development. Our two trained models were packaged as a single, browser-only dashboard with two selectable modes, available at the public url https://engine-dashboard.netlify.app/. The system requires no installation or internet connection and can be used to provide an engineer with immediate engine condition information, an likely root-cause, and a recommended maintenance inspection checklist from inputted OBD readings..

Index Terms. predictive maintenance, OBD-II, automobile engine fault diagnosis, Random Forest, gradient boosting, ensemble learning, condition-based monitoring, diagnostic dashboard, root-cause analysis.











I. Introduction

On-board diagnostic (OBD-II) scanners give vehicle owners and technicians access to a wide range of live engine sensor readings: manifold pressure, throttle position, exhaust gas composition, engine speed, and more. In practice, however, reading this data is only the first step: converting a set of numeric sensor values into a specific, actionable diagnosis still typically depends on technician experience, frequently supplemented by manual, sequential component testing. Michailidis et al. [2] survey the growing body of OBD-II-based machine learning applications and note that most deployed systems still stop at anomaly flagging rather than root-cause attribution, leaving the diagnostic burden on the technician. Rule-based and decision-tree expert systems have been proposed to close this gap [3], and ensemble methods such as Random Forest have shown strong results on related vehicle-subsystem fault diagnosis tasks [4], [5], but, to the best of our knowledge, combining an ensemble classifier with an explicit two-stage detect-then-diagnose architecture and a publicly deployed, zero-install diagnostic interface has not been reported for OBD-derived engine data specifically.

The first phase of this project addressed the detection of half of this problem: a decision-tree classifier trained on aggregated “good” and “bad” engine readings predicted whether a given set of readings indicated a healthy or faulty engine. This paper extends that work along three axes, corresponding to its three contributions:

This paper is a direct continuation of an earlier undergraduate dissertation, “Development of a Decision Tree-Based Predictive Model for Condition-Based Fault Detection in Automobile Engines” [9], which identified that conventional fault-identification approaches “rely on residual knowledge, which can be outdated and prone to mistakes,” and set out to build a decision-tree model able to classify engine condition as good or bad and to surface the specific parameter(s) responsible for a bad reading. Its own recommendations for future work explicitly called for testing “other methods… like Discriminant Analysis, Support Vector Machines, Naive Bayes Classifiers, Logistic Regression, nearest neighbours, and neural network classifiers” to improve on the decision tree's accuracy, and for pairing the model with OBD-II scanners in practice. We explicitly addressed these two recommendations: Table V-G in Section V-G shows re- benchmarking ensemble and gradient boosting algorithms against the base decision tree over the exact same training and testing data; Table III-D in Section III-D show the OBD-II/ECU availability for each of the features we use; and Section VI explain our current OBD-facing deployment and mode tailored for the feature set used in a field setting.


II. Related Work

A. OBD-II-based diagnostics and machine learning

Michailidis, Panagiotopoulou, and Papadakis [2] present a survey of OBD-II machine learning, covering sustainability, efficiency, security and safety as applications and consider condition based fault diagnosis where traditional tabular machine learning models (trees, ensembles, Bayesian methods) dominate as more interpretable and effective in comparison to machine learning deep learning methods where interpretability is reduced to the gain of small accuracies improvements on saturated detection problems.

B. Decision-tree and knowledge-based diagnostic systems

The use of a tree structure for fault diagnosis is reinforced by the contribution of Prez-Vzquez, Anzures-Garca, and Snchez-Glvez [3], where they introduce a diagnosis based on a decision tree associated with a symbolic knowledge base for vehicular engine fault diagnosis. The approach takes advantage of the explicitness of the decisions offered by decision tree algorithms, a concept which this dashboard's explanatory module similarly benefits from.

C. Ensemble methods for vehicle subsystem diagnosis

Quan, Zhang, and Feng [4] apply a genetic-algorithm-optimized Random Forest to remote fault diagnosis of fuel-cell vehicle powertrains, reporting strong classification performance across eight fault types from CAN-bus telemetry, and Hossain, Rahman, and Ramasamy [5] review the broader landscape of AI-driven vehicle fault diagnosis, concluding that ensemble tree methods offer the best accuracy-to-interpretability trade-off for maintenance-facing deployments, as opposed to research-facing benchmarking exercises. This paper's model choice (Section IV-C) is directly motivated by that conclusion.

D. Foundational classifier methodology

The classifiers benchmarked in this paper are grounded in established methodology: CART decision trees [7], Random Forest bagging with per-split feature subsampling [6], and Gaussian Naive Bayes with continuous-feature density estimation [8]. Full mathematical formulations are given in Section IV.

E. Predictive maintenance methodology

Carvalho et al. [10] systematically review machine learning methods applied to predictive maintenance across industries and find decision trees and random forests among the most frequently applied and best-performing families for tabular sensor data, while cautioning that reported accuracy is highly sensitive to class balance, directly relevant to Section V-G, where class imbalance materially affects fault-class recall on the primary dataset. The foundation for machinery diagnostics and prognosis within CBM was introduced by Jardine, Lin, and Banjevic [11], whose three classic sub-problems for this paper-fault detection, isolation and severity estimation-formed the basis of the two-stage approach of this work (see Sec. IV-A).

F. Positioning relative to EngineFaultDB and the original thesis

This work uses EngineFaultDB [1], the first publicly available, peer-reviewed dataset developed specifically for automotive engine fault classification. The initial EngineFaultDB paper presented baseline accuracy for logistic regression, decision tree, random forest, SVM, k-nearest neighbours, and a feed-forward neural network classifier on the 4-class problem presented herein, producing a dataset-level reference performance benchmark. It also uses the original project's primary dataset [9], [12] (19,535 engine sensor readings labelled good/bad) on which the original thesis [9] itself compared twenty MATLAB Classification Learner configurations (Table VII) and found that ensemble methods (Boosted Trees, 66.5%) already outperformed every single decision tree variant tested (best: 65.7%), a finding this paper independently confirms and extends with a full precision/recall/F1/ROC evaluation (Section V-G). This paper builds on both foundations but differs in aim and contribution: rather than benchmarking classifiers in isolation, it (a) reframes the EngineFaultDB task as a two-stage detect-then-diagnose pipeline that mirrors real inspection workflow and exposes a difficulty asymmetry the original single-model benchmarks did not surface, (b) applies class-balanced ensemble learning and a validation-tuned gradient boosting classifier to the primary dataset to address the class-imbalance limitation implicit in the original thesis's per-class results, and (c) translates the resulting models into a publicly deployed, dependency-free diagnostic tool with an explicit accounting of which inputs are realistically available from a generic OBD-II scanner (Sections III-D and VI).

III. Dataset and Preprocessing

A. Primary dataset

The original thesis [9] takes as a secondary (i.e. pre-existing, third party) dataset an automotive engine health dataset whose sensor readings are publicly available on the Kaggle blog [12] and contain a binary Engine Condition: GOOD (coded 1) or BAD (coded 0) for each of 19,535 sensor readings for six continuous engine parameters. In this paper, the same data set is used. Its structure is summarized in Table I: there are 12,317 GOOD (63.1%) and 7,218 BAD (36.9%) readings, a moderate class imbalance that, as Section V-G demonstrates, in itself has a significant impact on fault-class detection unless addressed.

TABLE I.  Primary dataset feature set (n = 19,535), with OBD-II/ECU availability

Feature

Symbol

Description

OBD-II / ECU availability

Engine rpm

R

Engine speed, revolutions per minute

Standard PID 0x0C (SAE J1979)

Fuel pressure

FP

Fuel delivery line pressure

Standard PID 0x0A (SAE J1979, gauge pressure)

Lub oil pressure

LOP

Lubricant oil pressure

Not a standard PID; manufacturer-specific PID or aftermarket sender

lub oil temp

LOT

Lubricant oil temperature

Not a standard PID; manufacturer-specific PID or aftermarket sender

Coolant temp

CT

Engine coolant temperature

Standard PID 0x05 (SAE J1979)

Coolant pressure

CP

Engine coolant system pressure

Not a standard PID; aftermarket sender

Engine Condition (label)

EC

1 = Good, 0 = Bad

Target label, not a sensor reading

Four of these six inputs can be retrieved from any generic OBD-II scanner, while the other two (oil pressure and coolant pressure) are standard and can be found on most vehicles, but may not be available in all. Real world deployment implications of this availability pattern are discussed in Section III-D.

The methodology of the original thesis was followed in pre-processing: rows that had missing data were dropped leading to no distortions in the inter-variable relationships; statistical outlier removal was performed; no smoothing (moving-average/spline) was required because there was no material noise within the dataset. In this paper, that cleaned dataset is reused unchanged, thus the results in Section V-G are directly comparable to those reported in the original thesis.

B. Secondary dataset: EngineFaultDB

In order to assess our modeling method using an independent, published data set, we employ data from the EngineFaultDB [1]. EngineFaultDB consists of readings taken on a C14NE spark-ignition engine in a controlled laboratory setting and the gas composition data captured using an NGA 6000 gas analyzer. A National Instruments USB-6008 data acquisition device was used to store the data, which consist of 55,999 individual readings for the 14 continuous variables (Table I). All feature readings belong to one of the four class categories (Table II) and none were found to be missing.

TABLE II.  Feature set used for classification (EngineFaultDB), with OBD-II/ECU availability

Feature

Unit

Description

OBD-II / ECU availability

MAP

kPa

Manifold Absolute Pressure

Standard PID 0x0B (SAE J1979)

TPS

%

Throttle Position Sensor reading

Standard PID 0x11 (SAE J1979)

Force

N

Engine torque / rotational force

Not available via OBD-II; dynamometer/bench-rig measurement

Power

kW

Rate of energy transfer

Not available via OBD-II; dynamometer/bench-rig measurement

RPM

rpm

Crankshaft revolutions per minute

Standard PID 0x0C (SAE J1979)

Consumption L/H

L/h

Fuel consumption rate

Calculated PID 0x5E on some MY2010+ vehicles; otherwise ECU-specific

Consumption L/100KM

L/100 km

Fuel efficiency over distance

Derived quantity, not a direct PID

Speed

km/h

Vehicle travel speed

Standard PID 0x0D (SAE J1979)

CO

%

Carbon monoxide in exhaust

Not available via OBD-II; requires a laboratory gas analyzer

HC

ppm

Unburnt hydrocarbons in exhaust

Not available via OBD-II; requires a laboratory gas analyzer

CO2

%

Carbon dioxide in exhaust

Not available via OBD-II; requires a laboratory gas analyzer

O2

%

Oxygen remaining in exhaust

Related standard PIDs 0x14 to 0x1B report O2-sensor voltage, not exhaust O2 percentage

Lambda

unitless

Air-fuel equivalence ratio

Derivable from wideband PID 0x34 on some MY2010+ vehicles only

AFR

unitless

Air-fuel ratio

Not a direct PID; derivable from Lambda where available

Nine of these fourteen inputs, including all four exhaust-composition variables (CO, HC, CO2, O2) and both dynamometer-derived variables (Force, Power), are not retrievable from a standard consumer-grade OBD-II scanner; they require the gas-analyzer and data-acquisition bench rig used to collect EngineFaultDB [1]. This has a direct, practical consequence for this paper's stated goal of real-world deployability: the fault-subtype model trained on EngineFaultDB (Section VI, dashboard Mode 1) is best suited to a workshop or laboratory setting with the appropriate equipment, whereas the primary dataset's six field-realistic inputs (Table I) are the basis of the dashboard's Mode 2 (Section VI-F), which is the mode this paper recommends for deployment via a generic OBD-II scanner in the field.

TABLE III.  Class distribution (EngineFaultDB, n = 55,999)

Label

Condition

Samples

Share

0

Normal (good)

16,000

28.6%

1

Rich air-fuel mixture fault

10,998

19.6%

2

Lean air-fuel mixture fault

15,000

26.8%

3

Low ignition voltage fault

14,001

25.0%

C. Data splitting

The dataset was split 80/20 into training (n = 44,800) and held-out test (n = 11,199) sets using a fixed random seed (42) for reproducibility. Given the sample size, the random split preserved class proportions closely (test-set composition: 28.0% normal, 20.0% rich, 26.7% lean, 25.4% low-voltage, within 1 percentage point of the full-dataset proportions in Table II), so no explicit stratification was required.

IV. Methodology

A. Problem reframing: a two-stage pipeline

Rather than a single flat classifier over all four classes, the diagnostic task is decomposed into two stages that mirror how a technician actually works (Fig. 1). Stage 1 answers a binary question: is the engine faulty at all? Stage 2, invoked only when Stage 1 predicts a fault, answers which of the known fault types is most likely present. This decomposition allows each stage to be modelled and evaluated on its own terms and, as Section V shows, the two stages exhibit very different levels of difficulty, a structural insight a single flat classifier would obscure.


Fig. 1.  Two-stage diagnostic pipeline architecture.

B. Candidate models

Four classifiers were implemented and compared, all trained from first principles in NumPy to keep the pipeline fully transparent, auditable, and light enough to run entirely client-side in the deployed dashboard (Section VI) without a server-side inference dependency. The fourth, Gradient Boosting, was added specifically to target the primary dataset's harder, imbalanced binary problem (Section V-G) and is not used for the already near-saturated EngineFaultDB Stage 1 task.

1) Gaussian Naive Bayes [8]

For class c and feature vector x, the class-conditional likelihood assumes feature independence and Gaussian-distributed continuous features:

P(x | c) = ∏ᵢ (1 / √(2π·σ²꜀,ᵢ)) · exp( −(xᵢ − μ꜀,ᵢ)² / (2·σ²꜀,ᵢ) )

with the predicted class chosen to maximize the posterior P(c)·P(x|c) via Bayes' rule (log-space for numerical stability). This model serves as a fast probabilistic baseline that explicitly tests the (violated, as Section V shows) independence assumption.

2) CART Decision Tree [7]

At each node, the split (feature f, threshold t) is chosen to maximize the Gini impurity reduction:

Gini(S) = 1 − Σ꜀ p꜀²  ,   ΔGini = Gini(S) − (|Sₗ|/|S|)·Gini(Sₗ) − (|Sᵣ|/|S|)·Gini(Sᵣ)

where p꜀ is the proportion of class c in node S, and Sₗ, Sᵣ are the left/right children induced by the split. Growth stops at a maximum depth of 14, or when a node has fewer than 10 samples (min_samples_split) or a candidate split would leave a child with fewer than 5 samples (min_samples_leaf). This configuration replicates the original thesis's decision-tree method on the new data, serving as a like-for-like baseline.

3) Random Forest [6] (proposed model)

An ensemble of B = 40 CART trees, each trained on an independent bootstrap resample of the training set and restricted, at each split, to a random subset of ⌊√14⌋ = 3 candidate features. Final class probabilities are the mean of the per-tree leaf probability vectors:

P̂(c | x) = (1 / B) · Σᵦ Tᵦ(c | x)

with the predicted class argmax꜀ P̂(c|x). Bagging plus feature subsampling decorrelates the individual trees, reducing variance relative to a single decision tree without materially increasing bias.

4) Gradient Boosting Classifier [13] (primary-dataset refinement)

For the primary dataset's binary GOOD/BAD problem (Section V-G), an additive ensemble of M shallow CART regression trees is fitted stagewise to the residual of the binomial log-loss. Writing F₀(x) = log(p̄/(1−p̄)) for the log-odds of the training-set base rate p̄, each stage m fits a regression tree hₘ to the negative gradient (residual) rᵢ = yᵢ − σ(Fₘ₋₁(xᵢ)), where σ is the logistic sigmoid, and updates the ensemble additively:

Fₘ(x) = Fₘ₋₁(x) + ν · hₘ(x)  ,   P̂(GOOD | x) = σ(F_M(x)) = 1 / (1 + e^(−F_M(x)))

with learning rate ν = 0.06, M = 150 trees, and each tree limited to max depth 4 and a minimum of 8 samples per leaf. Unlike Random Forest's averaging of independent trees, boosting builds trees sequentially, each one correcting the errors of the ensemble so far, which typically yields sharper decision boundaries on harder, imbalanced problems such as the primary dataset's BAD-class detection task, at the cost of requiring careful regularization (shallow trees, a small learning rate) to avoid overfitting.

Because the default 0.5 probability threshold is not necessarily optimal under class imbalance, the classification threshold applied to P̂(GOOD | x) was tuned on a held-out validation split (Section IV-E) rather than fixed a priori or tuned on the test set, giving a final threshold of 0.59.

C. Model selection rationale

Random Forest was preferred over a higher-capacity alternative such as a feed-forward neural network for three reasons tied directly to this project's deployment goal, consistent with the conclusion of Hossain et al. [5] that ensemble tree methods offer the best accuracy-to-interpretability trade-off for maintenance-facing tools: (i) it outperforms a single decision tree on noisy sensor data (Section V-A) without a large increase in training cost; (ii) it exposes per-feature importance scores that map directly onto the dashboard's root-cause explanation (Section VI-D), which a neural network would not provide without additional tooling such as SHAP; and (iii) its inference cost is low enough to run entirely client-side in a browser, avoiding a server dependency for the deployed tool.

Gradient Boosting was added as a fourth and final, highly focused addition not in place of Random Forest due to class imbalance in the main dataset (Section III-A), where accuracy alone does not expose the compromises made by the Random Forest model on BAD-class precision vs recall (Section V-G); and a feature set or other collection of data was not necessitated to achieve that final bit of improvement through boosted sequential error-correction, combined with thresholding that was optimized by validation.

D. Hyperparameters

TABLE IV.  Model hyperparameters

Model

Setting

Value

Decision Tree

max_depth / min_samples_split / min_samples_leaf

14 / 10 / 5

Random Forest (4-class benchmark)

n_estimators / max_depth / min_samples_split / min_samples_leaf / max_features

40 / 14 / 10 / 5 / √14 ≈ 3

Random Forest, Stage 1 (binary)

n_estimators / max_depth / min_samples_split / min_samples_leaf

20 / 8 / 20 / 10

Random Forest, Stage 2 (subtype)

n_estimators / max_depth / min_samples_split / min_samples_leaf

25 / 10 / 15 / 8

Gradient Boosting (primary dataset)

n_estimators / max_depth / learning_rate / min_samples_split / min_samples_leaf / threshold

150 / 4 / 0.06 / 15 / 8 / 0.59 (validation-tuned)

All models

Random seed / train-test split

42 / 80% train, 20% test (60/10/30 for the validation-tuned Gradient Boosting model)

E. Evaluation metrics

For each class c, with true positives TP꜀, false positives FP꜀, and false negatives FN꜀:

Precision꜀ = TP꜀ / (TP꜀ + FP꜀)      Recall꜀ = TP꜀ / (TP꜀ + FN꜀)      F1꜀ = 2·Precision꜀·Recall꜀ / (Precision꜀ + Recall꜀)

Macro-F1 is the unweighted mean of per-class F1 scores. Receiver Operating Characteristic (ROC) curves were computed one-vs-rest per class by sweeping the predicted-probability threshold and plotting True Positive Rate against False Positive Rate; Area Under the Curve (AUC) was computed via the trapezoidal rule.

For the Gradient Boosting model specifically (Section IV-B.4), the primary dataset was split 60/10/30 into training, validation, and held-out test sets (seed 42). The classification threshold applied to P̂(GOOD | x) was selected to maximize macro-F1 on the validation split only, then applied unchanged to the untouched test split; no threshold, hyperparameter, or model-selection decision was made using test-set labels, avoiding the threshold-leakage that would otherwise inflate the reported test performance.

V. Results

A. Overall model comparison

TABLE V.  Model comparison, full 4-class problem (held-out test set, n = 11,199)

Model

Accuracy

Macro F1

Binary acc.

Notes

Gaussian Naive Bayes

38.5%

0.347

Independence assumption violated

Decision Tree

74.5%

0.740

99.96%

Baseline (original thesis method)

Random Forest

74.3%

0.752

100.00%

Proposed model


Fig. 2.  Accuracy and macro-F1 by model, 4-class problem.

B. Confusion matrix

Fig. 3 shows the row-normalized confusion matrix for the Random Forest model on the held-out test set. Normal and rich-mixture classes are classified with 100% accuracy; lean-mixture and low-voltage classes are frequently confused with each other, with roughly half of each misclassified as the other.


Fig. 3.  Random Forest confusion matrix (row-normalized), 4-class test set.

C. ROC analysis

One-vs-rest ROC curves (Figure 4) mirror the observation from the confusion matrix, yielding excellent separability for both Normal and Rich-mixture classes (AUC ≈ 1.00) but exhibiting much lower (although still above chance) AUC values for the other two classes (Lean-mixture and Low-voltage), which as would be expected from our earlier confusion results are to some extent confusable.


Fig. 4.  One-vs-rest ROC curves, Random Forest, 4-class test set.

D. Two-stage pipeline performance

Stage 1 (binary condition detection): 100.0% accuracy, macro-F1 = 1.00 (n = 11,199).

Stage 2 (fault-subtype identification), evaluated on faulty samples only (n = 8,065): 64.8% accuracy, macro-F1 = 0.675.

TABLE VI.  Stage-2 per-class performance (fault subtype identification)

Fault type

Precision

Recall

F1

Support

Rich mixture

1.00

1.00

1.00

2,235

Lean mixture

0.52

0.53

0.52

2,990

Low ignition voltage

0.49

0.48

0.49

2,840

End-to-end pipeline accuracy on the original 4-class labels (Stage 1 → Stage 2 combined): 74.7%.

E. Key finding

The 14 steady-state variables are perfectly separable from all other classes in these rich-mixture faults, in all the tested models (Table IV, Fig. 3, Fig. 4). Lean-mixture and low-voltage faults, on the other hand, tend to overlap not only with one another, but also with each other when using any of the classifiers; this is a characteristic of the feature space, not of any particular algorithm. This means that the instant reading of OBD-style information is enough to reliably detect that there is something wrong (Stage 1 is solved), but reliably discriminating between certain fault subtypes will likely require additional signal information, e.g. short time-series/derivative features, or another sensor channel, a concrete, testable direction for future work (Section VIII).

F. Feature importance

Importance based on mean impurity decrease (Stage-1 Random Forest, all data: CO (0.159), Force (0.123), fuel consumption L/H (0.089), HC (0.088), fuel consumption L/100km (0.087)). This appears to be logically sound as the first three indicators are measures of combustion mixture quality and the latter three represent the effect of a fault on the resultant output of the engine.


Fig. 5.  Random Forest feature importance, Stage-1 binary model.

G. Primary dataset benchmark

The identical three-model comparison was rerun on the primary project data-set [9],[12] (19,535 records, six attributes, binary GOOD/BAD target attribute), using the exact same 70/30 split between training and test sets that had been used in the original thesis for fair comparability. For comparison purposes the results from the original thesis's own three-way comparison of twenty different configurations of MATLAB’s Classification Learner is again shown in Table VII, which also already indicates the advantage that ensemble approaches (Boosted Trees) already had over all single-tree approaches examined.

TABLE VII.  Original thesis's comparative ML model results (MATLAB Classification Learner) [9]

Model

Accuracy

Model

Linear SVM

65.5%

Fine Decision Tree (65.4%)

Medium Gaussian SVM

66.1%

Coarse Decision Tree (65.5%)

Coarse Gaussian SVM

65.5%

Medium Decision Tree (65.7%)

Coarse KNN

66.0%

Bagged Trees (63.2%)

Boosted Trees (best overall)

66.5%

RUSBoosted Trees (62.3%)

This same pattern is also seen in this paper's from-scratch benchmark (Table VIII): a standard Random Forest achieves the same accuracy (66.3% vs. 65.7%) but has a slight macro-F1 regression (0.594 vs. 0.607), similar to that of the original thesis (recall on BAD: 34.5%, compared to 40.6% in the original thesis). The problem can be solved directly using class balanced bootstrap sampling per-class (Section IV-B.3), and macro-F1 reaches 0.623, higher than the original decision tree, while the overall accuracy is reduced by only 63.1% - 64.3% and the recall score of the BAD class is almost doubled to 67.2%. It is a more relevant trade-off for a fault detection system - where a false negative is a costlier failure than a false alarm - than just raw accuracy, and the Gradient Boosting result presented next further improves the trade off.

TABLE VIII.  Model comparison, primary dataset (held-out test set, n = 5,860)

Model

Accuracy

Macro F1

BAD recall

Notes

Gaussian Naive Bayes

66.3%

0.587

32.2%

Independence assumption violated

Decision Tree (original thesis [9])

65.7%

0.607

40.6%

MATLAB, reported baseline

Decision Tree (this study)

64.3%

0.586

N/A

NumPy re-implementation

Random Forest (standard)

66.3%

0.594

34.5%

Matches accuracy, weak minority recall

Random Forest (class-balanced)

63.1%

0.623

67.2%

Improves recall over standard RF

Gradient Boosting (validation-tuned)

64.4%

0.632

63.8%

Best macro-F1; recommended for deployment

To take that gap further, a Gradient Boosting Classifier [13] (IV-B.4) was also trained on the same six-feature primary dataset with an explicit 60/10/30 train/validation/test split (IV-E) so that its classification threshold could be adjusted without the validation set influencing decisions about test data. Adjusting this classification threshold to any value between 0 and 1 against P(GOOD | x) on the validation set and choosing the value which yields maximum validation macro-F1 results in an optimal threshold of 0.59 (e.g., classifying as GOOD requires model prediction ≥0.59, not just the default 0.5). Tested on the unused test set this performs at 64.4% accuracy with macro-F1 0.632 (Table VIII) – better than all models on this dataset whilst remaining closely matched between BAD- and GOOD- class recalls (63.8%, 64.7%). This is the model this paper suggests for use on the primary dataset and what is actually running as Mode 2 on the dashboard (VI-F).


Fig. 6.  Accuracy and macro-F1 by model, primary dataset (red bar marks the best macro-F1: Gradient Boosting).


Fig. 7.  Gradient Boosting confusion matrix, primary dataset (recommended model).


Fig. 8.  Gradient Boosting ROC curve, primary dataset (AUC = 0.696).

Using this main dataset, feature importance of each feature (9) provides additional empirical evidence to our primary qualitative finding of the original thesis without being given prior knowledge of which features to predict: Engine RPM shows overwhelming influence (0.387) (as “the most influential prediction feature for engine conditions, is our primary feature of classification.” [9]), Fuel Pressure (0.160), Lubricant Oil Temperature (0.137), Coolant Pressure (0.107), Lubricant Oil Pressure (0.111) and Coolant Temperature (0.099). This agreement between an independent ensemble model and the original single-tree analysis is itself a form of validation of the original thesis's threshold-based root-cause logic (Section III-A), which was built around exactly this RPM-led hierarchy. A split-frequency-based importance measure computed for the Gradient Boosting model (Section V-G) ranks the same feature highest (Engine RPM, 0.29), followed by Lubricant Oil Temperature (0.21) and Fuel Pressure (0.19), corroborating this ranking across two structurally different ensemble methods.


Fig. 9.  Random Forest feature importance, primary dataset.

VI. Diagnostic Dashboard

To translate the trained models into a tool usable by a working technician, both the EngineFaultDB two-stage Random Forest pipeline and the primary-dataset Gradient Boosting model were deployed together, as two selectable modes of a single self-contained, publicly hosted web application, live at: https://engine-dashboard.netlify.app/

A. System architecture

All trained tree ensembles, Random Forest for both EngineFaultDB stages and the Gradient Boosting model for the primary dataset, are serialized to JSON (per-node split feature, threshold, and left/right child references, plus the initial log-odds and learning rate for the boosting model) and embedded directly in the page. Inference is performed entirely client-side in JavaScript by replicating the same tree-traversal logic used during training in Python/NumPy: for a feature vector x reaching node n, descend left if x[n.feature] ≤ n.threshold, else right, until a leaf is reached; Random Forest predictions average leaf probability vectors across all trees (Section IV-B.3), while the Gradient Boosting prediction sums each tree's leaf value, scaled by the learning rate, onto a running log-odds total before applying the logistic sigmoid (Section IV-B.4). No server, database, or network call is required after the page loads, so the tool functions fully offline and imposes no ongoing hosting cost or latency.

B. Use cases

C. Interface layout

A mode-selector tab pair located above the page lets a user switch between mode 1 (EngineFaultDB, sub-fault-type) and mode 2 (the base-data-set, GOOD/BAD). Both modes have an input-panel-plus-result-panel format which is sketched out in fig. 10.


Fig. 10.  Schematic layout of the deployed dashboard interface, shared by both modes.

D. Running a new diagnosis (Mode 1: EngineFaultDB)

A new diagnosis is entered and then run in the left-hand Input Panel (Fig. 10)

E. Recommended checks by fault type

TABLE IX.  Fault type to recommended-check mapping shown in the dashboard

Predicted fault

Suggested engineer action

Rich air-fuel mixture

Check fuel injectors for leakage/over-delivery, inspect the O2 sensor and fuel pressure regulator, verify MAF/MAP sensor calibration.

Lean air-fuel mixture

Check for vacuum/intake leaks, inspect fuel delivery (pump pressure, clogged injectors/filter), verify O2 sensor response.

Low ignition voltage

Inspect ignition coil(s), spark plugs and gaps, battery/alternator charging voltage, and ignition wiring/connectors.

F. Second diagnostic mode: primary dataset (Good/Bad, Gradient Boosting)

Mode 2 exposes the primary dataset's Gradient Boosting model (Section IV-B.4) using the same Input Panel / Result Panel layout as Mode 1:

TABLE X.  Feature to recommended-check mapping shown in Mode 2 of the dashboard

Contributing reading

Suggested engineer action

Engine rpm

Check the idle air control (IAC) valve and throttle body for carbon buildup; inspect for vacuum leaks causing unstable idle; abnormally high sustained RPM can indicate a throttle linkage or governor fault.

Fuel pressure

Inspect fuel pump output pressure, the fuel pressure regulator, and the fuel filter for clogging.

Lub oil pressure

Check oil level and viscosity, oil pump condition, and the pressure relief valve; low oil pressure combined with high oil temperature is a strong indicator of lubrication-system failure.

lub oil temp

Inspect the oil cooler, oil level, and lubrication circuit for restrictions.

Coolant temp

Check coolant level, radiator and thermostat operation, water pump, and cooling fan.

Coolant pressure

Inspect the radiator cap and cooling system for leaks, and check head gasket integrity.

VII. Discussion

The results support five claims relevant to this paper's contribution. First, on EngineFaultDB, upgrading from a single decision tree to a Random Forest yields a measurable, if modest, macro-F1 gain (0.740 to 0.752) while achieving perfect binary detection, at negligible additional computational or interpretability cost, a favourable trade-off for a deployed tool, consistent with the broader literature on ensemble methods for vehicle fault diagnosis [4], [5], [10]. Second, reframing the task as two stages surfaces a finding a single flat classifier would have hidden: the difficulty in this domain is concentrated entirely in distinguishing specific fault subtypes with overlapping steady-state signatures, not in detecting that a fault exists at all; the ROC-AUC gap between Rich-mixture (≈ 1.00) and Lean/Low-voltage classes (Fig. 4) makes this quantitatively explicit. Third, on the primary data, the comparison between the standard and class-balanced Random Forest (Table VIII) illustrates that raw accuracy may be a misleading measure of fault detection accuracy in particular: the standard model has the same accuracy as the original thesis, and without a conscious change to the sampling process, it also captures the majority class bias; the class balanced model has a recall of nearly double for the class of interest in the real world (BAD engines), but at a cost in accuracy that a practitioner would likely be willing to pay given the imbalance in the cost of a false fault. Fourth, a validation-tuned Gradient Boosting Classifier further outdoes this trade off, achieving the highest macro-F1 score (0.632) of any tested model in the primary dataset, and balancing BAD- and GOOD-class recall closely without any choice of threshold affecting the test set, thereby directly mitigating the accuracy limitation suggested in an earlier version of this work. In fifth place, packaging both models as a two-mode, zero-install browser tool shows that ensemble and boosted tree models—unlike deep neural networks—can be deployed to end users without the infrastructure of a back-end, which is important for independent technicians and small repair shops without the IT support they normally require, and the OBD-II PID mapping in Section III-D allows a user to select the model to match the resources that are actually available.

Comparing the two data sets also helps to understand what each set contributes, while the availability analysis in Section III-D is explicit, rather than implicit, about the trade-off. Nine of the 14 variables collected by the sensor suite exhaust gas composition does not require a generic OBD-II scanner in the field to retrieve, but is required for Mode 1 of the dashboard to perform fine-grained root cause identification (Section VI-D) for EngineFaultDB. The primary dataset [9], [12] has six inputs, which correspond to standard or commonly available OBD-II/ECU parameters (Table I), and only the coarser GOOD/BAD distinction the original thesis targeted is applicable, the mode adopted in the dashboard (Section VI-F), but it is the one this paper recommends for field use with a generic scanner. The two are not combined into a single model in this paper (Section VIII) just because they answer different complementary questions under different equipment constraints.

VIII. Limitations and Future Work

IX. Conclusion

This paper directly answers the original thesis's [9] own recommendations for future work: it benchmarks alternative machine learning models against the original decision tree, on the original dataset [9], [12], and confirms that ensemble and boosting methods, specifically a class-balanced Random Forest followed by a validation-tuned Gradient Boosting Classifier, improve macro-F1 (0.607 to 0.632) and nearly double fault-detection recall (34.5% to 63.8%) while independently validating Engine RPM as the dominant diagnostic signal the original thesis identified. It further extends the underlying approach into a two-stage Random Forest pipeline, validated on the peer-reviewed EngineFaultDB dataset [1], that detects faulty engine conditions with perfect accuracy and additionally provides a probable root cause, closing the gap between “something is wrong” and “here is what to check.” Every input feature used by both models is explicitly mapped to its OBD-II or ECU availability (Section III-D), so that the more field-realistic six-parameter model is clearly identified as the one intended for deployment via a generic scanner. Both contributions are benchmarked against classical baselines with full mathematical formulation, and both trained models are packaged as a single working, dependency-free, two-mode diagnostic dashboard publicly deployed at 

https://engine-dashboard.netlify.app/, demonstrating a practical path from a classroom classification exercise to a tool a working technician could use today.






References

[1] M. Vergara, L. Ramos, N. D. Rivera-Campoverde, and F. Rivas-Echeverría, “EngineFaultDB: A novel dataset for automotive engine fault classification and baseline results,” IEEE Access, vol. 11, pp. 126155–126171, 2023, doi: 10.1109/ACCESS.2023.3331316.

[2] E. T. Michailidis, A. Panagiotopoulou, and A. Papadakis, “A review of OBD-II-based machine learning applications for sustainable, efficient, secure, and safe vehicle driving,” Sensors, vol. 25, no. 13, art. 4057, 2025, doi: 10.3390/s25134057.

[3] A. Pérez-Vázquez, M. Anzures-García, and L. A. Sánchez-Gálvez, “Vehicle engine fault diagnosis approach based on a decision tree and knowledge base,” Int. J. Combinatorial Optimization Problems and Informatics, vol. 15, no. 2, pp. 185–197, 2024, doi: 10.61467/2007.1558.2024.v15i2.450.

[4] R. Quan, J. Zhang, and Z. Feng, “Remote fault diagnosis for the powertrain system of fuel cell vehicles based on random forest optimized with a genetic algorithm,” Sensors, vol. 24, no. 4, art. 1138, 2024, doi: 10.3390/s24041138.

[5] M. N. Hossain, M. M. Rahman, and D. Ramasamy, “Artificial intelligence-driven vehicle fault diagnosis to revolutionize automotive maintenance: A review,” Computer Modeling in Engineering & Sciences, vol. 141, no. 2, pp. 951–996, 2024, doi: 10.32604/cmes.2024.056022.

[6] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.

[7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Monterey, CA, USA: Wadsworth & Brooks/Cole, 1984.

[8] G. H. John and P. Langley, “Estimating continuous distributions in Bayesian classifiers,” in Proc. 11th Conf. Uncertainty in Artificial Intelligence (UAI), Montreal, Canada, 1995, pp. 338–345.

[9] N. Umurie, “Development of a decision tree-based predictive model for condition-based fault detection in automobile engines,” B.Eng. dissertation, Dept. Mechanical Eng., Covenant Univ., Ota, Nigeria, 2025.

[10] T. P. Carvalho, F. A. A. M. N. Soares, R. Vita, R. P. Francisco, J. P. Basto, and S. G. S. Alcala, “A systematic literature review of machine learning methods applied to predictive maintenance,” Computers & Industrial Engineering, vol. 137, art. 106024, 2019, doi: 10.1016/j.cie.2019.106024.

[11] A. K. S. Jardine, D. Lin, and D. Banjevic, “A review on machinery diagnostics and prognostics implementing condition-based maintenance,” Mechanical Systems and Signal Processing, vol. 20, no. 7, pp. 1483–1510, 2006, doi: 10.1016/j.ymssp.2005.09.012.

[12] P. Modi, “Automotive vehicles engine health dataset,” Kaggle, 2023. [Online]. Available: https://www.kaggle.com/datasets/parvmodi/automotive-vehicles-engine-health-dataset

[13] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, 2001, doi: 10.1214/aos/1013203451.