Article View
Machine Learning-Based Identification of Key Bacterial Biomarkers during Anaerobic Succession in Rice Paddy Soil
Ji-Hoon Lee1, 2*
1Department of Bioenvironmental Chemistry, Jeonbuk National University, Jeonju, Korea
2Department of Agricultural Convergence Technology, Jeonbuk National University, Jeonju, Korea
Abstract
Characterizing microbial community succession in anaerobic environments is essential for elucidating ecosystem dynamics and establishing predictive ecological models. However, identifying specific bacterial biomarkers that drive these temporal transitions remains a significant challenge, as traditional univariate statistical methods often struggle with the high dimensionality and small sample sizes typical of microbiome datasets. In this study, a Random Forest (RF) classifier was applied to high-throughput sequencing data from anaerobically incubated paddy soil across three time points (days 1, 15, and 56) to identify core predictive operational taxonomic units (OTUs) that govern microbial succession. From a total of 13,497 OTUs, the RF model successfully identified 30 key OTUs whose combined feature importances enabled temporal classification. Leave-one-out cross-validation (LOOCV) demonstrated a classification accuracy of 66.67% using the top 30 OTUs, compared to 0% when using all OTUs. The high scored taxa such as Deltaproteobacteria and Rhodospirillales provided consistent results with transition to anaerobic conditions. This study suggests that machine learning-based feature selection provides a robust complement to typical statistical methods for biomarker discovery in time-series microbiome research.
Keyword
Anaerobic succession,Biomarker,LOOCV,Microbial community,Random Forest
Introduction
The general objectives of microbial community investigation include characterizing the composition and distribution of microbial communities. In addition, another important goal might be to identify microorganisms that distinguish between groups. Next-generation sequencing technologies enable high-dimensional characterization of microbial community structure, often including hundreds to tens of thousands of taxa per sample. Nevertheless, the extraction of biologically meaningful signals from such high-dimensional and sparse datasets remains a considerable analytical challenge [1].
In many cases, it is necessary to determine the microbial taxa that separate groups or differentiate samples, and statistical approaches such as analysis of variance (ANOVA), non-parametric tests (Kruskal-Wallis), and linear discriminant analysis effect size (LEfSe) are frequently employed for this purpose. These are fundamentally univariate or require strict statistical assumptions that are difficult to satisfy when sample sizes are small relative to the number of variables [2]. These methods evaluate each OTU independently, without considering inter-taxa interactions and non-linear relationships that may be critical for understanding community-level dynamics. Therefore, these methods often fail to produce meaningful results because of statistical limitations. To address this issue, the present study aimed to apply a machine learning (ML)-based approach for selecting microbial taxa that contribute to differences between groups.
Machine learning approaches, particularly ensemble methods such as Random Forest (RF), have emerged as powerful alternatives for microbiome data analysis [3-5]. RF models account for complex, non-linear interactions among variables (OTUs) and are robust to high dimensionality and data sparsity. Furthermore, the RF algorithm provides an interpretable measure of variable importance (feature importance), which can be used to identify OTUs that contribute most substantially to classification accuracy, thereby facilitating data-driven biomarker discovery.
Microbial communities in flooded sediment environments such as rice paddy soils can be dynamic, since redox conditions can change over time in response to water level, substrate availability, and inter-species interactions. Paddy soil, which undergoes periodic flooding and draining, serves as a particularly instructive model system for studying anaerobic microbial succession, as it hosts a succession of distinct microbial guilds ranging from facultative anaerobes to obligate fermenters and methanogenic archaea [6,7]. Despite the above-mentioned advantages of ML-based approaches on microbiome data, the approaches have been less frequently applied to time-series microbiome data from anaerobic soil systems, and suggesting biomarkers. In this study, a RF classifier was employed to train time-series 16S rRNA gene amplicon sequencing data from anaerobically incubated paddy soil to identify core OTUs with high predictive capacity for temporal classification, followed by validation of the identified biomarkers using LOOCV.
Results
Overview of microbial community structure
Sequencing of the six anaerobic incubation samples (day 1, day 15, day 56; n=2 per time point) yielded a total of 13,497 OTUs after quality screening and rarefaction. Community composition at the phylum level showed dominance of Pseudomonadota, unclassified Bacteria, Acidobacteriota, Bacteroidota, and Verrucomicrobiota across all time points [8].
Microbial succession indicated by PCA
To investigate the overall shift in the soil microbial community during the anaerobic incubation, a principal component analysis (PCA) was performed based on the relative abundance of all OTUs (13,497 OTUs), which might be more appropriate for suggesting linear variability than distance matrix-based approaches. The first two principal components (PC1 and PC2) accounted for 23.4% and 20.1% of the total variance, respectively (Fig. 1). Samples from the initial phase (day 1) were clearly separated from those of the intermediate (day 15) and final phases (day 56) along both axes. This variable-based PCA result looked similar to the distance-based NMDS result in the previous publication [8]. In addition, the PCA results revealed a distinct and consistent succession trajectory of the microbial community over time. The succession trajectory indicated a pronounced shift along the PC1 axis between day 1 and day 15, reflecting a rapid and substantial reorganization of the microbial community during the initial anaerobic incubation phase. The transition from day 15 to day 56 was comparatively more subtle, suggesting progressive community stabilization toward a mature anaerobic steady state (Fig. 1).
This bacterial succession pattern, suggested as rapid early-phase restructuring followed by convergence, is similar to the previously reported ecological model of anaerobic paddy soil succession [9,10], in which initial aerobic/facultative microbial populations are displaced by specialized anaerobic decomposers as oxygen is depleted and reducing conditions intensify.
LEfSe analysis: non-significant biomarkers identified
LEfSe (linear discriminant analysis effect size) analysis was performed for all pairwise comparisons among time points (day 1 vs. day 15, day 1 vs. day 56, day 15 vs. day 56). However, none of OTU was found to meet the significance threshold (Kruskal-Wallis p<0.05 after FDR correction; LDA score > 2.0) in the above comparisons. It seems that small sample size (n=2 per group) probably resulted in the absence of statistically significant biomarkers by LEfSe.
Random Forest model run
A Random Forest (RF) classifier was employed to identify the most predictive microbial taxa (biomarkers) characterizing each incubation time point (phase). While a standard statistical approach (LEfSe) failed to identify significant taxa differentiating the groups (time points) probably due to the high dimensionality of the data (OTU numbers of 13,497) and small sample size (6 samples), the RF model indicated the potential taxa characterizing phase transitions of the community.
Based on the training of the relative abundance matrix of OTUs, OTUs were scored to match the time points by the RF classifier. Among them, top 30 OTUs ranked by RF feature importance were selected (Table 1, Fig. 2). The phyla with high importance scores were Verrucomicrobiota, Acidobacteriota, Bacteroidota, Pseudomonadota, Planctomycetota, etc. (Table 1, Fig. 2). Interestingly, as induced for anammox enrichment in the incubations, a phylum Planctomycetota in which anammox bacteria are affiliated, was found as one of the highly scored biomarkers. Even though the sizes are not large, multiple OTUs were found to be predictive taxa, such as OTUs 1306, 1337, etc.
Among the top-ranked taxa, OTU 1306 (Subdivision3 genera incertae sedis), OTU 504 (Gp3), and OTU 12 (Chitinophagaceae) exhibited the highest Importance Scores of 0.00329, 0.00329, and 0.00318, respectively (Table 1, Fig. 2), suggesting their influential roles in the community's temporal differentiation. Feature importance scores for all top 30 OTUs ranged from 0.00201 to 0.00329, representing 30 to 45-fold higher relative importance compared to the average of all OTUs (1/13,497 ≈ 0.000074), suggesting their disproportionate contribution to temporal groupings/transitions.
Abundance patterns of predictive OTUs
To compare the temporal dynamics of the top 30 OTUs, the relative abundance data were Z-score normalized in heatmap (Fig. 3) [11]. This relative abundance normalization can assist to visualize relative increases or decreases in microbial abundance across different incubation time points [12]. The temporal shifts of microbial community were inferred as two successional transitions during the anaerobic incubation, as the environment transitioned from the initial soil state to a mature anaerobic system. The transition from the initial phase (day 1) to the intermediate phase (day 15) was indicated by the rapid decline of the indigenous rice paddy soil community and the emergence of transitional anaerobic populations. At day 1, the community was characterized by high positive Z-scores for several taxa, most notably OTU504 (Gp3), OTU12 (Chitinophagaceae unclassified), OTU618 (Deltaproteobacteria unclassified), OTU1337 (Planctomycetaceae unclassified), and OTU2746 (Spartobacteria genera incertae sedis) (Fig. 3). These groups typically represent aerobic or facultative anaerobic or strict anaerobic microorganisms prevalent in aerated and/or micro-aerated soils. By day 15, these initial high Z-scored OTUs showed significant decreases (shifting from red to blue), probably as oxygen was consumed. In their place, a transitional community of microorganisms began to dominate. Taxa such as OTU205 (Undibacter), OTU390 (Sideroxydans), and OTU851 (Spartobacter genera incertae sedis), exhibited increased Z-scores, which have been reported as facultative or strict anaerobic bacteria. This may support the above-described PCA results, an establishment of an anaerobic environment and substantial reorganization of the microbial community under oxygen-limited conditions (Fig. 1). In addition, Acidobacteria subgroups (OTU3919 (Gp12) and OTU1633 (Gp25)) are known to occur in wetland and rice paddy soils with high ammonium concentrations [13,14]. These results support the anaerobic incubation conditions designed for enriching anammox-associated microbial communities in rice paddy soil [8].
The shift from the intermediate phase (day 15) to the final phase (day 56) suggested maturation of the anaerobic community. This period was defined by the replacement of transitional groups with anaerobic respirers. While day 15 served as a developmental bridge, day 56 was distinguished by a cluster of late-phase OTUs that reached their maximum Z-scores. Notably, OTU26 (Parcubacteria genera incertae sedis), OTU169 (Gp18), OTU229 (Deltaproteobacteria unclassified), OTU1129 (Betaproteobacteria unclassified), OTU12175 (Verrucomicrobiota unclassified), OTU1026 (Bacteria unclassified), and OTU1617 (Rhodospirillales unclassified) showed a strong enrichment at the final time point (Fig. 3). The increases of Deltaproteobacteria and Rhodospirillales at day 56 suggests a high capacity for anaerobic respiration including dissimilatory sulfate reduction, a process common in the later stages of anaerobic soil incubation. The clear separation of the red bands between day 15 and day 56 suggests that the microbial community at the end of the incubation was biologically distinct and enriched for anaerobic metabolic processes, completing the successional trajectory observed in the PCA (Fig. 1).
Validation of the model
To validate the model's reliability, a leave-one-out cross-validation (LOOCV) was performed [15]. The RF classifier trained on all 13,497 OTUs showed a LOOCV accuracy of 0.00%, indicating overfitting due to the dataset structure of large number of taxa (predictors) and small number of groups (observations) (13,497 OTUs; 6 samples) (Table 2). In contrast, when the model was retrained using only the top 30 OTUs identified by RF feature importance ranking, LOOCV accuracy increased to 66.67% (4 of 6 samples correctly classified). This value is approximately twice the random classification baseline (33.3% for a three-class problem), indicating that the selected taxa (biomarkers) suggest meaningful predictive information required to distinguish the anaerobic incubation stages, rather than reflecting coincidence variation (Table 2).
Discussion
The unsuccessful identification of significant biomarkers by LEfSe in this dataset would not be due to absence of biological signal, but rather a consequence of the fundamental limitations of univariate non-parametric testing of Kruskal–Wallis when sample sizes are very small (n=2 per group). The Kruskal-Wallis tests require a minimum of three observations per group to compute test statistics and the combined burden of multiple testing correction across 13,497 OTUs may have further suppressed statistical outputs. In comparison, the RF approach supplements these limitations by evaluating feature importance within a combination of decision trees that collectively integrate patterns across all OTUs simultaneously. This multivariate, non-parametric test would be more robust to small sample sizes and does not require explicit statistical testing of each OTU individually. The improvement in LOOCV accuracy from 0% (full OTU set) to 66.67% (top 30 OTU set) provides empirical evidence that the RF-selected OTUs (biomarkers) would suggest confident biological information that was invisible to univariate methods.
To improve the predictive accuracy of future models and identify more robust biomarker sets in low-resource microbiome studies, more advanced computational strategies may be considered, such as Generative Adversarial Networks (GANs) for synthetic data augmentation and Statistically Equivalent Signatures (SES) incorporating multivariate feature selection algorithms. Furthermore, the integration of multi-omics data (e.g., meta-transcriptomics) could provide functional context to taxonomic biomarkers, potentially reducing the noise-to-signal ratio and enhancing classification performance. These advancements will be critical for transitioning machine learning models from exploratory tools to diagnostic applications in soil ecology.
For the RF importance, in the present dataset, the sum of all OTU importance scores is normalized to 1.0 across 13,497 OTUs (features), yielding a theoretical mean importance of approximately 7.4×10-5. The top 30 OTUs selected in this study exhibited importance scores in the range of 2.0×10-3~3.3×10-3, which are approximately 30-45 folds above the mean. Although these absolute values may appear small individually, they represent a substantial enrichment in predictive taxa (OTUs) relative to the background of uninformative OTUs. This result that the 30 OTUs enable time-point grouping at high opportunities suggests a sound validation of their (the high importance scored OTUs) utility as predictive biomarkers.
The primary limitation of this study was the small sample size (n=2 per time point; n=6 total), which restricts the statistical power of all analyses and limits acknowledgement of the identified biomarkers. Therefore, this study should have considered an exploratory or pilot investigation for acceptable results of biomarkers using the limited datasets. For the purpose, the RF model, one of the machine learning was employed to retrieve biomarkers from the given limited datasets.
The classification performance of the RF model must be interpreted within the context of this study’s small sample size (n=6 total). In high-dimensional datasets where features greatly exceed observations, there is a high risk of overfitting, as evidenced by 0.00% accuracy from the all OUT-applied validation (Table 2). The subsequent increase to 66.67% accuracy by feature selection suggests that the identified biomarkers can indicate a meaningful biological signal. However, the moderate accuracy (66.67%) also reflects the sensitivity of small datasets to individual sample variability. These results can be considered exploratory, and further validation with larger groups may suggest a better general acknowledgement on this process for selecting biomarkers.
This study suggests that a Random Forest machine learning approach can successfully identify core predictive OTUs from high-dimensional, time-series microbiome data, even when conventional statistical methods cannot be performed due to lack of statistical condition. The top 30 OTUs identified by the RF classifier represent the most predictive taxa for distinguishing the three incubation phases. These biomarkers would suggest the essential biological signals of microbial succession, as evidenced by the significantly higher classification accuracy compared to the random baseline. In addition, the suggested biomarkers during the anaerobic succession may also be considered as important players in paddy soil ecosystem. Identification of the anaerobically noted biomarkers such as Deltaproteobacteria (including Geobacter) and Planctomycetota suggests key anaerobic process of Fe and nitrogen cycling [16].
This study would also suggest the potential advantages of machine learning-based biomarker selection in low number sample microbiome studies. However, further work with expanded sample sizes and complementary functional analyses will be essential to fully characterize the ecological roles of the identified core OTUs.
MaterialsandMethods
Data from anaerobic incubation
The previously published sequencing data were used for different purpose in this study. Rice paddy soils were incubated anaerobically for potential enrichment of anammox reaction and subsampled after 1 day, 15 days, and 56 days [8]. Those soil samples were subjected to DNA extraction, library preparation for bacterial 16S rRNA gene, Illumina sequencing, and downstream data analyses. These data and analytical results were reported in previous publication [8]. In brief, DNAs were extracted from the rice paddy soils that were anaerobically incubated in a freshwater medium and V3-V4 region of 16S rRNA gene was sequenced using Illumina Miseq [8]. The paired-end sequences were analyzed using Mothur. The raw data was re-run using updated database of RDP (v19 from Mothur SOP) and Silva (v138.2) and Mothur (v1.48.5). The raw data can be found at NCBI Sequence Read Archive (SRA) with the accession numbers of PRJNA1060457 and PRJNA1062134.
Random Forest classification
In order to select OUTs contributing changes by time, a Random Forest (RF) classifier (RandomForestClassifier) was performed using the scikit-learn library (v1.8.0) [17] in Python (v3.14.3). Prior to modeling, OTU count data were converted to relative abundances to normalize for differences in library size. The RF model was trained using the relative abundance matrix as the feature matrix (x) and sampling time point (day 1, day 15, day 56) as the response variable (y). In brief, randomly selected OTUs (x) were tested 1,000 times to be determined to the time points (y) and the highly decisive OTUs were selected. Model parameters were set as follows: n_estimators=1,000 and random_state=42. All other hyperparameters were maintained at scikit-learn defaults. For examples, the max_depth parameter was set to the default of ‘None’, allowing nodes to be expanded until all leaves were pure, and max_features parameter was set to the default of ‘sqrt’. Feature importances which provide contributing degrees to decision and/or estimation were extracted from the trained model and used to rank all OTUs. The top 30 OTUs with the highest feature importance scores were selected as candidate predictive biomarkers.
PCA and heatmap
Principal component analysis (PCA) was performed on the relative abundance matrix using prcomp function of ggplot2 (v4.0.2) package of R (v4.5.2) and succession trajectory was added. Heatmap was drawn for the top 30 OTUs predicted by RF classifier using sns.heatmap function of seaborn (v0.13.2) package of Python. Z-scores were calculated by the following equation 1, described by Eisen et al. [11].
where χ is the OTU’s absolute abundance data, μ is the OTU’s average value across the samples, and σ is the OUT’s standard deviation across the samples.
Model validation
Predictive performance on unseen data was evaluated using leave-one-out cross-validation (LOOCV) [15], in which each sample was iteratively excluded as the test set, while the remaining samples were used for training. LOOCV accuracy was computed for both the full OTU set (13,497 OTUs) and the reduced top-30 OTU set using RandomForestClassifier in Python. Classification accuracy was compared against a random baseline [1/3 ≈ 33.3% for a 3-class (3 groups: day 1, day 15, and day 56) problem].
Comparative analysis with LEfSe
LEfSe analysis was performed to provide a conventional univariate benchmark. All pairwise combinations of time points (day 1 vs. day 15, day 1 vs. day 56, day 15 vs. day 56) were analyzed using default LEfSe parameters (α=0.05; LDA score threshold=2.0).
Data Availability: All data are available in the main text or in the Supplementary Information.
Author Contributions: J.-H.L.: Conceptualization, Data analyses, Writing – original draft and editing, Funding acquisition and administration.
Notes: The author declares no conflict of interest
Acknowledgments: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2026-25481823).
Additional Information:
Supplementary information The online version contains supplementary material available at https://doi.org/10.5338/KJEA.2026.45.06
Correspondence and requests for materials should be addressed to Ji-Hoon Lee.
Peer review information Agricultural and Environmental Sciences thanks the anonymous reviewers for their contribution to the peer review of this work.
Reprints and permissions information is available at http://www.korseaj.org
Tables & Figures
Fig. 1.
Principal component analysis of the sediment microbial communities across three time points: day 1, day 15, and day 56. The dashed line indicates the mean succession trajectory.
Table 1.
Taxonomic classification and Random Forest (RF) importance scores of the top 30 predictive biomarkers
Fig. 2.
Top 30 predictive OTUs identified by Random Forest classification. The bars show the Importance Scores of the OTUs, effective at distinguishing the three incubation phases. The color gradient and numerical values at the bars represents the total abundance (size) of each OTU.
Fig. 3.
Heatmap of the top 30 predictive OTUs by Z-score normalized relative abundance across the incubation period. Red indicates an abundance higher than the mean across all samples, while blue indicates an abundance lower than the mean.
Table 2.
Classification accuracy using leave-one-out cross-validation (LOOCV) method
References
1. Gloor, GB., Macklaim, JM., Pawlowsky-Glahn, V., & Egozcue,JJ.
((2017)).
Microbiome datasets are compositional: And this is not optional..
Frontiers in Microbiology
8.
2224.
2. Segata, N., Izard, J., Waldron, L., Gevers, D., Miropolsky, L., Garrett, WS., & Huttenhower,C.
((2011)).
Metagenomic biomarker discovery and explanation..
Genome Biology
12.
R60.
3. Knights, D., Costello, EK., & Knight,R.
((2011)).
Supervised classification of human microbiota..
FEMS Microbiology Reviews
35.
343
- 359.
4. Pasolli, E., Truong, DT., Malik, F., Waldron, L., & Segata,N.
((2016)).
Machine learning meta-analysis of large metagenomic datasets: Tools and biological insights..
PLoS Computational Biology
12.
e1004977.
5. Liao, L., Yu, Z., Lu, Y., Hu, Y., Zhu, Y., Zhang, Y., & Yu,D.
((2026)).
Integrative analysis of nutritional components, differential metabolites, and endophytic microbiota reveals flavor determinants of Lushan Russet potato..
Foods
15.
67.
6. Conrad,R.
((2007)).
Microbial Ecology of Methanogens and Methanotrophs. In: Advances in Agronomy, Vol. 96..
1
- 63.
7. Liesack, W., Schnell, S., & Revsbech,NP.
((2000)).
Microbiology of flooded rice paddies..
FEMS Microbiology Reviews
24.
625
- 645.
8. Khanal, A., Song, H-G., Cho, Y-S., Yang, S-Y., Kim, W-S., Joshi, A., Min, J., & Lee,J-H.
((2024)).
Evidence of potential anammox activities from rice paddy soils in microaerobic and anaerobic conditions..
Biology
13.
548.
9. Li, W., Liu, M., Wu, M., Jiang, C., Kuzyakov, Y., Gavrichkova, O., Feng, Y., Dong, Y., & Li,Z.
((2019)).
Bacterial community succession in paddy soil depending on rice fertilization..
Applied Soil Ecology
144.
92
- 97.
10. Wang, W., Luo, X., Chen, Y., Ye, X., Wang, H., Cao, Z., Ran, W., & Cui,Z.
((2019)).
Succession of composition and function of soil bacterial communities during key rice growth stages..
Frontiers in Microbiology
10.
421.
11. Eisen, MB., Spellman, PT., Brown, PO., & Botstein,D.
((1998)).
Cluster analysis and display of genome-wide expression patterns..
Proceedings of the National Academy of Sciences
95.
14863
- 14868.
12. Delgado-Baquerizo, M., Oliverio, AM., Brewer, TE., Benavent-González, A., Eldridge, DJ., Bardgett, RD., Maestre, FT., Singh, BK., & Fierer,N.
((2018)).
A global atlas of the dominant bacteria found in soil..
Science
359.
320
- 325.
13. Chen, T., Hu, R., Zheng, Z., Yang, J., Fan, H., Deng, X., Yao, W., Wang, Q., Peng, S., & null,null.
((2021)).
Soil bacterial community in the multiple cropping system increased grain yield within 40 cultivation years..
Frontiers in Plant Science
12.
804527.
14. Sui, X., Frey, B., Yang, L., Liu, Y., Zhang, R., Ni, H., & Li,M-H.
((2022)).
Soil acidobacterial community composition changes sensitively with wetland degradation in northeastern of China..
Frontiers in Microbiology
13.
1052161.
15. Vehtari, A., Gelman, A., & Gabry,J.
((2017)).
Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC..
Statistics and Computing
27.
1413
- 1432.
16. Masuda, Y., Itoh, H., Shiratori, Y., Isobe, K., Otsuka, S., & Senoo,K.
((2017)).
Predominant but previously-overlooked prokaryotic drivers of reductive nitrogen transformation in paddy soils, revealed by metatranscriptomics..
Microbes and Environments
32.
180
- 183.
17. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., & null,null.
((2011)).
Scikit-learn: Machine learning in Python..
The Journal of Machine Learning Research
12.
2825
- 2830.