Computational Molecular Biology 2025, Vol.15, No.1 http://bioscipublisher.com/index.php/cmb © 2025 BioSci Publisher, registered at the publishing platform that is operated by Sophia Publishing Group, founded in British Columbia of Canada. All Rights Reserved.
Computational Molecular Biology 2025, Vol.15, No.1 http://bioscipublisher.com/index.php/cmb © 2025 BioSci Publisher, registered at the publishing platform that is operated by Sophia Publishing Group, founded in British Columbia of Canada. All Rights Reserved. BioSci Publisher is an international Open Access publishing platform that publishes scientific journals in the field of bioscience registered at the publishing platform that is operated by Sophia Publishing Group (SPG), founded in British Columbia of Canada. Publisher BioSci Publisher Edited by Editorial Team of Computational Molecular Biology Email: edit@cmb.bioscipublisher.com Website: http://bioscipublisher.com/index.php/cmb Address: 11388 Stevenston Hwy, PO Box 96016, Richmond, V7A 5J5, British Columbia Canada Computational Molecular Biology (ISSN 1927-5587) is an open access, peer reviewed journal published online by BioSciPublisher. The Journal is publishing all the latest and outstanding research articles, letters, methods, and reviews in all areas of computational molecular biology, covering new discoveries in molecular biology, from genes to genomes, using statistical, mathematical, and computational methods as well as new development of computational methods and databases in molecular and genome biology. The papers published in the journal are expected to be of interests to computational scientists, biologists and teachers/students/researchers engaged in biology. All the articles published in Computational Molecular Biology are Open Access, and are distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. BioSciPublisher uses CrossCheck service to identify academic plagiarism through the world’s leading plagiarism prevention tool, iParadigms, and to protect the original authors’ copyrights.
Computational Molecular Biology (online), 2025, Vol. 15, No.6 ISSN 1927-5587 http://hortherbpublisher.com/index.php/cmb © 2025 BioSci iPublisher, registered at the publishing platform that is operated by Sophia Publishing Group, founded in British Columbia of Canada. All Rights Reserved. Latest Content Advances in Computational Vaccinology: From Antigen Discovery to Immune Simulation ShiyingYu Computational Molecular Biology, 2025, Vol.15, No.6, 265-272 Genome Assembly and Comparative Genomics of a Novel Extremophilic Bacterium Yinghua Chen, Hui Xiang, Zhongqi Wu Computational Molecular Biology, 2025, Vol.15, No.6, 273-281 Computational Modeling of Metabolic Networks in Rice Under Salt Stress Xingzhu Feng Computational Molecular Biology, 2025, Vol.15, No.6, 282-290 AI-Powered Prediction of Animal Disease Outbreaks Using Genomic Surveillance Data Qiqi Zhou, Shiqiang Huang Computational Molecular Biology, 2025, Vol.15, No.6, 291-298 Case Study: Computational Detection of Horizontal Gene Transfer in Soil Microbiomes Jun Wang, Qikun Huang Computational Molecular Biology, 2025, Vol.15, No.6, 299-306
Computational Molecular Biology 2025, Vol.15, No.6, 265-272 http://bioscipublisher.com/index.php/cmb 265 Review Article Open Access Advances in Computational Vaccinology: From Antigen Discovery to Immune Simulation ShiyingYu Biotechnology Research Center, Cuixi Academy of Biotechnology, Zhuji, 311800, China Corresponding author: shiying.yu@cuixi.org Computational Molecular Biology, 2025, Vol.15, No.6 doi: 10.5376/cmb.2025.15.0026 Received: 01 Sep., 2025 Accepted: 11 Oct., 2025 Published: 02 Nov., 2025 Copyright © 2025 Yu, This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Preferred citation for this article: Yu S.Y., 2025, Advances in computational vaccinology: from antigen discovery to immune simulation, Computational Molecular Biology, 15(6): 265-272 (doi: 10.5376/cmb.2025.15.0026) Abstract Computational vaccinology, as an emerging interdisciplinary subject integrating bioinformatics, immunology and systems biology, is profoundly transforming the vaccine research and development process. This study systematically reviews the key advancements in the field of computational vaccinology, covering theoretical foundations, core technologies, and practical application scenarios. It examines the background of the shift in vaccine development from traditional methods to computational strategies, and introduces genomic-based antigen screening methods (reverse vaccinology), epitope prediction algorithms, and the application of structural bioinformatics in antigen design The integrated application of immunoinformatics tools and databases was explored, especially the value of multi-omics data in refined antigen analysis. The practical value of computational vaccinology was demonstrated through multiple actual cases (such as AI-assisted COVID-19 vaccine development, multi-epitope vaccine design for tuberculosis and malaria, as well as tumor neoantigen prediction and clinical transformation). This study reveals the crucial role of computational vaccinology in enhancing the efficiency of vaccine development, reducing costs, responding to emerging infectious diseases, and achieving personalized immunization strategies. At the same time, it provides theoretical basis and technical prospects for the future construction of AI-driven automated vaccine platforms. Keywords Computational vaccinology; Reverse vaccinology; Epitope prediction; Immune simulation; Vaccine design 1 Introduction How were vaccines made in the past? Most of the time, it relies on experience - detoxification, inactivation, and then trying bit by bit (Li et al., 2024). Although these methods are indeed effective, the process is slow, the cost is considerable, and they are often inadequate in dealing with new pathogens (He and Wang, 2024). In recent years, computational vaccinology has gradually come to the fore, not because it is "high-end and sophisticated", but because it is indeed more practical in saving time and costs. After the integration of genomic, proteomic and immune data, antigen screening is no longer a blind exploration. New methods such as reverse vaccinology and immunoinformatics have begun to provide precise targets, especially in dealing with infectious diseases and cancers, and have been proven to have obvious advantages (Basmenj et al., 2025). However, no matter how good the tools are, it still depends on how they are used. The computing platforms that many researchers rely on nowadays are no longer merely analytical tools; they are more like the "experimental front desk" for vaccine development. Platforms like iVAX package epitope localization, antigen construction, and immune simulation, and also come with data visualization and resource databases. As long as the models are reasonable and the data are reliable, they can even preliminarily determine which antigens have potential before the vaccines enter the laboratory (Moise et al., 2015). Of course, computational models are not omnipotent, especially when it comes to new variant strains. Whether the algorithm can keep up is a question. However, from the overall trend, the progress of AI algorithms and structural modeling has indeed greatly compressed the time of the prediction work that originally took several years to complete. Therefore, computational vaccinology is regarded as the "standard tool" for the next stage of vaccine research and development (Nag et al., 2025; Tang et al., 2025). This study reviews the latest advancements in the field of computational vaccinology, with a focus on the entire process from antigen discovery to immune simulation. It also explores the challenges and future development
Computational Molecular Biology 2025, Vol.15, No.6, 265-272 http://bioscipublisher.com/index.php/cmb 266 directions of integrating multi-omics data and artificial intelligence to enhance vaccine efficacy and safety. Through a comprehensive analysis of existing research and tools, this paper emphasizes the transformative impact of computational methods on the fields of immunology and public health, highlighting their significant role in the rapid response to infectious disease outbreaks and personalized vaccine design, with the aim of accelerating vaccine development. 2 Computational Approaches for Antigen Discovery 2.1 Genome-based vaccine target identification (reverse vaccinology) Traditionally, finding suitable vaccine targets often requires step-by-step screening through experiments, but this method is time-consuming and limited. Reverse vaccinology bypasses this. It does not cultivate bacteria but directly starts from the genome, screening for genes that encode surface or secreted proteins, which are usually virus-related and more easily recognized by the immune system. Of course, this strategy is not applicable to every pathogen, but it has been proven in multiple cases to identify potential antigens (Rawal et al., 2021). When screening, it is not only necessary to look at antigenicity, but also to consider whether these proteins have homology with the host. If they are too similar to human proteins, they may instead cause immune side effects. These computational processes are like multi-layer filters, sifting out candidate antigens layer by layer and laying the foundation for subsequent immunoinformatics analysis. 2.2 Epitope prediction algorithms (B-cell and T-cell epitopes) Predicting epitopes may sound highly technical, but the logic is actually quite simple: it's about identifying which fragments can be recognized by the immune system. B-cell epitopes are usually regions that antibodies can directly recognize, while T-cells pay more attention to whether peptides can bind to MHC molecules. The problem is that there are too many combinations of TCR and MHC, and it is difficult to exhaust all possibilities relying on experience. For this reason, more and more algorithms have introduced machine learning models, especially performing well in identifying T-cell epitopes. Many tools have been able to balance affinity and specificity prediction (Zhang et al., 2021; Gao et al., 2023). Although the accuracy rate still cannot be compared with that of experiments, in the early screening stage, it can greatly improve efficiency and also help to find some conserved regions that are not easily detectable but have strong immunogenicity. 2.3 Applications of structural bioinformatics in antigen design Vaccine design without structural information is like picking a key to unlock with eyes closed. Structural bioinformatics is precisely the toolbox for solving this problem. It can tell us how antigens and antibodies "adhere", which epitopes are stereoscopically exposed and which may be hidden. By means of computational simulation and docking technology, it is possible to predict in advance whether the antigen design is reasonable. In recent years, some models have incorporated machine learning algorithms, which can predict antibody affinity and binding sites more accurately (Mason et al., 2021; Wilman et al., 2022). Of course, all of this is based on a reliable structural template. If the pathogen is a "structural blind box", modeling will be limited. Even so, integrating structural information with sequence prediction results can still provide a more realistic conformational basis for vaccine design and enhance its immune effect in vivo. 3 Immunoinformatics and Data Integration 3.1 Vaccine development databases and resources (e.g., IEDB, VaxiJen) Often, the first step in vaccine design is not in the laboratory but in the database. Platforms like IEDB contain tens of thousands of verified B-cell and T-cell epitope information. It is more like a constantly updated "immune map", and researchers can hardly do without it. On the other side, tools like VaxiJen simply do not even consider the structure and directly predict antigenicity based on the sequence, enabling the screening of potential vaccine candidate proteins without the need for comparative analysis. Although these databases and tools cannot replace experimental verification, they do indeed speed up the antigen screening process significantly. Especially when the research is confronted with a large number of candidate proteins, having a system that can automatically prioritize them is much more efficient than relying solely on intuition (Oli et al., 2020). Of course, the prerequisite is that these databases should be updated in a timely manner and have friendly entry points; otherwise, no matter how good the resources are, they won't be able to play their role.
Computational Molecular Biology 2025, Vol.15, No.6, 265-272 http://bioscipublisher.com/index.php/cmb 267 3.2 Sequence alignment, motif search, and homology modeling techniques Identifying key regions on proteins does not always rely on experience. Sometimes, a fragment that is conserved across species often conceals crucial immune information. At this point, sequence alignment and motif search tools come in handy. They help identify stable fragments that may trigger immune responses, and are particularly suitable for designing vaccines with high coverage rates of multiple strains. However, if these tools alone are not accurate enough, homologous modeling can fill the gap at the structural level. Even without complete structural data, as long as there is a similar template, a rough three-dimensional conformation can be pieced together. This is crucial for determining which epitopes can be exposed to the "field of view" of the immune system (Zaher et al., 2025). These tools can be used separately, but when combined, the effect is even better. They are an indispensable "intermediate stop" in many vaccine research and development processes. 3.3 Integration of multi-omics data for comprehensive antigen profiling It is difficult to describe the full picture of an antigen through a single data source. The genome tells us "what components there are", the transcriptome says "who is being expressed", and the proteome and immunopeptide group involve "who is really at work". Relying solely on a set of data, it is very easy to miss the key links in the immune response. Although it is not easy to integrate these multi-omics data, as the data formats, analysis dimensions, and sampling time points may all be inconsistent, once they are connected, a complete picture of the interaction between pathogens and hosts can be presented. Nowadays, many new algorithms and machine learning models are being used to solve the problem of data heterogeneity. Although they are not fully automated yet, the trend is already very clear: Whoever can integrate better is likely to find new vaccine targets in advance (Anderson et al., 2025; Kamali et al., 2025). Especially in the design of personalized vaccines, multi-omics analysis can more accurately identify the most effective antigenic loci for a specific population or individual, which is a cutting-edge step in the field of computational vaccines. 4 Computational Modeling of Immune Responses 4.1 Agent-based and mechanistic models of the immune system There is more than one path for modeling immune responses. Some models revolve around individual cells or molecules, treating them as objects that can "act", while others choose to use a set of differential equations to capture the operational rules of the immune system. Ultimately, the core objective of these methods is the same - to figure out how the immune system reacts step by step in the face of viruses, bacteria, and even vaccines. For instance, some studies have used hybrid modeling methods to investigate how IL-2 and IL-4 regulate lymphocyte activation. The simulated immune cell proliferation pathways have revealed the details of immune behavior during the infection period (Atitey and Anchang, 2022). When facing viral infections like SARS-CoV-2, mechanism models attempt to capture the dynamic interactions among viruses, immune cells, and cytokines, helping us predict the progression of the disease and even the possibility of immune clearance (Leon et al., 2023; Miroshnichenko et al., 2025). Although these models each have their own focuses, the theoretical support they provide remains indispensable for the validation of vaccine strategies or immunotherapies. 4.2 Applications of immune simulation tools (e.g., C-ImmSim, SimuLymph) Not everyone can build an immune model from scratch. Fortunately, platforms like C-ImmSim and SimuLymph have already set up the "framework". The design concept of C-ImmSim is to string the antigenic epitope information with the characteristics of lymphocyte receptors and predict the formation of immune response and even immune memory through amino acid sequences (Rapin et al., 2010). In fact, it has been able to reproduce some classic experiments quite well, such as the influence of different MHC combinations on immune responses. SimuLymph takes a different approach. It is based on proxy modeling and pays more attention to individualized immune responses - particularly suitable for predicting the responses of certain individuals to specific immunotherapies (Figure 1) (Matalon et al., 2025). Although these tools each have their own mechanisms, they are essentially like a "digital immunity sandbox" that can test vaccine construction or treatment pathways in advance in a virtual environment.
Computational Molecular Biology 2025, Vol.15, No.6, 265-272 http://bioscipublisher.com/index.php/cmb 268 Figure 1 Summary of MLR regulatory network and corresponding equations defining the state machine (Adopted from Matalon et al., 2025) 4.3 Prediction of immunogenicity and population coverage Designing a vaccine that is effective for the majority of people cannot be achieved merely through experiments. To know in advance which epitopes are more likely to induce immune responses, it is necessary to rely on computational prediction. The model will refer to the specificity of T-cell receptors, analyze the binding affinity of epitopes to MHC molecules, and thereby determine which candidates are more reliable. Interestingly, some machine learning models can handle nonlinear features and integrate a large number of variables - for instance, this advantage is particularly evident in complex tasks such as predicting the pertussis vaccine response (Shinde et al., 2025). As for whether the vaccine is suitable for a wide range of people, it still needs to be considered in combination with the distribution of MHC alleles in the population. Combining these predictive capabilities with immune simulation models can more effectively assess the efficacy of vaccines in advance and also help optimize immunization strategies for different populations. 5 Case Studies in Computational Vaccine Development 5.1 COVID-19: AI-assisted epitope mapping and vaccine candidate design At the beginning of the outbreak of the epidemic, no one expected that AI would be so quickly involved in vaccine research and development. But in fact, immunoinformatics platforms like iVAX had already begun screening for conserved and immunopotential T-cell epitopes shortly after the SARS-CoV-2 genome was published. These platforms do not rely on "guessing". They predict immune responses by analyzing sequences and can even optimize the construction plan of antigens. Although traditional methods remain important, computational tools have clearly shortened the entire time window from sequencing to candidate vaccine design (De Groot et al., 2020). 5.2 Multi-epitope vaccine design for tuberculosis and malaria For stubborn and structurally complex pathogens such as tuberculosis and malaria, the intervention of computational methods is not an added bonus; rather, it is more often the key to solving the problem. Researchers first identified antigenic proteins by combining reverse vaccinology with immunoinformatics techniques, and then selected from them the highly immunogenic epitopes that could trigger MHC Class I, II or B-cell responses. After the epitopes are assembled into a structure, linkers and adjuvants are added, somewhat like building with blocks.
Computational Molecular Biology 2025, Vol.15, No.6, 265-272 http://bioscipublisher.com/index.php/cmb 269 This not only ensures stability but also minimizes the risk of allergies as much as possible (Figure 2). Simulation data show that the overall immune-inducing potential is good. Although there is still a way to go before clinical application, the direction is clear (Kardani et al., 2020). Figure 2 Discontinuous B-cell epitopes predicted by ElliPro. (A-E): 3D representation of conformational or discontinuous epitopes of the most antigenic chimeric protein fromT. cruzi CL Brenner. Epitopes are shown as yellow surfaces, and the bulk of the protein is represented in grey sticks (Adopted from Rawal et al., 2021) 5.3 Cancer neoantigen vaccines: from prediction to clinical trials In the field of cancer vaccines, predicting individual-specific neoantigens is becoming increasingly realistic. By using AI to analyze tumor mutation sites, identifying which fragments might become "targets", and then combining structural modeling to confirm whether they can trigger immune attacks, this set of processes is no longer just theoretical. At present, some neoantigen vaccines screened out based on these algorithms have entered the clinical trial stage. Although the mutations of each patient are different, this personalized strategy does provide a breakthrough for the design of tumor vaccines (Guarra and Colombo, 2023). 6 Challenges, Limitations, and Validation Bottlenecks 6.1 Accuracy and generalizability issues in predictive models Calculating vaccine design is not as "automatic" as it seems. Often, the model works well on a certain type of pathogen at the beginning, but once the target is changed, its performance immediately drops. Especially when the data volume is small and the sources are diverse, many AI algorithms will fall into the trap of overfitting. This situation is not uncommon - for some models, the longer they are trained, the worse their generalization ability becomes. Not to mention that the existing data report formats are diverse. Many times, even the connection between different data becomes a problem, let alone expecting to extract from them which factors are truly meaningful for immune protection (Dalsass et al., 2019; Bravi, 2024). In other words, a model is not a universal key. The more complex the scene is, the more tailor-made algorithms and features selection skills are needed. 6.2 The necessity of experimental validation: bridging the computational-experimental gap No matter how accurate the calculation is, it still has to pass the test. Many epitope predictions that seemed "promising" ultimately failed to pass the in vivo and in vitro experimental test. Between calculation and reality, it's not something that can be easily bridged with just a few sets of data. This also makes experimental verification the most time-consuming yet indispensable part of the entire process. Although some high-throughput technologies and community standard testing platforms have been established, to be honest, the investment cost is still high and they cannot completely replace traditional verification methods. Unless computational prediction, immune experiments and clinical research are integrated, the entire process can truly succeed. Otherwise, no matter how intelligent AI is, it can only remain at the "hypothesis" stage (Hashim and Dimier-Poisson, 2025).
Computational Molecular Biology 2025, Vol.15, No.6, 265-272 http://bioscipublisher.com/index.php/cmb 270 6.3 Ethical, regulatory, and data privacy concerns related to vaccine AI Not all challenges stem from technical difficulties. Once AI enters the core process of vaccine development, regulatory and ethical issues become unavoidable. Want the public to trust AI models? First, we need to figure out how these models make judgments. But strangely enough, many of the most effective models are also the most "black box" ones. In addition, the use of an individual's genomic and health data can also easily raise privacy concerns. The more such data is used, the higher the requirements for the governance framework will be. If the model also contains data bias, not only will the results be distorted, but it may also amplify the already existing health inequalities. Therefore, from algorithmic fairness to privacy protection and then to regulatory standards, behind the development of AI vaccines is actually a test of a whole set of social mechanisms (El Arab et al., 2025). 7 Future Perspectives and Technological Innovations It is no longer news that AI is increasingly involved in vaccine research and development. But interestingly, the efficiency it has demonstrated in antigen and adjuvant screening has indeed changed many of the old methods that relied on feeling and trial in the past. Tools like convolutional neural networks and recurrent neural networks, which were previously mainly used in image and text processing, are now also being brought in to assist in the design of multi-epitope vaccines. By combining omics data and structural information, they can significantly enhance the accuracy of antigen selection. During the outbreak of COVID-19, AI-driven epitope prediction came into its own, and the pace of research and development significantly accelerated. In terms of adjuvant development, many new strategies have also moved away from the traditional trial-and-error model. By relying on the inference of immune pathways through AI models, the discovery efficiency has been accelerated and the hit rate has been improved. However, when it comes to truly personalized vaccines, we still have to go back to the genetic level. The combination of immunogenomics and computational vaccinology has begun to make "personalized" vaccines possible. As long as the genomic and transcriptomic data of patients can be obtained, the system can predict which new antigens each person may respond to. Combined with the analysis of the immune library, the design can be targeted. This approach is particularly useful in dealing with infectious diseases where the virus mutates rapidly, such as in the direction of cancer vaccines. Some neoantigen vaccines have already shown initial effects in clinical trials. In addition, digital twin simulation and vaccine formulas customized specifically for different populations have made the protection strategies for people with different genetic backgrounds more precise, and both safety and effectiveness are more guaranteed. Of course, a fully automated vaccine design process sounds ideal, but to be truly successful, many technical barriers still need to be overcome. At present, from antigen screening to epitope prediction and then to immune simulation, many steps can indeed be automatically processed. But the question is - can the model understand it? Is the data standard correct? Can the experiment verify whether it can be connected? All of these are still being resolved. Only when the AI process can be integrated with high-throughput experiments and regulatory frameworks can vaccine research and development truly enter the "closed-loop era". By then, the response speed to sudden infectious diseases and the capacity for large-scale production might indeed undergo a qualitative change. Acknowledgments I thank the anonymous reviewers and the editor for their meticulous review of the manuscript, whose constructive comments and valuable suggestions improved the structure of the argument. Conflict of Interest Disclosure The author affirms that this research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest. References Anderson L., Hoyt C., Zucker J., McNaughton A., Teuton J., Karis K., Arokium-Christian N., Warley J., Stromberg Z., Gyori B., and Kumar N., 2025, Computational tools and data integration to accelerate vaccine development: challenges, opportunities, and future directions, Frontiers in Immunology, 16: 1502484.
Computational Molecular Biology 2025, Vol.15, No.6, 265-272 http://bioscipublisher.com/index.php/cmb 271 Atitey K., and Anchang B., 2022, Mathematical modeling of proliferative immune response initiated by interactions between classical antigen-presenting cells under joint antagonistic IL-2 and IL-4 signaling, Frontiers in Molecular Biosciences, 9: 777390. https://doi.org/10.3389/fmolb.2022.777390 Basmenj E., Pajhouh S., Fallah A., Naijian R., Rahimi E., Atighy H., Ghiabi S., and Ghiabi S., 2025, Computational epitope-based vaccine design with bioinformatics approach: a review, Heliyon, 11(1): e41714. https://doi.org/10.1016/j.heliyon.2025.e41714 Bravi B., 2024, Development and use of machine learning algorithms in vaccine target selection, NPJ Vaccines, 9(1): 15. https://doi.org/10.1038/s41541-023-00795-8 Dalsass M., Brozzi A., Medini D., and Rappuoli R., 2019, Comparison of open-source reverse vaccinology programs for bacterial vaccine antigen discovery, Frontiers in Immunology, 10: 113. https://doi.org/10.3389/fimmu.2019.00113 De Groot A., Moise L., Terry F., Gutiérrez A., Hindocha P., Richard G., Hoft D., Ross T., Noe A., Takahashi Y., Kotraiah V., Silk S., Nielsen C., Minassian A., Ashfield R., Ardito M., Draper S., and Martin W., 2020, Better epitope discovery, precision immune engineering, and accelerated vaccine design using immunoinformatics tools, Frontiers in Immunology, 11: 442. https://doi.org/10.3389/fimmu.2020.00442 El Arab R., Alkhunaizi M., Alhashem Y., Khatib A., Bubsheet M., and Hassanein S., 2025, Artificial intelligence in vaccine research and development: an umbrella review, Frontiers in Immunology, 16: 1567116. https://doi.org/10.3389/fimmu.2025.1567116 Gao Y., Gao Y., Fan Y., Zhu C., Wei Z., Zhou C., Chuai G., Chen Q., Zhang H., and Liu Q., 2023, Pan-peptide meta learning for T-cell receptor–antigen binding recognition, Nature Machine Intelligence,5(3): 236-249. https://doi.org/10.1038/s42256-023-00619-3 Guarra F., and Colombo G., 2023, Computational methods in immunology and vaccinology: design and development of antibodies and immunogens, Journal of Chemical Theory and Computation, 19(16): 5315-5333. https://doi.org/10.1021/acs.jctc.3c00513 Hashim O., and Dimier-Poisson I., 2025, Computational vaccine development against protozoa, Computational and Structural Biotechnology Journal, 27: 2386-2393. https://doi.org/10.1016/j.csbj.2025.06.011 He X., and Wang S.B., 2024, Global trends in veterinary vaccine development for emerging pathogens, Molecular Pathogens, 15(2): 61-71. http://dx.doi.org/10.5376/mp.2024.15.0007 Kamali M., Salehi M., and Fath M., 2025, Advancing personalized immunotherapy for melanoma: integrating immunoinformatics in multi-epitope vaccine development, neoantigen identification via NGS, and immune simulation evaluation, Computers in Biology and Medicine, 188: 109885. https://doi.org/10.1016/j.compbiomed.2025.109885 Kardani K., Bolhassani A., and Namvar A., 2020, An overview of in silico vaccine design against different pathogens and cancer, Expert Review of Vaccines, 19(8): 699-726. https://doi.org/10.1080/14760584.2020.1794832 Leon C., Tokarev A., Bouchnita A., and Volpert V., 2023, Modelling of the innate and adaptive immune response to SARS viral infection, cytokine storm and vaccination, Vaccines, 11(1): 127. https://doi.org/10.3390/vaccines11010127 Li X.H., Liang H.B., and Xuan J., 2024, Observation analysis of vaccine efficacy in poultry farms: insights from field trials on chicken immunization, International Journal of Molecular Veterinary Research, 14(5): 202-210. http://dx.doi.org/10.5376/ijmvr.2024.14.0023 Mason D., Friedensohn S., Weber C., Jordi C., Wagner B., Meng S., Ehling R., Bonati L., Dahinden J., Gainza P., Correia B., and Reddy S., 2021, Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning, Nature Biomedical Engineering, 5(6): 600-612. https://doi.org/10.1038/s41551-021-00699-9 Matalon O., Perissinotto A., Baruch K., Braiman S., Maor A., Yoles E., Wilczynski E., Nevo U., and Priel A., 2025, Agent-based modeling for personalized prediction of an experimental immune response to immunotherapeutic antibodies, PLoS One, 20(6): e0324618. https://doi.org/10.1371/journal.pone.0324618 Miroshnichenko M., Kolpakov F., and Akberdin I., 2025, A modular mathematical model of the immune response for investigating the pathogenesis of infectious diseases, Viruses, 17(5): 589. https://doi.org/10.3390/v17050589 Moise L., Gutiérrez A., Kibria F., Martin R., Tassone R., Liu R., Terry F., Martin B., and De Groot A., 2015, IVAX: an integrated toolkit for the selection and optimization of antigens and the design of epitope-driven vaccines, Human Vaccines & Immunotherapeutics, 11(9): 2312-2321. https://doi.org/10.1080/21645515.2015.1061159 Nag R., Srivastava S., Rizvi S., Ahmed S., and Raza S., 2025, Innovations in vaccine design: computational tools and techniques, In: Advances in Pharmacology, Academic Press, 103: 375-391. https://doi.org/10.1016/bs.apha.2025.01.015 Oli A., Obialor W., Ifeanyichukwu M., Odimegwu D., Okoyeh J., Emechebe G., Adejumo S., and Ibeanu G., 2020, Immunoinformatics and vaccine development: an overview, ImmunoTargets and Therapy, 2020: 13-30.
Computational Molecular Biology 2025, Vol.15, No.6, 265-272 http://bioscipublisher.com/index.php/cmb 272 Rapin N., Lund O., Bernaschi M., and Castiglione F., 2010, Computational immunology meets bioinformatics: the use of prediction tools for molecular binding in the simulation of the immune system, PLoS ONE, 5(4): e9862. https://doi.org/10.1371/journal.pone.0009862 Rawal K., Sinha R., Abbasi B., Chaudhary A., Nath S., Kumari P., Preeti P., Saraf D., Singh S., Mishra K., Gupta P., Mishra A., Sharma T., Gupta S., Singh P., Sood S., Subramani P., Dubey A., Strych U., Hotez P., and Bottazzi M., 2021, Identification of vaccine targets in pathogens and design of a vaccine using computational approaches, Scientific Reports, 11(1): 17626. https://doi.org/10.1038/s41598-021-96863-x Shinde P., Willemsen L., Anderson M., Aoki M., Basu S., Burel J., Cheng P., Dastidar S., Dunleavy A., Einav T., Forschmiedt J., Fourati S., Garcia J., Gibson W., Greenbaum J., Guan L., Guan W., Gygi J., Ha B., Hou J., Hsiao J., Huang Y., Jansen R., Kakoty B., Kang Z., Kobie J., Kojima M., Konstorum A., Lee J., Lewis S., Li A., Lock E., Mahita J., Mendes M., Meng H., Neher A., Nili S., Olsen L., Orfield S., Overton J., Pai N., Parker C., Qian B., Rasmussen M., Reyna J., Richardson E., Safo S., Sorenson J., Srinivasan A., Thrupp N., Tippalagama R., Trevizani R., Ventz S., Wang J., Wu C., Ay F., Grant B., Kleinstein S., and Peters B., 2025, Putting computational models of immunity to the test—An invited challenge to predict B. pertussis vaccination responses, PLOS Computational Biology, 21(3): e1012927. https://doi.org/10.1371/journal.pcbi.1012927 Tang X., Deng J., He C., Xu Y., Bai S., Guo Z., Du G., Ouyang D., and Sun X., 2025, Application of in-silico approaches in subunit vaccines: overcoming the challenges of antigen and adjuvant development, Journal of Controlled Release, 381: 113629. https://doi.org/10.1016/j.jconrel.2025.113629 Wilman W., Wróbel S., Bielska W., Deszynski P., Dudzic P., Jaszczyszyn I., Kaniewski J., Mlokosiewicz J., Rouyan A., Satlawa T., Kumar S., Greiff V., and Krawczyk K., 2022, Machine-designed biotherapeutics: opportunities, feasibility and advantages of deep learning in computational antibody discovery, Briefings in Bioinformatics, 23(4): bbac267. https://doi.org/10.1093/bib/bbac267 Zaher M., El-Husseiny M., Hagag N., El-Amir A., Zowalaty M., and Tammam R., 2025, A novel immunoinformatic approach for design and evaluation of heptavalent multiepitope foot-and-mouth disease virus vaccine, BMC Veterinary Research, 21(1): 152. https://doi.org/10.1186/s12917-025-04509-1 Zhang W., Hawkins P., He J., Gupta N., Liu J., Choonoo G., Jeong S., Chen C., Dhanik A., Dillon M., Deering R., Macdonald L., Thurston G., and Atwal G., 2021, A framework for highly multiplexed dextramer mapping and prediction of T cell receptor sequences to antigen specificity, Science Advances, 7(20): eabf5835. https://doi.org/10.1126/sciadv.abf5835
Computational Molecular Biology 2025, Vol.15, No.6, 273-281 http://bioscipublisher.com/index.php/cmb 273 Research Insight Open Access Genome Assembly and Comparative Genomics of a Novel Extremophilic Bacterium Yinghua Chen, Hui Xiang, Zhongqi Wu Institute of Life Science, Jiyang College of Zhejiang A&F University, Zhuji, 311800, Zhejiang, China Corresponding author: zhongqi.wu@jicat.org Computational Molecular Biology, 2025, Vol.15, No.6 doi: 10.5376/cmb.2025.15.0027 Received: 07 Sep., 2025 Accepted: 20 Oct., 2025 Published: 10 Nov., 2025 Copyright © 2025 Chen et al., This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Preferred citation for this article: Chen Y.H., Xiang H., and Wu Z.Q., 2025, Genome assembly and comparative genomics of a novel extremophilic bacterium, Computational Molecular Biology, 15(6): 273-281 (doi: 10.5376/cmb.2025.15.0027) Abstract Extremophiles have evolved unique physiological mechanisms and genomic characteristics in extreme environments such as high salt, high temperature, high pressure, and strong acid. Studying their genetic basis is of great significance for revealing the adaptive evolution mechanisms of microorganisms and discovering functional genes with biotechnological value. This study aims to assemble and annotate the genome of a newly isolated polar bacterium and analyze its metabolic potential, environmental adaptation mechanism and evolutionary characteristics through comparative genomics. Functional annotations reveal that the genome of this strain is rich in key functional genes related to salt tolerance, heat resistance, heavy metal resistance, etc. In the comparative analysis with other known polar bacteria, conserved core gene clusters, species-specific gene islands, and the expansion of gene families in response to environmental stress were discovered. Case studies show that it has application potential in the development of industrial enzymes and the construction of synthetic biology platforms. This study provides a new genome-level perspective for understanding the adaptation mechanisms of polar microorganisms and lays a foundation for their functional exploration and application development. Keywords Polar microorganisms; Genome assembly; Comparative genomics; Salt resistance; Phylogeny 1 Introduction Not all lives are keen on the comfortable environment like a greenhouse. Some microorganisms, on the contrary, prefer high temperatures, extreme cold, strong acids, heavy salts and even radiation. These places are not suitable for most life forms, but they are common for extremist microorganisms. In environments like hot springs, polar ice caps and alkaline salt lakes, they not only survive tenaciously but also play a "behind-the-scenes role" in maintaining the ecosystem cycle. Although they seem far away from our lives, these organisms offer a window to understand the limits of life and are very inspiring for the exploration of celestial life, the development of biotechnology, and even the research of environmental restoration (Arias et al., 2023; Gomez et al., 2024). However, to figure out exactly what skills they have relied on to "survive", mere observation is far from enough. Methods such as genome assembly and comparative analysis are the true keys that enable us to enter the internal structure of their genetics. In the past, problems such as high GC content and numerous genomic duplications were indeed troublesome. Now, with high-throughput sequencing and long-read technologies, even these "tough nuts to crack" can be successfully tackled (Dong, 2024). By comparing the genomes of these extreme bacteria, not only can we identify genes related to membrane stability, DNA repair or stress response, but also some new families and mechanisms that were not previously noticed may be unearthed. Some may even change our understanding of microbial diversity and evolutionary patterns (Zhang et al., 2021a). This study will utilize the most advanced sequencing and bioinformatics methods to assemble and analyze the genome of a novel extreme microorganism isolated from a unique extreme environment, construct a high-quality genomic sequence, conduct comparative genomics analysis with related extreme microorganisms, and identify the genetic determinants of its extreme tolerance. The scientific significance of this research lies in expanding the catalogue of extremist microorganism genomes and revealing the molecular adaptation mechanisms that may have an impact on the application of biotechnology and evolutionary biology.
Computational Molecular Biology 2025, Vol.15, No.6, 273-281 http://bioscipublisher.com/index.php/cmb 274 2 Sample Collection and Sequencing Strategies 2.1 Sampling from extreme environments and bacterial isolation These "durable" microorganisms can not be found everywhere. Most of the true extreme bacteria grow in places like hot springs, salt lakes and acidic mines - the conditions are harsh, but they just survive well. However, extracting bacteria from these environments is not as simple as just scooping up a ladle of water. We have to find a way to restore them to their original growth state; otherwise, they will "die" as soon as they enter the laboratory. Often, it is necessary to carefully cultivate under simulated native conditions in order to select the truly tolerant batch. Although it takes a lot of effort, only in this way can pure and stable strains be obtained for subsequent genomic analysis. These operations also lay the foundation for us to understand how they withstand extreme coercion (Verma et al., 2024). 2.2 DNA extraction and selection of sequencing platforms (Illumina, Nanopore, PacBio, etc.) Extracting DNA from these microorganisms is easier said than done. For instance, if the cell wall is particularly hard or the sample contains some strange environmental impurities, conventional methods may not work. After successfully extracting high-quality DNA, the next step is to select a sequencing platform. The combination of short-read high-precision Illumina and long-read ONT or PacBio with strong coverage is currently the most common hybrid strategy. Especially when dealing with samples with unstable GC content or many repetitive sequences, using only one platform often yields mediocre results. Typically, the research will first use long read length (ONT or PacBio) for preliminary assembly, and then use Illumina for fine-tuning. The overall effect is stable and a considerable amount of budget is saved (Goldstein et al., 2018; Neal-McKinney et al., 2021). 2.3 Data quality control and preprocessing approaches After the measurement, the data cannot be used directly. Quality control is the crucial step next. For instance, first, the connectors need to be removed, low-quality reads filtered out, and contaminated fragments mixed in filtered out. All these tasks must be done thoroughly one by one. Otherwise, errors are likely to occur during the subsequent assembly. Especially when multiple sequencing platforms are used in combination, it is necessary to carefully examine the error rate and coverage distribution. Some long-read platforms themselves have insertion or missing issues. Using short-read high-fidelity data for error correction is one of the common operations. Nowadays, most processes can be automated. Basically, from raw data to available assembly data, the entire set of preprocessing can be seamlessly connected. This is particularly important for the research object of extremist microorganisms (De Maio et al., 2019; Olagoke et al., 2025). 3 Genome Assembly and Quality Assessment 3.1 Genome assembly strategies (de novo, hybrid, etc.) The genome assembly of extremist microorganisms usually adopts a hybrid strategy, combining short-read sequencing (such as Illumina) with long-read sequencing platforms (such as Oxford Nanopore Technologies (ONT) or PacBio). Hybrid assembly tools, especially Unicycler, combine the high precision of short reads with the long-distance continuity of long reads, thereby generating more complete and continuous genomes than de novo assembly based solely on a single technology. This method can effectively analyze the common complex genomic regions, repetitive sequences and structural variations in extremist microorganisms, thereby achieving chromosome-level assembly and improving the accuracy and completeness of the assembly. Studies comparing different strategies have shown that hybrid assembly is superior to pure short read or long read assembly in terms of continuity and genetic integrity, while also balancing sequencing costs and DNA initiation quantity requirements (Wick et al., 2017; Chen et al., 2020). 3.2 Evaluation of contigs/scaffolds and statistics on N50 and GC content Assembly quality is usually evaluated using indicators such as the number of overlapping groups or scaffolds, N50 values, and the distribution of GC content. A higher N50 value indicates greater continuity, reflecting longer assembly sequences and better representing genomic structure. The genomes of extremely thermophilic bacteria usually exhibit different GC contents, which poses challenges to assembly algorithms; Therefore, evaluating the GC content is helpful for verifying the accuracy of assembly and detecting potential deviations. Compared with
Computational Molecular Biology 2025, Vol.15, No.6, 273-281 http://bioscipublisher.com/index.php/cmb 275 assembly using only short-read sequences, hybrid assembly usually results in fewer overlapping groups and higher N50 values, indicating a more complete and continuous genome. Monitoring the consistency of GC content with the expected value of this species can further support the reliability of assembly (Zhang et al., 2021b). 3.3 Application of BUSCO, QUAST and other tools for completeness assessment The integrity and quality assessment of genomic assembly mainly rely on tools such as BUSCO and QUAST, which respectively provide indicators of biological and technical significance. BUSCO estimates the integrity and redundancy of the genome by evaluating nearly ubiquitous single-copy direct homologous genes, thereby gaining a deeper understanding of the integrity of genetic content. QUAST reports assembly statistics, including overlapping group counts, N50 values, and incorrect assembly rates, thereby enabling the detection of structural errors. Other tools such as CheckM2 utilize machine learning to predict integrity and contamination, especially for metagenomic assembled genomes. Combining these assessment methods can ensure that the assembled genome is structurally accurate and biologically complete, which is crucial for subsequent comparative genomics and functional analysis of extremist microorganisms (Manni et al., 2021; Chklovski et al., 2022). 4 Genome Annotation and Functional Analysis 4.1 Coding sequence prediction and structural annotation (Prokka, RAST, etc.) To understand the "genetic ledger" of an extremist microorganism, it is usually necessary to start with which coding regions it has. Often, researchers directly use toolkits like Prokka and RAST, which not only saves time but also facilitates standardization in the later stage. The operation interfaces of these programs may not be overly complicated, but in fact, they integrate multiple sets of gene prediction models and databases behind the scenes, capable of generating a complete set of annotation results at once, including rRNA, tRNA, and even hypothetical proteins. For instance, Prokka has been repeatedly used in various bacterial genomes. Stability and speed are its advantages (De Almeida et al., 2023). However, even if the tools are powerful, the "assumed proteins" automatically annotated still need to be guarded against - they often represent unknown functions, which is precisely the most attractive part of extremist microorganisms. 4.2 Functional annotation and database comparisons (COG, KEGG, Pfam) The list of genes derived from automatic annotation does not specify exactly what these genes do. This step still depends on the results of database comparisons, such as the commonly used COG, KEGG and Pfam. They respectively focus on different levels such as functional classification, metabolic networks, and domain recognition. When used, they are like jigsaw puzzles, filling in the blank Spaces of physiological processes with individual genes. Like KEGG pathway mapping, it can help people identify the node genes involved in key reactions, while the structural analysis of Pfam can reveal the conserved modules in proteins (Sohail et al., 2025). Interestingly, many times such comparisons will unearth some unique "module combinations" of extremist microorganisms, which are often functionally related to energy metabolism, nutrient utilization or environmental response. 4.3 Identification of special functional genes (salt tolerance, thermotolerance, heavy metal resistance, etc.) Not every extremist microorganism has "dramatic" stress resistance genes in its genome, but as long as it can survive in high-salt or high-temperature environments, those genes responsible for resistance functions are mostly not far away. Heat shock proteins, compatible solute synthases, metal transport pumps... These names may not look new, but once they appear in a certain strain in the form of family expansion or unique combinations, they are worth taking a second look. Comparative analysis can also reveal in which bacteria they are "standard" and in which they are "introduced from outside", which is crucial for understanding adaptation strategies (Srivastava et al., 2017; Wang et al., 2025). Sometimes, a metal-resistant protein is not just for "survival"; it may also become a "candidate star" in later biotechnology. 5 Comparative Genomic Analysis 5.1 Whole-genome alignment with related extremophilic bacteria Sometimes, identifying the "special features" of an extremist microorganism doesn't require it to speak for itself - a genomic comparison with its "relatives" can reveal where it has retained its traditions and where it has embarked
Computational Molecular Biology 2025, Vol.15, No.6, 273-281 http://bioscipublisher.com/index.php/cmb 276 on a unique evolutionary path. Especially with the high-quality genomes pieced together by hybrid sequencing, structural changes such as insertions, deletions, and even gene rearrangements become particularly clear. For instance, the consensus pan-chromosome assembly method of Acinetobacter baumannii has revealed "flexible regions" related to environmental adaptability. Such strategies are also applicable to the study of microorganisms living in extreme environments (Chan et al., 2015; Gould and Henderson, 2023). Often, it is those easily overlooked genomic islands and resistance elements that are truly the key to explaining adaptability. 5.2 Identification of core and variable genomic regions The requirements for bacteria in extreme environments are not merely to "survive", but to "survive and live well". So, in their genomes, apart from the core areas that maintain basic life activities, there are also many "mobile" modules hidden. These variable regions are not present in every strain; they are more like "additional configurations" tailored for certain ecological conditions. Polar bacilli such as Haemophilus erythrosalis exhibit an open pan-genome - the stable core region is maintained by homologous recombination, while the dynamic helper parts are strongly influenced by horizontal gene transfer, including many fragments from plasmids or gene islands (Figure 1) (González-Torres and Gabaldon, 2018). These "mobile components" may not be used every day, but once the environment changes, they become the trump card for survival. Figure 1 Genome dynamics and ecological models (Adopted from González-Torres and Gabaldón, 2018) 5.3 Gene family expansion, contraction, and species-specific gene analysis Not all genetic families remain unchanged during evolution; some grow stronger while others gradually fade away. For instance, those gene groups related to heat resistance, metal resistance or osmotic pressure regulation often become "expansion households" in extremophiles. However, in contrast, some less necessary pathways will also experience functional contraction. This contractional-expansion is not random but closely related to the selective pressure of the environment. By comparing the genomes of different species, some species-specific metabolic
RkJQdWJsaXNoZXIy MjQ4ODYzNA==