Abstract
Artificial intelligence (AI) is rapidly revolutionizing our daily lives, as it automates mundane tasks, enhances productivity, and transforms how we interact with technology. We believe it is inevitable that AI will soon become a crucial tool in common research practices, from data analysis to writing papers. Here we explore how this transition is occurring in the field of mass spectrometry-based metabolomics, a rapidly growing area of science. Metabolomics focuses on studying small molecules in biological systems, offering valuable insights into metabolic processes and their impact on health, diseases, and physiological conditions. With the remarkable advancements in sequencing technologies and the exploration of the microbiome, the combination of sequencing and metabolomics presents profound opportunities to understand biological complexity. Incorporating AI is promising to unlock new possibilities for expanding the realms of scientific discoveries. In this review we specifically focus on the current trends in the application of AI in metabolomics research. Existing practices are examined and a perspective on future directions for integrating AI into metabolomics research is presented.
Introduction
Metabolomics is a relatively new discipline within the field of science, emerging as a crucial component in the study of biological systems. Over the past few decades, particularly since the late 20th century and into the early 21st century, the application of metabolomics has grown rapidly, revolutionizing our understanding of various biological processes. Metabolomics focuses on studying small molecules, especially those generated through the metabolism of living organisms; however, this term is commonly used as a catch-all, describing the exploration of molecular distributions of all molecules, not only those stemming from metabolic processes, to but also molecules originating from environment, diet, etc. (Fiehn 2002, Tomita & Nishioka 2006, Griffiths 2008, Aksenov et al. 2017). In biomedical research, metabolomics involves the systematic analysis and measurement of metabolites found in samples like blood, urine, or tissue (di Meo et al. 2022). The primary goal of metabolomics is to understand the metabolic processes within organisms and how they are influenced by internal and external factors, including genetics, diet, lifestyle, and the environment. By profiling and quantifying a broad range of molecules, metabolomics provides a comprehensive overview of an organism's metabolic state, enabling the identification of metabolic signatures associated with various physiological and pathological conditions. This knowledge has far-reaching implications, with potential applications in disease diagnosis, prognosis, treatment monitoring, and the development of personalized medicine. Moreover, metabolomics finds applications in drug discovery, environmental toxicology, and food science.
Recently, metabolomics has experienced a significant surge in interest, largely due to its growing role in analyzing the effect and function of the microbiome. The gut microbiome in particular, comprising a multitude of different bacterial, fungal and archeal species, is crucially important for health and well-being, to the extent that it has recently been classified as an organ (Baquero & Nombela 2012, Anwar et al. 2020). There has been an explosion in our understanding of the microbiome's role and its health implications. It has been shown that microbiome plays a crucial role in conditions ranging from irritable bowel disease (IBD) (Menees & Chey 2018, Ghaffari et al. 2022), to those where the role of the microbiome is either unexpected or still being explored (e.g. autism spectrum disorder) (Pulikkan et al. 2019, Saurman et al. 2020, Alharthi et al. 2022, Morton et al. 2023) to the outright surprising, such as role of microbiome in cancer (Garrett 2015), including microbial DNA presence in tumors (Poore et al. 2020), which for a long time have been presumed sterile. Metabolomics plays a pivotal role in understanding microbiome’s role by offering valuable insights into the metabolic activities and interactions between the host organism and its resident microorganisms, commensal, pathogenic, and everything in between (Browne et al. 2016). As per the central dogma of biology, the metabolome represents the ultimate readout of biological systems (Jansson & Baker 2016). The genome serves as the foundational blueprint for biology, encoding the genetic information that shapes an organism's characteristics and potential functions. However, it is the metabolome, with its inherent flexibility and responsiveness, that exhibits diverse expressions and adaptations in response to internal and external factors. Unlike the fixed nature of the genome, the metabolome is subject to constant modulation by environmental cues, diet, lifestyle, the organism's health status, etc. This malleability allows the metabolome to manifest in various ways, making it a powerful indicator of an organism's physiological state and providing insights into disease progression, responses to treatment, and overall well-being. Studying the metabolites produced by the microbiome helps to reveal the functional dynamics of the microbial community, identify specific metabolic pathways, and uncover biomarkers associated with different microbial compositions or disease states. Molecular landscapes associated with microbiomes are complex and dynamic. For example, a single metabolite can be produced by more than one microbe, or multiple microbial species could be involved in production or modification of a metabolite (Zelezniak et al. 2015, Browne et al. 2016). This requires a comprehensive understanding of the interactions between the microbiome and the host (Schroeder & Bäckhed 2016).
The sheer size and complexity of metabolomes presents multiple challenges that are yet to be fully overcome. Artificial intelligence (AI) has emerged as a powerful tool for processing and analyzing massive datasets, finding hidden patterns, or summarizing complex data. AI is a broad term, but it can be summarized as a set of tools and mathematical approaches that enable computer systems to perform tasks and make decisions that would otherwise require human intelligence. There are several broad categories of techniques that fall under the umbrella of AI. Unsupervised learning involves modeling the underlying structure or manifold of data without labels, through methods like principal component analysis (Sanguansat 2012), t-SNE (van der Maaten & Hinton 2008), and autoencoders (Pierre-Antoine 2010). Supervised learning focuses on predictive tasks like classification, predicting outcome Y from input data X (e.g. predict the disease status from the abundances of metabolites measured by a mass spectrometer). Semisupervised learning combines both unsupervised learning to discover features and supervised learning for prediction, useful when most data is unlabeled (Kingma et al. 2014, Devlin et al. 2018, Migdadi et al. 2021, Hamamsy et al. 2022). For example, molecular networking (Wang et al. 2016, Aksenov et al. 2021) uses unlabeled fragmentation spectra for dimensionality reduction, then uses some labeled data for identification. Active learning and reinforcement learning incorporate interaction, like robots exploring environments via sensory input to decide next actions. These include machine learning, which utilizes algorithms to enable computers to improve at tasks through exposure to data without explicit programming (Jordan & Mitchell 2015), and deep learning, which uses neural networks composed of multiple layers to automatically learn hierarchical feature representations directly from raw data (LeCun et al. 2015). Overall, AI encompasses a spectrum of approaches from purely unsupervised manifold learning, to supervised classification, to interactive learning from experience, with appropriate techniques tailored to the problem and data availability. Advanced AI systems often strategically combine multiple techniques to produce robust and nuanced results.
The main machine learning algorithms used in metabolomics experiments are supervised and unsupervised learning. Supervised learning intakes data, learns relationships, and makes a prediction based on the learned knowledge (Colins 2017). Unsupervised learning analyzes raw datasets, generating insights from unlabeled data (Usama et al. 2019). For metabolomics, there exist direct applications of these methodologies (Galal et al. 2022), and we foresee these applications to dramatically expand.
Here, we discuss the current applications of AI in metabolomics, as well as provide an outlook and a wish list for the future role of AI in scientific process using the metabolomics field as an example.
Methodologies of metabolomics
Three analytical detection methods are most commonly associated with metabolomics: gas chromatography–mass spectrometry (GC-MS), liquid chromatography–mass spectrometry (LC-MS), and nuclear magnetic resonance (NMR). The specific advantages and disadvantages of these techniques are discussed in detail elsewhere (Patti et al. 2012, Gowda & Djukovic 2014, Bauermeister et al. 2022), and are not the focus of this review. The former two methodologies are examples of so-called hyphenated MS methods, where a mass spectrometer is interfaced with a separation module to fractionate and simplify complex samples, thus increasing the ability to comprehensively detect the complex metabolome. Mass spectrometry, broadly speaking, detects mass of molecules by measuring m/z (mass-to-charge ratio) of ions formed from neutral species. GC involves separating and detecting compounds in a mixture by exploiting differences in boiling temperature of compounds and their gas-phase interaction with a stationary phase. LC separates samples into individual components based on their hydrophobicity (and other factors) via differences in interactions in liquid phase with a stationary phase. NMR is a spectroscopic technique that uses energetic transitions of nuclear spins in the presence of a strong magnetic field. NMR offers the advantage of high reproducibility among laboratories, ensuring standardized procedures. However, it has lower sensitivity compared to MS techniques due to its low signal-to-noise ratio. Consequently, as biological systems tend to be highly complex and many molecules may be presented in levels undetectable by NMR, mass spectroscopy offers as a superior option for metabolomic studies, as it can be sensitive to detect metabolites circulating at femtomole or even attomole levels (Zhang et al. 2022).
The GC-MS and LC-MS are the workhorse methodologies of metabolomics. Searching for the term metabolomics when performing literature search, would return tens of thousands of papers, roughly equally split between these techniques. GC-MS offers excellent separation capability, sensitivity, reproducibility, and fast analysis but is limited to identifying volatile compounds (Fiehn 2016). Also, it is not possible to unambiguously determine mass of the molecule that gave rise to the MS spectrum (Bauermeister et al. 2022). On the other hand, LC-MS can detect a larger pool of metabolites (Patti et al. 2012, Gowda & Djukovic 2014), and can also provide information on the mass of the parent ion (MS1) as well as the fragmentation spectrum of the ion (MS/MS or MS2). A drawback of LC-MS is the limited ability to annotate (i.e. assign chemical identity) to the detected compounds, as most of the detected metabolome (often more than ~90%) is so-called metabolomics ‘dark matter’, i.e. molecules that are present in samples, but cannot be readily identified (da Silva et al. 2015). Compared to GC-MS, LC-MS has limited mass spectral libraries, but covers greater chemical space (Aksenov et al. 2017). The current LC-MS libraries consist of only a few tens of thousands of spectra, while the reference libraries for GC-MS contain over a million spectra (GC-MS libraries have been accumulated for over seven decades, while tandem MS libraries for LC-MS has only been rapidly accumulated over the past couple decades).
There are two primary approaches for analyzing metabolites in a sample: targeted and untargeted metabolomics. Targeted metabolomics focuses on a specific list of metabolites, driven by particular biochemical questions or hypotheses related to specific pathways. On the other hand, untargeted metabolomics takes a global approach, aiming to measure as many metabolites as possible in a sample (Patti et al. 2012), but much of the detected molecular features cannot be annotated. The challenge of illuminating metabolomics ‘dark matter’ (da Silva et al. 2015) has not been yet overcome, although a rapid progress of recent years has led to major advances in our ability to extract an increasingly greater amount of chemical information from the data. Looking ahead, there will be continued efforts to integrate targeted and untargeted metabolomics analyses, leveraging the strengths of each approach (Fiehn 2016). Targeted metabolomics provides more accurate quantification and concentration data for a defined set of metabolites (Melnik et al. 2017). This quantitative information can be used to calibrate and enhance the accuracy of untargeted analysis, which surveys a broader range of compounds. Combining these approaches (so-called semitargeted approach) will enable both comprehensive coverage and accurate quantification of the metabolome.
AI and experimental design
In a typical metabolomics experiment there are several key steps for a progression through the experiment to fill a knowledge gap. These steps are summarized and discussed in detail in (Aksenov et al. 2017) and are schematically represented on Fig. 1. In many cases, AI solutions have been already proposed or implemented, as discussed next. We posit that each of these steps could be benefited by the use of AI.
AI in experimental design
AI has the potential to revolutionize how we optimize experimental conditions in metabolomics to minimize bias, control confounding variables, and maximize the sensitivity and specificity of metabolite detection. Traditionally, researchers have relied on their experience to select the methodology for MS analysis, such as choosing between GC-MS and LC-MS, positive and negative ion mode, etc. (Patti 2011), with the goal of capturing a broad range of metabolites or specific molecules of interest. However, this approach does not guarantee the most suitable data for addressing research questions, as the broadest coverage of the metabolome does not guarantee that the biologically important metabolite(s) is/are indeed detected.
For instance, consider the case of Ruminococcus gnavus, a member of the human gut microbiome associated with Crohn's disease. This microbe is known to produce an inflammatory polysaccharide (Henke et al. 2019) that is linked to the pathogenic effect of the microbe. Detecting this molecule with the greatest efficiency requires using HILIC chromatography (Tang et al. 2016) and negative ion mode, an unusual combination of methodological choices in metabolomics. If researchers solely focus on the broadest metabolome approach, they would likely select reverse phase (RP) chromatography and positive ion mode, as a far greater number of metabolites are detectable due to generally greater stability of positively charged ions. Thus, in an untargeted analysis without any prior knowledge, the researchers would miss out on crucial information such as the involvement of this microbe’s specific molecule in Crohn's disease (Henke et al. 2019).
In such scenarios, AI can make a significant difference. Using AI, one could analyze large datasets of metabolomic information in context of research questions, and the distributions of the metabolites that are deemed important and relevant in the framework of said research question, then learn to predict the most likely and effective ways to identify specific metabolites for certain scenarios. By leveraging machine learning algorithms, AI can explore patterns and correlations in the data, ultimately suggesting optimal experimental conditions for detecting particular metabolites with enhanced accuracy and efficiency. This includes not only chromatography/mass spectrometry settings but also any other considerations such as sample collection, storage, or preparation (e.g. use of more or less polar extraction solvent), which would lead to bias in detection of certain metabolites or molecular families in general, thus defining the range of detectable metabolome. Other relevant questions could be explored, from the required study power, to the number and order of blanks in the experiment. Study power is a crucially important consideration, and well-established statistical tools exist to calculate it (Blaise 2013, Nyamundanda et al. 2013). However, the predicted number of samples is normally calculated based on the pilot or similar prior studies to assess the expected magnitude of the effect. As neither is necessarily precisely representative of the actual study, the magnitude of the effect is often ‘guesstimated’, postulated or even remains unknown. AI could be used to assess the expected magnitude of the effect, variability, etc. more reliably.
On the other hand, the blank collection strategy is, likewise, often purely ‘gut feel’. Increasing the number of blanks is generally beneficial to decrease the possibility of carryover, as well as accounting for background and contamination. However, increasing the number of blanks leads to reduced throughput and higher analysis costs, and thus a balance needs to be struck to ensure sufficient number of blanks is collected and analyzed but without overkill. Each lab has their own strategy that they stick to, with or without good understanding whether this strategy is indeed optimal. AI could be a great tool to suggest the best strategy for number and order of blanks in a specific experiment, by exploring the carryover patterns and the background trends, whether generalized or lab- and instrument-specific.
While the concept of using AI to predict the best approaches for identifying metabolites holds great promise, it is worth noting that its full implementation is still a future prospect. Techniques such as reinforcement learning can aid in tuning MS instrumentation settings to optimize output (Mnih et al. 2015). In addition, findings from AI models could be used to both design molecules and experiments, then closing the feedback loop, to make an active learning/reinforcement learning model (Abolhasani & Kumacheva 2023).
However, most of the scenarios of AI use described here are currently hypothetical. As AI continues to advance, with time, we may see the integration of AI-driven optimization in metabolomics becoming a standard practice, enabling researchers to gain deeper insights and make more informed decisions in their investigations. Also, while advanced AI tools provide powerful data analysis capabilities, a well-designed experiment with proper controls, quality checks, standardized metadata, and rigorous protocols remains paramount to ensure generation of high-quality, interpretable data that enables robust modeling (Dudzik et al. 2018).
Data conversion and preprocessing
Once the data are acquired, the raw data files that represent readout of the instrument, need to be converted into a tabulated list of detected molecular features. This step is called data processing and involves conversion of data into different formats, detection of individual features (deconvolution or feature finding), comparison of the feature in the data to find the same features across different runs (alignment), calculation of peak areas, filtering to remove redundancies etc. Different tools exist, both commercial and open source for the data processing (Kessner et al. 2008, Tsugawa et al. 2020, Schmid et al. 2023). The steps of data processing naturally lend themselves to the use of AI. The approaches such as convolutional neural networks (CNNs) (Albawi et al. 2017), recurrent neural networks (RNNs) (LeCun et al. 2015), and autoencoders (Gomari et al. 2022) are highly suitable to the MS data structure and have been applied to denoise data, to explore patterns in MS data for peak detection, and to investigate elution pattern across data to separate overlapping peaks (deconvolution) (Pomyen et al. 2020). The AI-based tools demonstrate improved performance compared to the processing approaches that require human input to determine processing settings. As an example, an AI-based approach for GC-MS data processing has been developed (Aksenov et al. 2020, 2021). This approach, named MSHub, investigates existing m/z and retention time shifts within the data using a one-layer neural network and determines optimal settings for the data processing. This has two benefits: a reduced barrier to data processing and improved reproducibility. In order to be able to set all of the settings correctly, a user needs to have deep knowledge of mass spectrometry and thorough understanding of the data. Some of the settings could be rather unforgiving, and if set incorrectly, render processing results subpar or outright unusable. On the other hand, starting from the same data, different users will invariably differ in their settings, leading to differences in processed data, which can cause significant irreproducibility. MSHub determines the optimal settings, in the same way, which means any users will generate the same, optimal results, starting from the same data. Moreover, a one-layer neural network of MSHub extracts spectral information across the entire dataset; thus, adding more data would enable better learning of spectral patterns (Aksenov et al. 2021). Although similar methodology does not yet exist for LC-MS data processing, we believe that AI-based approach(es) will become increasingly utilized across various kinds of MS data to simplify, streamline, and harmonize data processing.
Data correction
High-throughput omics data often encounter experimental bias, such as differences across sample sets, analysis times, and instrumentation drift, leading to various degrees of batch effect, i.e. situations when the technical differences are induced in the data (De Livera et al. 2015, Wehrens et al. 2016, Broadhurst et al. 2018). These differences may obscure or confound biological differences. However, the nature of the problem, yet again, lends itself to the use of AI, to explore complex patterns in the data and account for nonlinear effects. A significant breakthrough in the development of AI-based tools to mitigate such batch effects has been taking place, especially over the past decade. Many of the batch correction approaches have originated in the sequencing field and have been adopted for use in MS-based metabolomics. One widely used algorithm for batch correction is ComBat (Johnson et al. 2007), which operates within the empirical Bayes framework to adjust for batch effects. ComBat estimates batch-specific parameters and effectively reduces the impact of batch-specific effects by bringing them closer to a pooled estimate. This alignment process ensures that the data from different batches are harmonized while preserving the underlying biological signal (Johnson et al. 2007). Another powerful AI-based method for batch correction is Mutual Nearest Neighbor (MNN) (Haghverdi et al. 2018). MNN leverages shared nearest neighbor information to identify and correct batch effects. By aligning the nearest neighbors across batches, MNN can effectively detect and adjust for systematic variations, enabling datasets to be harmonized (Haghverdi et al. 2018).
A great strategy to mitigate batch effect is by using the Automatic Feature Engineering approach. Automatic feature engineering refers to methods that automatically generate or construct informative features from raw data to use in machine learning models, without requiring manual preprocessing and extraction of features. For MS, this would require extracting MS1 features and MS2 spectra in a differentiable manner, so that batch effects can be inferred directly from the experimental design. MS1 features and MS2 features would then be automatically learned, to calibrate batch effects, and infer a manifold over both MS2 features and biospecimens (AlQuraishi & Sorger 2021).
The AI-based tools have already significantly advanced the field of omics data analysis by addressing the challenge of experimental bias and batch effects. Comparatively speaking, the use of AI is quite mature for the batch correction purposes. However, far greater utility of AI can be leveraged in the future. In particular, the batch correction assumes the consistent experimental protocol throughout the study and the correction is only possible until the technical disparities become too stark. Altering experimental choices such as chromatographic protocols, data acquisition modes, or use of different instruments, most of the time, would lead to irreconcilable differences where the data cannot be compared directly. Currently, the workaround for such disparate datasets is meta-analysis (Jarmusch et al. 2020). Global data harmonization, where every data point ever collected could be co-analyzed with any other data point is yet a thing of the future. However, if or when this will transpire, it is almost certain that AI will play a central role in such harmonization.
Metabolite annotation
Metabolite annotation is a translation of a spectral pattern into the molecular identity. The most common approach for metabolite annotation involves matching mass spectra (electron ionization (EI) for GC-MS and tandem MS for LC-MS) to reference library spectra of known molecules (Aksenov et al. 2017). Both the matching and reference library generation could benefit from leveraging AI.
The matching of spectra against libraries appears straightforward: a query spectrum and a library spectrum are compared, and a similarity score is calculated (Stein 2012), e.g. cosine (dot product) score (Styczynski et al. 2007). However, the spectral patterns can vary greatly in richness, or fidelity (levels of noise, missing peaks, m/z drift etc.), while libraries can have redundancies or be incomplete and missing correct matches. Correspondingly, these various scenarios could be accounted for by using machine learning techniques to improve on scoring function. As an example, Spec2Vec is an approach to calculate spectral similarity based on learned co-occurrences across large datasets, considering the relationships between fragments instead of relying on binary assessments (Huber et al. 2021).
In terms of libraries, conventionally, the MS reference spectra are collected either for pure reference compounds, or for specific sources, where compounds may be unknown, but the molecules could be linked to the source, such as microbes (Han et al. 2021, Zuffa et al. 2024) and food (Song et al. 2021, Gauglitz et al. 2022). However, while these libraries are valuable for continuous expansion of reference data, they are limited in scope. We do not really know the magnitude of the chemical space, but likely far exceeds all the existing libraries or the libraries that can be assembled in the foreseeable future (Aksenov et al. 2017), as hundreds of millions to billions of molecules are theoretically possible (Kind & Fiehn 2007, 2010), while current spectral libraries are in hundreds of thousands (Aksenov et al. 2017). To broaden the scope of identified molecules, AI can play a crucial role in generating comprehensive reference libraries for metabolite identification. An in silico generation of spectra, i.e. computational simulation or prediction of the possible fragmentation patterns, has been instrumental in lipidomics (Kind et al. 2013), due to comparatively predictable spectra of lipids, but is less utilized in metabolomics. METLIN (Guijas et al. 2018) and MassBank (Horai et al. 2010) are examples of databases that utilize AI techniques to build extensive reference libraries for LC-MS and GC-MS data. They provide information on metabolite structures, mass spectra, and fragmentation patterns, facilitating accurate metabolite identification in various biological samples. At the present time, ab initio methods are too computationally demanding to handle the scope needed for metabolomics, but this may change with the advent of quantum computing (Naeij et al. 2023).
Conversely, instead of generating libraries, the fragmentation patterns could be used to directly predict the structure. Yet again, AI approaches are ideal for such a type of problem. In fact, use of AI has led to rapid increase in the ability to correctly predict structures (Shen et al. 2013). Sirius, a tool based on machine learning that can predict formula and structure (Dührkop et al. 2019), as well as molecular families (CANOPUS) (Dührkop et al. 2021), have become a staple of metabolomics analysis. In another example, the self-supervised BERT (Bidirectional Encoder Representations from Transformers) approach provides a way to learn latent representations of mass spectra in an unsupervised manner, and does not rely on peak assignment (Devlin et al. 2018). As AI tools improve, so will our ability to predict molecular structure from spectra. As an example, a recent tool named MS2Mol further pushes the boundaries in prediction accuracy (Butler et al. 2023).
AI-driven tools and techniques have already revolutionized metabolite annotation and identification by enabling the creation of comprehensive reference libraries, improving spectral matching accuracy, and providing efficient methods for molecular structure prediction without solely relying on existing databases. These advancements open new avenues for exploring the vast metabolome and gaining deeper insights into the complex world of metabolomics. Yet, we believe that we are only scratching the surface in terms of possibilities. The molecular annotation can be further contextualized to include all of the knowledge internal and external to the mass spectrometry experiment: the instrumental settings, sample preparation, type of sample or biological context. All of these and a myriad of other factors are reflected in detected and detectable metabolomes. Currently, it is up to a researcher to draw on the body of knowledge, experience, and familiarity with the problem, to propose annotations for ambiguous spectra or unknown molecules. Instead, using AI tools, we could carry out chemical inference, i.e. infer functional annotations of molecules, and perform classification/regression to infer biological states (i.e. disease) from engineered MS features. With the power of AI, the prior knowledge could be drawn upon for annotation context. Considering how rapidly improvements in annotation are taking place, we can hope for the possibility of completely illuminating ‘dark matter’ in our lifetime.
Statistical analysis and network inferences
The application of AI and deep learning in various analyses is a vibrant and rapidly developing field, outside of the scope of this review. Many of the advances that were made in image, video (e.g. Sora by OpenAI) or text recognition are making their way into scientific data analysis (Eraslan et al. 2019, Agarwal et al. 2020, Piuri et al. 2020, Baldi 2021, Elloumi 2021, Jabbar et al. 2021, Kose et al. 2021, Tiwari 2021, Bacciu et al. 2022, Alam et al. 2023, Attique et al. 2023, Dunn et al. 2023, Sumathi et al. 2023).
Another exciting application of AI is in network inference analysis. The chemical distribution patterns in MS analysis could be represented by networks (Bandeira et al. 2007, Wang et al. 2016, Aksenov et al. 2021). This approach is based on the conjecture that spectral similarities are reflective of structural similarities, as the shared structural fragments in molecules will give rise to a common peak(s) in fragmentation spectra. Representing MS data as a network in turn opens the possibility to leverage the approaches that are developed to operate on the data with network structure and leverage connectivity information. These include Network on Graphs (Bronstein et al. 2017), Community Detection (Yang et al. 2016), Network Propagation (Noble et al. 2005, da Silva et al. 2018) and Network Embedding (Amara et al. 2022), among others.
Data integration
As noted earlier, the revolution in sequencing and corollary microbiome research has also led to rapidly increasing interest in metabolomics. Metabolomics provides an important piece of the puzzle – while DNA sequencing of microbiomes provides a template of the microbial composition and genetic potential, the actual small molecule metabolites produced represent the functional readout of the system; therefore, pairing microbiome sequencing with metabolomics provides critical insights by linking the metabolic profile of a microbiome to the microbes present and their activity. Other omics data, especially transcriptome and proteome, can provide additional information that can help to further uncover biological complexities. The multi-omics analysis is more than the sum of its parts (Jansson & Baker 2016), and thus there exists a strong impetus to integrate and co-analyze various omics data. However, such integration presents several challenges, which include: differences in data structure (e.g. sequencing data are compositional, discrete, and sparse, i.e. mostly comprised of zeroes, while MS data are partially compositional, continuous, and are far lease sparse); size (datasets are often large with thousands to tens of thousands of OTUs (operational taxonomic unit) in sequencing and molecular features in metabolomics); complexity and nonlinearity (multiple omics features can have complex and nonlinear relationships, contain interdependencies and redundancies); as well as multitude of technical artifacts that include missing data, differing levels of noise, possible differences in experimental conditions for different -omes, etc. Robust computational methods are therefore needed to address these challenges and effectively extract inter-omics relationships from heterogeneous, large-scale, noisy biological data generated across multiple platforms and technologies. Yet again, AI is a very promising way to tackle such challenges. For example, tools like sparse canonical correlation analysis (SCCA) are highly suitable for multi-omics data as they can handle high dimensionality, reduce noise, and detect complex relationships (Lê Cao et al. 2009).
Powerful AI tools like deep neural networks have been rapidly advanced and perfected in recent years for applications such as image recognition, natural language processing, and recommendation systems. It appears that leveraging these cutting-edge AI technologies for new purposes is now gaining momentum across science as well. An example of such adoption is the recently developed method called MMvec (molecule–microbe vectors), an approach that allows assessing the conditional probability of specific microorganisms to be associated with certain molecules (Lloyd-Price et al. 2019, Morton et al. 2019). This approach overcomes a major limitation of integrating multi-omics data via exploring correlations between microbes and metabolites – due to data compositionality, correlation analysis tends to produce an overwhelmingly high rate of false positives and negatives (Friedman & Alm 2012, Kurtz et al. 2015, Weiss et al. 2016, Gloor et al. 2017, Vandeputte et al. 2017). MMvec instead investigates the co-occurrence patterns of microbes (OTUs) and metabolites (MS features) by matrix factorization. MMvec borrows the methodology previously developed for language processing, word2vec, which estimates word probabilities conditioned on a single particular word (Mikolov et al. 2013). This approach has resulted in a breakthrough in our ability to functionally link microbes and molecular distributions and has been shown to uncover compelling biological insight, e.g. metabolites produced by microorganisms in inflammatory bowel disease, that was not possible otherwise (Morton et al. 2019). MMvec, or related, perhaps future approaches, along the existing cross-domain techniques such as bi-clustering analysis (Gu & Veselkov 2018), multi-block data integration (Jiang et al. 2021), and joint nonnegative matrix factorization (NMF) (Abe et al. 2021) etc., hold promise to correlate microbial community taxonomic composition or functional potential with metabolic output (Chong & Xia 2017). By modeling the interconnected inferences across multi-omics datasets, the resulting meta-analyses can structurally map metabolic relationships to provide evidence of microbial biosynthetic origins. These AI-powered strategies are needed to overcome limitations of conventional statistic correlations by encoding biological constraints to simulate multi-scale mechanisms of metabolite production in complex microbial communities.
While these results have been promising, we need to acknowledge that we are, again, likely at the beginning of a long journey on the path of adopting and leveraging AI tools. The situation is akin to harmonizing and co-analyzing MS data, but of course, with a higher level of complexity and thus even more challenging. Ideally, we would like to analyze any omics data in context of any other data and draw conclusions using disparate studies. Whether this would be possible in a meaningful way remains an open question, for now.
Knowledge acquisition: future uses of AI
In the rapidly evolving landscape of technology, the way the scientific process transpires, is on the brink of a transformative shift, propelled by the widespread adoption of AI models for knowledge generation and access. The integration of AI in scientific exploration clearly holds great promise, and we believe it will have a direct role in accelerating the pace of discovery, something that we see already happening. There are many potential applications of AI/metabolomics combo that could be conceived. For example, AI could be utilized in drug design, where the goal is to determine an optimal drug to treat patients with a specific disease. Given prior experimental data on the efficacy of existing drugs and molecular profiles of the patient (e.g. composition of fecal metabolome), AI methods could be leveraged to design personalized drug candidates with higher efficacy by learning from the data, or even design molecule de novo (Ajagekar & You 2023). It is likely that there will be many applications that we have not imagined yet.
The last step on the diagram shown on Fig. 1 is the ‘knowledge incorporation’ and ‘re-learning’. This is different from the original diagram presented in (Aksenov et al. 2017) (where the last step was simply ‘knowledge’) in an important way – it implies an upgrade from a static knowledge acquisition to a dynamic one. The paradigm for knowledge acquisition and dissemination has been formed over centuries. In the last century it has solidified into the way we exchange knowledge currently: research findings are summarized, interpreted, and then published as a manuscript in a scientific journal, following peer review. The knowledge dissemination occurs in a processed format, where the findings are represented in the way the authors perceive and interpret them. The reviewers have some, albeit rather limited, ability to contribute to the interpretation process. Reviewers are not expected to repeat the analysis from scratch, starting from the raw data, and thus have to go by researchers’ judgment, at least with respect to the initial analyses that advance from raw data to interpretable findings. The raw data may or may not be shared via deposition into open access repositories. It is common to provide only processed data as a table, figure, etc. in Supplemental Materials. In the future, we envision the final step in the workflow will be knowledge incorporation and re-learning, where the new knowledge is immediately (or relatively quickly) included into existing knowledge bases and is recursively used for updated learning algorithms. Any next analysis draws on all existing information including new knowledge, tracking knowledge updates, perhaps via mechanisms akin to blockchain technology. This likely will be driven by AI to recursively re-learn based on the new evidence and updated knowledge across knowledge bases.
The current format of sharing findings in a paper may itself eventually become obsolete. It is already possible to summarize complex concepts and get answers to sophisticated questions using conversational AI assistants quickly and easily. AI is already capable of writing a research paper (Baker et al. 2023). It is quite possible that eventually the scientific paper format will morph into some form of ‘AI Science Chat Bot’ that, when prompted with a question, would provide an answer based on the current state of knowledge. Any new knowledge would be incorporated to update and refine the existing paradigm, while tracking provenance of the information and intellectual contributions of researchers. It is still, however, up to humans to come up with new ideas and propose explanations (or so we hope). Both the re-learning and examination using new data requires comprehensive data integration by gathering and organizing data from diverse formats, sources, and characteristics. The rise of data aggregators has already revolutionized how we access information (examples of sequencing data repositories are: National Center for Biotechnology Information (NCBI), sequence read archive (SRA) (Leinonen et al. 2011), European nucleotide archive (ENA) (Burgin et al. 2023), and DNA data bank of Japan (DDBJ) (Mashima et al. 2017); MS-based metabolomics data repositories – ProteomeXchange (Vizcaíno et al. 2014), GNPS (Wang et al. 2016), MetaboLights (Haug et al. 2020), Metabolomics Workbench (Sud et al. 2016)). These platforms can act as powerful tools, allowing scientists to explore vast arrays of data and extract relevant insights to address their specific research inquiries. We envision such platforms, and meta-aggregators of data, to be playing a continuously increasing role in the future. By harnessing the capabilities of AI algorithms, data aggregators will be able to efficiently sift through massive volumes of information and extract relevant and useful data (as well as metadata to provide context), to draw upon or contextualize research questions. Trends that are too subtle or complex to glimpse on limited datasets, may become noticeable when big data are interrogated.
The implications of AI entering the scientific process are profound. In the near term, it will make the scientific process more efficient and will save researchers’ time. In the long run, it will change the way we attain, interpret, exchange, and disseminate knowledge. The AI-powered future is likely to be more collaborative and inclusive (as the barriers to contributing to scientific processes will be reduced), as well as interconnected (the trend for interdisciplinary research is likely to accelerate). It will bring certain challenges, too. It will be crucial to install safeguards for AI use that prevent the propagation of errors, biases, and harm through continuous learning and improvement over time.
Considering the potential benefits of AI applications, there are a number of challenges that need to be overcome in order to make these AI applications happen. First, AI models are data limited, meaning that the model accuracy is limited by the quality of the input training dataset. This presents a dilemma since most experimental insights have been gained from a select few organisms. Correspondingly, future AI algorithms need to be able to generalize well-beyond commonly studied organisms. Emerging insights from self-supervised learning have shown promise in overcoming these challenges, where data collected from unknown organisms and molecules can also be included in training the AI models, lowering the barriers for model generalization across this vast biological unknown.
In order to further improve model generalization across this biological unknown, incorporating biological knowledge is key. In the machine learning literature, these are called inductive biases, where the implicit structure of the data is incorporated into the AI model (i.e. convolutional network networks, recurrent neural networks, and the attention mechanism). We are still in very early stages of leveraging genomic architectures, metabolic pathways and metabolite structures to engineer new inductive biases. Nevertheless, this will play an increasingly important role for constructing useful AI models designed to solve real biological problems.
Finally, a highly accurate AI model does not necessarily capture true causal explanations for its predictions when trained on correlative datasets. For instance, a high-performing image classifier distinguished wolves from dogs simply by presence of snow backgrounds rather than discerning anatomical features. The model exploited such incidental biases rather than learning biologically relevant differences. This illustrates the need for interpretable models even within successful implementations, to ensure conclusions stem from scientifically meaningful input variables rather than spurious correlations. Like peer researchers, algorithms must demonstrate structure–function comprehension. Simply correctly answering questions is inadequate unless armed with justification mirroring subject matter expertise to reliably inform future discovery. As metabolomics explores complex and subtle molecular patterns, developing best practices for deploying AI as an insightful collaborative assistant requires ongoing discourse between data science and domain sciences to transition accuracy into real understanding. The best practices for scientific use of AI in general, and in specific practices and applications in particular, are yet to be established.
Declaration of interest
AAA and AVM are founders of Arome Science, Inc. JTM is founder and CEO of Gutz Analytics LLC. All other authors declare no competing interests.
Funding
AAA and AVM were supported by the USDA NIFA GRANT13665683.
Author contribution statement
AAA provided supervision. AAA, JTM, and AVM devised the conceptual framework for the manuscript. AAA, JTM, EAC, and WC wrote and edited the manuscript. EAC generated Fig. 1. All authors reviewed and approved the final version of the manuscript.
Acknowledgements
The authors are grateful to Dr Ricardo DaSilva for helpful discussions.
References
Abe K, Hirayama M, Ohno K, et al.2021 Hierarchical non-negative matrix factorization using clinical information for microbial communities. BMC Genomics 22 104. (https://doi.org/10.1186/s12864-021-07401-y)
Abolhasani M & & Kumacheva E 2023 The rise of self-driving labs in chemical and materials sciences. Nature Synthesis 2 483–492. (https://doi.org/10.1038/s44160-022-00231-0)
Agarwal B, Balas VE, Jain LC, et al.2020 Deep Learning Techniques for Biomedical and Health Informatics. Cambridge, MA: Academic Press. (https://doi.org/10.1016/C2018-0-04781-7)
Ajagekar A & & You F 2023 Molecular design with automated quantum computing-based deep learning and optimization. Npj Computational Materials 9 1–14 . (https://doi.org/10.1038/s41524-023-01099-0)
Aksenov AA, da Silva R, Knight R, et al.2017 Global chemical analysis of biology by mass spectrometry. Nature Reviews Chemistry 1. (https://doi.org/10.1038/s41570-017-0054)
Aksenov AA, Laponogov I, Zhang Z, et al.2020 Algorithmic learning for auto-deconvolution of GC-MS data to enable molecular networking within GNPS. bioRxiv. (https://doi.org/10.1101/2020.01.13.905091)
Aksenov AA, da Silva R, Knight R, et al.2021 Auto-deconvolution and molecular networking of gas chromatography-mass spectrometry data. Nature Biotechnology 39 169–173. (https://doi.org/10.1038/s41587-020-0700-3)
Alam T, Shia W-C, Hsu FR, et al.2023 Improving breast cancer detection and diagnosis through semantic segmentation using the Unet3+ deep learning framework. Biomedicines 11 1536. (https://doi.org/10.3390/biomedicines11061536)
Albawi S, Mohammed TA & & Al-Zawi S 2017 Understanding of a convolutional neural network. In International Conference on Engineering and Technology (ICET) 2017. Antalya: IEEE Publications. (https://doi.org/10.1109/ICEngTechnol.2017.8308186)
Alharthi A, Alhazmi S, Alburae N, et al.2022 The human gut microbiome as a potential factor in autism spectrum disorder. International Journal of Molecular Sciences 23 1363. (https://doi.org/10.3390/ijms23031363)
AlQuraishi M & & Sorger PK 2021 Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms. Nature Methods 18 1169–1180. (https://doi.org/10.1038/s41592-021-01283-4)
Amara A, Frainay C, Jourdan F, et al.2022 Networks and graphs discovery in metabolomics data analysis and interpretation. Frontiers in Molecular Biosciences 9 841373. (https://doi.org/10.3389/fmolb.2022.841373)
Anwar H, Irfan S, Hussain G, et al.2020 Gut microbiome: a new organ system in body. In Parasitology and Microbiology Research. eds Bastidas Pacheco GA, & Kamboh AA. London: IntechOpen. (https://doi.org/10.5772/intechopen.89634)
Attique M, Alkhalifah T, Alturise F, et al.2023 DeepBCE: evaluation of deep learning models for identification of immunogenic B-cell epitopes. Computational Biology and Chemistry 104 107874. (https://doi.org/10.1016/j.compbiolchem.2023.107874)
Bacciu D, Lisboa PJG & & Vellido A 2022 Deep Learning in Biology and Medicine. Singapore: World Scientific. (https://doi.org/10.1142/9781800610941_0001)
Baker N, Thompson B & & Fox D 2023 ChatGPT Can Write a Paper in an Hour – but There Are Downsides. London: Nature Publishing Group. (https://doi.org/10.1038/d41586-023-02298-x)
Baldi P 2021 Deep Learning in Science. Cambridge: Cambridge University Press. (https://doi.org/10.1017/9781108955652)
Bandeira N, Tsur D, Frank A, et al.2007 Protein identification by spectral networks analysis. PNAS 104 6140–6145. (https://doi.org/10.1073/pnas.0701130104)
Baquero F & & Nombela C 2012 The microbiome as a human organ. Clinical Microbiology and Infection 18(Supplement 4) 2–4. (https://doi.org/10.1111/j.1469-0691.2012.03916.x)
Bauermeister A, Mannochio-Russo H, Costa-Lotufo LV, et al.2022 Mass spectrometry-based metabolomics in microbiome investigations. Nature Reviews Microbiology 20 143–160. (https://doi.org/10.1038/s41579-021-00621-9)
Blaise BJ 2013 Data-driven sample size determination for metabolic phenotyping studies. Analytical Chemistry 85 8943–8950. (https://doi.org/10.1021/ac4022314)
Broadhurst D, Goodacre R, Reinke SN, et al.2018 Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies. Metabolomics 14 72. (https://doi.org/10.1007/s11306-018-1367-3)
Bronstein MM, Bruna J, LeCun Y, et al.2017 Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 18–42. (https://doi.org/10.1109/MSP.2017.2693418)
Browne HP, Forster SC, Blessing O, et al.2016 Culturing of ‘unculturable’ human microbiota reveals novel taxa and extensive sporulation. Nature 533 543–546. (https://doi.org/10.1038/nature17645)
Burgin J, Ahamed A, Cummins C, et al.2023 The European nucleotide archive in 2022. Nucleic Acids Research 51 D121–D125. (https://doi.org/10.1093/nar/gkac1051)
Butler T, & Frandsen A, Lightheart R, et al.2023 MS2Mol: A Transformer Model for Illuminating Dark Chemical Space from Mass Spectra. ChemRxiv. (https://doi.org/10.26434/chemrxiv-2023-vsmpx-v2)
Chong J & & Xia J 2017 Computational approaches for integrative analysis of the metabolome and microbiome. Metabolites 7 62. (https://doi.org/10.3390/metabo7040062)
Colins M 2017 Machine Learning: An Introduction to Supervised and Unsupervised Learning Algorithms. CreateSpace Independent Publishing Platform.
da Silva RR, Dorrestein PC & & Quinn RA 2015 Illuminating the dark matter in metabolomics. PNAS 112 12549–12550. (https://doi.org/10.1073/pnas.1516878112)
da Silva RR, Wang M, Nothias L-F, et al.2018 Propagating annotations of molecular networks using in silico fragmentation. PLoS Computational Biology 14 e1006089. (https://doi.org/10.1371/journal.pcbi.1006089).
De Livera AM, Sysi-Aho M, Jacob L, et al.2015 Statistical methods for handling unwanted variation in metabolomics data. Analytical Chemistry 87 3606–3615. (https://doi.org/10.1021/ac502439y)
Devlin J, Chang M-W, Lee K, et al.2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Association for Computational Linguistics; Minneapolis, Minnesota. (https://doi.org/10.48550/ARXIV.1810.04805)
di Meo SA, Loizzo D, Pandolfo SD, et al.2022 Metabolomic approaches for detection and identification of biomarkers and altered pathways in bladder cancer. International Journal of Molecular Sciences 23 5143. (https://doi.org/10.3390/ijms23084173)
Dudzik D, Barbas-Bernardos C, García A, et al.2018 quality assurance procedures for mass spectrometry untargeted metabolomics. a review. Journal of Pharmaceutical and Biomedical Analysis 147 149–173. (https://doi.org/10.1016/j.jpba.2017.07.044)
Dührkop K, Fleischauer M, Ludwig M, et al.2019 SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nature Methods 16 299–302. (https://doi.org/10.1038/s41592-019-0344-8)
Dührkop K, Nothias L-F, Fleischauer M, et al.2021 Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nature Biotechnology 39 462–471. (https://doi.org/10.1038/s41587-020-0740-8)
Dunn B, Pierobon M & & Wei Q 2023 Automated classification of lung cancer subtypes using deep learning and CT-scan based radiomic analysis. Bioengineering 10 690. (https://doi.org/10.3390/bioengineering10060690)
Elloumi M 2021 Dee p Learning for Biomedical Data Analysis: Techniques, Approaches, and Applications. Springer Nature. (https://doi.org/10.1007/978-3-030-71676-9)
Eraslan G, Avsec Ž, Gagneur J, et al.2019 Deep learning: new computational modelling techniques for genomics. Nature Reviews. Genetics 20 389–403. (https://doi.org/10.1038/s41576-019-0122-6)
Fiehn O 2002 Metabolomics – the link between genotypes and phenotypes. Plant Molecular Biology 48 155–171. (https://doi.org/10.1023/A:1013713905833)
Fiehn O 2016 Metabolomics by gas chromatography-mass spectrometry: combined targeted and untargeted profiling. Current Protocols in Molecular Biology 114 30.4.1–30.4.32. (https://doi.org/10.1002/0471142727.mb3004s114)
Friedman J & & Alm EJ 2012 Inferring correlation networks from genomic survey data. PLOS Computational Biology 8 e1002687. (https://doi.org/10.1371/journal.pcbi.1002687)
Galal A, Talal M & & Moustafa A 2022 Applications of machine learning in metabolomics: disease modeling and classification. Frontiers in Genetics 13 1017340. (https://doi.org/10.3389/fgene.2022.1017340)
Garrett WS 2015 Cancer and the microbiota. Science 348 80–86. (https://doi.org/10.1126/science.aaa4972)
Gauglitz JM, West KA, Bittremieux W, et al.2022 Enhancing untargeted metabolomics using metadata-based source annotation. Nature Biotechnology 40 1774–1779. (https://doi.org/10.1038/s41587-022-01368-1)
Ghaffari P, Shoaie S & & Nielsen LK 2022 Irritable bowel syndrome and microbiome; switching from conventional diagnosis and therapies to personalized interventions. Journal of Translational Medicine 20 173. (https://doi.org/10.1186/s12967-022-03365-z)
Gloor GB, Macklaim JM, Pawlowsky-Glahn V, et al.2017 Microbiome datasets are compositional: and this is not optional. Frontiers in Microbiology 8 2224. (https://doi.org/10.3389/fmicb.2017.02224)
Gomari DP, Schweickart A, Cerchietti L, et al.2022 Variational autoencoders learn transferrable representations of metabolomics data. Communications Biology 5 645. (https://doi.org/10.1038/s42003-022-03579-3)
Gowda GAN & & Djukovic D 2014 Overview of mass spectrometry-based metabolomics: opportunities and challenges. Methods in Molecular Biology 1198 3–12. (https://doi.org/10.1007/978-1-4939-1258-2_1)
Griffiths WJ 2008 Metabolomics , Metabonomics and Metabolite Profiling. Royal Society of Chemistry. (https://doi.org/10.1039/9781847558107)
Gu Q & & Veselkov K 2018 Bi-clustering of metabolic data using matrix factorization tools. Methods 151 12–20. (https://doi.org/10.1016/j.ymeth.2018.02.004)
Guijas C, Montenegro-Burke JR, Domingo-Almenara X, et al.2018 Metlin: a technology platform for identifying knowns and unknowns. Analytical Chemistry 90 3156–3164. (https://doi.org/10.1021/acs.analchem.7b04424)
Haghverdi L, Lun ATL, Morgan MD, et al.2018 Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology 36 421–427. (https://doi.org/10.1038/nbt.4091)
Hamamsy T, Morton JT, Berenberg D, et al.2022 TM-Vec: template modeling vectors for fast homology detection and alignment. bioRxiv. (https://doi.org/10.1101/2022.07.25.501437)
Han S, Van Treuren W, Fischer CR, et al.2021 A metabolomics pipeline for the mechanistic interrogation of the gut microbiome. Nature 595 415–420. (https://doi.org/10.1038/s41586-021-03707-9)
Haug K, Cochrane K, Nainala VC, et al.2020 MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Research 48 D440–D444. (https://doi.org/10.1093/nar/gkz1019)
Henke MT, Kenny DJ, Cassilly CD, et al.2019 Ruminococcus gnavus, a Member of the Human Gut microbiome Associated with Crohn’s disease, Produces an inflammatory polysaccharide. PNAS 116 12672–12677. (https://doi.org/10.1073/pnas.1904099116)
Horai H, Arita M, Kanaya S, et al.2010 MassBank: a public repository for sharing mass spectral data for life sciences. Journal of Mass Spectrometry 45 703–714. (https://doi.org/10.1002/jms.1777)
Huber F, Ridder L, Verhoeven S, et al.2021 Spec2Vec: improved mass spectral similarity scoring through learning of structural relationships. PLoS Computational Biology 17 e1008724. (https://doi.org/10.1371/journal.pcbi.1008724)
Jabbar MA, Abraham A, Dogan O, et al.2021 Deep Learning in Biomedical and Health Informatics: Current Applications and Possibilities. CRC Press. (https://doi.org/10.1201/9781003161233)
Jansson JK & & Baker ES 2016 A multi-omic future for microbiome studies. Nature Microbiology 1 16049. (https://doi.org/10.1038/nmicrobiol.2016.49)
Jarmusch AK, Wang M, Aceves CM, et al.2020 ReDU: a framework to find and reanalyze public mass spectrometry data. Nature Methods 17 901–904. (https://doi.org/10.1038/s41592-020-0916-7)
Jiang L, Elord C, Kim JJ, et al.2021 Bayesian multivariate sparse functional principal components analysis with applications to longitudinal microbiome multi-omics data. Annals of Applied Statistics 16. (https://doi.org/10.48550/arXiv.2102.00067)
Johnson WE, Li C & & Rabinovic A 2007 Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8 118–127. (https://doi.org/10.1093/biostatistics/kxj037)
Jordan MI & & Mitchell TM 2015 Machine learning: trends, perspectives, and prospects. Science 349 255–260. (https://doi.org/10.1126/science.aaa8415)
Kessner D, Chambers M, Burke R, et al.2008 ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24 2534–2536. (https://doi.org/10.1093/bioinformatics/btn323)
Kind T & & Fiehn O 2007 Seven Golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics 8 105. (https://doi.org/10.1186/1471-2105-8-105)
Kind T & & Fiehn O 2010 Advances in structure elucidation of small molecules using mass spectrometry. Bioanalytical Reviews 2 23–60. (https://doi.org/10.1007/s12566-010-0015-9)
Kind T, Liu K-H, Lee DY, et al.2013 LipidBlast in silico tandem mass spectrometry database for lipid identification. Nature Methods 10 755–758. (https://doi.org/10.1038/nmeth.2551)
Kingma DP, Rezende DJ, Mohamed S, et al.2014 Semi-supervised learning with deep generative models. Available at: http://arxiv.org/abs/1406.5298
Kose U, Deperlioglu O & & Hemanth DJ 2021 Deep Learning for Biomedical Applications. CRC Press. (https://doi.org/10.1201/9780367855611)
Kurtz ZD, Müller CL, Miraldi ER, et al.2015 Sparse and compositionally robust inference of microbial ecological networks. PLoS Computational Biology 11 e1004226. (https://doi.org/10.1371/journal.pcbi.1004226)
Lê Cao K-A, Martin PGP, Robert-Granié C, et al.2009 Sparse canonical methods for Biological Data integration: application to a cross-platform study. BMC Bioinformatics 10 34. (https://doi.org/10.1186/1471-2105-10-34)
LeCun Y, Bengio Y & & Hinton G 2015 Deep learning. Nature 521 436–444. (https://doi.org/10.1038/nature14539)
Leinonen R, Sugawara H, Shumway M, et al.2011 The sequence read archive. Nucleic Acids Research 39 D19–D21. (https://doi.org/10.1093/nar/gkq1019)
Lloyd-Price J, Arze C, Ananthakrishnan AN, et al.2019 Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569 655–662. (https://doi.org/10.1038/s41586-019-1237-9)
Mashima J, Kodama Y, Fujisawa T, et al.2017 DNA data bank of Japan. Nucleic Acids Research 45 D25–D31. (https://doi.org/10.1093/nar/gkw1001)
Melnik AV, da Silva RR, Hyde ER, et al.2017 Coupling targeted and untargeted mass spectrometry for metabolome-microbiome-wide association studies of human fecal samples. Analytical Chemistry 89 7549–7559. (https://doi.org/10.1021/acs.analchem.7b01381)
Menees S & & Chey W 2018 The gut microbiome and irritable bowel syndrome. F1000Research 7 1029. (https://doi.org/10.12688/f1000research.14592.1)
Migdadi L, Lambert J, Telfah A, et al.2021 Automated metabolic assignment: semi-supervised learning in metabolic analysis employing two dimensional nuclear magnetic resonance (NMR). Computational and Structural Biotechnology Journal 19 5047–5058. (https://doi.org/10.1016/j.csbj.2021.08.048)
Mikolov T, Sutskever I, Chen K, et al.2013 Distributed representations of words and phrases and their compositionality. Available at: http://arxiv.org/abs/1310.4546
Mnih V, Kavukcuoglu K, Silver D, et al.2015 Human-level control through deep reinforcement learning. Nature 518 529–533. (https://doi.org/10.1038/nature14236)
Morton JT, Aksenov AA, Nothias LF, et al.2019 Learning representations of microbe-metabolite interactions. Nature Methods 16 1306–1314. (https://doi.org/10.1038/s41592-019-0616-3)
Morton JT, Jin D-M, Mills RH, et al.2023 Multi-level analysis of the gut-brain axis shows autism spectrum disorder-associated molecular and microbial profiles. Nature Neuroscience 26 1208–1217. (https://doi.org/10.1038/s41593-023-01361-0)
Naeij HR, Mahmoudi E, Yeganeh HD, et al.2023 Molecular electronic structure calculation via a quantum computer. Available at: http://arxiv.org/abs/2303.09911
Noble WS, Kuang R, Leslie C, et al.2005 Identifying remote protein homologs by network propagation. FEBS Journal 272 5119–5128. (https://doi.org/10.1111/j.1742-4658.2005.04947.x)
Nyamundanda G, Gormley IC, Fan Y, et al.2013 MetSizeR: selecting the optimal sample size for metabolomic studies using an analysis based approach. BMC Bioinformatics 14 338. (https://doi.org/10.1186/1471-2105-14-338)
Patti GJ 2011 Separation strategies for untargeted metabolomics. Journal of Separation Science 34 3460–3469. (https://doi.org/10.1002/jssc.201100532)
Patti GJ, Yanes O & & Siuzdak G 2012 Innovation: metabolomics: the apogee of the omics trilogy. Nature Reviews. Molecular Cell Biology 13 263–269. (https://doi.org/10.1038/nrm3314)
Pierre-Antoine M 2010 Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11 3371–3408. (https://doi.org/10.5555/1756006.1953039)
Piuri V, Raj S, Genovese A, et al.2020 Trends in Deep Learning Methodologies: Algorithms, Applications, and Systems. Academic Press: San Diego, CA, USA.
Pomyen Y, Wanichthanarak K, Poungsombat P, et al.2020 Deep metabolome: applications of deep learning in metabolomics. Computational and Structural Biotechnology Journal 18 2818–2825. (https://doi.org/10.1016/j.csbj.2020.09.033)
Poore GD, Kopylova E, Zhu Q, et al.2020 Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579 567–574. (https://doi.org/10.1038/s41586-020-2095-1).
Pulikkan J, Mazumder A & & Grace T 2019 Role of the gut microbiome in autism spectrum disorders. Advances in Experimental Medicine and Biology 1118 253–269. (https://doi.org/10.1007/978-3-030-05542-4_13)
Sanguansat P 2012 Principal Component Analysis: Multidisciplinary Applications. BoD – Books on Demand. (https://doi.org/10.5772/2694)
Saurman V, Margolis KG & & Luna RA 2020 Autism spectrum disorder as a brain-gut-microbiome axis disorder. Digestive Diseases and Sciences 65 818–828. (https://doi.org/10.1007/s10620-020-06133-5)
Schmid R, Heuckeroth S, Korf A, et al.2023 Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nature Biotechnology 41 447–449. (https://doi.org/10.1038/s41587-023-01690-2)
Schroeder BO & & Bäckhed F 2016 Signals from the gut microbiota to distant organs in physiology and disease. Nature Medicine 22 1079–1089. (https://doi.org/10.1038/nm.4185)
Shen H, Zamboni N, Heinonen M, et al.2013 Metabolite identification through machine learning- tackling CASMI challenge using FingerID. Metabolites 3 484–505. (https://doi.org/10.3390/metabo3020484)
Song HS, Lee SH, Ahn SW, et al.2021 Effects of the main ingredients of the fermented food, kimchi, on bacterial composition and metabolite profile. Food Research International 149 110668. (https://doi.org/10.1016/j.foodres.2021.110668)
Stein S 2012 Mass spectral reference libraries: an ever-expanding resource for chemical identification. Analytical Chemistry 84 7274–7282. (https://doi.org/10.1021/ac301205z)
Styczynski MP, Moxley JF, Tong LV, et al.2007 Systematic identification of conserved metabolites in GC/MS data for metabolomics and biomarker discovery. Analytical Chemistry 79 966–973. (https://doi.org/10.1021/ac0614846)
Sud M, Fahy E, Cotter D, et al.2016 Metabolomics workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Research 44 D463–D470. (https://doi.org/10.1093/nar/gkv1042)
Sumathi S, Suganya K, Swathi K, et al.2023 A review on deep learning-driven drug discovery: strategies, tools and applications. Current Pharmaceutical Design 29 1013–1025. (https://doi.org/10.2174/1381612829666230412084137)
Tang DQ, Dao-Quan LlZ, Yin X-X, et al.2016 HILIC-MS for metabolomics: an attractive and complementary approach to RPLC-MS. Mass Spectrometry Reviews 35 574–600. (https://doi.org/10.1002/mas.21445)
Tiwari AK 2021 Deep Lear ning and Its Applications. Nova Science Publishers: Hauppauge, NY, USA.
Tomita M & & Nishioka T 2006 Metabolomics: The Frontier of Systems Biology. Springer Science & Business Media, Tokyo, Japan.
Tsugawa H, Ikeda K, Takahashi M, et al.2020 A lipidome atlas in MS-DIAL 4. Nature Biotechnology 38 1159–1163. (https://doi.org/10.1038/s41587-020-0531-2)
Usama M, Qadir J, Raza A, et al.2019 Unsupervised machine learning for networking: techniques, applications and research challenges. IEEE Access 7 65579–65615. (https://doi.org/10.1109/ACCESS.2019.2916648)
Vandeputte D, Kathagen G, D’hoe K, et al.2017 Quantitative microbiome profiling links gut community variation to microbial load. Nature 551 507–511. (https://doi.org/10.1038/nature24460)
van der Maaten L van der & & Hinton G 2008 Visualizing data using T-SNE. Journal of Machine Learning Research 9 2579–2605.
Vizcaíno JA, Deutsch EW, Wang R, et al.2014 ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nature Biotechnology 32 223–226. (https://doi.org/10.1038/nbt.2839).
Wang M, Carver JJ, Phelan VV, et al.2016 Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nature Biotechnology 34 828–837. (https://doi.org/10.1038/nbt.3597)
Wehrens R, Hageman JA, van Eeuwijk F, et al.2016 Improved batch correction in untargeted MS-based metabolomics. Metabolomics 12 88. (https://doi.org/10.1007/s11306-016-1015-8)
Weiss S, Van Treuren W, Lozupone C, et al.2016 Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. ISME Journal 10 1669–1681. (https://doi.org/10.1038/ismej.2015.235)
Yang Z, Algesheimer R & & Tessone CJ 2016 A comparative analysis of community detection algorithms on artificial networks. Scientific Reports 6 30750. (https://doi.org/10.1038/srep30750)
Zelezniak A, Andrejev S, Ponomarova O, et al.2015 Metabolic dependencies drive species co-occurrence in diverse microbial communities. PNAS 112 6449–6454. (https://doi.org/10.1073/pnas.1421834112)
Zhang Y, Li K, Zhao Y, et al.2022 Attomole-level multiplexed detection of neurochemicals in picoliter droplets by On-Chip nanoelectrospray ionization coupled to mass spectrometry. Analytical Chemistry 94 13804–13809. (https://doi.org/10.1021/acs.analchem.2c02323)
Zuffa S, Schmid R, Bauermeister A, et al.2024 microbeMASST: a taxonomically informed mass spectrometry search tool for microbial metabolomics data. Nature Microbiology 9 336–345. (https://doi.org/10.1038/s41564-023-01575-9)