celeri lab

Microbiome Studies Explorer

Where does the data come from ?

The data is sourced from the European Nucleotide Archive (ENA), a comprehensive repository that houses nucleotide sequence data from various studies worldwide. The data presented here specifically focuses on the metadata of raw sequencing reads from human gut microbiome studies. This database is updated on a weekly basis to ensure that users have access to the most recent and relevant studies. Data structure follows ENA's schema: studies contain samples, which in turn contain experiments which hold the sequencing runs. We do not host any sequencing data ourselves; we only provide structured metadata to help users find relevant datasets.

How is the metadata curated ?

Metadata is pulled from ENA XML files containing descriptions and attributes provided by data submitters. We use AI to extract and standardize key metadata fields for studies and samples, that come unstructured and heterogeneous from ENA.

Which AI model is used ?

GPT-5 is the model used for our different agents. However, as new models become available, we will update accordingly.

What are the metadata fields ?

Study-level metadata:

  • Accession: The unique identifier assigned to the study.
  • Study Title: The title of the study as provided by the submitter.
  • Center Name: The name of the center where the study was conducted.
  • Host status: Determined using AI on the study description to classify if the host is healthy or diseased. Possible values are Healthy,Diseased, or Mixed.
  • Host age: Determined using AI on the study description and samples metadata to classify ages of studied subjects. Possible values are Infant, Child, Adolescent, Adult, and Elderly.
  • Diseases: The disease studied in this study. Extracted using AI from the study description, and mapped to the most relevant node in the Disease Ontology (DO) (v2025-09-30).
  • Interventions: The medical interventions performed and observed in this study. Extracted using AI from the study description, and mapped to the most relevant node in the Medical action Ontology (MAxO) (v2025-04-24). When possible, drugs are mapped to entities in the ChEMBL database.
  • Longitudinal: Extracted using AI from the study description and samples metadata. Indicates if the study was longitudinal (multiple timepoints) or cross-sectional (single timepoint).
  • Timepoints: Extraxted using AI from the study description if the study is longitudinal. Provides details on the available timepoints in the study.
  • N samples: Number of WGS samples associated with the study.
  • Samples origin: Indicates the countries where samples were collected. Extracted from samples metadata using AI and standardized using GeoPy's geocoder.
  • Collection date: The date when samples were collected (month and year). Either a range or a single date. Extracted from the samples metadata.
  • Joint publication: Publication in which the generation of this data is described.
  • Related publications: Publications citing this study.
  • Mgnify studies: When available, link to the corresponding studies in the MGnify database.
  • NCT: When an NCT ID is provided, link to the corresponding clinical trial on ClinicalTrials.gov.
  • Tags: Extracted using AI from the study description to provide additional context on the study. Possible values include Clinical trial, Pregnancy, Placebo, Control, Healthy.
  • First published: The date when the study was first made public on ENA.
  • Last updated: The date when the study was last updated on ENA.

Sample-level metadata:

  • Accession: The unique identifier assigned to the sample.
  • Study: Study accessions associated with the sample.
  • Origin: Indicates the countries where samples were collected. Extracted from samples metadata using AI and standardized using GeoPy's geocoder.
  • Collection date: The date when samples were collected (month and year). Either a range or a single date. Extracted from the samples metadata.
  • Timpepoint: When available, indicates the timepoint of the sample in a longitudinal study. Extracted using AI from the sample metadata.
  • Host ID: When available, the identifier of the subject from whom the sample was collected. Extracted using AI from the sample metadata.
  • Disease: When available, the disease associated with the sample. Extracted using AI from the sample metadata, and mapped to the most relevant node in the Disease Ontology (DO) (v2025-09-30).
  • Intervention: When available, the medical intervention associated with the sample. Extracted using AI from the sample metadata, and mapped to the most relevant node in the Medical Intervention Ontology (MIO) (v2025-09-30). When possible, drugs are mapped to entities in the ChEMBL database.
  • Response: When available, indicates the sample's response to the intervention. Extracted using AI from the sample metadata.
  • Host age: When available, the age of the host from whom the sample was collected. Extracted using AI from the sample metadata.
  • Host sex: When available, the sex of the host from whom the sample was collected. Extracted using AI from the sample metadata.
  • Host body mass: When available, the body mass (e.g., weight, height, BMI) of the host from whom the sample was collected. Extracted using AI from the sample metadata.
  • Clinical data: When available, additional clinical information about the host or sample. Extracted using AI from the sample metadata.
  • Tags: When available, additional tags associated with the sample. Extracted using AI from the sample metadata.
  • N experiments: Number of experiments associated with the sample.
  • N runs: Number of runs associated with the sample.

Experiment-level metadata:

  • Library layout: Paired or Single-end sequencing. Extracted from the experiment metadata.
  • Library selection: The method used for library selection (e.g., RANDOM, PCR_AMPLIFIED, etc.). Extracted from the experiment metadata.