Chapter 2 Data preparation

2.0.1 Sample metadata

sample_metadata <- read_tsv("data/sample_metadata.tsv")

2.0.2 Genome metadata

Relevant metadata of genomes is fetched from 2-3 files and merged into one genome metadata object for downstream analyses.

2.0.2.1 Taxonomy

This is the raw taxonomy table generated by GTDBtk, which is simplified for downstream analyses.

genome_taxonomy <- bind_rows(read_tsv("data/gtdbtk.bac120.summary.tsv"),
                             read_tsv("data/gtdbtk.ar53.summary.tsv")) %>%
  rename(genome = user_genome) %>%
  mutate(genome = str_replace_all(genome,"\\.fa", "")) %>%
  separate(classification, c("domain","phylum","class","order","family","genus","species"),  sep =";") %>%
  select(genome,domain,phylum,class,order,family,genus,species) %>%
  arrange(match(genome, read_counts$genome))

2.0.2.2 Genome quality

Quality properties of the genomes. Derived from dREP’s Widb table.

genome_quality <- read_csv("data/genome_quality.csv") %>%
  rename(genome = 1) %>%
  arrange(match(genome, read_counts$genome)) %>%
  select(genome, completeness, contamination, size) %>%
  rename(length=size) %>%
  mutate(genome = str_remove(genome, "\\.fa$"))

2.0.2.3 Merged metadata object

Merge taxonomy, length and quality information

genome_metadata <- genome_taxonomy %>%
  left_join(genome_quality,by="genome") #join quality

2.0.3 Count table

This is the document containing the number of sequencing reads from each sample have been mapped to each MAG. Note that this is the raw data that needs to be further processed before running any statistics on them.

read_counts <- read_tsv("data/genome_count.tsv.gz") %>%
    rename_all(~ str_remove_all(., ".lib1")) %>% #simplify column names
    rename(genome = 1) %>%
    select(all_of(c("genome",sample_metadata$sample))) %>% # sort samples
    arrange(match(genome,genome_metadata$genome)) # sort genomes

2.0.4 Base hit table

This is the document containing the number of nucleotide bases have been covered by at least one read in each sample and MAG. This information is used to calculate MAG coverage values.

basehits <- read_tsv("data/genome_covered_bases.tsv.gz") %>%
    rename_all(~ str_remove_all(., ".lib1")) %>% #simplify column names
    rename(genome = 1) %>%
    select(all_of(c("genome",sample_metadata$sample))) %>% # sort samples
    arrange(match(genome,genome_metadata$genome)) # sort genomes

2.0.5 Genome tree

This is the raw tree generated by GTDBtk, which needs to be pruned to obtain the phylogenetic tree of the genomes. Note that the archaeal tree is only generated if any archaeans are detected among the genomes.

archaea_tree <- read.tree("data/gtdbtk.ar53.classify.tree") #note that when no archaea are detected, this tree is not generated
bacteria_tree <- read.tree("data/gtdbtk.backbone.bac120.classify.tree")
genome_tree <- bind.tree(archaea_tree, bacteria_tree)
genome_tree$tip.label <- str_replace_all(genome_tree$tip.label,"'", "") #remove single quotes in MAG names
genome_tree <- keep.tip(genome_tree, tip=genome_metadata$genome) # keep only MAG tips

2.0.6 MAG functional annotations

This is the raw annotation table generated by DRAM, which is used to generate GIFT data using distillR.

genome_annotations <- read_tsv("data/genome_annotations.tsv.xz") %>%
  rename(gene=1,genome=2)

2.1 Filter and normalise data

Raw data needs to be filtered and normalised to make it useful for downstream analyses.

2.1.1 Generate coverage table

By dividing the number of base hits by the length of each genome, coverage values can be calculated.

genome_coverage <- basehits %>%
  mutate(across(where(is.numeric), ~ ./genome_metadata$length))

2.1.2 Coverage filtering

Genomes that have less than 30% of their length covered by reads are turned into zeros to account for the random allocation of reads across genomes due to mapping heuristics.

min_coverage=0.3
read_counts_filt <- genome_coverage %>%
  mutate(across(where(is.numeric), ~ ifelse(. > min_coverage, 1, 0))) %>%
  mutate(across(-1, ~ . * read_counts[[cur_column()]])) 

2.1.3 Generate genome count table

After filtering the low-coverage reads, read counts are transformed into genome counts using genome-length and read-length information.

readlength=150 #change if sequencing read length is different

genome_counts <- read_counts %>%
  mutate(across(where(is.numeric), ~ . / (genome_metadata$length / readlength) ))

genome_counts_filt <- read_counts_filt %>%
  mutate(across(where(is.numeric), ~ . / (genome_metadata$length / readlength) ))

2.1.4 Distil functional annotations

Raw functional annotations are distilled into genome-inferred functional traits to generate biologically more meaningful functional traits for downstream analyses.

genome_gifts <- distill(genome_annotations,GIFT_db,genomecol=2,annotcol=c(9,10,19),verbosity=F)

2.2 Color scheme

AlberdiLab projects use unified color schemes developed for the Earth Hologenome Initiative, to facilitate figure interpretation.

phylum_colors <- read_tsv("https://raw.githubusercontent.com/earthhologenome/EHI_taxonomy_colour/main/ehi_phylum_colors.tsv") %>%
    right_join(genome_metadata, by=join_by(phylum == phylum)) %>%
    arrange(match(genome, genome_tree$tip.label)) %>%
    select(phylum, colors) %>% 
    unique() %>%
    arrange(phylum) %>%
    pull(colors, name=phylum)

2.3 Wrap working objects

In the last step, the objects that are needed for downstream analyses are stored in an R object.

save(read_counts, 
     read_counts_filt, 
     genome_counts, 
     genome_counts_filt, 
     genome_tree, 
     genome_metadata, 
     genome_gifts, 
     sample_metadata, 
     phylum_colors, 
     file = "data/data.Rdata")
  • read_counts: Number of reads mapped to each genome in each sample. Note this is the unfiltered and unnormalised raw community composition table.
  • genome_counts: Number of genomes quantified in each sample, calculated through filtering and normalising read_counts. This is the community composition table to be used in downstream analyses unless otherwise stated.
  • genome_tree: Phylogenetic tree of the genomes, to be employed in downstream phylogenetic analyses.
  • genome_metadata: Taxonomic and quality information of the genomes.
  • genome_gifts: Genome-inferred functional traits of the genomes, to be employed in downstream functional analyses.
  • sample_metadata: Treatment/population and other relevant metadata of the samples.