Chapter 2 Data preparation
2.0.2 Genome metadata
Relevant metadata of genomes is fetched from 2-3 files and merged into one genome metadata object for downstream analyses.
2.0.2.1 Taxonomy
This is the raw taxonomy table generated by GTDBtk, which is simplified for downstream analyses.
genome_taxonomy <- bind_rows(read_tsv("data/gtdbtk.bac120.summary.tsv"),
read_tsv("data/gtdbtk.ar53.summary.tsv")) %>%
rename(genome = user_genome) %>%
mutate(genome = str_replace_all(genome,"\\.fa", "")) %>%
separate(classification, c("domain","phylum","class","order","family","genus","species"), sep =";") %>%
select(genome,domain,phylum,class,order,family,genus,species) %>%
arrange(match(genome, read_counts$genome))
2.0.3 Count table
This is the document containing the number of sequencing reads from each sample have been mapped to each MAG. Note that this is the raw data that needs to be further processed before running any statistics on them.
2.0.4 Base hit table
This is the document containing the number of nucleotide bases have been covered by at least one read in each sample and MAG. This information is used to calculate MAG coverage values.
2.0.5 Genome tree
This is the raw tree generated by GTDBtk, which needs to be pruned to obtain the phylogenetic tree of the genomes. Note that the archaeal tree is only generated if any archaeans are detected among the genomes.
archaea_tree <- read.tree("data/gtdbtk.ar53.classify.tree") #note that when no archaea are detected, this tree is not generated
bacteria_tree <- read.tree("data/gtdbtk.backbone.bac120.classify.tree")
genome_tree <- bind.tree(archaea_tree, bacteria_tree)
genome_tree$tip.label <- str_replace_all(genome_tree$tip.label,"'", "") #remove single quotes in MAG names
genome_tree <- keep.tip(genome_tree, tip=genome_metadata$genome) # keep only MAG tips
2.0.6 MAG functional annotations
This is the raw annotation table generated by DRAM, which is used to generate GIFT data using distillR.
2.1 Filter and normalise data
Raw data needs to be filtered and normalised to make it useful for downstream analyses.
2.1.1 Generate coverage table
By dividing the number of base hits by the length of each genome, coverage values can be calculated.
2.1.2 Coverage filtering
Genomes that have less than 30% of their length covered by reads are turned into zeros to account for the random allocation of reads across genomes due to mapping heuristics.
2.2 Color scheme
AlberdiLab projects use unified color schemes developed for the Earth Hologenome Initiative, to facilitate figure interpretation.
phylum_colors <- read_tsv("https://raw.githubusercontent.com/earthhologenome/EHI_taxonomy_colour/main/ehi_phylum_colors.tsv") %>%
right_join(genome_metadata, by=join_by(phylum == phylum)) %>%
arrange(match(genome, genome_tree$tip.label)) %>%
select(phylum, colors) %>%
unique() %>%
arrange(phylum) %>%
pull(colors, name=phylum)
2.3 Wrap working objects
In the last step, the objects that are needed for downstream analyses are stored in an R object.
save(read_counts,
read_counts_filt,
genome_counts,
genome_counts_filt,
genome_tree,
genome_metadata,
genome_gifts,
sample_metadata,
phylum_colors,
file = "data/data.Rdata")
- read_counts: Number of reads mapped to each genome in each sample. Note this is the unfiltered and unnormalised raw community composition table.
- genome_counts: Number of genomes quantified in each sample, calculated through filtering and normalising read_counts. This is the community composition table to be used in downstream analyses unless otherwise stated.
- genome_tree: Phylogenetic tree of the genomes, to be employed in downstream phylogenetic analyses.
- genome_metadata: Taxonomic and quality information of the genomes.
- genome_gifts: Genome-inferred functional traits of the genomes, to be employed in downstream functional analyses.
- sample_metadata: Treatment/population and other relevant metadata of the samples.