Overview

This report documents the data processing and preparation pipeline for a BAMM (Bayesian Analysis of Macroevolutionary Mixtures) analysis of insect genome size evolution. The workflow is divided into three parts:

  • Part 1: Loading genome size data and taxonomy, normalizing taxon names, resolving duplicate genome size records, and cross-referencing species between datasets
  • Part 2: Pruning the phylogenetic trees to retain only taxa with matched genome size data
  • Part 3: Converting the ML tree to ultrametric form and estimating BAMM priors

Part 1: Tip Search and Genome Size Matching

Libraries

library(tidyverse)
library(readxl)
library(janitor)

Data Sources

Two datasets were used:

  • insectgs.csv: Genome size (GS) data sourced from an online genome size database, containing C-value measurements in picograms (pg) and megabase pairs (Mbp), measurement method, and metamorphosis type
  • List_of_4854_insects&10_outgroups.xlsx: Supplementary taxonomy file accompanying the phylogenetic tree, containing 4,864 taxa (4,854 insects + 10 outgroups) with full taxonomic rank information and tip labels matching the tree file

Load Files

gs_df  <- read_csv("insectgs.csv", locale = locale(encoding = "latin1"))
tip_df <- read_xlsx("List_of_4854_insects&10_outgroups.xlsx") |> clean_names()

The latin1 encoding was required for the GS CSV due to special characters present in some species names. clean_names() from the janitor package was applied to standardize column names in the taxonomy file.

Genome Size Duplicate Resolution

The GS dataset contained 773 duplicate species entries with differing genome size values, reflecting multiple measurements in the database. Duplicates were resolved by preferring the most reliable measurement method for each species, then averaging if multiple records of the same best method existed.

The Method column indicates the technique used to measure genome size:

Method Description Rank
FCM Flow Cytometry — gold standard 1
FIA Feulgen Image Analysis 2
FD Feulgen Densitometry 3
BCA Biochemical Analysis 4
NS / Other Not Specified or other 5

A coefficient of variation (CV) was retained to flag species where records disagreed substantially, providing a transparency trail for downstream analyses.

Note on C-value vs Mbp: Both columns represent the same measurement in different units (1 pg ≈ 978 Mbp). The conversion factor was confirmed consistent across the dataset. Mbp was retained for all downstream analyses.

gs_resolved <- gs_df |>
  mutate(method_rank = case_when(
    Method == "FCM" ~ 1,
    Method == "FIA" ~ 2,
    Method == "FD"  ~ 3,
    Method == "BCA" ~ 4,
    TRUE            ~ 5
  )) |>
  group_by(Name) |>
  slice_min(method_rank, with_ties = TRUE) |>
  summarise(
    gs_value      = mean(Mbp, na.rm = TRUE),
    gs_sd         = sd(Mbp, na.rm = TRUE),
    gs_cv         = gs_sd / gs_value,
    best_method   = first(Method),
    metamorphosis = first(Metamorphosis),
    n_records     = n(),
    .groups       = "drop"
  ) |>
  mutate(
    name_normalized = Name |>
      str_to_lower() |>
      str_replace_all("-", "") |>    # strip hyphens to match tip formatting
      str_squish()
  )

Name Normalization

Taxon names required normalization prior to matching due to formatting differences between the two files:

  • The taxonomy file encodes tip labels as Order_Genus_species (e.g. Coleoptera_Anthonomus_grandis)
  • Some tip labels contain additional identifiers for subspecies or unidentified specimens (e.g. Coleoptera_Anthonomus_grandis_grandis)
  • The GS file uses standard binomial nomenclature with spaces (e.g. Anthonomus grandis)
  • Some GS names contain hyphens absent in the tip labels (e.g. Xestia c-nigrum vs Xestia_cnigrum)

To address these issues:

  • Tip labels were normalized by extracting the 2nd and 3rd underscore-delimited words (genus + species epithet), ensuring subspecies suffixes are ignored
  • GS names were converted to lowercase and hyphens were stripped to match tip label formatting
  • Original names were preserved in all cases; normalization was applied to new columns only
tip_df <- tip_df |>
  mutate(
    name_normalized = tip_names_in_data_matrix |>
      str_split("_") |>
      map_chr(~ paste(.x[2], .x[3], sep = " ")) |>
      str_to_lower() |>
      str_squish()
  )

Cross-Referencing

gs_to_join <- gs_resolved |>
  select(name_normalized, gs_value, gs_sd, gs_cv,
         best_method, metamorphosis, n_records)

matches <- tip_df |>
  inner_join(gs_to_join, by = "name_normalized") |>
  select(
    tip_names_in_data_matrix,
    kingdom, phylum, subphylum, class, subclass, infraclass,
    cohort, order, suborder, infraorder,
    superfamily, family, subfamily, genus, species,
    metamorphosis, gs_value, gs_sd, gs_cv, best_method, n_records
  )

unmatched_tips <- anti_join(tip_df, gs_resolved, by = "name_normalized")
unmatched_gs   <- anti_join(gs_resolved, tip_df, by = "name_normalized")

Matching Results

cat("Tree tips with GS data:   ", nrow(matches), "\n")
cat("Tree tips WITHOUT GS data:", nrow(unmatched_tips), "\n")
cat("GS species not in tree:   ", nrow(unmatched_gs), "\n")
## Tree tips with GS data:    405
## Tree tips WITHOUT GS data: 4459
## GS species not in tree:    1201

405 of 4,864 tree tips (~8%) were successfully matched to genome size data. The low match rate reflects the sparsity of published genome size data relative to the scale of the phylogeny. The 4,459 unmatched tips include species for which no genome size data exists, and unidentified specimens (e.g. Scymnus sp. 1 HSL-2020) which are structurally unmatchable at species level. These unidentified specimens retain genus-level labels and will be captured in planned genus- and family-level analyses.

High-variance matches (CV > 10%) were flagged for review:

high_variance <- matches |>
  filter(!is.na(gs_cv), gs_cv > 0.1) |>
  select(tip_names_in_data_matrix, species, gs_value, gs_sd,
         gs_cv, best_method, n_records) |>
  arrange(desc(gs_cv))

Output

write.csv(matches, file = "Matched_Tips.csv")

Part 2: Tree Pruning

Libraries

library(ape)
library(tidyverse)

Load Files

matches     <- read_csv("Matched_Tips.csv")
astral_tree <- read.tree("Tree_dat/10outgroups_4854insects_824genes_ASTRAL_rooted_final.tree")
ml_tree     <- read.tree("Tree_Dat/10outgroups_4854insects_824genes_MLbest_rooted.tree")

Two trees were provided with the dataset:

  • ASTRAL tree: Coalescent-based species tree estimated with ASTRAL from 824 genes. Branch lengths are in coalescent units and tip branch lengths are absent (expected ASTRAL behavior)
  • ML tree: Maximum likelihood concatenated tree with branch lengths in substitutions per site

Pruning

Both trees were pruned to retain only the 397 taxa with matched genome size data. The original rooting was preserved in both cases. Tips without GS data were dropped regardless of whether they were outgroups.

tips_to_keep <- matches$tip_names_in_data_matrix

prune_tree <- function(tree, keep_tips) {
  tips_to_drop <- tree$tip.label[!tree$tip.label %in% keep_tips]

  if (length(tips_to_drop) == 0) {
    message("No tips to drop.")
    return(tree)
  }

  pruned <- drop.tip(tree, tips_to_drop, trim.internal = TRUE)
  message(paste("Dropped:", length(tips_to_drop), "tips"))
  message(paste("Remaining:", length(pruned$tip.label), "tips"))
  return(pruned)
}

astral_pruned <- prune_tree(astral_tree, tips_to_keep)
ml_pruned     <- prune_tree(ml_tree, tips_to_keep)

Verify and Save

is.rooted(astral_pruned)
is.rooted(ml_pruned)

write.tree(astral_pruned, "ASTRAL_pruned.tree")
write.tree(ml_pruned,     "ML_pruned.tree")
## [1] TRUE
## [1] TRUE

Part 3: Ultrametric Conversion and BAMM Priors

Libraries

library(tidyverse)
library(ape)
library(BAMMtools)

Load Files

matches <- read_csv("Matched_Tips.csv")
ml_tree <- read.tree("ML_pruned.tree")

The ML tree was used for BAMM prior estimation. The ASTRAL tree was not used for this step because ASTRAL branch lengths are in coalescent units rather than substitutions per site, which are required by chronos() and BAMM. For the final analysis, MCMCtree will be used for divergence time estimation with multiple fossil calibration points.

Ultrametric Conversion

The ML tree was converted to ultrametric form using penalized likelihood rate smoothing via chronos() in the ape package. The optimal smoothing parameter (lambda) was selected by minimizing cross-validation error across a range of values (10^-1 to 10^6):

l     <- 10^(-1:6)
cv_ml <- numeric(length(l))
for(i in 1:length(l)){
  print(i)
  cv_ml[i] <- sum(attr(chronopl(ml_tree, lambda = l[i], CV = TRUE), "D2"))
}
plot(l, cv_ml)

A single root calibration point was used for this preliminary analysis:

  • Root calibration: Orchesella cincta (Collembola) vs Drosophila elegans (Diptera) = 459 MYA
cal_ml  <- makeChronosCalib(ml_tree, node = "root", age.min = 459, age.max = 459)
ML.tree <- chronos(ml_tree, lambda = l[which.min(cv_ml)],
                   model = "correlated", calibration = cal_ml)
is.ultrametric(ML.tree)
write.tree(ML.tree, file = "MLTree.tree")
## [1] TRUE

Note: This is a preliminary single-calibration analysis. Final analyses will incorporate multiple fossil calibration points across insect orders to improve divergence time accuracy at deep nodes.

BAMM Prior Estimation

Genome size values were formatted as a named numeric vector with tip labels as names, then passed to setBAMMpriors(). A sanity check confirmed all names matched the tree tips exactly before running:

gs_vector <- setNames(matches$gs_value, matches$tip_names_in_data_matrix)

all(names(gs_vector) %in% ML.tree$tip.label)

setBAMMpriors(phy = ML.tree, traits = gs_vector, outfile = "ml_priors.txt")
## [1] TRUE

The output file ml_priors.txt contains the betaInitPrior and betaShiftPrior values to be copied into the BAMM configuration file.


Part 4: Genome Size Visualization

Continuous Trait Mapping

Genome size evolution was visualized on the time-calibrated ML tree using continuous trait mapping via contMap() from the phytools package [@revell2012phytools]. Ancestral genome sizes were reconstructed under Brownian motion, with branch colors representing genome size from small (blue) to large (red).

library(phytools)

gs_vector  <- setNames(matches$gs_value, matches$tip_names_in_data_matrix)
gs_vector  <- gs_vector[ML.tree$tip.label]

gs_contmap <- contMap(ML.tree, gs_vector, plot = FALSE)
gs_contmap <- setMap(gs_contmap, colors = c("steelblue", "gold", "firebrick"))

plot(gs_contmap,
     type   = "fan",
     legend = 0.5 * max(nodeHeights(ML.tree)),
     lwd    = 2,
     fsize  = 0.3,
     ftype  = "i")
Continuous trait map of genome size (Mbp) across insect phylogeny. Branch colors represent reconstructed genome size from small (blue) to large (red).

Continuous trait map of genome size (Mbp) across insect phylogeny. Branch colors represent reconstructed genome size from small (blue) to large (red).


Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Denver
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] BAMMtools_2.1.12 ape_5.8-1        janitor_2.2.1    readxl_1.4.5    
##  [5] lubridate_1.9.5  forcats_1.0.1    stringr_1.6.0    dplyr_1.2.1     
##  [9] purrr_1.2.2      readr_2.2.0      tidyr_1.3.2      tibble_3.3.1    
## [13] ggplot2_4.0.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     gplots_3.3.0       bitops_1.0-9      
##  [5] KernSmooth_2.23-26 gtools_3.9.5       stringi_1.8.7      lattice_0.22-9    
##  [9] caTools_1.18.3     hms_1.1.4          digest_0.6.39      magrittr_2.0.4    
## [13] evaluate_1.0.5     grid_4.5.1         timechange_0.4.0   RColorBrewer_1.1-3
## [17] fastmap_1.2.0      cellranger_1.1.0   jsonlite_2.0.0     scales_1.4.0      
## [21] jquerylib_0.1.4    cli_3.6.5          rlang_1.1.7        withr_3.0.2       
## [25] cachem_1.1.0       yaml_2.3.12        otel_0.2.0         tools_4.5.1       
## [29] parallel_4.5.1     tzdb_0.5.0         vctrs_0.7.3        R6_2.6.1          
## [33] lifecycle_1.0.5    snakecase_0.11.1   pkgconfig_2.0.3    pillar_1.11.1     
## [37] bslib_0.10.0       gtable_0.3.6       glue_1.8.0         Rcpp_1.1.1        
## [41] xfun_0.57          tidyselect_1.2.1   rstudioapi_0.18.0  knitr_1.51        
## [45] farver_2.1.2       htmltools_0.5.9    nlme_3.1-169       rmarkdown_2.31    
## [49] compiler_4.5.1     S7_0.2.1