This report documents the data processing and preparation pipeline for a BAMM (Bayesian Analysis of Macroevolutionary Mixtures) analysis of insect genome size evolution. The workflow is divided into three parts:
Two datasets were used:
gs_df <- read_csv("insectgs.csv", locale = locale(encoding = "latin1"))
tip_df <- read_xlsx("List_of_4854_insects&10_outgroups.xlsx") |> clean_names()The latin1 encoding was required for the GS CSV due to
special characters present in some species names.
clean_names() from the janitor package was
applied to standardize column names in the taxonomy file.
The GS dataset contained 773 duplicate species entries with differing genome size values, reflecting multiple measurements in the database. Duplicates were resolved by preferring the most reliable measurement method for each species, then averaging if multiple records of the same best method existed.
The Method column indicates the technique used to
measure genome size:
| Method | Description | Rank |
|---|---|---|
| FCM | Flow Cytometry — gold standard | 1 |
| FIA | Feulgen Image Analysis | 2 |
| FD | Feulgen Densitometry | 3 |
| BCA | Biochemical Analysis | 4 |
| NS / Other | Not Specified or other | 5 |
A coefficient of variation (CV) was retained to flag species where records disagreed substantially, providing a transparency trail for downstream analyses.
Note on C-value vs Mbp: Both columns represent the same measurement in different units (1 pg ≈ 978 Mbp). The conversion factor was confirmed consistent across the dataset. Mbp was retained for all downstream analyses.
gs_resolved <- gs_df |>
mutate(method_rank = case_when(
Method == "FCM" ~ 1,
Method == "FIA" ~ 2,
Method == "FD" ~ 3,
Method == "BCA" ~ 4,
TRUE ~ 5
)) |>
group_by(Name) |>
slice_min(method_rank, with_ties = TRUE) |>
summarise(
gs_value = mean(Mbp, na.rm = TRUE),
gs_sd = sd(Mbp, na.rm = TRUE),
gs_cv = gs_sd / gs_value,
best_method = first(Method),
metamorphosis = first(Metamorphosis),
n_records = n(),
.groups = "drop"
) |>
mutate(
name_normalized = Name |>
str_to_lower() |>
str_replace_all("-", "") |> # strip hyphens to match tip formatting
str_squish()
)Taxon names required normalization prior to matching due to formatting differences between the two files:
Order_Genus_species
(e.g. Coleoptera_Anthonomus_grandis)Coleoptera_Anthonomus_grandis_grandis)Anthonomus grandis)Xestia c-nigrum vs Xestia_cnigrum)To address these issues:
gs_to_join <- gs_resolved |>
select(name_normalized, gs_value, gs_sd, gs_cv,
best_method, metamorphosis, n_records)
matches <- tip_df |>
inner_join(gs_to_join, by = "name_normalized") |>
select(
tip_names_in_data_matrix,
kingdom, phylum, subphylum, class, subclass, infraclass,
cohort, order, suborder, infraorder,
superfamily, family, subfamily, genus, species,
metamorphosis, gs_value, gs_sd, gs_cv, best_method, n_records
)
unmatched_tips <- anti_join(tip_df, gs_resolved, by = "name_normalized")
unmatched_gs <- anti_join(gs_resolved, tip_df, by = "name_normalized")cat("Tree tips with GS data: ", nrow(matches), "\n")
cat("Tree tips WITHOUT GS data:", nrow(unmatched_tips), "\n")
cat("GS species not in tree: ", nrow(unmatched_gs), "\n")## Tree tips with GS data: 405
## Tree tips WITHOUT GS data: 4459
## GS species not in tree: 1201
405 of 4,864 tree tips (~8%) were successfully matched to genome size
data. The low match rate reflects the sparsity of published genome size
data relative to the scale of the phylogeny. The 4,459 unmatched tips
include species for which no genome size data exists, and unidentified
specimens (e.g. Scymnus sp. 1 HSL-2020) which are
structurally unmatchable at species level. These unidentified specimens
retain genus-level labels and will be captured in planned genus- and
family-level analyses.
High-variance matches (CV > 10%) were flagged for review:
matches <- read_csv("Matched_Tips.csv")
astral_tree <- read.tree("Tree_dat/10outgroups_4854insects_824genes_ASTRAL_rooted_final.tree")
ml_tree <- read.tree("Tree_Dat/10outgroups_4854insects_824genes_MLbest_rooted.tree")Two trees were provided with the dataset:
Both trees were pruned to retain only the 397 taxa with matched genome size data. The original rooting was preserved in both cases. Tips without GS data were dropped regardless of whether they were outgroups.
tips_to_keep <- matches$tip_names_in_data_matrix
prune_tree <- function(tree, keep_tips) {
tips_to_drop <- tree$tip.label[!tree$tip.label %in% keep_tips]
if (length(tips_to_drop) == 0) {
message("No tips to drop.")
return(tree)
}
pruned <- drop.tip(tree, tips_to_drop, trim.internal = TRUE)
message(paste("Dropped:", length(tips_to_drop), "tips"))
message(paste("Remaining:", length(pruned$tip.label), "tips"))
return(pruned)
}
astral_pruned <- prune_tree(astral_tree, tips_to_keep)
ml_pruned <- prune_tree(ml_tree, tips_to_keep)The ML tree was used for BAMM prior estimation. The ASTRAL tree was
not used for this step because ASTRAL branch lengths are in coalescent
units rather than substitutions per site, which are required by
chronos() and BAMM. For the final analysis, MCMCtree will
be used for divergence time estimation with multiple fossil calibration
points.
The ML tree was converted to ultrametric form using penalized
likelihood rate smoothing via chronos() in the
ape package. The optimal smoothing parameter (lambda) was
selected by minimizing cross-validation error across a range of values
(10^-1 to 10^6):
l <- 10^(-1:6)
cv_ml <- numeric(length(l))
for(i in 1:length(l)){
print(i)
cv_ml[i] <- sum(attr(chronopl(ml_tree, lambda = l[i], CV = TRUE), "D2"))
}
plot(l, cv_ml)A single root calibration point was used for this preliminary analysis:
cal_ml <- makeChronosCalib(ml_tree, node = "root", age.min = 459, age.max = 459)
ML.tree <- chronos(ml_tree, lambda = l[which.min(cv_ml)],
model = "correlated", calibration = cal_ml)
is.ultrametric(ML.tree)
write.tree(ML.tree, file = "MLTree.tree")## [1] TRUE
Note: This is a preliminary single-calibration analysis. Final analyses will incorporate multiple fossil calibration points across insect orders to improve divergence time accuracy at deep nodes.
Genome size values were formatted as a named numeric vector with tip
labels as names, then passed to setBAMMpriors(). A sanity
check confirmed all names matched the tree tips exactly before
running:
gs_vector <- setNames(matches$gs_value, matches$tip_names_in_data_matrix)
all(names(gs_vector) %in% ML.tree$tip.label)
setBAMMpriors(phy = ML.tree, traits = gs_vector, outfile = "ml_priors.txt")## [1] TRUE
The output file ml_priors.txt contains the
betaInitPrior and betaShiftPrior values to be
copied into the BAMM configuration file.
Genome size evolution was visualized on the time-calibrated ML tree
using continuous trait mapping via contMap() from the
phytools package [@revell2012phytools]. Ancestral genome sizes
were reconstructed under Brownian motion, with branch colors
representing genome size from small (blue) to large (red).
library(phytools)
gs_vector <- setNames(matches$gs_value, matches$tip_names_in_data_matrix)
gs_vector <- gs_vector[ML.tree$tip.label]
gs_contmap <- contMap(ML.tree, gs_vector, plot = FALSE)
gs_contmap <- setMap(gs_contmap, colors = c("steelblue", "gold", "firebrick"))
plot(gs_contmap,
type = "fan",
legend = 0.5 * max(nodeHeights(ML.tree)),
lwd = 2,
fsize = 0.3,
ftype = "i")Continuous trait map of genome size (Mbp) across insect phylogeny. Branch colors represent reconstructed genome size from small (blue) to large (red).
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Denver
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BAMMtools_2.1.12 ape_5.8-1 janitor_2.2.1 readxl_1.4.5
## [5] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.1
## [9] purrr_1.2.2 readr_2.2.0 tidyr_1.3.2 tibble_3.3.1
## [13] ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 gplots_3.3.0 bitops_1.0-9
## [5] KernSmooth_2.23-26 gtools_3.9.5 stringi_1.8.7 lattice_0.22-9
## [9] caTools_1.18.3 hms_1.1.4 digest_0.6.39 magrittr_2.0.4
## [13] evaluate_1.0.5 grid_4.5.1 timechange_0.4.0 RColorBrewer_1.1-3
## [17] fastmap_1.2.0 cellranger_1.1.0 jsonlite_2.0.0 scales_1.4.0
## [21] jquerylib_0.1.4 cli_3.6.5 rlang_1.1.7 withr_3.0.2
## [25] cachem_1.1.0 yaml_2.3.12 otel_0.2.0 tools_4.5.1
## [29] parallel_4.5.1 tzdb_0.5.0 vctrs_0.7.3 R6_2.6.1
## [33] lifecycle_1.0.5 snakecase_0.11.1 pkgconfig_2.0.3 pillar_1.11.1
## [37] bslib_0.10.0 gtable_0.3.6 glue_1.8.0 Rcpp_1.1.1
## [41] xfun_0.57 tidyselect_1.2.1 rstudioapi_0.18.0 knitr_1.51
## [45] farver_2.1.2 htmltools_0.5.9 nlme_3.1-169 rmarkdown_2.31
## [49] compiler_4.5.1 S7_0.2.1