Jump to
section
Gene family cluster Phylogeny tree construct Divergence time estimation
Expansion and contraction of gene families Ks distribution calculation

This part are some genome analysis, include Gene family cluster, Phylogeny tree construct, Divergence time estimation, Expansion and contraction of gene families and Ks distribution calculation

Gene family cluster

Using Treefam’s mcnes of other species were obtained from NCBI. We chose the transcripts with longest coding sequence to represent each gene. In summary, we first performed all-against-all comparison of all proteins using BLASTP with a cutoff of E-value < 1e-5 to both genes. OrthoMCL package (Version 1.4) was used to process high-scoring segment pairs (HSPs). MCL software in OrthoMCL was used to define final paralogous and orthologous genes with the parameter of “-abc – I=1 .5”.


Figure: Orthology cluster compositions of Ataus: Aegilops tauschii, Atha: Arabidopsis thaliana, Bradi: Brachypodium distachyon, Erufi: Erianthus fulvus, Orysa: Oryza sativa, Phall: Panicum hallii, Pmili: Panicum miliaceum, Sital: Setaria talica, Sorbi: Sorghum bicolor, Zeam: Zea mays, and Sspon: Saccharum spontaneum.

Figure: Venn diagram showing the number of unique and shared gene families among Erianthus fulvus, Oryza sativa, Zea mays, Sorghum bicolor, and Saccharum spontaneum.
Family stat
Species Genes number Genes in families Unclustered genes Family number Unique families Average genes per family
Aegilops tauschii 45305 39865 5440 18556 1551 2.15
Arabidopsis thaliana 26950 22998 3952 11710 1201 1.96
Brachypodium distachyon 25447 23437 2010 16827 181 1.39
Erianthus fulvus 35065 25802 9263 17563 676 1.47
Oryza sativa 28566 25667 2899 17376 323 1.48
Panicum hallii 26368 24990 1378 17958 102 1.39
Panicum miliaceum 34182 28019 6163 16204 150 1.73
Setaria talica 27422 25541 1881 17747 134 1.44
Sorghum bicolor 28110 26300 1810 18295 187 1.44
Saccharum spontaneum 83826 67545 16281 20553 1981 .29
Zea mays 37245 31794 5451 17587 942 1.81

Phylogeny tree construct

Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and species. Hence, based on the most conserved single-copy orthologs, the phylogenic tree were reconstructed via Bayesian inference method.


Figure: Phylogenetic tree based on Bayesian inference analyses of a concatenated alignment of single-copy genes from Ataus: Aegilops tauschii, Atha: Arabidopsis thaliana, Bradi: Brachypodium distachyon, Erufi: Erianthus fulvus, Orysa: Oryza sativa, Phall: Panicum hallii, Pmili: Panicum miliaceum, Sital: Setaria talica, Sorbi: Sorghum bicolor, Zeama: Zea mays, and Sspon: Saccharum spontaneum.

Divergence time estimation

The program MCMCTREE program, implemented in PAML package, was used to estimate divergence time for all species . MCMCTree performs Bayesian estimation of species divergence times usingsoft fossil constraints .The HKY85 model (model=4) and independent rates molecular clock (clock=2) were used for calculation.The MCMC process of MCMCTREE was performed with the samples 1,000,000 times, with a sample frequency setting of 2, after a burnin of 200,000.The program uses for input a sequence alignment , a phylogenetic tree with fossil calibrations, and a control file (usuallycalled mcmctree.ctl).


Figure:Estimation of divergence time.

Expansion and contraction of gene families

We used CAFE (Computational Analysis of gene Family Evolution) for the statistical analysis of the evolution of the size of gene families. For a specified phylogenetic tree, and given the gene family sizes in the extant species, CAFE can estimate the global birth and death rate of gene families, infer the most likely gene family size at all internal nodes, identify gene families that have accelerated rates of gain and loss (quantified by a p-value) and identify which branches cause the p-value to be small for significant families.


Figure:The proportion of gene families expansion and contraction.

Ks distribution calculation

To find the Tandem duplicated gene family according to BLASTP results (the intergene insertion number is less than 20). Perform MUSCLE to comparison the sequence of each gene family, then use yn00 in PAML to calculate the Ks value between the sequences and remove the Ks value greater than 2. Taking the median or the mean to represent the Ks value of each copy of the gene family. Add the Ks value for the interval which increment is 0.5 units.


Figure:Analyse Ks distribution of Erianthus fulvus.