Jump to section |
Gene family cluster | Phylogeny tree construct | Divergence time estimation |
Expansion and contraction of gene families | Ks distribution calculation |
This part are some genome analysis, include Gene family cluster, Phylogeny tree construct, Divergence time estimation, Expansion and contraction of gene families and Ks distribution calculation
Gene family cluster
Using Treefam’s mcnes of other species were obtained from NCBI. We chose the transcripts with longest coding sequence to represent each gene. In summary, we first performed all-against-all comparison of all proteins using BLASTP with a cutoff of E-value < 1e-5 to both genes. OrthoMCL package (Version 1.4) was used to process high-scoring segment pairs (HSPs). MCL software in OrthoMCL was used to define final paralogous and orthologous genes with the parameter of “-abc – I=1 .5”.
Figure: Orthology cluster compositions of Ataus: Aegilops tauschii, Atha: Arabidopsis thaliana, Bradi: Brachypodium distachyon, Erufi: Erianthus fulvus, Orysa: Oryza sativa, Phall: Panicum hallii, Pmili: Panicum miliaceum, Sital: Setaria talica, Sorbi: Sorghum bicolor, Zeam: Zea mays, and Sspon: Saccharum spontaneum.
Figure: Venn diagram showing the number of unique and shared gene families among Erianthus fulvus, Oryza sativa, Zea mays, Sorghum bicolor, and Saccharum spontaneum.
Species | Genes number | Genes in families | Unclustered genes | Family number | Unique families | Average genes per family |
---|---|---|---|---|---|---|
Aegilops tauschii | 45305 | 39865 | 5440 | 18556 | 1551 | 2.15 |
Arabidopsis thaliana | 26950 | 22998 | 3952 | 11710 | 1201 | 1.96 |
Brachypodium distachyon | 25447 | 23437 | 2010 | 16827 | 181 | 1.39 |
Erianthus fulvus | 35065 | 25802 | 9263 | 17563 | 676 | 1.47 |
Oryza sativa | 28566 | 25667 | 2899 | 17376 | 323 | 1.48 |
Panicum hallii | 26368 | 24990 | 1378 | 17958 | 102 | 1.39 |
Panicum miliaceum | 34182 | 28019 | 6163 | 16204 | 150 | 1.73 |
Setaria talica | 27422 | 25541 | 1881 | 17747 | 134 | 1.44 |
Sorghum bicolor | 28110 | 26300 | 1810 | 18295 | 187 | 1.44 |
Saccharum spontaneum | 83826 | 67545 | 16281 | 20553 | 1981 | .29 |
Zea mays | 37245 | 31794 | 5451 | 17587 | 942 1.81 |
Phylogeny tree construct
Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and species. Hence, based on the most conserved single-copy orthologs, the phylogenic tree were reconstructed via Bayesian inference method.
Figure: Phylogenetic tree based on Bayesian inference analyses of a concatenated alignment of single-copy genes from Ataus: Aegilops tauschii, Atha: Arabidopsis thaliana, Bradi: Brachypodium distachyon, Erufi: Erianthus fulvus, Orysa: Oryza sativa, Phall: Panicum hallii, Pmili: Panicum miliaceum, Sital: Setaria talica, Sorbi: Sorghum bicolor, Zeama: Zea mays, and Sspon: Saccharum spontaneum.
Divergence time estimation
The program MCMCTREE program, implemented in PAML package, was used to estimate divergence time for all species . MCMCTree performs Bayesian estimation of species divergence times usingsoft fossil constraints .The HKY85 model (model=4) and independent rates molecular clock (clock=2) were used for calculation.The MCMC process of MCMCTREE was performed with the samples 1,000,000 times, with a sample frequency setting of 2, after a burnin of 200,000.The program uses for input a sequence alignment , a phylogenetic tree with fossil calibrations, and a control file (usuallycalled mcmctree.ctl).
Figure:Estimation of divergence time.
Expansion and contraction of gene families
We used CAFE (Computational Analysis of gene Family Evolution) for the statistical analysis of the evolution of the size of gene families. For a specified phylogenetic tree, and given the gene family sizes in the extant species, CAFE can estimate the global birth and death rate of gene families, infer the most likely gene family size at all internal nodes, identify gene families that have accelerated rates of gain and loss (quantified by a p-value) and identify which branches cause the p-value to be small for significant families.
Figure:The proportion of gene families expansion and contraction.
Ks distribution calculation
To find the Tandem duplicated gene family according to BLASTP results (the intergene insertion number is less than 20). Perform MUSCLE to comparison the sequence of each gene family, then use yn00 in PAML to calculate the Ks value between the sequences and remove the Ks value greater than 2. Taking the median or the mean to represent the Ks value of each copy of the gene family. Add the Ks value for the interval which increment is 0.5 units.
Figure:Analyse Ks distribution of Erianthus fulvus.