Jump to section |
Assembly statistics | Repeat annotation | Protein-coding genes prediction |
Non-protein-coding genes annotation | Gene function annotation |
Assembly statistics
Sequenced Read Pairs | 523,236,442 |
Normal Paired | 218,464,153 (41.75%) |
Chimeric Paired | 205,929,044 (39.36%) |
Chimeric Ambiguous | 94,735,398 (18.11%) |
Unmapped | 4,107,847 (0.79%) |
Ligation Motif Present | 246,325,885 (47.08%) |
Alignable (Normal+Chimeric Paired) | 424,393,197 (81.11%) |
Unique Reads | 391,426,518 (74.81%) |
PCR Duplicates | 32,066,559 (6.13%) |
Optical Duplicates | 900,120 (0.17%) |
Library Complexity Estimate | 2,653,449,089 |
Intra-fragment Reads | 13,654,065 (2.61% / 3.49%) |
Below MAPQ Threshold | 112,374,693 (21.48% / 28.71%) |
Hi-C Contacts | 265,397,760 (50.72% / 67.80%) |
Ligation Motif Present | 89,285,763 (17.06% / 22.81%) |
3' Bias (Long Range) | 62% - 38% |
Pair Type %(L-I-O-R) | 25% - 25% - 25% - 25% |
Inter-chromosomal | 163,545,853 (31.26% / 41.78%) |
Intra-chromosomal | 101,851,907 (19.47% / 26.02%) |
Short Range (<20Kb) | 46,129,853 (8.82% / 11.79%) |
Long Range (>20Kb) | 55,719,938 (10.65% / 14.24%) |
Summary of genome
Stat Type | Revised genome | Final assembly | ||
contig Length | contig Number | scaffold Length | scaffold Number | |
N90 | 4,639,304 | 34 | 61,586,581 | 10 |
N80 | 15,034,666 | 24 | 70,621,000 | 9 |
N70 | 20,394,276 | 19 | 72,404,604 | 7 |
N60 | 25,436,122 | 15 | 73,488,368 | 6 |
N50 | 30,662,761 | 12 | 83,965,975 | 5 |
N40 | 33,267,760 | 9 | 92,639,145 | 4 |
N30 | 38,162,988 | 6 | 99,781,956 | 3 |
N20 | 42,127,889 | 4 | 100,399,333 | 2 |
N10 | 55,419,229 | 2 | 114,828,255 | 1 |
Max length | 57,253,638 | 114,828,255 | ||
Total length | 901,936,647 | 902,157,147 | ||
Total number | 829 | 1,111 | ||
Average length | 1,087,981 | 812,022 |
Statistic of BUSCO(aves)
Gene numbers | Percentage | |
---|---|---|
Complete BUSCOs | 1,570 | 97.20% |
Complete and single-copy BUSCOs | 1,463 | 97.20% |
Complete and duplicated BUSCOs | 107 | 90.60% |
Fragmented BUSCOs | 20 | 6.60% |
Missing BUSCOs | 24 | 1.20% |
Total BUSCO groups searched | 1,614 | - |
Repeat annotation
Tandem repeats were identified across the genome with the help of the program Tandem Repeats Finder (TRF). Transposable elements (TEs) in the genome were identified by a combination of homology-based and de novo approaches. For homolog based prediction, known repeats were identified using RepeatMasker and RepeatProteinMask against Repbase (Repbase Release 16.10; http://www.girinst.org/repbase/index.html). For de novo prediction, RepeatModeler(http://repeatmasker.org/), LTR FINDER (http://tlife.fudan.edu.cn/ltr_finder/) were used to identify de novo evolved repeats inferred from the assembled genome.
Statistics of Repeats in Erianthus fulvus Genome.
Type | Repeat Size (bp) | % of genome |
---|---|---|
Trf | 29,925,519 | 3.317109 |
Repeatmasker | 464,380,822 | 51.47452 |
Proteinmask | 2,482,700 | 0.275196 |
De novo | 150,594,738 | 16.69275 |
Total | 618,852,487 | 68.59701 |
TEs Content in the Assembled Erianthus fulvus Genome
Repbase TEs | TE protiens | De novo | Combined TEs | |||||
---|---|---|---|---|---|---|---|---|
Type | Length (Bp) | % in genome | Length (Bp) | % in genome | Length (Bp) | % in genome | Length (Bp) | % in genome |
DNA | 87,015,373 | 9.645262 | 986,372 | 0.109335 | 26,206,859 | 2.904912 | 113,833,999 | 12.617986 |
LINE | 18,830,051 | 2.087226 | 619,464 | 0.068665 | 4,839,490 | 0.536436 | 24,238,336 | 2.68671 |
SINE | 15072 | 0.001671 | 0 | 0 | 27,062 | 0.003 | 42,134 | 0.00467 |
LTR | 350,446,038 | 38.845365 | 875,053 | 0.096996 | 97,471,904 | 10.804321 | 444,709,531 | 49.294048 |
Other | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Unknown | 166,369 | 0.018441 | 4,101 | 0.000455 | 26,532,782 | 2.941039 | 26,703,252 | 2.959935 |
Total | 464,380,822 | 51.474522 | 2,482,700 | 0.275196 | 150,582,070 | 16.691344 | 612,542,897 | 67.897621 |
Note: Repbase TEs: the result of RepeatMasker based on Repbase; TE proteins: the result of RepeatProteinMask based on Repbase; De novo: De novo finding repeats (Reaptmodeler); Combined: combine the results of Repbase TEs, TE proteins and De novo.
Protein-coding genes prediction
To predict protein-coding genes, we used both homology-based and de novo prediction methods.
Homology based gene prediction
Proteins of all species were used in the homology-based annotation. First, TBLASTN was processed with parameters of “E-value = 1e-5, -F F”. BLAST hits that correspond to reference proteins were concatenated by Solar software, and low-quality records were filtered. Genomic sequence of each reference protein was extended upstream and downstream by 2000 bp to represent a protein-coding region. GeneWise software was used to predict gene structure contained in each protein region.
de novo gene prediction
In addition we used three de novo prediction programs: Augustus and GlimmerHMM, GENESCAN. Augustus with gene model parameters trained from Erianthus fulvus, which seted up by PASA. GlimmerHMM and GENESCAN with gene model parameters trained from Sorghum bicolor. We filtered partial genes and small genes with less than 150 bp coding lengths.
High-confidence gene model generation
Finally, comprehensive above predicted results, and coupled with the transcriptome comparison data by PASA (http://pasa.sourceforge.net/). All kinds of gene sets merged by EVidenceModeler (EVM entry, http://evidencemodeler.sourceforge.net/) integration software, get a non redundant, a more complete set of genes.
Gene annotation statistics for Erianthus fulvus genome.
Methods | Gene Number | Avg. mRNA Length | Total Exon Number | Avg. Exon Length | Avg. CDS Length | Avg. Exon Number | Total Intron Length |
---|---|---|---|---|---|---|---|
Ab initio | |||||||
augustus_busco | 184,675 | 2,419.225818 | 768,778 | 300.0223875 | 1,248.954168 | 4.162869907 | 216,119,917 |
glimmerHMM | 458,735 | 890.4655171 | 802,674 | 347.3729285 | 607.8176246 | 1.749755305 | 129,660,481 |
genscan | 150,293 | 4,507.772185 | 1,027,965 | 237.3783893 | 1,623.606396 | 6.839739708 | 433,469,929 |
snap | 596,445 | 707.3504011 | 1,163,537 | 265.6400819 | 518.2071507 | 1.950786745 | 112,813,546 |
Homology | |||||||
Arabidopsis thaliana | 27,304 | 2,800.446565 | 121,467 | 230.4133468 | 1,025.037284 | 4.448688837 | 48,475,775 |
Sesamum indicum | 29,008 | 3,333.27844 | 128,974 | 240.2205561 | 1,068.057295 | 4.446152785 | 65,709,535 |
Oryza sativa | 37,220 | 3,216.894546 | 164,427 | 269.1302767 | 1,188.938313 | 4.417705535 | 75,480,531 |
Zea mays | 39,882 | 2,978.23043 | 172,446 | 262.5742493 | 1,135.346246 | 4.323905521 | 73,497,907 |
Sorghum bicolor | 42447 | 3,234.425919 | 186,722 | 267.8713274 | 1,178.35112 | 4.398944566 | 87,274,207 |
Brachypodium distachyon | 35,979 | 3,256.63676 | 159,355 | 264.3051928 | 1,170.637149 | 4.429111426 | 75,052,180 |
transcript | 11,366 | 4,974.840313 | 72,708 | 271.9961215 | 1,739.951962 | 6.39697343 | 25,946,868 |
EVM | 189,635 | 2,513.742458 | 804,939 | 291.3776746 | 1,236.803617 | 4.244675297 | 242,152,297 |
Final | 35,065 | 3,367.788849 | 186,937 | 264.0805779 | 1,407.85487 | 5.331156424 | 68,725,085 |
Non-protein-coding genes annotation
The tRNAscan-SE (version 1.23) software with default parameters for eukaryote was used for tRNA annotation. rRNA annotation was based on homology information of plant rRNAs using BLASTN with parameters of “E-value = 1e-5”. The miRNA and snRNA genes were predicted by INFERNAL software (http://infernal.janelia.org, version 0.81) against the Rfam database (Release 11.0).
Statistics for Erianthus fulvus ncRNA annotation.
Type | Copy Number | Avg. Length (bp) | Total Length (bp) | % of genome |
---|---|---|---|---|
miRNA | 532 | 230.3271 | 122,534 | 0.013582 |
tRNA | 1,184 | 75.6267 | 89,542 | 0.009925 |
rRNA | 583 | 310.0566 | 180,763 | 0.020037 |
18S | 131 | 913.8473 | 119,714 | 0.01327 |
28S | 260 | 142.2654 | 36,989 | 0.0041 |
5.8S | 55 | 153.9091 | 8,465 | 0.000938 |
5S | 137 | 113.8321 | 15,595 | 0.001729 |
snRNA | 2,578 | 109.9387 | 283,422 | 0.031416 |
CD-box | 2,408 | 107.2018 | 258,142 | 0.028614 |
HACA-box | 36 | 117.0556 | 4,214 | 0.000467 |
Splicing | 134 | 157.209 | 21,066 | 0.002335 |
Gene function annotation
Gene functions were assigned according to the best match alignment using BLASTP against Swiss-Prot or TrEMBL databases. Gene motifs and domains were determined using InterProScan against protein databases including all databases . Gene Ontology IDs were obtained from the corresponding Swiss-Prot and TrEMBL entries. All genes were aligned against the KEGG proteins. The pathway to which the gene might belong was derived from the matching genes in KEGG.
Statistics for Erianthus fulvus gene function annotation.
Number | Percent(%) | |
---|---|---|
Total | 35,065 | |
InterPro | 21,855 | 62.327107 |
GO | 16,649 | 47.480394 |
KEGG | 22,868 | 65.216027 |
Swissprot | 19,779 | 56.406673 |
TrEMBL | 26,148 | 74.570084 |
Annotated | 27,763 | 79.175816 |
Unanotated | 7,302 | 20.824184 |