Jump to
section
Assembly statistics Repeat annotation Protein-coding genes prediction
Non-protein-coding genes annotation Gene function annotation

Assembly statistics

We published HI-C assembly version genome. We use NextDenovo to do preliminary assembly of raw data and further polished using NextPolish. Chromosome construction was performed using Juicer+3DDNA, and the Scaffold sequence was assembled by Juicer. The alignment ratio was 81.11%, and the ratio of effective data was 50.72%. Next, we construct the cluster through 3DDNA. The 3D DNA is based on unsupervised clustering. According to the heat map results, the final chromosome sequence length table is obtained as follows.
Sequenced Read Pairs 523,236,442
Normal Paired 218,464,153 (41.75%)
Chimeric Paired 205,929,044 (39.36%)
Chimeric Ambiguous 94,735,398 (18.11%)
Unmapped 4,107,847 (0.79%)
Ligation Motif Present 246,325,885 (47.08%)
Alignable (Normal+Chimeric Paired) 424,393,197 (81.11%)
Unique Reads 391,426,518 (74.81%)
PCR Duplicates 32,066,559 (6.13%)
Optical Duplicates 900,120 (0.17%)
Library Complexity Estimate 2,653,449,089
Intra-fragment Reads 13,654,065 (2.61% / 3.49%)
Below MAPQ Threshold 112,374,693 (21.48% / 28.71%)
Hi-C Contacts 265,397,760 (50.72% / 67.80%)
Ligation Motif Present 89,285,763 (17.06% / 22.81%)
3' Bias (Long Range) 62% - 38%
Pair Type %(L-I-O-R) 25% - 25% - 25% - 25%
Inter-chromosomal 163,545,853 (31.26% / 41.78%)
Intra-chromosomal 101,851,907 (19.47% / 26.02%)
Short Range (<20Kb) 46,129,853 (8.82% / 11.79%)
Long Range (>20Kb) 55,719,938 (10.65% / 14.24%)
Summary of genome
Stat Type Revised genome Final assembly
contig Length contig Number scaffold Length scaffold Number
N90 4,639,304 34 61,586,581 10
N80 15,034,666 24 70,621,000 9
N70 20,394,276 19 72,404,604 7
N60 25,436,122 15 73,488,368 6
N50 30,662,761 12 83,965,975 5
N40 33,267,760 9 92,639,145 4
N30 38,162,988 6 99,781,956 3
N20 42,127,889 4 100,399,333 2
N10 55,419,229 2 114,828,255 1
Max length 57,253,638 114,828,255
Total length 901,936,647 902,157,147
Total number 829 1,111
Average length 1,087,981 812,022
Statistic of BUSCO(aves)
Gene numbers Percentage
Complete BUSCOs 1,570 97.20%
Complete and single-copy BUSCOs1,46397.20%
Complete and duplicated BUSCOs10790.60%
Fragmented BUSCOs206.60%
Missing BUSCOs241.20%
Total BUSCO groups searched1,614-

Repeat annotation

Tandem repeats were identified across the genome with the help of the program Tandem Repeats Finder (TRF). Transposable elements (TEs) in the genome were identified by a combination of homology-based and de novo approaches. For homolog based prediction, known repeats were identified using RepeatMasker and RepeatProteinMask against Repbase (Repbase Release 16.10; http://www.girinst.org/repbase/index.html). For de novo prediction, RepeatModeler(http://repeatmasker.org/), LTR FINDER (http://tlife.fudan.edu.cn/ltr_finder/) were used to identify de novo evolved repeats inferred from the assembled genome.

Statistics of Repeats in Erianthus fulvus Genome.
Type Repeat Size (bp) % of genome
Trf29,925,5193.317109
Repeatmasker464,380,82251.47452
Proteinmask2,482,7000.275196
De novo150,594,73816.69275
Total618,852,48768.59701
TEs Content in the Assembled Erianthus fulvus Genome
Repbase TEsTE protiensDe novo Combined TEs
TypeLength (Bp) % in genomeLength (Bp) % in genomeLength (Bp) % in genomeLength (Bp)% in genome
DNA 87,015,373 9.645262 986,3720.10933526,206,8592.904912113,833,99912.617986
LINE 18,830,0512.087226619,4640.0686654,839,4900.53643624,238,3362.68671
SINE 150720.0016710027,0620.00342,1340.00467
LTR 350,446,03838.845365875,0530.09699697,471,90410.804321444,709,53149.294048
Other 00000000
Unknown 166,3690.0184414,1010.00045526,532,7822.94103926,703,2522.959935
Total 464,380,82251.4745222,482,7000.275196150,582,07016.691344612,542,89767.897621

Note: Repbase TEs: the result of RepeatMasker based on Repbase; TE proteins: the result of RepeatProteinMask based on Repbase; De novo: De novo finding repeats (Reaptmodeler); Combined: combine the results of Repbase TEs, TE proteins and De novo.

Protein-coding genes prediction

To predict protein-coding genes, we used both homology-based and de novo prediction methods.
Homology based gene prediction
Proteins of all species were used in the homology-based annotation. First, TBLASTN was processed with parameters of “E-value = 1e-5, -F F”. BLAST hits that correspond to reference proteins were concatenated by Solar software, and low-quality records were filtered. Genomic sequence of each reference protein was extended upstream and downstream by 2000 bp to represent a protein-coding region. GeneWise software was used to predict gene structure contained in each protein region.
de novo gene prediction
In addition we used three de novo prediction programs: Augustus and GlimmerHMM, GENESCAN. Augustus with gene model parameters trained from Erianthus fulvus, which seted up by PASA. GlimmerHMM and GENESCAN with gene model parameters trained from Sorghum bicolor. We filtered partial genes and small genes with less than 150 bp coding lengths.
High-confidence gene model generation
Finally, comprehensive above predicted results, and coupled with the transcriptome comparison data by PASA (http://pasa.sourceforge.net/). All kinds of gene sets merged by EVidenceModeler (EVM entry, http://evidencemodeler.sourceforge.net/) integration software, get a non redundant, a more complete set of genes.

Gene annotation statistics for Erianthus fulvus genome.
MethodsGene NumberAvg. mRNA LengthTotal Exon NumberAvg. Exon LengthAvg. CDS LengthAvg. Exon NumberTotal Intron Length
Ab initio
augustus_busco184,6752,419.225818768,778300.02238751,248.9541684.162869907216,119,917
glimmerHMM458,735890.4655171802,674347.3729285607.81762461.749755305129,660,481
genscan150,2934,507.7721851,027,965237.37838931,623.6063966.839739708433,469,929
snap596,445707.35040111,163,537265.6400819518.20715071.950786745112,813,546
Homology
Arabidopsis thaliana27,3042,800.446565121,467230.41334681,025.0372844.44868883748,475,775
Sesamum indicum29,0083,333.27844128,974240.22055611,068.0572954.44615278565,709,535
Oryza sativa37,2203,216.894546164,427269.13027671,188.9383134.41770553575,480,531
Zea mays39,8822,978.23043172,446262.57424931,135.3462464.32390552173,497,907
Sorghum bicolor424473,234.425919186,722267.87132741,178.351124.39894456687,274,207
Brachypodium distachyon35,9793,256.63676159,355264.30519281,170.6371494.42911142675,052,180
transcript11,3664,974.84031372,708271.99612151,739.9519626.3969734325,946,868
EVM189,6352,513.742458804,939291.37767461,236.8036174.244675297242,152,297
Final35,0653,367.788849186,937264.08057791,407.854875.33115642468,725,085

Non-protein-coding genes annotation

The tRNAscan-SE (version 1.23) software with default parameters for eukaryote was used for tRNA annotation. rRNA annotation was based on homology information of plant rRNAs using BLASTN with parameters of “E-value = 1e-5”. The miRNA and snRNA genes were predicted by INFERNAL software (http://infernal.janelia.org, version 0.81) against the Rfam database (Release 11.0).

Statistics for Erianthus fulvus ncRNA annotation.
TypeCopy NumberAvg. Length (bp)Total Length (bp)% of genome
miRNA532230.3271122,5340.013582
tRNA1,18475.626789,5420.009925
rRNA583310.0566180,7630.020037
18S131913.8473119,7140.01327
28S260142.265436,9890.0041
5.8S55153.90918,4650.000938
5S137113.832115,5950.001729
snRNA2,578109.9387283,4220.031416
CD-box2,408107.2018258,1420.028614
HACA-box 36117.05564,2140.000467
Splicing 134157.20921,0660.002335

Gene function annotation

Gene functions were assigned according to the best match alignment using BLASTP against Swiss-Prot or TrEMBL databases. Gene motifs and domains were determined using InterProScan against protein databases including all databases . Gene Ontology IDs were obtained from the corresponding Swiss-Prot and TrEMBL entries. All genes were aligned against the KEGG proteins. The pathway to which the gene might belong was derived from the matching genes in KEGG.

Statistics for Erianthus fulvus gene function annotation.
NumberPercent(%)
Total35,065
InterPro21,85562.327107
GO16,64947.480394
KEGG22,86865.216027
Swissprot19,77956.406673
TrEMBL26,14874.570084
Annotated27,76379.175816
Unanotated7,30220.824184