UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2019_08 STATISTICS


1.  INTRODUCTION

Release 2019_08 of 18-Sep-2019 of UniProtKB/TrEMBL contains 171501488 sequence entries,
comprising 57651951630 amino acids.

4227499 sequences have been added since release 2019_07, the sequence data of
18613 existing entries has been updated and the annotations of
20902200 entries have been revised. This represents an increase of 3%.

Number of fragments: 16359479

Protein existence (PE):              entries      %
1: Evidence at protein level          150648     0.09%
2: Evidence at transcript level      1272252     0.74%
3: Inferred from homology           43351274    25.28%
4: Predicted                       126727314    73.89%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 1179153

   The first twenty species represent 17770163 sequences:  10.4 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 668123
                            2x: 129369
                            3x:  68661
                            4x:  48613
                            5x:  30065
                            6x:  21619
                            7x:  16183
                            8x:  12820
                            9x:  10421
                           10x:  15771
                       11- 20x:  81080
                       21- 50x:  24083
                       51-100x:  15100
                         >100x:  37245


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1    1961734  compost metagenome
       2    1113123  Chernetidae sp. UAIC
   2.3  Taxonomic distribution of the sequences


   Kingdom        sequences (% of the database)
    Archaea         3500083 (  2%)
    Bacteria      120690603 ( 70%)
    Eukaryota      39558937 ( 23%)
    Viruses         4145685 (  2%)
    Other           3606180 ( <1%)



   Within Eukaryota:


    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 152925 (  0%)           (  0%)
     Other Mammalia       3347661 (  8%)           (  2%)
     Other Vertebrata     4795749 ( 12%)           (  3%)
     Viridiplantae        8378028 ( 21%)           (  5%)
     Fungi               11196133 ( 28%)           (  7%)
     Insecta              3753265 (  9%)           (  2%)
     Nematoda             1712429 (  4%)           (  1%)
     Other                6222747 ( 16%)           (  4%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50  2074577             1001-1100   1191819
                 51- 100 14643715             1101-1200    827630
                101- 150 17280593             1201-1300    558264
                151- 200 16568777             1301-1400    375096
                201- 250 16419950             1401-1500    291623
                251- 300 16370912             1501-1600    215514
                301- 350 14939526             1601-1700    161008
                351- 400 11568879             1701-1800    123726
                401- 450  9832601             1801-1900    107889
                451- 500  7871667             1901-2000     90543
                501- 550  5494407             2001-2100     71461
                551- 600  4149212             2101-2200     66860
                601- 650  3072291             2201-2300     53015
                651- 700  2428275             2301-2400     43184
                701- 750  2047910             2401-2500     37063
                751- 800  1742394             >2500        279101
                801- 850  1376518
                851- 900  1174308
                901- 950   900214
                951-1000   691487



   The average sequence length in UniProtKB/TrEMBL is   336 amino acids.

   The shortest sequence is     C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  A0A316Q3J5_9FIRM: 74488 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   203106663                1.18                                                    
   Submitted to EMBL/GenBank/DDBJ 135768163 122901866      0.79                                                    
   Journal                         59019631  55599148      0.34                                                    
   Submitted to other databases     8286240   8258997      0.05                                                    
   Book citation                      16600     16533     <0.01                                                    
   Thesis                             16028     15968     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 760088


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     250689786                1.46                                                    
   ACTIVITY REGULATION               479498    469805     <0.01    11                                              
   CATALYTIC ACTIVITY              20753304  18452395      0.12     4                                              
   CAUTION                        107068892 104653256      0.62     1                                              
   COFACTOR                        10082287   9148451      0.06     8                                              
   DOMAIN                           1580783   1225630      0.01     9                                              
   FUNCTION                        23663503  22506041      0.14     3                                              
   INTERACTION                         3055      3055     <0.01    12                                              
   MISCELLANEOUS                     931504    840696      0.01    10                                              
   PATHWAY                         10302312   9297016      0.06     7                                              
   SIMILARITY                      43508303  42933294      0.25     2                                              
   SUBCELLULAR LOCATION            20117127  19975527      0.12     5                                              
   SUBUNIT                         12199218  12056154      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     537175345                3.13                                                    
   ACT_SITE                        11077859   6742149      0.06    11                                              
   BINDING                         23720315   6154522      0.14     5                                              
   CARBOHYD                           32586     29498     <0.01    25                                              
   CHAIN                           12686709  12671077      0.07     9                                              
   COILED                          21218616  14638549      0.12     7                                              
   COMPBIAS                        43572195  19441655      0.25     4                                              
   CROSSLNK                           51386     47559     <0.01    24                                              
   DISULFID                         2767232    796575      0.02    16                                              
   DNA_BIND                         1364121   1343834      0.01    18                                              
   DOMAIN                         125143595  90345644      0.73     2                                              
   INIT_MET                           69389     69388     <0.01    22                                              
   INTRAMEM                            1578      1314     <0.01    27                                              
   LIPID                             403776    232858     <0.01    21                                              
   METAL                           19389110   5048216      0.11     8                                              
   MOD_RES                          3490570   3080194      0.02    14                                              
   MOTIF                            2028744   1389530      0.01    17                                              
   NON_STD                             9583      9331     <0.01    26                                              
   NON_TER                         23278199  16404483      0.14     6                                              
   NP_BIND                          9862439   6246167      0.06    12                                              
   PEPTIDE                              922       644     <0.01    28                                              
   PROPEP                             60461     60461     <0.01    23                                              
   REGION                          61052246  36699736      0.36     3                                              
   REPEAT                           6591738   1572179      0.04    13                                              
   SIGNAL                          12654530  12654519      0.07    10                                              
   SITE                             2962782   1767176      0.02    15                                              
   TOPO_DOM                          404600    184863     <0.01    20                                              
   TRANSIT                              161       161     <0.01    29                                              
   TRANSMEM                       152696027  33478280      0.89     1                                              
   ZN_FING                           583876    459894     <0.01    19                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             1982006720               11.56                                                    
   ABCD                                 218       218     <0.01   113   Protocols and materials databases          
   Allergome                           3846      3111     <0.01    88   Protein family/group databases             
   ArachnoServer                        199       199     <0.01   117   Organism-specific databases                
   Araport                            15097     15031     <0.01    79   Organism-specific databases                
   BRENDA                              9500      9210     <0.01    82   Enzyme and pathway databases               
   Bgee                              518741    518471     <0.01    45   Gene expression databases                  
   BindingDB                            215       215     <0.01   114   Chemistry                                  
   BioCyc                          15635383  15013027      0.09    20   Enzyme and pathway databases               
   BioMuta                             1022      1021     <0.01   101   Polymorphism and mutation databases        
   CAZy                              128754    120495     <0.01    55   Protein family/group databases             
   CDD                             28290296  25188499      0.16    14   Family and domain databases                
   CGD                                20794     20728     <0.01    76   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   132   2D gel databases                           
   CORUM                                232       232     <0.01   112   Protein-protein interaction databases      
   CPTAC                                 22        15     <0.01   126   Proteomic databases                        
   CTD                              1380643   1379022      0.01    37   Organism-specific databases                
   CarbonylDB                           245       245     <0.01   110   PTM databases                              
   ChEMBL                              1001       998     <0.01   102   Chemistry                                  
   ChiTaRS                           130977    130976     <0.01    54   Other                                      
   CollecTF                             195       195     <0.01   118   Gene expression databases                  
   ComplexPortal                        201       149     <0.01   115   Protein-protein interaction databases      
   ConoServer                           157       157     <0.01   119   Organism-specific databases                
   DIP                                 3182      3181     <0.01    91   Protein-protein interaction databases      
   DNASU                              41226     40787     <0.01    68   Protocols and materials databases          
   DisProt                               96        96     <0.01   121   Family and domain databases                
   DrugBank                             753       447     <0.01   104   Chemistry                                  
   DrugCentral                          243       243     <0.01   111   Chemistry                                  
   ELM                                   94        94     <0.01   122   Protein-protein interaction databases      
   EMBL                           201985119 165720472      1.18     3   Sequence databases                         
   EPD                                12182     12182     <0.01    80   Proteomic databases                        
   ESTHER                             79931     79604     <0.01    59   Protein family/group databases             
   Ensembl                          3646652   3529504      0.02    29   Genome annotation databases                
   EnsemblBacteria                 37588331  35373416      0.22    12   Genome annotation databases                
   EnsemblFungi                     5913661   5771266      0.03    28   Genome annotation databases                
   EnsemblMetazoa                    892262    872098      0.01    39   Genome annotation databases                
   EnsemblPlants                    2728681   2487389      0.02    32   Genome annotation databases                
   EnsemblProtists                  1684744   1597291      0.01    36   Genome annotation databases                
   EuPathDB                          687412    686965     <0.01    41   Organism-specific databases                
   EvolutionaryTrace                   5910      5910     <0.01    86   Other                                      
   ExpressionAtlas                   637522    637522     <0.01    42   Gene expression databases                  
   FlyBase                            90757     90320     <0.01    58   Organism-specific databases                
   GO                             287136389 107611010      1.67     2   Ontologies                                 
   Gene3D                          79666837  65372953      0.46     8   Family and domain databases                
   GeneCards                           1340      1314     <0.01    99   Organism-specific databases                
   GeneDB                            105233    103575     <0.01    56   Genome annotation databases                
   GeneID                          11243736  11127299      0.07    24   Genome annotation databases                
   GeneTree                         3167449   3167239      0.02    30   Phylogenomic databases                     
   Genevisible                        15776     15768     <0.01    78   Gene expression databases                  
   GenomeRNAi                         32135     32135     <0.01    74   Other                                      
   GlyConnect                            16        16     <0.01   128   PTM databases                              
   Gramene                          2697534   2455877      0.02    33   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   133   Chemistry                                  
   HAMAP                           18797990  18564980      0.11    17   Family and domain databases                
   HGNC                               52908     52799     <0.01    64   Organism-specific databases                
   HOGENOM                          2981654   2981570      0.02    31   Phylogenomic databases                     
   InParanoid                       2223084   2223084      0.01    34   Phylogenomic databases                     
   IntAct                             19620     19484     <0.01    77   Protein-protein interaction databases      
   InterPro                       450457895 135352754      2.63     1   Family and domain databases                
   KEGG                            17763105  17312128      0.10    18   Genome annotation databases                
   KO                               7976656   7944726      0.05    26   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    96   Organism-specific databases                
   Leproma                             1271      1269     <0.01   100   Organism-specific databases                
   MEROPS                            234941    234939     <0.01    49   Protein family/group databases             
   MGI                                63378     62942     <0.01    61   Organism-specific databases                
   MINT                                2581      2581     <0.01    95   Protein-protein interaction databases      
   MalaCards                              6         6     <0.01   130   Organism-specific databases                
   MaxQB                              38811     38811     <0.01    69   Proteomic databases                        
   MoonDB                                 1         1     <0.01   136   Protein family/group databases             
   MoonProt                              59        59     <0.01   125   Protein family/group databases             
   NIAGADS                              263       263     <0.01   109   Organism-specific databases                
   OGP                                    3         3     <0.01   134   2D gel databases                           
   OMA                              7037155   7037124      0.04    27   Phylogenomic databases                     
   OpenTargets                        51100     51052     <0.01    65   Organism-specific databases                
   OrthoDB                         19265784  19265775      0.11    16   Phylogenomic databases                     
   PANTHER                         37650712  36538612      0.22    11   Family and domain databases                
   PATRIC                          16291446  16272731      0.09    19   Genome annotation databases                
   PDB                                43547     20204     <0.01    66   3D structure databases                     
   PDBsum                             42922     19774     <0.01    67   3D structure databases                     
   PIR                               162384    130157     <0.01    50   Sequence databases                         
   PIRSF                           14868721  14741058      0.09    21   Family and domain databases                
   PMAP-CutDB                           130       130     <0.01   120   Other                                      
   PRIDE                             386296    386296     <0.01    47   Proteomic databases                        
   PRINTS                          22012579  19931880      0.13    15   Family and domain databases                
   PRO                                 2229      2229     <0.01    97   Other                                      
   PROSITE                         85992281  57492452      0.50     7   Family and domain databases                
   PaxDb                             273752    273752     <0.01    48   Proteomic databases                        
   PeptideAtlas                      132796    132796     <0.01    53   Proteomic databases                        
   PeroxiBase                          2610      2594     <0.01    94   Protein family/group databases             
   Pfam                           173803359 124525180      1.01     4   Family and domain databases                
   PharmGKB                            3120      3120     <0.01    92   Organism-specific databases                
   PhosphoSitePlus                     2189      2189     <0.01    98   PTM databases                              
   PhylomeDB                         457111    457111     <0.01    46   Phylogenomic databases                     
   PomBase                                2         2     <0.01   135   Organism-specific databases                
   ProMEX                              2775      2775     <0.01    93   Proteomic databases                        
   Proteomes                      135925903 125191381      0.79     5   Other                                      
   ProteomicsDB                       35679     35625     <0.01    72   Proteomic databases                        
   PseudoCAP                           4441      4437     <0.01    87   Organism-specific databases                
   REBASE                             32929     32887     <0.01    73   Protein family/group databases             
   REPRODUCTION-2DPAGE                   62        61     <0.01   124   2D gel databases                           
   RGD                                21558     20688     <0.01    75   Organism-specific databases                
   Reactome                          137490     51340     <0.01    52   Enzyme and pathway databases               
   RefSeq                          48850414  47616206      0.28     9   Sequence databases                         
   SABIO-RK                             638       638     <0.01   105   Enzyme and pathway databases               
   SFLD                             1078404    838443      0.01    38   Family and domain databases                
   SGD                                    7         7     <0.01   129   Organism-specific databases                
   SIGNOR                                 5         5     <0.01   131   Enzyme and pathway databases               
   SMART                           41054580  31089642      0.24    10   Family and domain databases                
   SMR                              1686713   1686713      0.01    35   3D structure databases                     
   STRING                          12655956  12655901      0.07    23   Protein-protein interaction databases      
   SUPFAM                         113008516  89509209      0.66     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   137   2D gel databases                           
   SignaLink                           3759      3759     <0.01    89   Enzyme and pathway databases               
   SwissLipids                           81        81     <0.01   123   Chemistry                                  
   SwissPalm                           3501      3501     <0.01    90   PTM databases                              
   TAIR                               11782     11721     <0.01    81   Organism-specific databases                
   TCDB                                8401      8387     <0.01    83   Protein family/group databases             
   TIGRFAMs                        36059866  33151639      0.21    13   Family and domain databases                
   TopDownProteomics                    274       274     <0.01   108   Proteomic databases                        
   TreeFam                           544684    544648     <0.01    44   Phylogenomic databases                     
   TubercuList                          993       992     <0.01   103   Organism-specific databases                
   UCSC                               92378     92166     <0.01    57   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   127   PTM databases                              
   UniLectin                            201       201     <0.01   116   Protein family/group databases             
   UniPathway                      10018533   9263066      0.06    25   Enzyme and pathway databases               
   VGNC                              156580    156580     <0.01    51   Organism-specific databases                
   VectorBase                        596095    577101     <0.01    43   Genome annotation databases                
   WBParaSite                        860893    852465      0.01    40   Genome annotation databases                
   World-2DPAGE                         314       309     <0.01   107   2D gel databases                           
   WormBase                           56059     55676     <0.01    62   Organism-specific databases                
   Xenbase                            37061     36803     <0.01    71   Organism-specific databases                
   ZFIN                               54301     54149     <0.01    63   Organism-specific databases                
   dictyBase                           7988      7766     <0.01    85   Organism-specific databases                
   eggNOG                          13573650   6803736      0.08    22   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    60   Organism-specific databases                
   iPTMnet                             8260      8260     <0.01    84   PTM databases                              
   jPOST                              37368     37368     <0.01    70   Proteomic databases                        
   mycoCLAP                             447       447     <0.01   106   Protein family/group databases             

Number of explicitly cross-referenced databases: 157


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.23   Gln (Q) 3.77   Leu (L) 9.90   Ser (S) 6.64
   Arg (R) 5.77   Glu (E) 6.16   Lys (K) 4.90   Thr (T) 5.55
   Asn (N) 3.81   Gly (G) 7.35   Met (M) 2.37   Trp (W) 1.30
   Asp (D) 5.48   His (H) 2.19   Phe (F) 3.91   Tyr (Y) 2.90
   Cys (C) 1.19   Ile (I) 5.65   Pro (P) 4.87   Val (V) 6.92

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.04


   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Arg, Ile, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 2220003
Total number of entries encoded on a Plasmid: 1098611
Total number of entries encoded on a Plastid: 164290
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 62
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: