UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2020_01 STATISTICS


1.  INTRODUCTION

Release 2020_01 of 26-Feb-2020 of UniProtKB/TrEMBL contains 177754527 sequence entries,
comprising 59974041839 amino acids.

2559360 sequences have been added since release 2019_11, the sequence data of
371 existing entries has been updated and the annotations of
35839156 entries have been revised. This represents an increase of 2%.

Number of fragments: 16988198

Protein existence (PE):              entries      %
1: Evidence at protein level          155939     0.09%
2: Evidence at transcript level      1304588     0.73%
3: Inferred from homology           45872701    25.81%
4: Predicted                       130421299    73.37%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 1196468

   The first twenty species represent 17140044 sequences:   9.6 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 677414
                            2x: 131050
                            3x:  69383
                            4x:  49075
                            5x:  30400
                            6x:  21989
                            7x:  16500
                            8x:  12900
                            9x:  10489
                           10x:  15874
                       11- 20x:  81808
                       21- 50x:  24611
                       51-100x:  16265
                         >100x:  38710


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1    1113123  Chernetidae sp. UAIC
   2.3  Taxonomic distribution of the sequences


   Kingdom        sequences (% of the database)
    Archaea         3949203 (  2%)
    Bacteria      126798814 ( 71%)
    Eukaryota      40977693 ( 23%)
    Viruses         4343839 (  2%)
    Other           1684978 ( <1%)



   Within Eukaryota:


    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 168088 (  0%)           (  0%)
     Other Mammalia       3239247 (  8%)           (  2%)
     Other Vertebrata     4832871 ( 12%)           (  3%)
     Viridiplantae        9536718 ( 23%)           (  5%)
     Fungi               11628189 ( 28%)           (  7%)
     Insecta              3830857 (  9%)           (  2%)
     Nematoda             1706849 (  4%)           (  1%)
     Other                6034874 ( 15%)           (  3%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50  2061893             1001-1100   1243209
                 51- 100 14565399             1101-1200    861409
                101- 150 17774427             1201-1300    577627
                151- 200 17199072             1301-1400    387073
                201- 250 17105709             1401-1500    299289
                251- 300 17071480             1501-1600    220784
                301- 350 15611199             1601-1700    165389
                351- 400 12103732             1701-1800    127152
                401- 450 10292368             1801-1900    110486
                451- 500  8234808             1901-2000     92399
                501- 550  5746760             2001-2100     72823
                551- 600  4334623             2101-2200     67870
                601- 650  3209475             2201-2300     54517
                651- 700  2532879             2301-2400     43775
                701- 750  2139711             2401-2500     38013
                751- 800  1816452             >2500        282872
                801- 850  1436046
                851- 900  1223434
                901- 950   939364
                951-1000   722811



   The average sequence length in UniProtKB/TrEMBL is   337 amino acids.

   The shortest sequence is A0A1B0GX77_HUMAN:     7 amino acids.
   The longest sequence is  A0A5A9P0L4_9TELE: 45354 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   208333207                1.17                                                    
   Submitted to EMBL/GenBank/DDBJ 141578276 128264042      0.80                                                    
   Journal                         58321034  55076718      0.33                                                    
   Submitted to other databases     8395926   8370809      0.05                                                    
   Book citation                      22111     22044     <0.01                                                    
   Thesis                             15860     15800     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 782261


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     264619329                1.49                                                    
   ACTIVITY REGULATION               500900    490779     <0.01    11                                              
   CATALYTIC ACTIVITY              22123722  19610749      0.12     4                                              
   CAUTION                        112061794 109481254      0.63     1                                              
   COFACTOR                        10915191   9905120      0.06     8                                              
   DOMAIN                           1644009   1281344      0.01     9                                              
   FUNCTION                        25125478  23857969      0.14     3                                              
   INTERACTION                         3471      3471     <0.01    12                                              
   MISCELLANEOUS                     975862    883224      0.01    10                                              
   PATHWAY                         10931291   9862126      0.06     7                                              
   SIMILARITY                      46121481  45508795      0.26     2                                              
   SUBCELLULAR LOCATION            21278190  21134074      0.12     5                                              
   SUBUNIT                         12937940  12789731      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     566248245                3.19                                                    
   ACT_SITE                        12391039   7596957      0.07    11                                              
   BINDING                         25601934   6655500      0.14     5                                              
   CARBOHYD                           34809     30913     <0.01    25                                              
   CHAIN                           13844297  13658885      0.08     9                                              
   COILED                          22142688  15292058      0.12     7                                              
   COMPBIAS                        45386158  20391823      0.26     4                                              
   CROSSLNK                           54896     50822     <0.01    24                                              
   DISULFID                         2868109    840529      0.02    16                                              
   DNA_BIND                         1436268   1415590      0.01    18                                              
   DOMAIN                         131196551  94723339      0.74     2                                              
   INIT_MET                           72502     72501     <0.01    22                                              
   INTRAMEM                            1623      1377     <0.01    27                                              
   LIPID                             411674    237426     <0.01    21                                              
   METAL                           21403710   5541494      0.12     8                                              
   MOD_RES                          3770521   3344113      0.02    14                                              
   MOTIF                            2113733   1448064      0.01    17                                              
   NON_STD                            10996     10765     <0.01    26                                              
   NON_TER                         24054707  17022462      0.14     6                                              
   NP_BIND                         10555113   6641141      0.06    12                                              
   PEPTIDE                             1088       797     <0.01    28                                              
   PROPEP                             61231     61231     <0.01    23                                              
   REGION                          64117735  38663075      0.36     3                                              
   REPEAT                           7042788   1670322      0.04    13                                              
   SIGNAL                          13408813  13408802      0.08    10                                              
   SITE                             3259362   1990071      0.02    15                                              
   TOPO_DOM                          425376    190569     <0.01    20                                              
   TRANSIT                              167       167     <0.01    29                                              
   TRANSMEM                       159989787  34988914      0.90     1                                              
   ZN_FING                           590570    459838     <0.01    19                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             2124324218               11.95                                                    
   ABCD                                 249       249     <0.01   113   Protocols and materials databases          
   Allergome                           3864      3137     <0.01    88   Protein family/group databases             
   ArachnoServer                        199       199     <0.01   117   Organism-specific databases                
   Araport                            15049     14983     <0.01    79   Organism-specific databases                
   BRENDA                              9522      9241     <0.01    82   Enzyme and pathway databases               
   Bgee                              492866    492303     <0.01    45   Gene expression databases                  
   BindingDB                            551       551     <0.01   108   Chemistry                                  
   BioCyc                          15275811  14695585      0.09    22   Enzyme and pathway databases               
   BioMuta                              990       990     <0.01   104   Polymorphism and mutation databases        
   CAZy                              129128    120840     <0.01    55   Protein family/group databases             
   CDD                             29697940  26420967      0.17    14   Family and domain databases                
   CGD                                20793     20727     <0.01    77   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   133   2D gel databases                           
   CORUM                                229       229     <0.01   115   Protein-protein interaction databases      
   CPTAC                                 22        15     <0.01   128   Proteomic databases                        
   CTD                              1347164   1345485      0.01    37   Organism-specific databases                
   CarbonylDB                           229       229     <0.01   114   PTM databases                              
   ChEMBL                              1006      1003     <0.01   103   Chemistry                                  
   ChiTaRS                           173785    173783     <0.01    51   Other                                      
   CollecTF                             191       191     <0.01   119   Gene expression databases                  
   ComplexPortal                        219       163     <0.01   116   Protein-protein interaction databases      
   ConoServer                           157       157     <0.01   122   Organism-specific databases                
   DIP                                 3156      3155     <0.01    91   Protein-protein interaction databases      
   DNASU                              41195     40756     <0.01    70   Protocols and materials databases          
   DisProt                              181       181     <0.01   121   Family and domain databases                
   DrugBank                             751       445     <0.01   106   Chemistry                                  
   DrugCentral                          184       184     <0.01   120   Chemistry                                  
   ELM                                   93        93     <0.01   123   Protein-protein interaction databases      
   EMBL                           210461563 171869624      1.18     3   Sequence databases                         
   EPD                                11941     11941     <0.01    80   Proteomic databases                        
   ESTHER                             77356     77042     <0.01    60   Protein family/group databases             
   Ensembl                          3883499   3746174      0.02    30   Genome annotation databases                
   EnsemblBacteria                 35538010  33518473      0.20    13   Genome annotation databases                
   EnsemblFungi                     5912545   5770181      0.03    29   Genome annotation databases                
   EnsemblMetazoa                   1112696   1072446      0.01    39   Genome annotation databases                
   EnsemblPlants                    2851041   2607466      0.02    32   Genome annotation databases                
   EnsemblProtists                  1687531   1599971      0.01    36   Genome annotation databases                
   EuPathDB                          720621    720174     <0.01    42   Organism-specific databases                
   EvolutionaryTrace                   5888      5888     <0.01    85   Other                                      
   ExpressionAtlas                   799113    799113     <0.01    41   Gene expression databases                  
   FlyBase                            90740     90318     <0.01    58   Organism-specific databases                
   GO                             311794021 117656173      1.75     2   Ontologies                                 
   Gene3D                          87452926  71757360      0.49     8   Family and domain databases                
   GeneCards                           1298      1282     <0.01   101   Organism-specific databases                
   GeneDB                            105021    103363     <0.01    56   Genome annotation databases                
   GeneID                          11370505  11258794      0.06    25   Genome annotation databases                
   GeneTree                         3233417   3233173      0.02    31   Phylogenomic databases                     
   Genevisible                        15533     15532     <0.01    78   Gene expression databases                  
   GenomeRNAi                         31892     31892     <0.01    74   Other                                      
   GlyConnect                            45        45     <0.01   127   PTM databases                              
   Gramene                          2819891   2575951      0.02    33   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   132   Chemistry                                  
   HAMAP                           19796335  19549720      0.11    16   Family and domain databases                
   HGNC                               53715     53613     <0.01    66   Organism-specific databases                
   HOGENOM                         17659926  17659411      0.10    19   Phylogenomic databases                     
   InParanoid                       2200668   2200668      0.01    34   Phylogenomic databases                     
   IntAct                             28024     27888     <0.01    75   Protein-protein interaction databases      
   InterPro                       474279642 142306655      2.67     1   Family and domain databases                
   KEGG                            18163485  17704199      0.10    18   Genome annotation databases                
   KO                               8205484   8172867      0.05    27   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    96   Organism-specific databases                
   Leproma                             1271      1269     <0.01   102   Organism-specific databases                
   MEROPS                            230923    230919     <0.01    49   Protein family/group databases             
   MGI                                63702     63285     <0.01    62   Organism-specific databases                
   MINT                                2541      2541     <0.01    95   Protein-protein interaction databases      
   MalaCards                              6         6     <0.01   131   Organism-specific databases                
   MaxQB                              37337     37337     <0.01    71   Proteomic databases                        
   MoonDB                                 1         1     <0.01   137   Protein family/group databases             
   MoonProt                              58        58     <0.01   126   Protein family/group databases             
   NIAGADS                              262       262     <0.01   112   Organism-specific databases                
   OGP                                    3         3     <0.01   134   2D gel databases                           
   OMA                              7868213   7867708      0.04    28   Phylogenomic databases                     
   OpenTargets                        51616     51566     <0.01    67   Organism-specific databases                
   OrthoDB                         19040921  19040712      0.11    17   Phylogenomic databases                     
   PANTHER                         39473772  38309105      0.22    11   Family and domain databases                
   PATRIC                          15446566  15430182      0.09    21   Genome annotation databases                
   PDB                                46761     20787     <0.01    68   3D structure databases                     
   PDBsum                             46243     20596     <0.01    69   3D structure databases                     
   PIR                               162212    129990     <0.01    52   Sequence databases                         
   PIRSF                           15644888  15510100      0.09    20   Family and domain databases                
   PRIDE                             356213    356213     <0.01    47   Proteomic databases                        
   PRINTS                          22956004  20762180      0.13    15   Family and domain databases                
   PRO                                 2275      2275     <0.01    98   Other                                      
   PROSITE                         90480038  60486327      0.51     7   Family and domain databases                
   PaxDb                             262009    262009     <0.01    48   Proteomic databases                        
   PeptideAtlas                      132949    132949     <0.01    54   Proteomic databases                        
   PeroxiBase                          2729      2712     <0.01    93   Protein family/group databases             
   Pfam                           182780950 130905923      1.03     4   Family and domain databases                
   PharmGKB                            3114      3114     <0.01    92   Organism-specific databases                
   PhosphoSitePlus                     2168      2168     <0.01    99   PTM databases                              
   PhylomeDB                         456237    456237     <0.01    46   Phylogenomic databases                     
   PlantReactome                       2321      1466     <0.01    97   Enzyme and pathway databases               
   PomBase                                2         2     <0.01   135   Organism-specific databases                
   ProMEX                              1944      1944     <0.01   100   Proteomic databases                        
   Proteomes                      169696403 156836878      0.95     5   Other                                      
   ProteomicsDB                       35533     35475     <0.01    72   Proteomic databases                        
   PseudoCAP                           4409      4405     <0.01    87   Organism-specific databases                
   REBASE                             81718     78594     <0.01    59   Protein family/group databases             
   REPRODUCTION-2DPAGE                   62        61     <0.01   125   2D gel databases                           
   RGD                                21588     20675     <0.01    76   Organism-specific databases                
   RNAct                               2562      2562     <0.01    94   Other                                      
   Reactome                          144919     49626     <0.01    53   Enzyme and pathway databases               
   RefSeq                          48572256  47323517      0.27     9   Sequence databases                         
   SABIO-RK                             599       599     <0.01   107   Enzyme and pathway databases               
   SFLD                             1140381    886296      0.01    38   Family and domain databases                
   SGD                                    7         7     <0.01   130   Organism-specific databases                
   SIGNOR                                 1         1     <0.01   136   Enzyme and pathway databases               
   SMART                           43291309  32764337      0.24    10   Family and domain databases                
   SMR                              1722727   1722727      0.01    35   3D structure databases                     
   STRING                          12435495  12434991      0.07    24   Protein-protein interaction databases      
   SUPFAM                         119392804  94441136      0.67     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   138   2D gel databases                           
   SignaLink                           3737      3737     <0.01    89   Enzyme and pathway databases               
   SwissLipids                           67        67     <0.01   124   Chemistry                                  
   SwissPalm                           3535      3535     <0.01    90   PTM databases                              
   TAIR                               11738     11677     <0.01    81   Organism-specific databases                
   TCDB                                8427      8415     <0.01    83   Protein family/group databases             
   TIGRFAMs                        37967722  34910453      0.21    12   Family and domain databases                
   TopDownProteomics                    274       274     <0.01   111   Proteomic databases                        
   TreeFam                           531439    531034     <0.01    44   Phylogenomic databases                     
   TubercuList                          982       981     <0.01   105   Organism-specific databases                
   UCSC                               91963     91743     <0.01    57   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   129   PTM databases                              
   UniLectin                            191       191     <0.01   118   Protein family/group databases             
   UniPathway                      10626918   9825529      0.06    26   Enzyme and pathway databases               
   VGNC                              184551    184512     <0.01    50   Organism-specific databases                
   VectorBase                        595799    577012     <0.01    43   Genome annotation databases                
   WBParaSite                        880979    870062     <0.01    40   Genome annotation databases                
   World-2DPAGE                         314       309     <0.01   110   2D gel databases                           
   WormBase                           62674     62298     <0.01    64   Organism-specific databases                
   Xenbase                            63361     53926     <0.01    63   Organism-specific databases                
   ZFIN                               54144     54020     <0.01    65   Organism-specific databases                
   dictyBase                           7986      7764     <0.01    84   Organism-specific databases                
   eggNOG                          13430158   6730422      0.08    23   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    61   Organism-specific databases                
   iPTMnet                             4889      4889     <0.01    86   PTM databases                              
   jPOST                              35216     35216     <0.01    73   Proteomic databases                        
   mycoCLAP                             447       447     <0.01   109   Protein family/group databases             

Number of explicitly cross-referenced databases: 157


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.26   Gln (Q) 3.75   Leu (L) 9.91   Ser (S) 6.63
   Arg (R) 5.80   Glu (E) 6.16   Lys (K) 4.88   Thr (T) 5.55
   Asn (N) 3.80   Gly (G) 7.36   Met (M) 2.36   Trp (W) 1.30
   Asp (D) 5.49   His (H) 2.19   Phe (F) 3.91   Tyr (Y) 2.90
   Cys (C) 1.18   Ile (I) 5.64   Pro (P) 4.88   Val (V) 6.93

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.02


   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Arg, Ile, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 2268515
Total number of entries encoded on a Plasmid: 1140896
Total number of entries encoded on a Plastid: 167211
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 62
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: