UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2020_06 STATISTICS


1.  INTRODUCTION

Release 2020_06 of 02-Dec-2020 of UniProtKB/TrEMBL contains 209157139 sequence entries,
comprising 71325856333 amino acids.

14469528 sequences have been added since release 2020_05, the sequence data of
52421 existing entries has been updated and the annotations of
48539594 entries have been revised. This represents an increase of 8%.

Number of fragments: 25729559

Protein existence (PE):              entries      %
1: Evidence at protein level          172052     0.08%
2: Evidence at transcript level      1344276     0.64%
3: Inferred from homology           64315725    30.75%
4: Predicted                       143325086    68.53%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 1233899

   The first twenty species represent 18924436 sequences:     9 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 695609
                            2x: 136096
                            3x:  71482
                            4x:  50341
                            5x:  31160
                            6x:  22594
                            7x:  16983
                            8x:  13340
                            9x:  10796
                           10x:  16131
                       11- 20x:  83514
                       21- 50x:  25584
                       51-100x:  18420
                         >100x:  41849


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1    1501023  Escherichia coli
       2    1113123  Chernetidae sp. UAIC
   2.3  Taxonomic distribution of the sequences


   Kingdom        sequences (% of the database)
    Archaea         4675462 (  2%)
    Bacteria      140981583 ( 67%)
    Eukaryota      56313581 ( 27%)
    Viruses         4761529 (  2%)
    Other           2424984 ( <1%)



   Within Eukaryota:


    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 173942 (  0%)           (  0%)
     Other Mammalia       4091468 (  7%)           (  2%)
     Other Vertebrata     6706371 ( 12%)           (  3%)
     Viridiplantae       11676509 ( 21%)           (  6%)
     Fungi               12550674 ( 22%)           (  6%)
     Insecta              4674453 (  8%)           (  2%)
     Nematoda             1818213 (  3%)           (  1%)
     Other               14621951 ( 26%)           (  7%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50  2259633             1001-1100   1492274
                 51- 100 16262395             1101-1200   1045610
                101- 150 20062809             1201-1300    716486
                151- 200 19458723             1301-1400    480807
                201- 250 19358897             1401-1500    373132
                251- 300 19304535             1501-1600    277785
                301- 350 17724772             1601-1700    209893
                351- 400 13823783             1701-1800    163285
                401- 450 11778836             1801-1900    142880
                451- 500  9456651             1901-2000    122539
                501- 550  6647418             2001-2100     95590
                551- 600  5026027             2101-2200     86334
                601- 650  3741871             2201-2300     70485
                651- 700  2975578             2301-2400     56458
                701- 750  2517223             2401-2500     49277
                751- 800  2127425             >2500        377025
                801- 850  1699958
                851- 900  1450643
                901- 950  1121280
                951-1000   869263



   The average sequence length in UniProtKB/TrEMBL is   341 amino acids.

   The shortest sequence is A0A0G2JLJ8_HUMAN:     7 amino acids.
   The longest sequence is  A0A5A9P0L4_9TELE: 45354 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   246684427                1.18                                                    
   Submitted to EMBL/GenBank/DDBJ 168720947 151864849      0.81                                                    
   Journal                         65826748  62253579      0.31                                                    
   Submitted to other databases    12098841  12068778      0.06                                                    
   Book citation                      22111     22044     <0.01                                                    
   Thesis                             15780     15720     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 832815


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     335812016                1.61                                                    
   ACTIVITY REGULATION               486867    486865     <0.01    11                                              
   CATALYTIC ACTIVITY              29858644  25619157      0.14     4                                              
   CAUTION                        124360972 121538961      0.59     1                                              
   COFACTOR                        15934936  14153853      0.08     7                                              
   DOMAIN                           1863937   1451799      0.01     9                                              
   FUNCTION                        29147698  27598976      0.14     5                                              
   INTERACTION                         3816      3816     <0.01    12                                              
   MISCELLANEOUS                    1051283    949534      0.01    10                                              
   PATHWAY                         14177197  12846932      0.07     8                                              
   SIMILARITY                      65980352  64442819      0.32     2                                              
   SUBCELLULAR LOCATION            36509542  36356832      0.17     3                                              
   SUBUNIT                         16436772  16107693      0.08     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     643858836                3.08                                                    
   ACT_SITE                        13625487   8355859      0.07    11                                              
   BINDING                         28043922   7371986      0.13     6                                              
   CARBOHYD                           42752     34869     <0.01    25                                              
   CHAIN                           15396250  15196555      0.07     9                                              
   COILED                          25234947  17197311      0.12     7                                              
   COMPBIAS                        52399411  23232280      0.25     4                                              
   CROSSLNK                           62016     57190     <0.01    24                                              
   DISULFID                         3425098    952369      0.02    16                                              
   DNA_BIND                         1572130   1547464      0.01    18                                              
   DOMAIN                         147937087 105528535      0.71     2                                              
   INIT_MET                           79183     79182     <0.01    22                                              
   INTRAMEM                            1853      1607     <0.01    27                                              
   LIPID                             433313    250645     <0.01    21                                              
   METAL                           23388849   6030528      0.11     8                                              
   MOD_RES                          4065120   3599597      0.02    14                                              
   MOTIF                            2252500   1543159      0.01    17                                              
   NON_STD                            14421     14174     <0.01    26                                              
   NON_TER                         36215674  25813798      0.17     5                                              
   NP_BIND                         11593461   7214135      0.06    12                                              
   PEPTIDE                             1173       860     <0.01    28                                              
   PROPEP                             69610     69610     <0.01    23                                              
   REGION                          73151256  43687997      0.35     3                                              
   REPEAT                           8091904   1896303      0.04    13                                              
   SIGNAL                          14924705  14924694      0.07    10                                              
   SITE                             3502086   2141869      0.02    15                                              
   TOPO_DOM                          453280    202133     <0.01    20                                              
   TRANSIT                              212       212     <0.01    29                                              
   TRANSMEM                       177208737  38742470      0.85     1                                              
   ZN_FING                           672399    522640     <0.01    19                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             2412104560               11.53                                                    
   ABCD                                 454       454     <0.01   111   Protocols and materials databases          
   Allergome                           3865      3152     <0.01    90   Protein family/group databases             
   Antibodypedia                      75088     74996     <0.01    61   Protocols and materials databases          
   ArachnoServer                        198       198     <0.01   121   Organism-specific databases                
   Araport                            26865     26726     <0.01    76   Organism-specific databases                
   BMRB                                 328       328     <0.01   114   3D structure databases                     
   BRENDA                              9474      9194     <0.01    83   Enzyme and pathway databases               
   Bgee                              484960    484659     <0.01    44   Gene expression databases                  
   BindingDB                           2164      2164     <0.01   102   Chemistry databases                        
   BioCyc                           3514423   3493157      0.02    30   Enzyme and pathway databases               
   BioGRID                                1         1     <0.01   147   Protein-protein interaction databases      
   BioGRID-ORCS                       48161     47909     <0.01    71   Miscellaneous databases                    
   BioMuta                              969       969     <0.01   107   Polymorphism and mutation databases        
   CAZy                              128762    120516     <0.01    54   Protein family/group databases             
   CDD                             33690281  29971957      0.16    14   Family and domain databases                
   CGD                                20791     20725     <0.01    78   Organism-specific databases                
   CLAE                                 447       447     <0.01   112   Protein family/group databases             
   COMPLUYEAST-2DPAGE                     4         4     <0.01   142   2D gel databases                           
   CORUM                                228       228     <0.01   119   Protein-protein interaction databases      
   CPTAC                                 23        16     <0.01   133   Proteomic databases                        
   CTD                              1463087   1460854      0.01    37   Organism-specific databases                
   CarbonylDB                           229       229     <0.01   118   PTM databases                              
   ChEMBL                              1085      1081     <0.01   106   Chemistry databases                        
   ChiTaRS                           173484    173482     <0.01    50   Miscellaneous databases                    
   CollecTF                             191       191     <0.01   123   Gene expression databases                  
   ComplexPortal                        227       175     <0.01   120   Protein-protein interaction databases      
   ConoServer                           157       157     <0.01   126   Organism-specific databases                
   DIP                                 3117      3116     <0.01    93   Protein-protein interaction databases      
   DNASU                              40908     40494     <0.01    73   Protocols and materials databases          
   DisProt                              193       193     <0.01   122   Family and domain databases                
   DrugBank                             789       463     <0.01   109   Chemistry databases                        
   DrugCentral                          175       175     <0.01   125   Chemistry databases                        
   ELM                                   91        91     <0.01   128   Protein-protein interaction databases      
   EMBL                           285523065 199577932      1.37     3   Sequence databases                         
   EPD                                12897     12897     <0.01    80   Proteomic databases                        
   ESTHER                             83472     83140     <0.01    58   Protein family/group databases             
   Ensembl                          5098798   4953004      0.02    28   Genome annotation databases                
   EnsemblBacteria                 35198501  33115225      0.17    13   Genome annotation databases                
   EnsemblFungi                     5866801   5726260      0.03    27   Genome annotation databases                
   EnsemblMetazoa                   1601594   1544765      0.01    36   Genome annotation databases                
   EnsemblPlants                    3202069   2941156      0.02    32   Genome annotation databases                
   EnsemblProtists                  1686513   1598953      0.01    35   Genome annotation databases                
   EuPathDB                          769014    768420     <0.01    41   Organism-specific databases                
   EvolutionaryTrace                   5831      5831     <0.01    86   Miscellaneous databases                    
   ExpressionAtlas                   769480    769460     <0.01    40   Gene expression databases                  
   FlyBase                            87810     87403     <0.01    57   Organism-specific databases                
   GO                             363419678 132527964      1.74     2   Ontologies                                 
   Gene3D                         100797410  81591001      0.48     8   Family and domain databases                
   GeneCards                           1320      1308     <0.01   104   Organism-specific databases                
   GeneDB                             94310     92816     <0.01    55   Genome annotation databases                
   GeneID                          12820422  12709451      0.06    24   Genome annotation databases                
   GeneTree                         3341964   3322777      0.02    31   Phylogenomic databases                     
   Genevisible                        15490     15489     <0.01    79   Gene expression databases                  
   GenomeRNAi                         31844     31844     <0.01    74   Miscellaneous databases                    
   GlyConnect                            46        46     <0.01   131   PTM databases                              
   GlyGen                                16        16     <0.01   136   PTM databases                              
   Gramene                          3516313   2913812      0.02    29   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   141   Chemistry databases                        
   HAMAP                           21489035  21218980      0.10    16   Family and domain databases                
   HGNC                               54644     54544     <0.01    66   Organism-specific databases                
   HOGENOM                         17514975  17514517      0.08    19   Phylogenomic databases                     
   IDEAL                                 10        10     <0.01   137   Family and domain databases                
   InParanoid                       2169312   2169312      0.01    34   Phylogenomic databases                     
   IntAct                             27595     27595     <0.01    75   Protein-protein interaction databases      
   InterPro                       534130712 158368824      2.55     1   Family and domain databases                
   KEGG                            19237702  18729238      0.09    17   Genome annotation databases                
   LegioList                           2496      2483     <0.01    97   Organism-specific databases                
   Leproma                             1271      1269     <0.01   105   Organism-specific databases                
   MEROPS                            229265    229261     <0.01    48   Protein family/group databases             
   MGI                                63827     63413     <0.01    63   Organism-specific databases                
   MINT                                2462      2462     <0.01    99   Protein-protein interaction databases      
   MalaCards                              6         6     <0.01   139   Organism-specific databases                
   MaxQB                              42783     42783     <0.01    72   Proteomic databases                        
   MetOSite                             336       336     <0.01   113   PTM databases                              
   MoonDB                                 1         1     <0.01   146   Protein family/group databases             
   MoonProt                              56        56     <0.01   130   Protein family/group databases             
   NIAGADS                              261       261     <0.01   117   Organism-specific databases                
   OGP                                    3         3     <0.01   143   2D gel databases                           
   OMA                              8138026   8137909      0.04    26   Phylogenomic databases                     
   OpenTargets                        51740     51691     <0.01    68   Organism-specific databases                
   OrthoDB                         18628157  18628013      0.09    18   Phylogenomic databases                     
   PANTHER                         44962859  43383390      0.21    11   Family and domain databases                
   PATRIC                          15220524  15203268      0.07    21   Genome annotation databases                
   PCDDB                                 17        17     <0.01   135   3D structure databases                     
   PDB                                51000     22202     <0.01    69   3D structure databases                     
   PDBsum                             49048     21492     <0.01    70   3D structure databases                     
   PHI-base                            4721      4284     <0.01    88   Miscellaneous databases                    
   PIR                               161956    129736     <0.01    51   Sequence databases                         
   PIRSF                           16781703  16616376      0.08    20   Family and domain databases                
   PRIDE                             358967    358967     <0.01    46   Proteomic databases                        
   PRINTS                          25678216  23163202      0.12    15   Family and domain databases                
   PRO                                 2270      2270     <0.01   101   Miscellaneous databases                    
   PROSITE                        101007386  66793866      0.48     7   Family and domain databases                
   PaxDb                             258978    258978     <0.01    47   Proteomic databases                        
   PeptideAtlas                      140484    140484     <0.01    53   Proteomic databases                        
   PeroxiBase                          2584      2568     <0.01    95   Protein family/group databases             
   Pfam                           204313044 145576818      0.98     4   Family and domain databases                
   PharmGKB                            3111      3111     <0.01    94   Organism-specific databases                
   PhosphoSitePlus                     2152      2152     <0.01   103   PTM databases                              
   PhylomeDB                         443067    443067     <0.01    45   Phylogenomic databases                     
   PlantReactome                       2313      1462     <0.01   100   Enzyme and pathway databases               
   PomBase                                2         2     <0.01   144   Organism-specific databases                
   ProMEX                              2486      2486     <0.01    98   Proteomic databases                        
   Proteomes                      192046562 171824723      0.92     5   Miscellaneous databases                    
   ProteomicsDB                       58801     58734     <0.01    65   Proteomic databases                        
   PseudoCAP                           4379      4375     <0.01    89   Organism-specific databases                
   REBASE                             80676     77528     <0.01    59   Protein family/group databases             
   REPRODUCTION-2DPAGE                   62        61     <0.01   129   2D gel databases                           
   RGD                                21989     21083     <0.01    77   Organism-specific databases                
   RNAct                               2529      2529     <0.01    96   Miscellaneous databases                    
   Reactome                          153921     52624     <0.01    52   Enzyme and pathway databases               
   RefSeq                          53042816  51348142      0.25     9   Sequence databases                         
   SABIO-RK                             704       704     <0.01   110   Enzyme and pathway databases               
   SASBDB                               143       143     <0.01   127   3D structure databases                     
   SFLD                             1284261    987872      0.01    38   Family and domain databases                
   SGD                                    7         7     <0.01   138   Organism-specific databases                
   SIGNOR                                 6         6     <0.01   140   Enzyme and pathway databases               
   SMART                           49030580  36871576      0.23    10   Family and domain databases                
   SMR                              2181436   2181436      0.01    33   3D structure databases                     
   STRING                          12332530  12332390      0.06    25   Protein-protein interaction databases      
   SUPFAM                         135254370 106507014      0.65     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   145   2D gel databases                           
   SignaLink                           3713      3713     <0.01    91   Enzyme and pathway databases               
   SwissLipids                           34        34     <0.01   132   Chemistry databases                        
   SwissPalm                           3604      3604     <0.01    92   PTM databases                              
   TAIR                               11651     11590     <0.01    81   Organism-specific databases                
   TCDB                                8536      8518     <0.01    84   Protein family/group databases             
   TIGRFAMs                        41867784  38499402      0.20    12   Family and domain databases                
   TopDownProteomics                    272       272     <0.01   116   Proteomic databases                        
   TreeFam                           517837    517773     <0.01    43   Phylogenomic databases                     
   TubercuList                          960       959     <0.01   108   Organism-specific databases                
   UCSC                               91387     91156     <0.01    56   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   134   PTM databases                              
   UniLectin                            184       184     <0.01   124   Protein family/group databases             
   UniPathway                      13620581  12617066      0.07    23   Enzyme and pathway databases               
   VGNC                              225043    224983     <0.01    49   Organism-specific databases                
   VectorBase                        585980    554434     <0.01    42   Genome annotation databases                
   WBParaSite                        901959    890661     <0.01    39   Genome annotation databases                
   World-2DPAGE                         312       307     <0.01   115   2D gel databases                           
   WormBase                           62332     61958     <0.01    64   Organism-specific databases                
   Xenbase                            63989     61679     <0.01    62   Organism-specific databases                
   ZFIN                               54113     53963     <0.01    67   Organism-specific databases                
   dictyBase                           7985      7763     <0.01    85   Organism-specific databases                
   eggNOG                          13621150  13163712      0.07    22   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    60   Organism-specific databases                
   iPTMnet                             5277      5277     <0.01    87   PTM databases                              
   jPOST                              11376     11376     <0.01    82   Proteomic databases                        

Number of explicitly cross-referenced databases: 167


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.23   Gln (Q) 3.78   Leu (L) 9.87   Ser (S) 6.72
   Arg (R) 5.83   Glu (E) 6.19   Lys (K) 4.89   Thr (T) 5.54
   Asn (N) 3.78   Gly (G) 7.34   Met (M) 2.35   Trp (W) 1.31
   Asp (D) 5.48   His (H) 2.20   Phe (F) 3.88   Tyr (Y) 2.87
   Cys (C) 1.24   Ile (I) 5.54   Pro (P) 4.93   Val (V) 6.90

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.02


   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Arg, Thr, Ile, Asp, Pro, Lys, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 2383848
Total number of entries encoded on a Plasmid: 1344425
Total number of entries encoded on a Plastid: 190691
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 62
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: