uniprot logo

News

UniProt release 2015_08

Published July 22, 2015

Headline

Pseudo-allergy, real progress

Do you sniffle and sneeze as trees start to bloom and the pollen gets airborne? Your mast cells are to blame. These cells reside at strategic anatomical positions, such as skin, gastrointestinal tract and lung, and provide us with a first line of defence against potential harm from our environment. Besides their beneficial functions, mast cells can also react to compounds that do not represent any threat to our health, such as pollen. This process begins with the interaction of an antigen with immunoglobulin E (IgE) bound to high affinity Fc epsilon receptors at the mast cell surface. It ends with the release of histamine and various inflammatory and immunomodulatory substances, which causes allergy. Most adverse reactions to peptidergic and small molecule therapeutic agents, collectively called basic secretagogues, also rely on mast cell stimulation, but do not correlate with IgE antibody titer. They proceed through a different, not yet fully understood, IgE-independent mechanism called pseudo-allergy, that eventually also leads the release of granule-stored histamine. In human, MRGPRX2 has been proposed, among others, to serve as a receptor for basic secretagogues, but until recently there was no direct proof of its involvement.

Earlier this year, McNeil et al. showed that “basic secretagogues activate mouse mast cells in vitro and in vivo through a single receptor, Mrgprb2, the ortholog of the human G-protein-coupled receptor MRGPRX2”. The first achievement of this study was to prove the orthology of these 2 genes, which was not an easy task. In humans, MRGPRX2 is found in a cluster with 3 other MRGPRX family members. This cluster is dramatically expanded in mouse, with 22 potential protein-coding genes that show comparable sequence identity to MRGPRX2. To establish orthology, the authors used 2 criteria: expression pattern (expression in mast cells) and pharmacology (some 16 compounds were tested for mast cell activation). Then Mrgprb2a knockout mice were created. Gene targeting was performed using a zinc-finger-nuclease-based strategy, as classical homologous recombination approach was impossible in this genomic locus due to too many repetitive sequences. The null animals showed no visible phenotype in normal conditions, but didn’t produce any pseudo-allergic reaction in response to small-molecule therapeutic drugs. Secretagogue-induced histamine release, inflammation and airway contraction were abolished.

This elegant study does not deal simply with the identification of “just another receptor”. It addresses an issue that may concern all of us at some point in our lives. Basic secretagogues are compounds that are frequently encountered either in natural fluids, such as the wasp venom toxin mastoparan, or in various drugs, such as cationic peptidergic drugs, antibiotics (fluoroquinolone family), neuromuscular blocking agents, etc. These latter are routinely used in surgery to reduce unwanted muscle movement and are responsible for nearly 60% of allergic reactions in a surgical setting. The majority of these compounds activate mast cells in an Mrgprb2-dependent manner. The animal model created by McNeil et al. could then be used for pre-clinical testing of new drugs in order to minimize pseudo-allergic risks. In addition, the identification a motif common to several Mrgprb2 agonists may allow the prediction of side effects of clinically used compounds.

As of this release, primate MRGPRX2 and mouse Mrgprb2 entries have been updated and are publicly available.

UniProt service news

Programmatic access to UniProt with sparql.uniprot.org

We are happy to announce the public release of the UniProt SPARQL endpoint at sparql.uniprot.org, where you can also find links to the documentation of the UniProt RDF data model and an interactive query interface with sample queries to get you started.

For those unfamiliar with SPARQL, this is a W3C standardized query language for the Semantic Web. If you know SQL, it will look familiar to you and you can do similar types of queries with it. SPARQL also allows you to query and combine data from a variety of SPARQL endpoints, providing a valuable low-cost alternative to building your own data warehouse. You can combine UniProt data from sparql.uniprot.org with that from the SPARQL endpoints hosted by the EBI’s RDF platform, the SIB’s neXtProt SPARQL endpoint, etc.

We look forward to feedback from the community to help us improve this service further.

UniProtKB news

Addition of human somatic protein altering variants from COSMIC

The Catalogue of Somatic Mutations in Cancer (COSMIC) is a database of manually curated somatic variants from peer reviewed publications and genome-wide studies. UniProt, in collaboration with COSMIC, have integrated COSMIC release v71 protein altering variants into the homo_sapiens_variation.txt.gz file. The COSMIC variants provide the standard information found in the homo_sapiens_variation.txt.gz file and additional information on the primary tissue(s) the variant was found in within the Phenotype/Disease field.

Changes to the humdisease.txt file

We have added cross-references to MedGen to the humdisease.txt file. MedGen, the NCBI portal to information about human genetic disorders, conveys multiple disease names, medical terms and information for the same disorder from various sources into a specific concept. Each MedGen concept has a Concept Unique Identifier (CUI) that allows computational access to global disease information. Together with disease nomenclature, this includes disease definitions, clinical findings, available clinical and research tests, molecular resources, professional guidelines, original and review literature, consumer resources, clinical trials, and Web links to other related resources. MedGen is a valuable resource to allow UniProtKB users to access an extensive range of biomedical data.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Blepharophimosis-ptosis-intellectual disability syndrome
  • Ehlers-Danlos syndrome 2

UniProt release 2015_07

Published June 24, 2015

Headline

Coding-non-coding RNAs: a game of hide-and-seek

It is well-established that microRNAs (miRNAs) are small eukaryotic non-coding RNA molecules that repress the expression of their target genes. miRNAs are transcribed by RNA polymerase II as large primary transcripts (pri-miRNA), that share the same characteristics as all other RNA polymerase II-transcribed RNAs, such as the presence of a 5’-cap and a 3’-poly(A) tail. pri-miRNAs are processed to smaller pre-miRNAs, which in turn are cleaved to produce mature miRNAs. In animals, this final maturation step occurs in the cytoplasm, while in plants it takes place in the nucleus. Cytosolic mature miRNAs guide the RNA-induced silencing complex (RISC) in repressing target genes through either cleavage or translational repression of their mRNAs.

A recent article published in Nature revealed that plant pri-miRNAs may not be as non-coding as previously assumed. Some do actually encode small regulatory peptides, called miPEPs, which enhance the accumulation of their corresponding mature miRNAs. This has been shown for Medicago truncatula pri-miR171b and Arabidopsis thaliana pri-miR165a which encode miPEP171b and miPEP165a, respectively. These two 20- and 18-amino acid-long peptides have been shown to be translated in vivo and to promote the transcription of their pri-miRNAs, resulting in the accumulation of mature miR171b and miR165a. This increase leads to the reduction of lateral root development in the case of miR171b and stimulation of main root growth for miR165a. The same effects were observed when synthetic peptides were applied to plants, suggesting that miPEPs might have agronomical applications.

Five other pri-miRNAs were experimentally shown to encode active miPEPs, suggesting that the presence of such small regulatory peptides may be widespread in plants. Computer analysis of the 5’-end of 50 pri-miRNAs in Arabidopsis thaliana revealed that all of them contained at least one ORF, which, if translated, could give rise to 3- to 59-amino acid-long peptides of unknown biological activity. No common signature was found among them, possibly due to the specificity of each putative miPEP for its own pri-miRNA.

Arabidopsis thaliana miPEP165a, miPEP160b, miPEP164a and miPEP319a and Medicago truncatula miPEP171b peptides have been manually annotated and are integrated into UniProtKB/Swiss-Prot as of this release. The sequences of the other 2 Medicago truncatula functionally characterized peptides, miPEP169d and miPEP171e, are unfortunately not available.

UniProtKB news

Cross-references to ESTHER

Cross-references have been added to ESTHER, a database of the Alpha/Beta-hydrolase fold superfamily of proteins.

ESTHER is available at http://bioweb.ensam.inra.fr/ESTHER/general?what=index.

The format of the explicit links is:

Resource abbreviation ESTHER
Resource identifier Gene locus.
Optional information 1 Family name.

Example: P0C064

Show all entries having a cross-reference to ESTHER.

Text format

Example: P0C064

DR   ESTHER; bacbr-grsb; Thioesterase.

XML format

Example: P0C064

<dbReference type="ESTHER" id="bacbr-grsb">
  <property type="family name" value="Thioesterase"/>
</dbReference>

Cross-references to Genevisible

Cross-references have been added to Genevisible, a search portal to normalized and curated expression data from GENEVESTIGATOR.

Genevisible is available at http://genevisible.com/search.

The format of the explicit links is:

Resource abbreviation Genevisible
Resource identifier Gene identifier.
Optional information 1 Organism code.

Example: P31946

Show all entries having a cross-reference to Genevisible.

Text format

Example: P31946

DR   Genevisible; P31946; HS.

XML format

Example: P31946

<dbReference type="Genevisible" id="P31946">
  <property type="organism ID" value="HS"/>
</dbReference>

Removal of the cross-references to Genevestigator

Cross-references to Genevestigator have been removed.

Change of the cross-references to PomBase

Cross-references to PomBase may now optionally indicate a gene designation in order to align them with the format of other model organism databases.

Text format

Example: Q9P3A7

DR   PomBase; SPAC1565.08; cdc48.

Example: O60058

DR   PomBase; SPBC56F2.07c; -.

XML format

Example: Q9P3A7

<dbReference type="PomBase" id="SPAC1565.08">
  <property type="gene designation" value="cdc48"/>
</dbReference>

Example: O60058

<dbReference type="PomBase" id="SPBC56F2.07c"/>

This change did not affect the XSD, but may nevertheless require code changes.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Hypogonadism LHB-related

Changes to keywords

New keywords:

UniProt release 2015_06

Published May 27, 2015

Headline

POLQ, a new target for cancer therapy?

DNA double-strand breaks (DSBs) are our worse cellular enemy, yet they do occur all the time, often accidentally, as a result of endogenous metabolic reactions and replication stress. They can also be induced by exogenous sources, like radiation or exposure of cells to DNA-damaging agents, or serve as intermediates in a number of programmed recombination events, during meiosis or assembly of immunoglobulins or T-cell receptors. Whatever their origin, DSBs are highly toxic to cells if not repaired, and if repaired incorrectly, they can cause deletions, translocations, and fusions in the DNA, which can have dramatic consequences.

The most frequently used mechanisms for DSB repair are homologous recombination (HR) and non-homologous end-joining (NHEJ), but alternative forms of end-joining exist, such as microhomology-mediated end-joining (MMEJ). HR is highly accurate and therefore important for preserving genome integrity. NHEJ results in small, less than 10 bp deletions. The most error-prone is MMEJ, which promotes inter- and intrachromosome rearrangements associated with relatively large DNA deletions (30-200 bp).

While NHEJ preferentially acts on ‘blunt-ended’ DNA breaks, HR is preceded by resection of DNA around the 5’-ends of the break. RAD51 proteins bind to the resulting 3’ single-stranded overhangs and help them to recognize complementary (homologous) DNA in another intact DNA helix. The overhangs then invade the homologous double-strand and use it as a template for repair. MMEJ also starts with DNA resected ends, but in this case it is DNA polymerase theta (POLQ) that directly binds them and enables short (2-6 bp) homologous DNA sequences in overhangs to form base pairs. The homology can be either terminal, or internal, as far as 5 nucleotides away from the 3’ terminus. Once homology has been found, each DNA strand is extended from the base-paired region using the opposing overhang as a template, and, in case of internal homology, the terminal unpaired regions are removed.

Normal cells tend to down-regulate POLQ. Cancer cells, which exhibit HR deficiency due to mutations in genes involved in HR repair, tend to up-regulate POLQ. This allows them to limit DNA damage and survive, although at the expense of genome integrity. In these cells, increased levels in POLQ will further inhibit HR, by binding to RAD51 proteins and preventing their accumulation at resected DNA ends.

Cytotoxic drugs used for cancer therapy promote DSBs in order to overwhelm DNA repair mechanisms and induce cell death. Could the use of POLQ inhibitors, alone or in combination with other DNA damaging drugs, improve the treatment of HR-deficient tumors? It’s too early to tell, but preliminary results suggest that it is worth investigating. Indeed, knockdown of POLQ in HR-deficient cells reduces cell survival following treatment with cisplatin or mitomycin C, and human tumor cells expressing shRNA against both FANCD2 (HR knockdown) and POLQ (MMEJ knockdown) do not grow in mice.

At the beginning of this year, POLQ was in the spotlight thanks to 3 very interesting publications, which shed light on its role and mode of action. UniProtKB/Swiss-Prot POLQ entries have been updated accordingly and are publicly available as of this release.

UniProt release 2015_05

Published April 29, 2015

Headline

A never-ending race between evolution and genomic integrity

Primate evolution has been accompanied by several waves of retrotransposon insertions. Nowadays about 50% of our genome is composed of endogenous retroelements (EREs). Although many of them have lost their transposition ability, some remain quite active. For instance, among the 500,000 copies of long interspersed element-1 (LINE1 or L1) present in the human genome, about 100 are retrotransposition-competent, and over 40 of them are highly active. Other EREs, such as short interspersed nuclear elements (SINEs), including Alu repeats, and SINE-VNTR-Alu (SVA), a composite hominid-restricted ERE, also actively move in the genome. It is currently estimated that new, non-parental L1 integrations occur in nearly 1/100 births and roughly every 20th newborn baby has a new Alu retrotransposon somewhere in its DNA.

Obviously having DNA jumping around our genome may be quite harmful and our cells work hard to repress EREs. Transcriptional silencing is controlled by TRIM28 and KRAB domain-containing Zinc finger proteins (KRAB-ZNFs). TRIM28 forms a repressive complex (KAP1 complex) by interacting with CHD3, a subunit of the nucleosome remodeling and deacetylation (NuRD) complex, and SETDB1, which specifically methylates histone H3 at ‘Lys-9’, inducing heterochromatinization. KRAB-ZNFs bind DNA and recruit the KAP1 complex to target sites.

KRAB-ZNF genes are one of the fastest growing gene families in primates, possibly to limit the activity of newly emerged ERE classes. This hypothesis has gained support in an elegant study recently published in Nature. In this article, Jacobs et al. used a heterologous cell system in which murine embryonic stem cells harbored a copy of human chromosome 11, which contains a number of EREs, including SVA and the L1 subfamily L1PA. In this cellular environment, the primate-specific EREs were derepressed. Individual overexpression of highly expressed human KRAB-ZNFs, confirmed by reporter gene assays, allowed the identification of genes involved in the repression of specific ERE (sub)families: ZNF91 and ZNF93 which acted on SVA and L1PA4, respectively. The authors then traced back the phylogenic history of these genes in the primate lineage and analyzed the parallel evolution of their target EREs. They could show that a new wave of L1PA insertions in great ape genomes was made possible through the deletion of a 129-bp element in L1PA3, which destroyed the ZNF93-binding site. This could be interpreted as an ERE response to a series of structural changes in ZNF93 that occurred soon before and improved host repression of L1PA activity.

In conclusion, the expansion of a new ERE drives the evolution of a host repressor which leads to a subsequent change in ERE to escape repression, and so on. It is a never-ending race of our genome with itself, which leads inexorably to greater and greater complexity.

As of this release, updated human ZNF91 and ZNF93 entries are available in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of IPI species proteome data sets from FTP site

Since the closure of IPI in 2011, UniProt has provided proteome data sets for IPI species on its FTP site. In UniProt release 2015_03, we have started to provide new data sets for reference proteomes which cover also the IPI species and we have now removed the old ‘proteomes’ FTP directory that contained only data for the IPI species.

UniProtKB XSD change for evidence attribution

We have made the following changes to the UniProtKB XSD to allow a more fine-grained attribution of evidences to the parts of comment annotations that contain “free-text” descriptions:

  • The cardinality of all existing text elements was changed from maxOccurs="1" to maxOccurs="unbounded".
  • The phDependence, redoxPotential and temperatureDependence child elements of the bpcCommentGroup now have a sequence of text child elements.
  • The note child element of the isoformType was replaced by a sequence of text child elements.

The XSD changes are highlighted in red color below:

    <xs:complexType name="commentType">
        ...
            <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>
        ...
    <xs:group name="bpcCommentGroup">
       ...
             <xs:element name="absorption" minOccurs="0">
                ...
                        <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>
                ...
            <xs:element name="kinetics" minOccurs="0">
                ...
                        <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>
                ...

            <!-- The following 3 elements will in future each have a sequence of <text> child elements:
            <xs:element name="phDependence" type="evidencedStringType" minOccurs="0"/>
            <xs:element name="redoxPotential" type="evidencedStringType" minOccurs="0"/>
            <xs:element name="temperatureDependence" type="evidencedStringType" minOccurs="0"/>
            -->
            <xs:element name="phDependence" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="text" type="evidencedStringType" maxOccurs="unbounded"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:element name="redoxPotential" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="text" type="evidencedStringType" maxOccurs="unbounded"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:element name="temperatureDependence" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="text" type="evidencedStringType" maxOccurs="unbounded"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        ...
    <xs:complexType name="isoformType">
        ...
            <!-- The <note> element will be replaced by a sequence of <text> elements:
            <xs:element name="note" minOccurs="0">
                <xs:complexType>
                    <xs:simpleContent>
                        <xs:extension base="xs:string">
                            <xs:attribute name="evidence" type="intListType" use="optional"/>
                        </xs:extension>
                    </xs:simpleContent>
                </xs:complexType>
            </xs:element>
            -->
            <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>

Cross-references to BioMuta

Cross-references have been added to BioMuta, a curated single-nucleotide variation and disease association database.

BioMuta is available at https://hive.biochemistry.gwu.edu/tools/biomuta/.

The format of the explicit links is:

Resource abbreviation BioMuta
Resource identifier Gene name.

Example: P02787

Show all entries having a cross-reference to BioMuta.

Text format

Example: P02787

DR   BioMuta; TF; -.

XML format

Example: P02787

<dbReference type="BioMuta" id="TF"/>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Lipidation’ (‘LIPID’ in the flat file):

  • O-palmitoleyl serine

UniProt release 2015_04

Published April 1, 2015

Headline

Of CAT tails and protein translation by-products

Correct translation of mRNA into functional proteins is an essential cellular process. Defects in translation not only deprive cells of proteins needed for almost any task, but also produce by-products that can negatively impact these tasks and be toxic. Therefore translational garbage has to be removed.

One source of errors is defective ribosomes that stop during translation and hence produce incomplete polypeptide chains. All organisms have evolved mechanisms to manage translation arrest. In eukaryotes, ribosome stalling induces dissociation of the small 40S subunit and recruitment of the ‘ribosome quality control complex’ (RQC) to the large 60S subunit. RQC mediates the ubiquitination and degradation of the incompletely synthesized polypeptide chains.

Over the past few years, the mode of action of RQC has begun to be elucidated. The molecular components of RQC include listerin, an E3 ubiquitin ligase encoded by RKR1 in yeast and LTN1 in mammals, the AAA adenosine triphosphatase CDC48/VCP/p97 and ubiquitin-binding cofactors, as well as 2 proteins of unknown function. Listerin mediates the ubiquitination of the stalled polypeptide and subsequent recruitment of CDC48/VCP/p97 to the complex. The ATPase may provide the mechanical force to allow extraction of the nascent chain and its delivery to the proteasome for degradation.

Three recent studies have addressed the function of one of the uncharacterized proteins of the complex, called RQC2 in yeast and NEMF in mammals. In mammals, NEMF/RQC2 is responsible for the selective recognition of stalled 60S subunit. It does so by making multiple simultaneous contacts with 60S and peptidyl-tRNA to sense nascent chain occupancy. NEMF/RQC2 is also important for the stable association of listerin with the complex. Work in yeast not only corroborates these findings, but it reveals another unexpected function for NEMF/RQC2. NEMF/RQC2 recruits alanine- and threonine-charged tRNAs to the ribosomal A site and directs the elongation of stalled nascent chains independently of mRNA or 40S subunits, leading to non-templated C-terminal Ala and Thr extensions, aptly named CAT tails. The exact function of CAT tails is still under investigation, but they seem to induce an HSF1-dependent heat shock response in yeast through a mechanism that is yet to be determined. The heat shock response may help cells to buffer against malformed proteins. Alternatively, the extension at the C-terminus may serve to test the functional integrity of large ribosomal subunits, so that the cell can detect and dispose of defective large subunits that induce stalling.

mRNA-independent polypeptide biosynthesis has already been described in microorganisms. Classical examples of such peptides are peptide antibiotics, including actinomycin, bacitracin, colistin, and polymyxin B. In addition, in Staphylococcus aureus, pentaglycines acting as cross-linkers in the cell wall peptidoglycan are synthesized in the absence of mRNA. Although still considered as a very marginal event, the assembly of amino acids without mRNA blueprint might be more widespread than previously anticipated.

As of this release, updated yeast RQC2 and mammalian NEMF entries are available in UniProtKB/Swiss-Prot.

UniProtKB news

Reducing redundancy in proteomes

The UniProt Knowledgebase (UniProtKB) has witnessed an exponential growth in the last few years with a two-fold increase in the number of entries in 2014. This follows the vastly increased submission of multiple genomes for the same or closely related organisms. This increase has been accompanied by a high level of redundancy in UniProtKB/TrEMBL and many sequences are over-represented in the database. This is especially true for bacterial species where different strains of the same species have been sequenced and submitted (e.g. 1,692 strains of Mycobacterium tuberculosis, corresponding to 5.97 million entries). To reduce this redundancy, we have developed a procedure to identify highly redundant proteomes within species groups using a combination of manual and automatic methods. We have applied this procedure to bacterial proteomes (which constituted 81% of UniProtKB/TrEMBL in release 2015_03) and sequences corresponding to redundant proteomes (47 million entries) have been removed from UniProtKB. These sequences are still available in the UniParc sequence archive dataset within UniProt. From now on, we will no longer create new UniProtKB/TrEMBL records for proteomes identified as redundant.

Protein sequences belonging to proteomes that are not identified as redundant remain in UniProtKB. All proteomes are searchable through the UniProt website’s Proteomes pages. Sequences corresponding to redundant proteomes are available for download from UniParc and you will also be directed to alternate non-redundant proteome(s) available for the same species. The history (i.e. previous versions) of redundant UniProtKB records is still available.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Acid phosphatase deficiency

Changes to keywords

Modified keyword:

UniMES news

Retirement of UniProt Metagenomic and Environmental Sequences (UniMES)

The UniProt Metagenomic and Environmental Sequences (UniMES) database was developed as a repository for metagenomic and environmental data. UniProt has retired UniMES as there is now a resource at the EBI that is dedicated to serving metagenomic researchers. Henceforth, we recommend using the EBI Metagenomics portal instead. In addition to providing a repository of metagenomics sequence data, EBI Metagenomics allows you to view functional and taxonomic analyses and to submit your own samples for analysis.

UniProt release 2015_03

Published March 4, 2015

Headline

Regulation of translation initiation through folding

Many physiopathological events, such as stress or nutrient deprivation, induce rapid changes in cellular protein levels. In these cases, cells preferentially use translational control of existing mRNAs over transcriptional control, since the latter generates a slower response. Translation can be divided into 4 steps, initiation, elongation, termination, and ribosome recycling, but most regulation occurs at the initiation level.
In eukaryotes, translation initiation involves recruitment of the 40S ribosome to mRNA by the eukaryotic initiation factor 4F (eIF4F) complex. This complex is composed of eIF4E, which binds to the mRNA 5’ cap structure, eIF4A, an RNA helicase and eIF4G, a scaffolding protein. Availability of eIF4E is rate-limiting in this process and it is an important target for control. Under stress or starvation conditions, when translation has to be rapidly repressed, eIF4E binding proteins (4E-BPs) interact with eIF4E outcompeting eIF4G, hence preventing eIF4F assembly and cap-dependent translation initiation. 3 4E-BPs have been identified in mammals. 4E-BP2 (EIF4EBP2) is one of them. It is an intrinsically disordered protein (IDP) that contains several phosphorylation sites. In its unphosphorylated state, 4E-BP2 interacts with eIF4E via 2 domains: a YXXXXLΦ motif (residues 54 through 60) and a secondary dynamic motif (residues 78 through 82). The unphosphorylated (or minimally phosphorylated), eIF4E-binding form of EIF4EBP2 is unstable and targeted for degradation via the ubiquitin-proteasome pathway. By contrast, highly phosphorylated 4E-BP2 is very stable, but only weakly binds to eIF4E and hence can be outcompeted by eIF4G, allowing translation to occur.

How does phosphorylation regulate 4E-BP2 interaction with eIF4E and its stability? It has been recently shown that phosphorylation induces a widespread disorder-to-order transition occurring in 2 steps. First, phosphorylation at Thr-37 and Thr-46 by MTOR induces folding of residues Pro-18 to Arg-62 into a four-stranded β-domain that sequesters the helical YXXXXLΦ motif into a partially buried β-strand, blocking accessibility to eIF4E. The folding also protects Lys-57 from ubiquitination, preventing proteasomal degradation. This ordered structure is further stabilized by phosphorylation at Ser-65, Thr-70 and Ser-83. The fully phosphorylated protein has an affinity for eIF4E 4,000 fold lower than the unphosphorylated form. This observation implies that binding must be coupled to unfolding in order to free the YXXXXLΦ motif, and it is indeed what is experimentally observed. When the phosphorylated form binds eIF4E, it undergoes an order-to-disorder transition, as suggested by NMR spectra that are similar to those of the unphosphorylated form.

Although it has long been suspected that the function of IDPs may be controlled by post-translational modifications (PTMs), this is the first report experimentally showing how a PTM can fold an entire domain. This new data have been annotated into UniProtKB/Swiss-Prot and as of this release, the updated EIF4EBP2 entry is publicly available.

UniProtKB news

New proteomics mapping files

Mappings of UniProt Knowledgebase (UniProtKB) human sequences to identified human peptides from public mass spectrometry (MS) proteomics repositories can now be found in the new dedicated ‘proteomics_mapping’ directory on the UniProt FTP site together with a description of how the mappings were generated. The mappings are based on our analysis of the content of those MS proteomics repositories that openly share with us their data and quality metrics concerning peptide identifications.

Mass spectrometry provides direct experimental evidence for the existence of proteins and these new peptide mappings greatly increase the proportion of human sequences in UniProtKB whose existence is supported by experimental proteomics data. The human reference proteome currently contains 89383 sequences and our analysis provides mass spectrometry evidence for 68229 of those sequences.

In future UniProt releases, we expect to add data from more MS proteomics repositories and additional species. We very much welcome the feedback of the community on our efforts.

New FTP repository for reference proteomes

Based on a gene-centric perspective, UniProt Knowledgebase (UniProtKB) starts to provide data sets for reference proteomes, whose repository can be found at the new reference_proteomes directory.

As of release 2015_03, it encompasses 1933 species distributed in Eukaryota, Archaea and Bacteria. Viruses will be added in the next release.

Removal of the cross-references to PhosSite

Cross-references to PhosSite have been removed.

Removal of the cross-references to PptaseDB

Cross-references to PptaseDB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified disease:

Deleted diseases:

  • Amelogenesis imperfecta and gingival fibromatosis syndrome
  • Glycogen storage disease 14
  • Ichthyosis, autosomal recessive, with hypotrichosis
  • Loeys-Dietz syndrome 2A
  • Loeys-Dietz syndrome 2B
  • Leigh syndrome, X-linked
  • Mental retardation, X-linked 59

Changes to keywords

New keyword:

UniParc news

UniParc cross-references with proteome identifier and component

The UniParc XML format uses dbReference elements to represent cross-references to external database records that contain the same sequence as the UniParc record. Additional information about an external database record is provided with different types of property child elements. We have introduced two new types for cross-references to external database records from which UniProt proteomes are derived: The type "proteome_id" shows the identifier of the corresponding UniProt proteome and the type "component" the genomic component which encodes the protein. As a first step, we have added this information to bacterial ENA records.

Example:

<entry dataset="uniparc">
    <accession>UPI0000131B78</accession>
    <dbReference type="EMBL" id="AAK44239" version_i="1" active="Y" version="1" created="2003-03-12" last="2014-11-23">
        <property type="NCBI_GI" value="13879058"/>
        <property type="NCBI_taxonomy_id" value="83331"/>
        <property type="protein_name" value="serine/threonine protein kinase"/>
        <property type="gene_name" value="MT0017"/>
        <property type="proteome_id" value="UP000001020"/>
        <property type="component" value="Chromosome"/>
    </dbReference>
    <dbReference type="EMBL" id="ABQ71734" version_i="1" active="Y" version="1" created="2007-07-09" last="2014-11-23">
        <property type="NCBI_GI" value="148503925"/>
        <property type="NCBI_taxonomy_id" value="419947"/>
        <property type="protein_name" value="serine/threonine protein kinase"/>
        <property type="gene_name" value="pknB"/>
        <property type="proteome_id" value="UP000001988"/>
        <property type="component" value="Chromosome"/>
    </dbReference>
    ...
    <dbReference type="EMBL_CON" id="EFD75652" version_i="1" active="Y" version="2" created="2011-12-05" last="2014-11-23">
        <property type="NCBI_taxonomy_id" value="537209"/>
        <property type="protein_name" value="transmembrane serine/threonine-protein kinase B pknB"/>
        <property type="gene_name" value="TBIG_00439"/>
        <property type="proteome_id" value="UP000004676"/>
        <property type="component" value="Unassembled WGS sequence"/>
    </dbReference>
    ...
</entry>

This change did not affect the UniParc XSD, but may nevertheless require code changes.

UniProt RDF news

UniProt RDF files compressed with XZ instead of gzip

The UniProt RDF distribution has been available on the UniProt FTP site as gzip compressed RDF/XML files since 2008. We have now changed the compression algorithm from gzip to XZ, which has a number of features that make it a better choice for the UniProt RDF data:

  • It reduces the file size by approximately 23%, which improves FTP download time.
  • It can be decompressed in parallel, which can give faster decompression rates on current hardware with a minimum of 6-8 CPU cores.
  • It allows random access.

Replacement of UniProt RDF file go.rdf with go.owl

The UniProt RDF distribution that is available on the UniProt FTP site contained a go.rdf file that has been replaced with a go.owl file that contains a subset of the official go.owl distribution of the Gene Ontology consortium, which is taken as a snapshot that is in sync with the GO annotations in the UniProt Knowledgebase.

In practical terms this means:

UniProt release 2015_02

Published February 4, 2015

Headline

Mosquitoes prefer humans

Blood-feeding is extremely unusual in insects. Among the 1 to 10 million insect species, only some 10,000 feed on blood, and among these, only 100 target humans. Not only is this behavior rare in terms of species, but within one species, it may be gender-specific. However this small proportion of insects have a dramatic impact on human health. Female mosquitoes are major vectors of human diseases, such as malaria, dengue, yellow fever and chikungunya. Mosquito’s preference for humans is a matter of evolution. Aedes aegypti, the main vector of dengue and yellow fevers, actually exists as 2 subspecies, Aedes aegypti aegypti, feeding on human blood, and Aedes aegypti formosa, a generalist, zoophilic mosquito. It is currently thought that Aedes aegypti aegypti originated from a small population of forest-dwelling Aedes aegypti that became isolated in North Africa when a period of severe drought began in the Sahara approximately 4,000 years ago. The mosquito adapted to these harsh conditions, evolved a preference for breeding in artificial water storage containers and specialized in biting humans. This “domestic” form was reintroduced along the coast of East Africa following human movement and trade, and spread across much of the tropical and subtropical world. Today, along the coasts of Kenya, the 2 subspecies coexist, sometimes just a few hundreds of meters apart, domestic Aedes aegypti aegypti found in homes, laying eggs in water stored in containers indoors, and the forest Aedes aegypti formosa avoiding human settlements, laying eggs in tree holes outdoors.

What is the genetic basis underlying the mosquito’s preference for humans? In order to answer this question, Mc Bride et al. established 29 colonies of each Aedes aegypti subspecies. They observed that, contrary to their forest counterparts, domestic females showed a strong preference for human odor as compared to guinea pig, and were also more responsive in assays in which insects were directly exposed to live hosts, i.e. an anaesthetized guinea-pig and a human arm (the owner of which should be congratulated for her commitment). Analysis of gene expression in antennae, the major olfactory organ, in both subspecies revealed almost 1’000 differentially expressed genes and among them, odorant receptors, a family of insect chemosensory receptors, were significantly overrepresented. Odorant receptor 4 (Or4) was of particular interest. It was upregulated in human-preferring mosquitoes, and also the 2nd most highly expressed odorant receptor in the antennae of domestic females. In addition, Or4 exhibited extensive variations that might affect its function. Or4 responds to sulcatone, a volatile odorant produced by a variety of animals and plants, but whose levels in humans are uniquely high. 7 major Or4 alleles have been identified. Alleles A, B, C, F, and G were highly sensitive to sulcatone, whereas D and E were much less sensitive. Interestingly, human-preferring colonies from various African, Asian and American countries were dominated by A-like alleles, whereas animal-preferring colonies were highly variable. This suggests that both Or4 expression levels and ligand-sensitivity play a role in human preference. Surprisingly, sulcatone has been described as a mosquito repellent at certain concentrations. Mc Bride et al. hypothesized that it could be a repellent at high concentrations and an attractant at lower levels.

The important behavioral (r)evolution form the ancestral Aedes aegypti formosa to Aedes aegypti aegypti is unlikely to be due to a single gene, but at least Or4 is one genetic element clearly associated with these changes. The corresponding Or4 UniProtKB entry has been manually annotated and is publicly available as of this release.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Amelogenesis imperfecta and gingival fibromatosis syndrome
  • Glycogen storage disease 14
  • Ichthyosis, autosomal recessive, with hypotrichosis
  • Leigh syndrome, X-linked
  • Loeys-Dietz syndrome 2A
  • Loeys-Dietz syndrome 2B
  • Mental retardation, X-linked 59

Changes to keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2015_01

Published January 7, 2015

Headline

Thalidomide, the pharmacological version of yin and yang

In the 1950s, the German company Chemie Gruenenthal brought a new drug to the market, thalidomide. It was primarily used as a sedative, but as it also had anti-emetic properties, it soon became popular to alleviate “morning sickness” in pregnant women. About 10,000 children were born to women taking thalidomide. They exhibited severe malformations, affecting limbs, ears, heart and other internal organs and only 50% survived. By the early sixties, the teratogenic effect of thalidomide had been established and its use discontinued. However, scientists’ interest in this molecule never stopped. In 1965, thalidomide was shown to have immunomodulatory and anti-inflammatory properties in patients with erythema nodosum leprosum, an inflammatory complication of leprosy. More recently, thalidomide was proved to be efficient against several hematological cancers, including multiple myeloma, inhibiting cancer cell proliferation, modulating the immune system and the tumor microenvironment.

In 60 years, observations on thalidomide effects have accumulated, but its mode of action is still not fully elucidated. Nevertheless, some major steps have been accomplished to achieve this aim. A major breakthrough came in 2010 when thalidomide’s primary target, a protein called cereblon (CRBN), was identified. CRBN is a component of a ubiquitin E3 complex, called CRL4. This complex is made of at least 4 proteins, CUL4, DDB1, RBX1 and CRBN. Each protein has its specific function. CUL4 provides a scaffold for assembly of RBX1 and DDB1, RBX1 is the docking site for the activated E2 protein, and DDB1 recruits substrate-specificity receptors, such as CRBN, that form the substrate-presenting side of the CRL4 complex. The recently published CRL4 3D structure revealed that the ligase arm of CUL4 is quite mobile, establishing a ubiquitination zone. As it is a promiscuous enzyme, any lysine crossing this zone may be a target.

How does thalidomide affect CRBN activity within the CRL4 complex? In the presence of thalidomide, 2 transcription factors, IKZF1 and IKZF3, are recognized by CRBN and targeted for destruction by the proteasome. Neither of these proteins are substrates in the absence of the drug. Under normal conditions, IKZF1 and IKZF3 regulate B-and T-cell development. IKZF1 suppresses the expression of IL2 in T-cells and stimulates the expression of IRF4. This observation sheds light upon the immunomodulatory effects of thalidomide. What about endogenous CRBN substrates? Until recently, none were known. Last July, Fisher et al. published the results of their search for proteins whose ubiquitination by CRL4/CRBN was inhibited by thalidomide (or thalidomide derivatives) and identified MEIS2, a homeodomain-containing protein. MEIS2 has been involved in some aspects of normal human development. In bats, differential MEIS2 expression has been observed during limb development. A failure in limb development is a very striking feature of “thalidomide babies”. Hence MEIS2 may be a candidate for some aspects of thalidomide-induced teratogenicity.

Based on 3D structure analysis of the CRL4 complex, a model has been proposed in which thalidomide binds to CRBN at the canonical substrate-binding site. This interferes with the binding of endogenous CRBN substrates, impairs their ubiquitination and subsequent destruction, and results in their up-regulation. Conversely, the presence of thalidomide modifies the CRBN surface, creating a new binding site for neo-substrates, leading to their down-regulation.

As of this release, the updated versions of CRBN, DDB1, CUL4B, RBX1 entries are available in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to UniProt Proteomes

For several years now, UniProt has been providing ‘proteome’ sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced. In the past, these sets were based on the taxonomy of the organisms, but as more and more genomes of the same organism are being sequenced, we have recently introduced unique proteome identifiers to distinguish individual proteomes. These proteomes can be queried and downloaded from the new Proteomes section of the UniProt website. UniProtKB entries that are part of a proteome now have a cross-reference to their proteome and, where known, we also indicate the name of the component that encodes the respective protein.

UniProt Proteomes are available at http://www.uniprot.org/proteomes/.

The format of the explicit links is:

Resource abbreviation Proteomes
Resource identifier Proteome identifier.
Optional information 1 Component name.

Example: P78363

Text format

Example: P78363

DR   Proteomes; UP000005640; Chromosome 1.

XML format

Example: P78363

<dbReference type="Proteomes" id="UP000005640">
  <property type="component" value="Chromosome 1"/>
</dbReference>

RDF format

In the RDF format, we have introduced a new property proteome to represent a proteomes resource. The component is indicated by a relative URI reference.

Example: P78363

uniprot:P78363
  up:proteome <http://purl.uniprot.org/proteomes/UP000005640#Chromosome%201> .

Cross-references to DEPOD

Cross-references have been added to DEPOD, the human DEPhOsphorylation Database.

DEPOD is available at http://www.koehn.embl.de/depod/.

The format of the explicit links is:

Resource abbreviation DEPOD
Resource identifier UniProtKB accession number.

Example: Q99502

Show all entries having a cross-reference to DEPOD.

Text format

Example: Q99502

DR   DEPOD; Q99502; -.

XML format

Example: Q99502

<dbReference type="DEPOD" id="Q99502"/>

Cross-references to MoonProt

Cross-references have been added to MoonProt, a manually curated database containing information about the known moonlighting proteins.

MoonProt is available at http://www.moonlightingproteins.org/.

The format of the explicit links is:

Resource abbreviation MoonProt
Resource identifier UniProtKB accession number.

Example: P31230

Show all entries having a cross-reference to MoonProt.

Text format

Example: P31230

DR   MoonProt; P31230; -.

XML format

Example: P31230

<dbReference type="MoonProt" id="P31230"/>

Changes to the controlled vocabulary of human diseases

New diseases:

UniProt release 2014_11

Published November 26, 2014

Headline

Higher and higher

It is in human nature to push back the frontiers of what is possible. Modern humans left Africa and conquered the world. During their exploration, they met other humans who had already colonized the most improbable places tens of thousands of years earlier, maybe themselves being driven by the same urge to discover new horizons. Among the most challenging dwelling places is the Tibetan plateau, with an average elevation exceeding 4,500 meters. At this altitude, the oxygen concentration is only 60% of that available at sea level. Nevertheless, the Tibetan plateau is thought to have been inhabited for some 25,000 years.

To maintain oxygen homeostasis at high altitude (over 2,500 meters), the body responds in various ways, including increasing ventilation over the short term and increasing red blood cell production over the long term (see review). Hypoxia-inducible factor (HIF) plays a key role in the regulation of gene transcription in this process. HIF is a dimer composed of a common subunit beta, called ARNT, and 1 of 3 alpha subunits, called HIF1A, EPAS1, or HIF3A. Under normoxic conditions, HIFs-alpha are hydroxylated by prolyl hydroxylases EGLN1 (also known as PHD2), EGLN2 or EGLN3. Hydroxylation allows interaction with an E3-ubiquitin ligase, named VHL, followed by proteasomal degradation. Under hypoxic conditions, hydroxylation is arrested and HIFs-alpha are stabilized. They dimerize with ARNT and initiate the hypoxia response transcriptional program, which includes the stimulation of erythropoiesis. Strikingly, Tibetans exhibit a blunted erythropoietic response and their hemoglobin concentration is maintained at values expected at sea-level.

In 2010, 3 independent publications identified genes or loci showing evidence of hypoxia adaptation in Tibetans. All 3 studies pointed to 2 genes, among many others, being significantly associated with the decreased hemoglobin phenotype. They are EPAS1 and EGLN1. Interestingly, Tibetans may have inherited EPAS1 SNPs from Denisova man, an archaic Homo species identified in the Altai mountains of Siberia. The Tibetan-specific EGLN1 variant is more recent, currently estimated to have appeared some 8,000 years ago. It contains 2 single amino acid polymorphisms: p.Asp4Cys and p.Cys127Ser. Some characterization of this double variant came in September this year. Lorenzo et al. showed that it exhibited a lower K(m) value for oxygen, suggesting that it promotes increased HIF-alpha hydroxylation and degradation under hypoxic conditions. It could hence abrogate hypoxia-induced and HIF-mediated augmentation of erythropoiesis. Song et al. reported that the double variant specifically interferes with binding to PTGES3 (also called HSP90 cochaperone p23), but not to other known EGLN1 ligands, including FKBP8 or HSP90AB. As PTGES3-binding may facilitate HIF-alpha hydroxylation, a perturbation in this interaction would actually decrease HIF-alpha hydroxylation, hence decreased degradation and consequently increased HIF activity. The central question about the functional consequences of the Tibetan EGLN1 variant remains open…

It is not yet clear how high-altitude populations adapted to their harsh environment, but at least we begin to grasp the amazing complexity of this phenomenon. The scientific community has studied mostly 3 populations, Tibetans, Andeans and Ethiopians settled on the Simien plateau. They all exhibit patterns of genetic adaptation largely distinct from one another and the overlap is surprisingly low. The polymorphisms identified so far may not be straightforward loss- or gain-of-function, but they may instead fine tune complex interactions in which several proteins, possibly themselves carrying adaptive variations, are involved in a tissue-specific context.

As of this release, the UniProtKB/Swiss-Prot human EGLN1 has been updated with the new characterization data of the p.[Asp4Cys; Cys127Ser] polymorphism. On the new UniProt website, this information is to be found in the ‘Sequences’ section, ‘Polymorphism’ and ‘Natural variant’ subsections.

UniProtKB news

New mouse and zebrafish variation files

We would like to announce the release of two additional species, mouse and zebrafish, to the set of variation files available in the dedicated variants directory on the UniProt FTP sites. Both files catalogue protein altering Single Nucleotide Variants (SNVs or SNPs), stop-gained and stop-lost variants for UniProtKB/Swiss-Prot and UniProtKB/TrEMBL sequences of each species. These variants have been automatically mapped to UniProtKB sequences, including isoform sequences, through Ensembl. We very much welcome the feedback of the community on our efforts.

Structuring of ‘cofactor’ annotations

We have structured the previously free text cofactor annotations in UniProtKB and mapped individual cofactors to ChEBI identifiers. How this affects different UniProtKB distribution formats is described below.

Text format

 CC   -!- COFACTOR:( <molecule>:)?
(CC       Name=<cofactor>; Xref=<database>:<identifier>;( Evidence={<evidence>};)?)* 
(CC       Note=<free text>;)?

Note: Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional (?) or may occur 0 or more times (*).

A cofactor annotation consists of:

  • An optional <molecule> value that indicates the isoform, chain or peptide to which this annotation applies.
  • Zero or more cofactors that are each described with:
    • A Name= field that shows the cofactor name.
    • A Xref= field that shows a cross-reference to the corresponding ChEBI record.
    • An optional Evidence= field that provides the evidence for the cofactor (see Evidences in the UniProtKB flat file format)
  • An optional Note= field that provides additional information.

Each cofactor description and the optional Note= field start on a new line. Lines are wrapped at a line length of 75 characters and indented to increase readability.

Examples:

  • Protein binds alternate/several cofactors
    CC   -!- COFACTOR:
    CC       Name=Mg(2+); Xref=ChEBI:CHEBI:18420;
    CC         Evidence={ECO:0000255|HAMAP-Rule:MF_00086};
    CC       Name=Co(2+); Xref=ChEBI:CHEBI:48828;
    CC         Evidence={ECO:0000255|HAMAP-Rule:MF_00086};
    CC       Note=Binds 2 divalent ions per subunit (magnesium or cobalt).
    CC       {ECO:0000255|HAMAP-Rule:MF_00086};
    CC   -!- COFACTOR:
    CC       Name=K(+); Xref=ChEBI:CHEBI:29103;
    CC         Evidence={ECO:0000255|HAMAP-Rule:MF_00086};
    CC       Note=Binds 1 potassium ion per subunit. {ECO:0000255|HAMAP-
    CC       Rule:MF_00086};
    
  • Isoforms
    CC   -!- COFACTOR: Isoform 1:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
    CC         Evidence={ECO:0000269|PubMed:16683188};
    CC       Note=Isoform 1 binds 3 Zn(2+) ions. {ECO:0000269|PubMed:16683188};
    CC   -!- COFACTOR: Isoform 2:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
    CC         Evidence={ECO:0000269|PubMed:16683188};
    CC       Note=Isoform 2 binds 2 Zn(2+) ions. {ECO:0000269|PubMed:16683188};
    
  • Chains
    CC   -!- COFACTOR: Serine protease NS3:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
    CC         Evidence={ECO:0000269|PubMed:9060645};
    CC       Note=Binds 1 zinc ion. {ECO:0000269|PubMed:9060645};
    CC   -!- COFACTOR: Non-structural protein 5A:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105; Evidence={ECO:0000250};
    CC       Note=Binds 1 zinc ion in the NS5A N-terminal domain.
    CC       {ECO:0000250};
    
  • Cofactor unknown
    CC   -!- COFACTOR:
    CC       Note=Does not require a metal cofactor.
    CC       {ECO:0000269|PubMed:24450804};
    

XML format

We modified the XSD type commentType and introduced a new XSD type cofactorType as shown in red. We also moved the declaration of the molecule element – already used in the comment type "subcellular location" – to a more generic context so that it can also be used by other comment types such as "cofactor".

    <xs:complexType name="commentType">
        ...
        <xs:sequence>
            <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
            <xs:choice minOccurs="0">
            ...
                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'cofactor' annotations.</xs:documentation>
                    </xs:annotation>
                    <xs:element name="cofactor" type="cofactorType" maxOccurs="unbounded"/>
                </xs:sequence>

                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'subcellular location' annotations.</xs:documentation>
                    </xs:annotation>
                    <!-- <xs:element name="molecule" type="moleculeType" minOccurs="0"/> -->
                    <xs:element name="subcellularLocation" type="subcellularLocationType" maxOccurs="unbounded"/>
                </xs:sequence>
                ...
            </xs:choice>
            ...
            <xs:element name="text" type="evidencedStringType" minOccurs="0">
                <xs:annotation>
                    <xs:documentation>Used to store non-structured types of annotations,
                    as well as optional free-text notes of structured types of annotations.</xs:documentation>
                </xs:annotation>
            </xs:element>
            ...
        </xs:sequence>
        ...
    </xs:complexType>
    ...
    <xs:complexType name="cofactorType">
        <xs:annotation>
            <xs:documentation>Describes a cofactor.</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="name" type="xs:string"/>
            <xs:element name="dbReference" type="dbReferenceType"/>
        </xs:sequence>
        <xs:attribute name="evidence" type="intListType" use="optional"/>
    </xs:complexType>

A cofactor annotation consists of a sequence of:

  • An optional molecule element that indicates the isoform, chain or peptide to which this annotation applies.
  • Zero or more cofactor elements that each describe an individual cofactor with the following child elements:
    • A name element shows the cofactor name.
    • A dbReference element represents a cross-reference to the corresponding ChEBI record.
  • An optional text element that provides additional information.

Examples:

  • Protein binds alternate/several cofactors
    <comment type="cofactor">
      <cofactor evidence="1">
        <name>Mg(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:18420"/>
      </cofactor>
      <cofactor evidence="1">
        <name>Co(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:48828"/>
      </cofactor>
      <text evidence="1">Binds 2 divalent ions per subunit (magnesium or cobalt).</text>
    </comment>
    <comment type="cofactor">
      <cofactor evidence="1">
        <name>K(+)</name>
        <dbReference type="ChEBI" id="CHEBI:29103"/>
      </cofactor>
      <text evidence="1">Binds 1 potassium ion per subunit.</text>
    </comment>
    ...
    <evidence key="1" type="ECO:0000255">
      <source>
        <dbReference type="HAMAP-Rule" id="MF_00086"/>
      </source>
    </evidence>
    
  • Isoforms
    <comment type="cofactor">
      <molecule>Isoform 1</molecule>
      <cofactor evidence="9">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="9">Isoform 1 binds 3 Zn(2+) ions.</text>
    </comment>
    <comment type="cofactor">
      <molecule>Isoform 2</molecule>
      <cofactor evidence="9">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="9">Isoform 2 binds 2 Zn(2+) ions.</text>
    </comment>
    ...
    <evidence key="9" type="ECO:0000269">
      <source>
        <dbReference type="PubMed" id="16683188"/>
      </source>
    </evidence>
    
  • Chains
    <comment type="cofactor">
      <molecule>Serine protease NS3</molecule>
      <cofactor evidence="13">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="13">Binds 1 zinc ion.</text>
    </comment>
    <comment type="cofactor">
      <molecule>Non-structural protein 5A</molecule>
      <cofactor evidence="3">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="3">Binds 1 zinc ion in the NS5A N-terminal domain.</text>
    </comment>
    ...
    <evidence key="3" type="ECO:0000250"/>
    ...
    <evidence key="13" type="ECO:0000269">
      <source>
        <dbReference type="PubMed" id="9060645"/>
      </source>
    </evidence>
    
  • Cofactor unknown
    <comment type="cofactor">
      <text evidence="1">Does not require a metal cofactor.</text>
    </comment>
    ...
    <evidence key="1" type="ECO:0000269">
      <source>
        <dbReference type="PubMed" id="24450804"/>
      </source>
    </evidence>
    

RDF format

We introduced a new cofactor property to list individual cofactors as ChEBI resource descriptions. As for other types of annotations, an optional sequence property may describe the molecule to which the annotation applies and an optional rdfs:comment property may provide additional information.

Examples:

Note: Evidences are omitted from the examples to make it easier to read them. They are represented as for all other types of annotations by reification of the concerned statements.

  • Protein binds alternate/several cofactors
    uniprot:Q5M434
      up:annotation SHA:1, SHA:2 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 2 divalent ions per subunit (magnesium or cobalt)." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_18420> ,
                  <http://purl.obolibrary.org/obo/CHEBI_48828> .
    SHA:2
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 1 potassium ion per subunit." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29103> ;
    
  • Isoforms
    uniprot:O15304
      up:annotation SHA:1, SHA:2 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Isoform 1 binds 3 Zn(2+) ions." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence isoform:O15304-1 .
    SHA:2
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Isoform 2 binds 2 Zn(2+) ions." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence isoform:O15304-2 .
    
  • Chains
    uniprot:P26662
      up:annotation SHA:1, SHA:2 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 1 zinc ion." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence annotation:PRO_0000037644 .
    SHA:2
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 1 zinc ion in the NS5A N-terminal domain." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence annotation:PRO_0000037647 .
    
  • Cofactor unknown
    uniprot:A9CEQ7
      up:annotation SHA:1 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Does not require a metal cofactor." ;
    

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2014_10

Published October 29, 2014

Headline

K for Koagulation

After several weeks of a cholesterol-free diet, chickens start bleeding. The phenotype cannot be reversed by the addition of purified cholesterol to their chow, suggesting that another compound could have been extracted along with cholesterol during food preparation. This observation made by Henrik Dam in 1929 led to the identification of a fat-soluble vitamin involved in coagulation, also known as vitamin K (K standing for Koagulationsvitamin, the original German name for this compound, since the initial observations were reported in a German journal). This discovery was awarded the Nobel prize in 1943, but vitamin K function and metabolism are still extensively studied.

In plants, vitamin K plays an essential role in photosynthesis, which is why it is particularly enriched in photosynthetic tissues, such as green leaves. In animals, vitamin K is essential for blood clotting and bone mineralization. It also prevents the calcification of arteries and other soft tissues. More recently, vitamin K has been shown to function as a mitochondrial electron carrier and to serve as a ligand for the nuclear receptor SXR, which controls the expression of genes involved in transport and metabolism of endo- and xenobiotics.

The most extensively studied vitamin K function is its role as a cosubstrate for vitamin K-dependent gamma-carboxylase (GGCX). This enzyme catalyzes gamma-carboxylation of glutamate residues in target proteins. The modification activates several blood factor proteins and leads to initiation of the blood coagulation cascade. Widely used anticoagulant drugs, called coumarins, take advantage of this property and act as vitamin K antagonists. For example, warfarin is thought to inhibit vitamin K epoxide reductase complex subunit 1 (VKORC1), blocking vitamin K recycling, hence depleting active vitamin K stores. Although life-saving, the use of warfarin is quite tricky, as inadequate dosage may have dramatic consequences, either embolism or thrombosis (underdosage), or potentially fatal hemorrhage (overdosage). Interindividual genetic variations greatly affect warfarin efficiency. Polymorphisms within VKORC1 and CYP2C9, a cytochrome P450 family member involved in coumarin inactivation, together account for approximately 30% of population dose variance. A genetic variant p.Val433Met in another P450 family member, CYP4F2, has also been reported to increase warfarin requirements. CYP4F2 has recently been shown to catalyze vitamin K omega-hydroxylation, a key step in vitamin K degradation. The p.Val433Met polymorphism produces a decrease of CYP4F2 protein in the liver. Lower CYP4F2 levels likely lead to an increase in hepatic vitamin K levels, hence more molecules that warfarin must antagonize, resulting in coumarin resistance in individuals bearing this polymorphism.

As of this release, an updated version of the UniProtKB/Swiss-Prot CYP4F2 entry is available. Proteins undergoing gamma-carboxylation can be retrieved using the keyword Gamma-carboxyglutamic acid.

UniProtKB news

Change of the cross-reference ArrayExpress to ExpressionAtlas

The Expression Atlas database provides information on baseline and differential gene expression patterns under different biological conditions. Experiments in Expression Atlas are selected from the ArrayExpress database of functional genomics experiments. Because UniProtKB entries cross-reference only this subset of experiments, we have changed the resource abbreviation for these cross-references from ArrayExpress to ExpressionAtlas. We have at the same time added a field to indicate the type of expression patterns for which information can be found in the ExpressionAtlas (see examples below).

Text format

Example: P15822

DR   ExpressionAtlas; P15822; baseline and differential.

XML format

Example: P15822

<dbReference type="ExpressionAtlas" id="P15822"/>
  <property type="expression patterns" value="baseline and differential"/>
</dbReference>

RDF format

Example: P15822

uniprot:P15822
  rdfs:seeAlso <http://purl.uniprot.org/expressionatlas/P15822> .
<http://purl.uniprot.org/expressionatlas/P15822>
  rdf:type Resource ;
  up:database database:ExpressionAtlas ;
  rdfs:comment "baseline and differential" .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Amelogenesis imperfecta and gingival fibromatosis syndrome
  • Mental retardation, X-linked 59

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • (4R)-5-hydroxyleucine
  • (4R)-5-oxoleucine

Deleted term:

  • 5-methoxythiazole-4-carboxylic acid (Val-Cys)

UniProt release 2014_09

Published October 1, 2014

Headline

Small is beautiful (and useful)

In large scale studies, small proteins tend to be overlooked. They are difficult to predict using software tools and they often escape detection by mass spectrometry. When cDNA sequences are submitted, short coding sequences (CDS) are only rarely annotated and hence do not appear in any protein databases, including UniProtKB/TrEMBL or Genpept, and their nucleotide sequences can be tagged as ‘non-coding RNAs’. In UniProtKB/Swiss-Prot, we are aware of the problem, but we are often reluctant to annotate uncharacterized small ORFs, fearing to introduce imaginary sequences in a database we wish to be as reliable as possible. That is why we are thrilled when new data become available that allow us to fill the gap.

This happened a few months ago, with the publication of 2 articles that brought the ‘noncoding transcript’ AK092578 under the spotlight. Pauli et al. were investigating inductive events during early embryogenesis in zebrafish. In order to find new signaling peptides, they sequenced RNAs extracted from embryos at different developmental stages and combined this approach with ribosome profiling to select for transcripts most likely to be translated. This led to the discovery of 399 novel coding genes. 28 of them contained a signal peptide, but no transmembrane domain, making them good candidates for signaling proteins. Pauli et al. focused their attention on one of them, apela, that they called toddler, encoded by AK092578, so far considered to be a noncoding transcript. A few weeks earlier, Chng et al. had already published the identification of the same protein, which they named elabela.

Apela is a highly conserved protein among vertebrates; this conservation is particularly striking in the 30 amino acid long mature peptide, the last 13 residues being nearly invariant in all vertebrate species studied. Apela is expressed in the zygote, with a peak during gastrulation, and becomes undetectable by 4 days post-fertilization. Its disruption leads to a dramatic phenotype, including small or absent hearts, posterior accumulation of blood cells, malformed pharyngeal endoderm, and abnormal left-right positioning and formation of the liver. Most mutant embryos eventually die between 5 and 7 days of development. Interestingly, this phenotype was reminiscent of that observed for apelin receptor (aplnr) deficiency.

The pathway leading to aplnr activation that could explain the observed mutant phenotype remained unsolved for several years. Indeed, aplnr disruption in zebrafish demonstrated that aplnr was required prior to the onset of gastrulation for proper cardiac morphogenesis, but its known ligand, apln, was not expressed until midgastrulation, too late to play a role in such a very early event. Along the same line, it had been reported that Aplnr mutant animals were not born in the expected Mendelian ratio, and many showed cardiovascular developmental defects, while Apln-deficient mice were viable, fertile, and showed normal development. Taken together, these observations suggested that Aplnr might have yet another ligand, expressed very early in embryonic development. The newly discovered apela protein seemed to fulfill the conditions and, using different strategies, both groups convincingly showed that apela is indeed aplnr’s first ligand.

Human, mouse and zebrafish Apela orthologs have been updated accordingly and these entries are now available.

UniProtKB news

Evidences in the UniProtKB flat file format

The evidence for annotations in UniProtKB entries has been available for several years in the XML and RDF representation of the data and we have now added this information also to the text format (aka flat file format).

Representation of evidences

This section describes how evidences are represented, independent of the context in which they can be found.

An individual evidence description consists of a mandatory evidence type, represented by a code from the Evidence Codes Ontology (ECO) and, where applicable, the source of the data which is usually another database record that is represented by the database name and record identifier, but in the case of publications that are not in PubMed we indicate instead the corresponding UniProtKB reference number.

Examples:

  • An evidence type without source: {type}, e.g.
    {ECO:0000305}
    {ECO:0000250}
    {ECO:0000255}
    
  • An evidence type with source: {type|source}, e.g.
    {ECO:0000269|PubMed:10433554}
    {ECO:0000303|Ref.6}
    {ECO:0000305|PubMed:16683188} 
    {ECO:0000250|UniProtKB:Q8WUF5}
    {ECO:0000312|EMBL:BAG16761.1}
    {ECO:0000313|EMBL:BAG16761.1}
    {ECO:0000255|HAMAP-Rule:MF_00205}
    {ECO:0000256|HAMAP-Rule:MF_00205}
    {ECO:0000244|PDB:1K83}
    {ECO:0000213|PDB:1K83}
    
  • Several evidences: {type|source, type|source, ...}, e.g.
    {ECO:0000269|PubMed:10433554, ECO:0000303|Ref.6}
    

Change of the representation of different line and annotation types

This section describes in which line and annotation types evidences may be found and where they are placed. We use here the symbolic representation {evidence} as a placeholder for all evidence representations that are described in the previous section.

DE lines

Evidences may be found at the end of subcategory fields, e.g.

DE   RecName: Full=Palmitoyl-protein thioesterase-dolichyl pyrophosphate phosphatase fusion 1 {evidence};
DE   Contains:
DE     RecName: Full=Palmitoyl-protein thioesterase {evidence};
DE              Short=PPT {evidence};
DE              EC=3.1.2.22 {evidence};
DE     AltName: Full=Palmitoyl-protein hydrolase {evidence};
DE   Contains:
DE     RecName: Full=Dolichyldiphosphatase {evidence};
DE              EC=3.6.1.43 {evidence};
DE     AltName: Full=Dolichyl pyrophosphate phosphatase {evidence};
DE   Flags: Precursor;
GN lines

Evidences may be found after each gene designation, e.g.

GN   Name=cysA1 {evidence}; Synonyms=cysA {evidence};
GN   OrderedLocusNames=Rv3117 {evidence}, MT3199 {evidence};
GN   ORFNames=MTCY164.27 {evidence};
GN   and
GN   Name=cysA2 {evidence}; OrderedLocusNames=Rv0815c {evidence}, MT0837
GN   {evidence}; ORFNames=MTV043.07c {evidence};
OG lines

Evidences may be found after an organelle or plasmid, e.g.

OG   Mitochondrion {evidence}.
OG   Plasmid pWR100 {evidence}, Plasmid pINV_F6_M1382 {evidence}, and
OG   Plasmid pCP301 {evidence}.
OX lines

Evidences may be found after the taxonomy identifier, e.g.

OX   NCBI_TaxID=9606 {evidence};
RN lines

Evidences may be found after the reference number, e.g.

RN   [1] {evidence}
RC lines

Evidences may be found after each value, e.g.

RC   STRAIN=C57BL/6J {evidence}, and DBA/2J {evidence}; TISSUE=Brain
RC   {evidence};
KW lines

Evidences may be found after each keyword, e.g.

KW   ATP-binding {evidence}; Cell cycle {evidence}; Cell division {evidence};
KW   DNA replication {evidence};
CC lines

The evidence location depends on the annotation type.

Unstructured annotations:

Evidences may initially be found at the end of the annotations because this is how they have historically been attributed, e.g.

CC   -!- FUNCTION: Possesses kinase activity. May be involved in
CC       trafficking and/or processing of RNA. {evidence}.

At a later time, we intend to start attributing evidences at a more fine-grained level by placing them behind the sentences or paragraphs to which they apply, e.g.

CC   -!- FUNCTION: Possesses kinase activity. {evidence}. May be involved
CC       in trafficking and/or processing of RNA. {evidence}.

Structured annotations:

ALTERNATIVE PRODUCTS:

Evidences may be found behind the values of the Name= and Synonyms= fields. They may also be found in Comment= and Note= fields where they are placed as in unstructured annotations, e.g.

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=13;
CC         Comment=Additional isoforms seem to exist. {evidence};
CC       Name=1 {evidence}; Synonyms=LST1/A {evidence};
CC         IsoId=O00453-1; Sequence=Displayed;
..
CC       Name=12;
CC         IsoId=O00453-12; Sequence=VSP_047367;
CC         Note=No experimental confirmation available. {evidence};

BIOPHYSICOCHEMICAL PROPERTIES:

In the structured subtopics Absorption and Kinetic parameters evidences may be found at the end of the Abs(max)=, KM= and Vmax= fields. They may also be found in Note= fields and the unstructured subtopics pH dependence, Redox potential and Temperature dependence where they are placed as in unstructured annotations, e.g.

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Absorption:
CC         Abs(max)=465 nm {evidence};
CC         Note=The above maximum is for the oxidized form. Shows a maximal
CC         peak at 330 nm in the reduced form. These absorption peaks are
CC         for the tryptophylquinone cofactor. {evidence};
CC       Kinetic parameters:
CC         KM=5.4 uM for tyramine {evidence};
CC         Vmax=17 umol/min/mg enzyme {evidence};
CC         Note=The enzyme is substrate inhibited at high substrate
CC         concentrations (Ki=1.08 mM for tyramine). {evidence};

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       pH dependence:
CC         Optimum pH is 7-8 for ATPase activity. Is more active at pH 8 to
CC         10 than at pH 5.5. {evidence};
CC       Temperature dependence:
CC         Optimum temperature is 80 degrees Celsius for ATPase activity.
CC         {evidence};

RNA EDITING:

Evidences may be found behind the modified positions as well as in the optional Note= field where they are placed as in unstructured annotations, e.g.

CC   -!- RNA EDITING: Modified_positions=207 {evidence}; Note=Partially
CC       edited. Target of Adar. {evidence};

(Please note that we have taken this occasion to make an additional small format change to this annotation type: We have replaced the full-stop at the end of the annotation with a semi-colon to be consistent with other structured annotation types that consist of a list of Field=Value; items.)

MASS SPECTROMETRY:

In MASS SPECTROMETRY annotations the same evidence applies to all fields (incl. the optional Note= field) and all evidences are thus displayed in a separate field instead of adding them at the end of each field. A new Evidence= field has replaced the previously existing Source= field, e.g.

CC   -!- MASS SPECTROMETRY: Mass=2189.4; Method=Electrospray; Range=167-
CC       186; Note=Monophosphorylated.; Evidence={evidence};

SEQUENCE CAUTION:

In SEQUENCE CAUTION annotations the same evidence applies to all fields (incl. the optional Note= field) and all evidences are thus displayed in a separate new Evidence= field instead of adding them at the end of each field, e.g.

CC   -!- SEQUENCE CAUTION:
CC       Sequence=AAL25396.1; Type=Miscellaneous discrepancy; Note=Intron retention.; Evidence={evidence};
CC       Sequence=ABF70206.1; Type=Miscellaneous discrepancy; Note=Intron retention.; Evidence={evidence};
CC       Sequence=CAA32567.1; Type=Erroneous gene model prediction; Evidence={evidence};
CC       Sequence=CAA32568.1; Type=Erroneous gene model prediction; Evidence={evidence};

SUBCELLULAR LOCATION:

Evidences may be found at the same places where previously the non-experimental qualifiers By similarity, Probable and Potential were displayed (see Syntax modification of the ‘Subcellular location’ subtopic) as well as in the optional Note= field where they are placed as in unstructured annotations, e.g.

CC   -!- SUBCELLULAR LOCATION: Golgi apparatus, trans-Golgi network
CC       membrane {evidence}; Multi-pass membrane protein {evidence}.
CC       Note=Predominantly found in the trans-Golgi network (TGN). Not
CC       redistributed to the plasma membrane in response to elevated
CC       copper levels. {evidence}.
CC   -!- SUBCELLULAR LOCATION: Isoform 2: Cytoplasm {evidence}.
CC   -!- SUBCELLULAR LOCATION: WND/140 kDa: Mitochondrion {evidence}.

DISEASE:

Evidences may be found at end of the disease description as well as in the optional Note= field where they are placed as in unstructured annotations, e.g.

CC   -!- DISEASE: Sarcoidosis 1 (SS1) [MIM:181000]: An idiopathic,
CC       systemic, inflammatory disease characterized by the formation of
CC       immune granulomas in involved organs. Granulomas predominantly
CC       invade the lungs and the lymphatic system, but also skin, liver,
CC       spleen, eyes and other organs may be involved. {evidence}.
CC       Note=Disease susceptibility is associated with variations
CC       affecting the gene represented in this entry. {evidence}.
FT lines

Evidences may be found at the end of the feature description, e.g.

FT   VARIANT     341    341       P -> L (in AH2; strongly reduced
FT                                activity). {evidence}.
FT                                /FTId=VAR_065665.
FT   CONFLICT     52     53       RT -> KI (in Ref. 8; AAD14329).
FT                                {evidence}.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Glycogen storage disease 14

Changes to keywords

New keyword:

Modified keyword:

UniProt release 2014_08

Published September 3, 2014

Headline

Ubiquitin caught at its own game

Ubiquitination is a widely used post-translational modification (PTM) in eukaryotic cells. It is involved in a plethora of cellular activities ranging from removal of misfolded and unwanted proteins to signaling in innate immunity, from transcriptional regulation to membrane trafficking. Ubiquitination is the covalent attachment of the small 76-residue protein ubiquitin onto a target protein, most often via an isopeptide bond between the amino group of a lysine side chain and ubiquitin C-terminus. This process occurs in several steps: an ubiquitin-activation step catalyzed by E1 enzymes, an ubiquitin-conjugation step catalyzed by E2 enzymes, and a step ensuring the target specificity involving E3 ligases. Many different types of ubiquitination exist, monoubiquitination, multi(mono)ubiquitination and polyubiquitination, each type conveying a different signal. Polyubiquitination occurs via further ubiquitination of a single lysine residue on the substrate protein. Ubiquitin contains 7 ubiquitin lysines; each can serve as an acceptor for further elongation and each defines a distinct fate for the modified protein. The classic example is the Lys-48-linked chain which targets the protein bearing it to degradation via the proteasome.

An additional step of complexity has been unveiled in 3 recent publications: Ubiquitin was discovered to be itself subjected to another PTM, namely phosphorylation, which confers on it the ability to activate the E3 ubiquitin-protein ligase Parkin (PARK2).

Parkin and the PINK1 kinase are involved in the signaling pathway leading to mitophagy, a specialized program which eliminates damaged mitochondria and hence maintains health. Indeed, defects in any of these proteins cause early-onset Parkinson disease.

Under normal conditions, PINK1 is imported into mitochondria, where it is processed and rapidly degraded. When mitochondria lose membrane potential or amass unfolded proteins, PINK1 accumulates on the outer membrane where it recruits cytosolic Parkin and activates its latent E3 activity. As a result, mitochondrial outer membrane proteins are ubiquitinated and the defective organelle is targeted for destruction.

It is in the Parkin activation step that phosphorylated ubiquitin comes into play. PINK1 directly phosphorylates ubiquitin at Ser-65. Of note, Parkin itself contains a ubiquitin-like domain that is also phosphorylated by PINK1 at Ser-65. All three publications agree that phosphorylated ubiquitin is involved in the PINK1/PARK2 pathway. Nevertheless Koyano and colleagues found that both ubiquitin and Parkin Ser-65 phosphorylations are needed for full Parkin activation, whereas Kane et al. observed Parkin activation with phospho-ubiquitin alone. While phospho-ubiquitin can be used by Parkin as a substrate for ubiquitination, its Parkin-binding and -activating abilities seem to be separated from its role as a substrate.

As of this release, human Parkin, PINK1 and ubiquitin entries have been updated accordingly and annotations have been transferred to orthologous entries based on sequence similarity. Proteins known to undergo ubiquitination can be retrieved with the keyword Ubl conjugation and proteins involved in the ubiquitination pathway, such as E1, E2 or E3 enzymes, with the keyword Ubl conjugation pathway.

UniProtKB news

New variant types in homo_sapiens_variation.txt.gz on the UniProt FTP site

UniProt would like to announce the addition of two variant types, stop lost and stop gained, to the set of protein altering variants from the 1000 Genomes Project available in the homo_sapiens_variation.txt.gz file. Stop lost and stop gained variants have been selected as the first structural variants to be added to the UniProt variant catalogue because they are two of the most commonly occurring variant types. UniProt expects to add further structural variant types and somatic variants to the available variant types and to include additional species. This file, along with the humsavar.txt file, can now be found in the new dedicated variants directory in the UniProt FTP site. We very much welcome the feedback of the community on our efforts.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Ichthyosis, autosomal recessive, with hypotrichosis
  • Loeys-Dietz syndrome 2A
  • Loeys-Dietz syndrome 2B

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • Isoaspartyl glycine isopeptide (Asn-Gly)
  • Isoaspartyl glycine isopeptide (Asp-Gly)

Deleted terms:

  • Aspartyl isopeptide (Asn)
  • Aspartyl isopeptide (Asp)

Changes to keywords

Modified keyword:

Website news

The UniProt website is changing

We would like to introduce you to the new UniProt website! We have been working on this site behind the scenes for a while and we’re glad it’s finally time to share it with you.

We redesigned the UniProt website following a user centered design process, involving over 250 users worldwide with varied research backgrounds and use cases. User centered design is a design approach that is grounded in the requirements and expectations of users. They are included at every stage of the process, from gathering requirements to testing the end product.

Some highlights of the changes and improvements:

  • A new homepage and advanced search functionality
  • A new results page interface with easy to use filters
  • A basket to store your favorite proteins and build up your own set
  • New protein entry page content classification and navigation bar
  • New tool output interfaces (e.g. BLAST results)
  • New ‘Proteomes’ pages for full protein sets from completely sequenced organisms

Contextual help is available on the site as well as UniProt help videos from the UniProt YouTube channel. We look forward to feedback from the scientific community to help improve the site further.

UniProt release 2014_07

Published July 9, 2014

Headline

Lark or owl? PER3 is the answer

Unless you are like Napoleon who never needed more than 4 hours of sleep at a stretch, being both an early bird and a night owl, you certainly have a diurnal preference. It is not a simple matter of taste, it is a matter of genetics, involving the PER3 gene.

In humans, the PER3 gene exists in 2 versions: a short one and a long one. The length variation depends upon the number of 18 amino-acid tandem repeats in the protein’s C-terminus: 4 in the short version, 5 in the long one. Roughly 10% of the population is homozygous for the long allele (PER3 5/5) and 50% for the short allele (PER3 4/4). This polymorphism correlates significantly with extreme diurnal preference, the longer allele being associated with morningness and the shorter allele with eveningness. In addition, PER3 5/5 individuals are more vulnerable to sleep deprivation than their PER3 4/4 counterparts, exhibiting greater cognitive performance impairment. When allowed to take naps, PER3 5/5 individuals show a greater ability to sleep independently of circadian phase, suggesting that the polymorphism modifies the sleep homeostatic response without influencing circadian parameters.

The molecular mechanism of this behavioral difference is not known and there was no animal model to investigate it until recently. Indeed, the 18 amino-acid polymorphism does not exist in non-primate mammals. Earlier this year, Hasan et al. published a study in which they created 2 knock-in mice. These mice contained a “humanized” PER3 exon 18 with either the 4-repeat or 5-repeat allele. The transgenic mice exhibited a phenotypic response to sleep deprivation and recovery consistent with the observations made in humans. 816 genes were differentially expressed in the cortex of Per3 4/4 and Per3 5/5 mice and a similar amount in the hypothalamus. At least some of these genes seem to be involved in the regulation of, or response to, sleep, as well as in neuronal development and function. For instance, some isoforms of the Homer1 gene, a marker of sleep homeostasis, were up-regulated in the Per3 5/5 compared to the Per3 4/4 hypothalamus.

With this tool in hand, we may be in a position to start identifying the genetic control of sleep architecture in humans and maybe unveil if Napoleon’s sleep ability was a true genetic oddity, the result of his iron will or just a historical myth.

As of this release, the human PER3 entry has been updated in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to CCDS

Cross-references have been added to CCDS, the Consensus CDS project.

CCDS is available at http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi.

The format of the explicit links is:

Resource abbreviation CCDS
Resource identifier CCDS identifier

Example: O70554

Show all entries having a cross-reference to CCDS.

Text format

Examples:

O70554
DR   CCDS; CCDS38509.1; -.
P00750
DR   CCDS; CCDS6126.1; -. [P00750-1]
DR   CCDS; CCDS6127.1; -. [P00750-3]

XML format

Examples:

O70554
<dbReference type="CCDS" id="CCDS38509.1"/>
P00750
<dbReference type="CCDS" id="CCDS6126.1">
  <molecule id="P00750-1"/>
</dbReference>
<dbReference type="CCDS" id="CCDS6127.1">
  <molecule id="P00750-3"/>
</dbReference>

Cross-references to GeneReviews

Cross-references have been added to GeneReviews, a resource of expert-authored, peer-reviewed disease descriptions.

GeneReviews is available at http://www.ncbi.nlm.nih.gov/books/NBK1116/.

The format of the explicit links is:

Resource abbreviation GeneReviews
Resource identifier GeneReviews identifier

Example: O00555

Show all entries having a cross-reference to GeneReviews.

Text format

Example: O00555

DR   GeneReviews; CACNA1A; -.

XML format

Example: O00555

<dbReference type="GeneReviews" id="CACNA1A"/>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • L-isoglutamyl histamine

Modified term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • N6-crotonyl-L-lysine -> N6-crotonyllysine

Changes to keywords

New keywords:

Modified keywords:

UniParc news

UniParc cross-references with protein and gene names

The UniParc XML format uses dbReference elements to represent cross-references to external database records that contain the same sequence as the UniParc record. Additional information about an external database record is provided with different types of property child elements. We have introduced two new types, "protein_name" and "gene_name", to show the preferred protein and gene name of external database records that provide this information. In this release we have added names for cross-references to UniProtKB and RefSeq. For UniProtKB entries that have several protein or gene names, UniParc shows only the main one, which is the same name that is shown in the UniProtKB FASTA format. We will soon added names for cross-references to ENA, Ensembl, EnsemblGenomes and model organism databases (FlyBase, SGD, TAIR, WormBase).

Examples:

<dbReference type="UniProtKB/Swiss-Prot" id="P05067" version_i="3" active="Y" version="3" created="1991-11-01" last="2014-02-19">
  <property type="NCBI_GI" value="112927"/>
  <property type="NCBI_taxonomy_id" value="9606"/>
  <property type="protein_name" value="Amyloid beta A4 protein"/>
  <property type="gene_name" value="APP"/>
</dbReference>
...
<dbReference type="UniProtKB/Swiss-Prot protein isoforms" id="P05067-2" version_i="1" active="Y" created="2003-03-28" last="2014-02-19">
  <property type="NCBI_taxonomy_id" value="9606"/>
  <property type="protein_name" value="Isoform APP305 of Amyloid beta A4 protein"/>
  <property type="gene_name" value="APP"/>
</dbReference>

This change did not affect the UniParc XSD, but may nevertheless require code changes.

FTP site news

Every folder on our FTP server now contains a file called RELEASE.metalink that specifies the size and MD5 checksum of every file in that folder, e.g.
ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/RELEASE.metalink

Metalink is an extensible metadata file format that describes one or more computer files available for download. It facilitates file verification and recovery from data corruption and lists alternate download sources (mirror URIs).

Various command line download tools, e.g. cURL version 7.30 or higher and aria2, support metalink.

Example: The following command will download all files in the current_release/ folder and verify their MD5 checksums:

curl --metalink ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/RELEASE.metalink

They will be downloaded from one of the alternative locations mentioned in the metalink file. If one FTP server goes down during a download, programs can automatically switch to another mirror location. Some programs can also download segments from several FTP locations at the same time, which can make downloads much faster.

Please note that UniProt can be downloaded from the consortium member FTP sites at three different geographical locations:

USA: ftp://ftp.uniprot.org/pub/databases/uniprot
UK: ftp://ftp.ebi.ac.uk/pub/databases/uniprot
Switzerland: ftp://ftp.expasy.org/databases/uniprot

This information can be found in our FAQ.

UniProt release 2014_06

Published June 11, 2014

Headline

Everything you always wanted to know about… sperm-egg interaction

To reach the ultimate goal of sexual reproduction which is egg fertilization, sperm cells have to run an obstacle course. They have to jump, or rather to swim, through a lot of hoops and hurdles before fusing with the oocyte and forming a zygote. The very first step of this race starts after ejaculation and involves sperm capacitation, a complex process characterized by a series of structural and functional changes, leading to sperm hypermotility that allows it to swim through oviductal mucus. In the ampulla of the fallopian tube, in the immediate surroundings of the oocyte, the spermatozoon meets a hyaluronic acid-rich matrix secreted by cumulus cells that it penetrates with the help of hyaluronidase PH-20/SPAM1. The next impediment is the egg’s coat, the zona pellucida. The interaction between the spermatozoon and zona pellucida leads to the acrosomal reaction, in which molecules required for penetrating the zona pellucida are secreted and molecules needed for sperm binding to the egg are exposed. Once through the coat, the sperm access the perivitelline space and eventually the egg’s plasma membrane, called the oolemma. It binds to it and both egg and sperm membranes fuse.

Although the overall fertilization process has been known for a long time, a large part of the detailed molecular mechanism is still mysterious. In 2005, Inoue et al. identified Izumo1 as the sperm-specific protein involved in egg attachment. Without Izumo1, fertilization does not occur, at least in mice. It took 9 more years to pinpoint Folr4 as the Izumo1 egg partner. Folr4 is widely conserved across mammals, including marsupials. Contrary to what its name might suggest, Folr4 is not a folate receptor, but it efficiently binds Izumo1 and hence has been renamed Juno, after Jupiter’s wife (and sister). The Juno and Izumo1 interaction is an absolute requirement for fertilization. In the absence of Juno, mice display no particular phenotype in a daily life, but are totally sterile, although they mate normally.

After fertilization, the egg becomes refractory to further sperm fusion events to prevent polyspermy. This process involves biochemical changes of the oolemma occurring 30-45 minutes after the initial fusion event, as well as hardening of the zona pellucida in a second phase. Juno may play a role in establishing the membrane block to polyspermy. Indeed, it is rapidly shed from the oolemma and redistributed to vesicles within the perivitelline space where it may create an area of “decoy eggs” to neutralize incoming sperm.

This discovery is not yet “everything you always wanted to know about” fertilization, for instance it does not unveil the fusion mechanism itself, but is nevertheless a major step forward.

As of this release, human and mouse Juno proteins have been updated in UniProtKB/Swiss-Prot.

UniProtKB news

Extension of the UniProtKB accession number format

We have extended the UniProtKB accession number format to 10 alphanumerical characters by adding a third pattern for new UniProtKB accession numbers. Old UniProtKB accession numbers will not change. The valid patterns for UniProtKB accession numbers are:

accession 1 2 3 4 5 6 7 8 9 10
old [O,P,Q] [0-9] [A-Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9]
old [A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]
new [A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]

The three patterns can be combined into the following regular expression:

[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • N6-glutaryllysine

UniProt DAS news

We have retired the SAAS data source from our DAS server.

UniProt release 2014_05

Published May 14, 2014

Headline

A flounder… on the rocks!

Some organisms, such as certain vertebrates, plants, fungi and bacteria, have to resist low, subzero temperatures. Their survival relies upon the production of antifreeze molecules. Some insects, like the beetle Upis ceramboides, tolerate freezing to -60°C in midwinter thanks to the production of a compound, called xylomannan, made of a sugar and a fatty acid and located in cell membranes. However, most organisms use antifreeze proteins (AFPs). All AFPs act by binding to small ice crystals to inhibit growth that would otherwise be fatal, but each type of AFP seems to arrive at this end by a different route.

Pseudopleuronectes americanus, commonly called ‘winter flounder’, is a very common variety of flounder in North America. It lives in cold water and survives thanks to the expression of the AFP Maxi. The 3D structure of the Maxi protein has been recently elucidated, unveiling some very unusual features.

Maxi belongs to the type-I AFP family and consists of a homodimer. Each monomer folds exactly in half so that its N-and C-termini are side by side, hence the dimer looks like a 4-helix rod. It is composed of tandem 11-residue repeats that exhibit the [T/I]-x3-A-x3-A-x2 motif, where x is any residue. The conserved threonine/isoleucine and alanine residues in this motif have been shown to bind ice in monomeric type-I AFPs. In the 3D structure, the internal space generated by the packing of the 4 helices in the 11-residue repeat regions is just wide enough to accommodate a single layer of water. Amazingly, the water layer that occupies the gap consists of over 400 molecules forming an extensive, mainly polypentagonal network. As is the case for most globular proteins, Maxi internal residues are nonpolar, mainly alanines, which obviously is far from optimal for hydrophilic contacts. To overcome this problem, Maxi takes advantage of its backbone carboxyl groups to anchor water molecules and the whole structure is stabilized by water-mediated hydrogen bonding rather than by direct protein association. The positioned water molecules extend outwards between all 4 helices from the core to the surface and they form a network of ordered molecules at the periphery. As a result, this rather hydrophobic protein remains highly solvated and freely soluble in flounder blood under physiological conditions, i.e. at low temperatures. When the temperature rises above 16°C, Maxi irreversibly denatures.

Another surprise came from the observation that the predicted ice-binding residues, expected to face the protein exterior, actually occur on the inward-pointing surfaces of all 4 helices where they cooperate to form and anchor the interior ordered waters. How then does Maxi bind to ice? The current working hypothesis is that the positioned water molecules that extend outwards may form a network available to merge and freeze with the quasi-liquid layer on the surface of ice.

As of this release, the winter flounder antifreeze protein Maxi has been annotated and integrated into UniProtKB/Swiss-Prot. All antifreeze proteins available in UniProtKB/Swiss-Prot can be retrieved with the keyword ‘Antifreeze protein’.

UniProtKB news

Update of ECO mapping for evidences

In 2011, we have started to use the Evidence Codes Ontology (ECO) to describe the evidences for UniProtKB annotations. Since then, this ontology has been extended and the GO Consortium has published a mapping of their GO evidence codes to ECO. We have adapted our mapping to ECO accordingly to have equivalent evidence codes for UniProtKB and GO annotations. How this affects different UniProtKB distribution formats is described below.

XML and DAS format

In these two formats, ECO codes are used to describe the evidences for UniProtKB annotations. In the UniProtKB XML format, an evidence is represented by an evidence element with a type attribute whose value is an ECO code. In the DAS (features) representation of UniProtKB, an evidence is represented by a METHOD element with an optional cvId attribute whose value is an ECO code.

The table below shows the mapping of previous to new ECO codes.

Previous ECO code New ECO code
ECO:0000001 ECO:0000305
ECO:0000006 ECO:0000269
ECO:0000034 ECO:0000303
ECO:0000044 ECO:0000250
ECO:0000203 ECO:0000501 and ECO:0000256

The codes ECO:0000312 and ECO:0000313 remain unchanged.

In the future, we will also use ECO:0000255 for UniProtKB annotations.

RDF format

In the UniProtKB RDF format, ECO codes are used to describe the evidences for UniProtKB and GO annotations. An evidence is represented by an evidence property whose value is an ECO code. The evidence property is part of an attribution object which is assigned to a UniProtKB or GO annotation via reification.

The table below shows the mapping of previous to new ECO codes.

GO evidence code Previous ECO code New ECO code
EXP ECO:0000006 ECO:0000269
IBA ECO:0000308 ECO:0000318
IBD ECO:0000214 ECO:0000319
IC ECO:0000001 ECO:0000305
IDA ECO:0000002 ECO:0000314
IEA ECO:0000203 ECO:0000501
IEP ECO:0000008 ECO:0000270
IGC ECO:0000177 ECO:0000317
IGI ECO:0000011 ECO:0000316
IKR ECO:0000216 ECO:0000320
IMP ECO:0000015 ECO:0000315
IPI ECO:0000021 ECO:0000353
IRD ECO:0000215 ECO:0000321
ISA ECO:0000200 ECO:0000247
ISM ECO:0000202 ECO:0000255
ISO ECO:0000201 ECO:0000266
ISS ECO:0000044 ECO:0000250
NAS ECO:0000034 ECO:0000303
ND ECO:0000035 ECO:0000307
RCA ECO:0000053 ECO:0000245
TAS ECO:0000033 ECO:0000304

Cross-references for isoform sequences: RefSeq

We have added isoform-specific cross-references to the RefSeq database. The format of these cross-references is as described in release 2014_03.

Cross-references to MaxQB

Cross-references have been added to MaxQB, a database of large proteomics projects.

MaxQB is available at http://maxqb.biochem.mpg.de/mxdb/.

The format of the explicit links is:

Resource abbreviation MaxQB
Resource identifier UniProtKB accession number.

Example: Q6ZSR9

Show all entries having a cross-reference to MaxQB.

Text format

Example: Q6ZSR9

DR   MaxQB; Q6ZSR9; -.

XML format

Example: Q6ZSR9

<dbReference type="MaxQB" id="Q6ZSR9"/>

Removal of the cross-references to ProtClustDB

Cross-references to ProtClustDB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Short rib-polydactyly syndrome 2B
  • Short rib-polydactyly syndrome 3

UniParc news

UniParc cross-references with multiple taxonomy identifiers

The UniParc XML format uses dbReference elements to represent cross-references to external database records that contain the same sequence as the UniParc record. Additional information about an external database record is provided with different types of property child elements, e.g. the species is represented with a property of the type "NCBI_taxonomy_id" that stores an NCBI taxonomy identifier in its value attribute. In the past, all external database records described a single species.

Example:

<dbReference type="REFSEQ" id="ZP_06545872" version_i="1" active="Y" version="1" created="2010-03-07" last="2013-07-18">
  <property type="NCBI_GI" value="289827083"/>
  <property type="NCBI_taxonomy_id" value="496064"/>
</dbReference>
<dbReference type="REFSEQ" id="ZP_18488583" version_i="1" active="Y" version="1" created="2012-11-25" last="2013-07-18">
  <property type="NCBI_GI" value="425085490"/>
  <property type="NCBI_taxonomy_id" value="1203546"/>
</dbReference>

With the introduction of WP-accessions in the NCBI Reference Sequence Project (RefSeq) database, UniParc needs to represent more than one species per dbReference element.

Example:

<dbReference type="REFSEQ" id="WP_001144069" version_i="1" active="Y" version="1" created="2013-07-19" last="2013-11-12">
  <property type="NCBI_GI" value="447066813"/>
  <property type="NCBI_taxonomy_id" value="496064"/>
  <property type="NCBI_taxonomy_id" value="1203546"/>
</dbReference>

This change did not affect the UniParc XSD, but may nevertheless require code changes.

UniProt release 2014_04

Published April 16, 2014

Headline

An old unwanted guest being shown the door

Poliomyelitis causes disabling paralysis, notably in children and adolescents. It is an old plague. An early case of poliomyelitis is shown on a 3,000-year-old Egyptian stele. The disease is caused by the poliovirus, an RNA virus that colonizes the gastro-intestinal tract without any symptoms. In rare cases, the virus enters the central nervous system, preferentially infecting and destroying motor neurons, leading to muscle weakness and acute flaccid paralysis.

In the late 1940s, John Enders showed that the virus could be grown in cells cultured in vitro. This observation provided the basis for the generation of poliovirus vaccines during the 1950s. Poliomyelitis is now virtually absent in economically developed countries, and the World Health Organization is currently using the vaccine in a far-reaching plan to eradicate the poliovirus worldwide.

Polioviruses are small-sized (30nm), non-enveloped icosahedral viruses composed of a capsid and an 8kb single-stranded RNA genome. Upon entry into a host cell, the poliovirus rearranges cytoplasmic membranes to create double membrane spherical vesicles in which the virus replicates, hidden from the antiviral detectors of the host cell. Once new viral particles are assembled, the host cell undergoes lysis, releasing poliovirus virions.

The poliovirus genome encodes a single polyprotein, which is processed by autocatalytic cleavage into 13 different products that ensure all viral functions from entry and replication to cell exit. The size constraint on the poliovirus genome is enormous, since it has to fit within a 30nm wide capsid. In this context, the polyprotein coding strategy is ideal as it allows the greatest economy of genome length versus protein end products.

In order to reduce redundancy in the knowledgebase, UniProtKB/Swiss-Prot describes all the protein products encoded by one gene in a given species in a single entry. Viral proteins are no exception to the rule. Hence, the poliovirus polyprotein is represented in a single UniProtKB/Swiss-Prot entry, which contains the description of 13 final and 4 intermediate chains.

As of this release, the Genome polyprotein entry of poliovirus type 1 (strain Mahoney) has been updated in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references for isoform sequences: Ensembl Genomes

We have added isoform-specific cross-references to the Ensembl Genomes sections EnsemblFungi, EnsemblMetazoa, EnsemblPlants and EnsemblProtists. The format of these cross-references is as described in release 2014_03.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2014_03

Published March 19, 2014

Headline

Minority report

We are a minority in our own body. Over 90% of our cells are actually not human, but microbial. The majority of these microbes reside in the gut. The gut microbiota is typically dominated by bacteria, more specifically by Bacteroidetes and Firmicutes. The exact composition of gut microbiota varies between individuals and depends upon lifestyle, diet, hygienic preferences, use of antibiotics, etc. Gut microbes have a profound influence on human physiology and nutrition. Among others, they contribute to harvesting energy from food.

All guidelines for a healthy diet emphasize the necessity of eating fruit, vegetables and whole grains. These products are rich in dietary fibers, i.e. non-starch polysaccharides, most of which cannot be digested by the hydrolases encoded by our genome. Our inherent ability to digest carbohydrates is restricted to starch and simple saccharides, not xyloglucans (XyGs), a family of highly branched plant cell wall polysaccharides, which are abundant in plants. In view of the prevalence of XyGs in our diet, the mechanism of degradation of these complex polysaccharides by bacteria was expected to be important to human energy acquisition, but until recently it was still unclear. Very interesting work by Larsbrink et al., published in February, sheds light on XyG metabolism. The authors identified a polysaccharide utilization locus (PUL) in the genome of a common human gut symbiont, Bacteroides ovatus. PUL is transcriptionally upregulated in response to growth on galactoxyloglucan. It is predicted to encode 10 genes, including 8 glycoside hydrolases. All of them were subjected to in-depth molecular characterization through reverse genetics, in vitro protein biochemistry and enzymology. Finally, the 3D structure of the endo-xyloglucanase BoGH5A, which generates short XyG oligosaccharides, was solved. This study unraveled all the details of the enzymatic pathways by which the most common dietary polysaccharides are digested in our gut.

Although XyG utilization loci (XyGULs) have been identified in only a few other gut-resident Bacteroidetes, including B. cellulosyliticus, B. uniformis, B. fluxus, Dysgonomonas mossii and D. gadei, most human beings harbor at least one of these Bacteroides XyGULs in their gut, suggesting their importance in human nutrition.

The importance of the gut microbiome goes far beyond an active role in food digestion. It also acts on intestinal function, promoting gut-associated lymphoid tissue maturation, tissue regeneration, gut motility, and morphogenesis of the vascular system surrounding the gut. It additionally affects many other physiopathological aspects, such as the nervous system and bone homeostasis. Not surprisingly, changes in the microbiota composition or a complete lack of a gut microbiota has been shown to affect metabolism, tissue homeostasis and behavior.

As of this release, manually reviewed B. ovatus XyGUL gene products are available in UniProtKB/Swiss-Prot. Let’s bet that they will be followed by many more proteins encoded by our other genome(s) in the near future.

UniProtKB news

Cross-references for isoform sequences

Some of the resources to which we link contain information that is specific to an isoform sequence and where this is known we now indicate the corresponding UniProtKB isoform sequence identifier in our cross-references as described below. The first resources for which we provide such isoform-specific cross-references are Ensembl and UCSC.

Text format

The UniProtKB isoform sequence identifier is shown in square brackets at the end of the DR line as an optional field:

DR   ResourceAbbreviation; ResourceIdentifier(; AdditionalField)+. [IsoId]

Examples:

DR   Ensembl; ENST00000281772; ENSP00000281772; ENSG00000144445. [A0AUZ9-1]
DR   Ensembl; ENST00000418791; ENSP00000405724; ENSG00000144445. [A0AUZ9-2]
DR   Ensembl; ENST00000452086; ENSP00000401408; ENSG00000144445. [A0AUZ9-3]
DR   Ensembl; ENST00000457374; ENSP00000393432; ENSG00000144445. [A0AUZ9-3]
DR   UCSC; uc002vds.3; human. [A0AUZ9-1]
DR   UCSC; uc002vdt.3; human. [A0AUZ9-2]
DR   UCSC; uc002vdx.1; human. [A0AUZ9-4]

XML format

To show the UniProtKB isoform sequence identifier in dbReference elements, we added an optional molecule element to the dbReferenceType. For consistency, we also changed the type of the molecule element that is found in the commentType. The XSD has been changed as highlited below:

    <xs:complexType name="commentType">
    ...
                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'subcellular location' annotations.</xs:documentation>
                    </xs:annotation>
                    <!-- <xs:element name="molecule" type="xs:string" minOccurs="0"/> -->
                    <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
                    <xs:element name="subcellularLocation" type="subcellularLocationType" maxOccurs="unbounded"/>
                </xs:sequence>
    ...
    <xs:complexType name="dbReferenceType">
    ...
        <xs:sequence>
            <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
            <xs:element name="property" type="propertyType" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
        ...
    </xs:complexType>
    ...
    <xs:complexType name="moleculeType">
        <xs:annotation>
            <xs:documentation>Describes a molecule by name or unique identifier.</xs:documentation>
        </xs:annotation>
        <xs:simpleContent>
            <xs:extension base="xs:string">
                <xs:attribute name="id" type="xs:string" use="optional"/>
            </xs:extension>
        </xs:simpleContent>
    </xs:complexType>

Examples:

<dbReference type="Ensembl" id="ENST00000281772">
  <molecule id="A0AUZ9-1"/>
  <property type="protein sequence ID" value="ENSP00000281772"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000418791">
  <molecule id="A0AUZ9-2"/>
  <property type="protein sequence ID" value="ENSP00000405724"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000452086">
  <molecule id="A0AUZ9-3"/>
  <property type="protein sequence ID" value="ENSP00000401408"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000457374">
  <molecule id="A0AUZ9-3"/>
  <property type="protein sequence ID" value="ENSP00000393432"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="UCSC" id="uc002vds.3">
  <molecule id="A0AUZ9-1"/>
  <property type="organism name" value="human"/>
</dbReference>
<dbReference type="UCSC" id="uc002vdt.3">
  <molecule id="A0AUZ9-2"/>
  <property type="organism name" value="human"/>
</dbReference>
<dbReference type="UCSC" id="uc002vdx.1">
  <molecule id="A0AUZ9-4"/>
  <property type="organism name" value="human"/>
</dbReference>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • Isoaspartyl lysine isopeptide (Lys-Asp)

UniProt release 2014_02

Published February 19, 2014

Headline

Epigenetics in the spotlight

In its active form, folate, commonly known as vitamin B9, is a methyl carrier, essential for the biosynthesis of methionine and nucleic acids, most notably thymine, but also purine bases. Methionine synthesis involves first the activation of methionine synthase (MTR) by methionine synthase reductase (MTRR) and then the MTR-catalyzed conversion of homocysteine into methionine concomitant with conversion of 5-methyltetrahydrofolate into tetrahydrofolate. Methionine can be further modified into S-adenosyl methionine which serves as a methyl donor in the biosynthesis of cysteine, carnitine, taurine, lecithin, and phospholipids, among others.

Folate deficiency can result in many health problems, the most notable one being neural tube defects in developing embryos, but the molecular mechanism linking folate metabolism to development remains poorly understood. This is what prompted Padmanabhan et al. to create an animal model to study the impact of abnormal folate metabolism. These authors produced a mouse that contained a gene trap vector inserted in Mtrr gene intron 9. Wild-type Mtrr mRNA was still produced in spite of the insertion, but at lower levels, and folate metabolism was impaired.

When mid-gestation embryos from heterozygous intercrosses were analyzed, it appeared that about half of them displayed developmental defects typical of folate deficiency, ranging from developmental delay to neural tube and heart defects. Surprisingly, wild-type embryos were affected to a similar extent as embryos bearing the mutated gene. Inheritance of the phenotype was not dependent upon the parental genotype, but instead upon that of the maternal grandparents. In other words, Mtrr mutations in either maternal grandparent disrupted the development of their grandchildren, even when the parents and the conceptus were wild-type. These congenital abnormalities persisted in wild-type progeny in generations 4 and 5 of Mtrr mutant maternal ancestors.

What could be the mechanism of this peculiar mode of inheritance? The answer is not yet definite. Because folate plays a key role in one-carbon metabolism, the authors investigated DNA methylation. As expected, global DNA hypomethylation was observed in livers, uteri and placentas. Imprinted loci (differentially methylated regions or DMRs) in wild-type placentas of mid-gestation embryos from heterozygous maternal grandparents were also analyzed. A large proportion of the DMRs assessed in placentas of severely affected embryos had CpG site methylation levels that were statistically different from unrelated wild-type C57BL/6 mice. Surprisingly however, the majority of these sites were hypermethylated and the associated genes down-regulated. There was a positive correlation between epigenetic instability and the severity of the phenotype. Hence, epigenetic instability leading to the misexpression of certain genes may be the cause of developmental phenotypes.

Epigenetic heredity has been reported for Kit and Sox9 genes. In this case, heredity was mediated by RNA, a mechanism rather unlikely for the Mtrr mutations described above. The RNA-mediated heredity observed for Kit and Sox9 required the presence of the tRNA-methyltransferase TRDMT1/DNMT2. Hence, for both phenomena, it seems that the common feature may be methylation, either at the DNA or RNA level.

While awaiting further exciting discoveries in the field of epigenetics, we have already updated MTRR entries with the current knowledge and made them available.

UniProtKB news

Change of the cross-references to PROSITE and HAMAP

The format of the cross-references to the PROSITE and HAMAP databases has been simplified in order to align it with the format of other InterPro member databases.

Text format

Changes for PROSITE:

The optional qualifiers "UNKNOWN", "FALSE_NEG" and "PARTIAL" have been removed. Only matches above the threshold were kept, i.e. cross-references with a "FALSE_NEG" or "PARTIAL" qualifier have been removed.

Examples:

A1RHR2:

Previous format:

DR   PROSITE; PS51257; PROKAR_LIPOPROTEIN; UNKNOWN_1.
DR   PROSITE; PS00922; TRANSGLYCOSYLASE; FALSE_NEG.

New format:

DR   PROSITE; PS51257; PROKAR_LIPOPROTEIN; 1.

O02781:

Previous format:

DR   PROSITE; PS00237; G_PROTEIN_RECEP_F1_1; PARTIAL.
DR   PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1.

New format:

DR   PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1.

Changes for HAMAP:

The optional field that described the nature of signature hits ("atypical", "fused" or "atypical/fused") has been removed. Only matches above the threshold were kept, i.e. "atypical" and "atypical/fused" cross-references have been removed if their match score was below the threshold.

Example:

Q9K3D6:

Previous format:

DR   HAMAP; MF_00006; Arg_succ_lyase; 1; fused.
DR   HAMAP; MF_01105; N-acetyl_glu_synth; 1; atypical/fused.

New format:

DR   HAMAP; MF_00006; Arg_succ_lyase; 1.

XML format

Changes for PROSITE:

The optional values "UNKNOWN", "FALSE_NEG" and "PARTIAL" that were stored in a property of type match status have been removed, so that the match status value has become an integer. Only matches above the threshold were kept, i.e. "FALSE_NEG" and "PARTIAL" cross-references have been removed.

Examples:

A1RHR2:

Previous format:

<dbReference type="PROSITE" id="PS51257">
  <property type="entry name" value="PROKAR_LIPOPROTEIN"/>
  <property type="match status" value="UNKNOWN_1"/>
</dbReference>
<dbReference type="PROSITE" id="PS00922">
  <property type="entry name" value="TRANSGLYCOSYLASE"/>
  <property type="match status" value="FALSE_NEG"/>
</dbReference>

New format:

<dbReference type="PROSITE" id="PS51257">
  <property type="entry name" value="PROKAR_LIPOPROTEIN"/>
  <property type="match status" value="1"/>
</dbReference>

O02781:

Previous format:

<dbReference type="PROSITE" id="PS00237">
  <property type="entry name" value="G_PROTEIN_RECEP_F1_1"/>
  <property type="match status" value="PARTIAL"/>
</dbReference>
<dbReference type="PROSITE" id="PS50262">
  <property type="entry name" value="G_PROTEIN_RECEP_F1_2"/>
  <property type="match status" value="1"/>
</dbReference>

New format:

<dbReference type="PROSITE" id="PS50262">
  <property type="entry name" value="G_PROTEIN_RECEP_F1_2"/>
  <property type="match status" value="1"/>
</dbReference>

Changes for HAMAP:

The optional property of type flag that described the nature of signature hits ("atypical", "fused" or "atypical/fused") has been removed. Only matches above the threshold were kept, i.e. "atypical" and "atypical/fused" cross-references have been removed if their match score was below the threshold.

Example:

Q9K3D6:

Previous format:

<dbReference type="HAMAP" id="MF_00006">
  <property type="entry name" value="Arg_succ_lyase"/>
  <property type="flag" value="fused"/>
  <property type="match status" value="1"/>
</dbReference>
<dbReference type="HAMAP" id="MF_01105">
  <property type="entry name" value="N-acetyl_glu_synth"/>
  <property type="flag" value="atypical/fused"/>
  <property type="match status" value="1"/>
</dbReference>

New format:

<dbReference type="HAMAP" id="MF_00006">
  <property type="entry name" value="Arg_succ_lyase"/>
  <property type="match status" value="1"/>
</dbReference>

These changes did not affect the XSD, but may nevertheless require code changes.

Cross-references to TreeFam

Cross-references have been added to TreeFam, a database composed of phylogenetic trees inferred from animal genomes.

TreeFam is available at http://www.treefam.org.

The format of the explicit links is:

Resource abbreviation TreeFam
Resource identifier TreeFam unique identifier.

Example: Q8CFE6

Show all entries having a cross-reference to TreeFam.

Text format

Example: Q8CFE6

DR   TreeFam; TF328787; -.

XML format

Example: Q8CFE6

<dbReference type="TreeFam" id="TF328787"/>

Cross-references to BioGrid

Cross-references have been added to BioGrid, a public database that archives and disseminates genetic and protein interaction data from model organisms and humans.

BioGrid is available at http://thebiogrid.org.

The format of the explicit links is:

Resource abbreviation BioGrid
Resource identifier BioGrid unique identifier.
Optional information 1 Number of interactions.

Example: O46201

Show all entries having a cross-reference to BioGrid.

Text format

Example: O46201

DR   BioGrid; 69392; 1.

XML format

Example: O46201

<dbReference type="BioGrid" id="69392">
  <property type="interactions" value="1"/>
</dbReference>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • N-methylglycine
  • N,N-dimethylglycine
  • N,N,N-trimethylglycine

Deleted term:

  • 3-hydroxyhistidine

UniRef news

Revision of the UniParc records used in the UniRef databases

We have stopped importing UniParc records that correspond to Ensembl proteomes sequences in the UniRef databases, as the relevant sequences are now part of UniProtKB. Previously, some sequences from Ensembl proteomes (e.g. from Human, Chicken, Cow) were missing from UniProtKB, but we have recently completed their import into UniProtKB (see FAQ) and thus no longer need to import them via UniParc. The UniRef databases will continue to include UniParc records from the RefSeq and PDB databases that are not in UniProtKB to ensure a complete sequence space coverage.

UniProt release 2014_01

Published January 22, 2014

Headline

Mouse attacks!

In the arid lands of Arizona lives a fierce predator whose howls pierce the desert night, terrifying its prey. This predator is… a mouse, Onychomys torridus, also called the grasshopper mouse. It may sound like a tale looming straight from the imagination of Tim Burton or Monthy Python, but this mouse really exists. It is carnivorous and it regularly howls just before a kill, although the emitted sound is more a sustained whistle than the actual howl of a wolf. Its prey is no less astonishing, including crickets, other rodents, tarantulas and bark scorpions (Centruroides sculpturatus).

Bark scorpions are not easy prey. They are venomous and inflict intensely painful, sometimes lethal stings. Surprisingly grasshopper mice do not seem to be seriously bothered by that, and it takes little time before the scorpion is captured, killed and eaten. How can O. torridus ignore the venom, while common house mice are sensitive to it? Overall, grasshopper mice do feel pain normally, but when they are injected with scorpion venom or with a physiological saline solution in their hind paws, they are much more irritated by the control saline solution than by the venom. In grasshopper mice, bark scorpion venom acts as an analgesic.

Venom from Buthidae scorpions initiates acute pain in sensitive mammals, such as house mice, rats and humans, by activating the voltage-gated sodium channel Nav1.7/SCN9A, but has no effect on the Nav1.8/SCN10A sodium channel. Recent experiments by Rowe et al. on freshly isolated O. torridus sensory neurons showed that, in this species, the venom strongly inhibits Nav1.8/SCN10A Na+ currents. These Na+ currents are necessary for action potential sustained firing and propagation. By inhibiting Nav1.8/SCN10A, the scorpion venom blocks pain transmission to the central nervous system, and hence induces analgesia. The diametrically opposed response of rodents towards scorpion venom seems to be due to only 2 residues within the Nav1.8/SCN10A sequence. In O. torridus, a glutamate residue is found at position 859 (E-859) and a glutamine residue at position 862 (Q-862), while in species known to be sensitive to the venom, these positions are reversed: Q-859 and E-862. Site-directed mutagenesis of these 2 residues in the O. torridus sequence (Q859E/E862Q) abolished venom sensitivity. Conversely, mutation of the glutamine position in Mus musculus (Q861E) conferred inhibition by C. sculpturatus venom.

Pain sensitivity is essential for survival, since it helps avoid damaging situations. Hence any change in pain perception has to be finely tuned in order not to be deleterious. O. torridus has evolved a brilliant strategy allowing it to exploit an abundant food resource in its environment, i.e. bark scorpions, while keeping intact its ability to feel the necessary pain.

Persistent pain can turn into a nightmare and improving our understanding of pain signaling may be a tremendous help in the discovery of new analgesic drugs. Nav1.7/SCN9A is already under close investigation as a potential target for pain prevention. The new and very exciting study by Rowe et al. shows that the Nav1.8/SCN10A channel also plays a crucial key role in the transmission of pain signals and may be an interesting target for analgesic development.

As of this release, the fully annotated O. torridus Nav1.8/SCN10A protein is available in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of the cross-references to IPI

Cross-references to IPI have been removed.

IPI has closed in 2011. The last release is archived at ftp://ftp.ebi.ac.uk/pub/databases/IPI.

The Ensembl and Ensembl Genomes projects offer access to genomic data from vertebrate and non-vertebrate species respectively.

Complete proteome data is available from UniProtKB.

An archive of the last mapping table between UniProtKB and IPI is archived at ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/.

Documents and RSS feeds for UniProt Forthcoming changes and News

We have replaced the documents sp_soon.htm (“UniProt Knowledgebase – Forthcoming changes”) and xml_soon.htm (“UniProt Knowledgebase – Forthcoming changes in XML”) by a searchable section Forthcoming changes on our website to announce planned changes for all UniProt data sets and file formats in one place and to provide a common RSS feed. The same information can also be downloaded from our FTP site.

Changes that have been implemented are described in our “News archive”, which can be searched in the News section of our website, followed via an RSS feed and downloaded from the FTP site. These news include the historical contents of sp_news.htm (“What’s new?”), but not that of xml_news.htm (“What’s new in XML?”). The latter file was renamed to xml_news_prior_2014_01.html to archive the XML changes that were implemented before 2014. This file will no longer be updated.

We have generated symbolic links on the FTP site for the files that have been replaced to give everyone time to update their FTP download procedures to the new files’ locations:

New version of DASty

Our DAS web client DASty has been redesigned. DASty provides a visual representation of the compilation of protein annotations from different third-party sources. This allows users to get a global overview of all protein annotation available for their protein of interest, from UniProt as well as other sources. The “Third-party data” link that is available on each UniProtKB entry now leads to this new version of DASty. Any bookmarks should be updated accordingly. For instance, the “Third-party data” link for UniProt accession P05067 now links to http://www.ebi.ac.uk/dasty/client/index.html?q=P05067

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

Modified terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • 1-(tryptophan-3-yl)-tryptophan (Trp-Trp) (interchain) -> 1-(tryptophan-3-yl)-tryptophan (Trp-Trp) (interchain with W-...)
  • 5’-tyrosyl-5’-aminotyrosine (Tyr-Tyr) (interchain) -> 5’-tyrosyl-5’-aminotyrosine (Tyr-Tyr) (interchain with Y-...)
  • Glycyl threonine ester (Gly-Thr) (interchain with G-...) -> Glycyl threonine ester (Gly-Thr) (interchain with T-...)

Changes to keywords

New keywords:

Modified keywords:

Deleted keyword:

  • Phage maturation

UniProt release 2013_12

Published December 11, 2013

Headline

The aflatoxin biosynthetic pathway annotated in UniProtKB/Swiss-Prot

Aflatoxins are very important members of the family of mycotoxins, that contaminate food and feed crops. More than 14 different aflatoxins have been identified so far. These secondary metabolites are mainly produced by the filamentous fungi Aspergillus flavus and Aspergillus parasiticus. These organisms grow in warm and humid locations, such as those where crops (e.g. rice, maize and ground nuts) are stored.

Intake of aflatoxins has both acute and long term effects. Acute aflatoxin poisoning leads to effects such as hemorrhagic necrosis of the liver, bile duct proliferation, edema and lethargy. In addition, aflatoxins have immunosuppressive effects and interfere with nutrient uptake leading to malnutrition (kwashiorkor). The most toxic of the aflatoxins, aflatoxin B1, is the most potent naturally occurring carcinogen known. The carcinogenic effect of aflatoxins is mediated by 2 cytochromes P-450 enzymes, CYP1A2 and CYP3A4. CYP1A2 and CYP3A4 turn the aflatoxins into much more reactive epoxides that react with DNA bases and induce mutations, leading, in the long term, to liver cancer. Overall it is estimated that aflatoxins negatively impact up to 5 billion people who live in warm and humid climates. The presence of dietary aflatoxin is strongly associated with incidences of liver and lung cancers, HIV/AIDS, malaria, growth stunting and childhood malnutrition, and increased risk of adverse birth outcomes in Asia, Africa, and Central America.

To increase the ability to eliminate or reduce aflatoxin contamination, the mycotoxin biosynthetic pathway has been comprehensively studied. The pathway is composed of over 25 enzymatic steps, each step catalyzed by a different enzyme. 13 of these enzymes have been biochemically characterized in sufficient depth to allow the recent attribution of enzyme classification (EC) numbers.

EC numbers are part of a classification system managed by the International Union for Biochemistry and Molecular Biology (IUBMB). They are composed of 4 digits, which represent both the name of the enzyme and the precise description of the chemical reaction it catalyzes. In UniProtKB, enzymes are annotated with EC numbers (in ‘Names and origin’, ‘Protein names’, ‘Recommended name’, see for instance pksL1 entry), when these are available.

As of this release, the enzymes involved in aflatoxin biosynthesis have been manually annotated and are publicly available in UniProtKB/Swiss-Prot. The newly characterized enzymes from this pathway belong to oxidoreductase, transferase, hydrolase, and lyase classes of the EC classification system.

UniProtKB news

New human 1000 Genomes Project variants file

UniProt would like to announce the release of a new extension to the humsavar.txt variant catalogue. This new variant file, homo_sapiens_variation.txt.gz, supplements the set of manually curated human variants in humsavar.txt with a catalogue of novel Single Nucleotide Variants (SNVs or SNPs) from the 1000 Genomes Project for both UniProtKB/Swiss-Prot and UniProtKB/TrEMBL sequences. These variants have been automatically mapped to UniProtKB sequences, including isoform sequences, through Ensembl. In addition to defining the position and amino acid change due to each variant, the new file maps each affected UniProtKB record to the corresponding Ensembl gene, transcript and protein identifiers, provides the chromosomal location with allele change and, where possible, a cross-reference to OMIM is provided for the variant. This file along with the humsavar.txt file can now be found in the new dedicated ‘variants’ directory in the UniProt FTP site. We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to add additional data sources for human variants that will include somatic variants, new data fields providing additional details concerning the variant and variants from additional species.

Cross-references to GuidetoPHARMACOLOGY

Cross-references have been added to GuidetoPHARMACOLOGY, which provides an expert-driven guide to pharmacological targets and the substances that act on them.

GuidetoPHARMACOLOGY is available at http://www.guidetopharmacology.org/

The format of the explicit links in the flat file is:

Resource abbreviation GuidetoPHARMACOLOGY
Resource identifier GuidetoPHARMACOLOGY identifier
Example Q08460:
DR   GuidetoPHARMACOLOGY; 380; -.

Show all the entries having a cross-reference to GuidetoPHARMACOLOGY.

New cross-reference category: Chemistry

A new database category has been added: Chemistry.

Change of the category of the cross-references BindingDB, ChEMBL and DrugBank

The BindingDB, ChEMBL and DrugBank databases have been moved from the category “Other” to the category “Chemistry”.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to keywords

New keywords:

Modified keyword:

Deleted keyword:

  • Inhibition of host TBK1-IKBKE-DDX3 complex by virus

UniProt release 2013_11

Published November 13, 2013

Headline

Forever young and cancer-free… in a black hole

In east African grasslands and savannas lives a most bizarre rodent: the naked mole-rat (Heterocephalus glaber). Naked mole-rats are small burrowing rodents, about the size of a mouse. They inhabit underground tunnels, where they form colonies ranging in size from 20 to 300 individuals. Naked mole-rats exhibit eusociality, a lifestyle reminiscent of that of ants or some bees. The colony is ruled by a queen; it has 1 to 3 males who breed only with the queen, while the other female members of the colony are sterile workers or soldiers. But this is not the only singularity of this amazing mammal. Among many other unexpected features, naked mole rats exhibit exceptional longevity, some reaching ages of 30 years, about 10 times longer than ordinary mice (in a protected environment). They show negligible senescence, no age-related increase in mortality, and high fecundity until death. In addition, they are highly resistant to cancer.

In 2009, it was reported that naked mole rats may resist cancer thanks to an extremely efficient mechanism of cell contact inhibition, called early contact inhibition (ECI). Contact inhibition is a process that arrests cell growth when cells come in contact with each other or the extracellular matrix. It is a powerful anticancer mechanism. The process of ECI causes naked mole-rat cells to arrest at a much lower density than mouse cells, and the loss of ECI makes naked mole-rat cells more susceptible to malignant transformation.

When culturing naked mole-rat fibroblasts, Tian et al. observed that the culture media became very viscous after a few days, much more than the media conditioned by human, guinea-pig or mouse cells. This increase in viscosity was due to the increased production of an anionic, nonsulfated glycosaminoglycan: high-molecular-mass hyaluronan (HMM-HA). HMM-HA overproduction was not restricted to tissue culture conditions. It was also observed in vivo, including in brain, heart, kidney and skin. Increased HMM-HA production was due to robust synthesis, via the up-regulation of hyaluronan synthase 2 (Has2), the enzyme catalyzing HMM-HA production, combined with slower degradation, due to the down-regulation of HA-degrading enzyme.

Secreted HMM-HA binds to fibroblasts through the Cd44 cell surface receptor and triggers intracellular signaling, leading to the expression of the cyclin-dependent kinase inhibitor Cdkn2a/p16-INK4a and to the induction of ECI. In naked mole-rat cells, this signaling is further optimized, since these cells exhibit a 2-fold higher affinity for HA as compared to mouse or human cells.

HA is widely distributed and one of the main components of the extracellular matrix. The authors hypothesized that the increased HMM-HA production in the naked mole-rat could have evolved as an adaptation to a subterranean lifestyle to provide flexible skin needed to squeeze through underground tunnels. This adaptation to harsh living conditions would turn out to have additional benefits, such as contributing to cancer resistance.

As of this release, naked mole-rat Has2 has been manually annotated and is publicly available in UniProtKB/Swiss-Prot entry G5AY81.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted diseases:
  • Epileptic encephalopathy, Lennox-Gastaut type
  • Knobloch syndrome 2

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N-acetylated lysine
Modified terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 5-glutamyl N2-arginine -> 5-glutamyl N2-ornithine
  • 5-glutamyl N2-glutamate -> 5-glutamyl glutamate

Changes to keywords

New keyword:

UniProt release 2013_10

Published October 16, 2013

Headline

When the cat’s away…

For all creatures, early detection of predators is a matter of survival. Olfaction often plays a crucial role in this regard. Odorant molecules activate specific receptors on sensory neurons. The axons from neurons expressing the same olfactory receptor come together at the same glomeruli, near the surface of the olfactory bulb of the brain. It is generally thought that odorants can be recognized by different receptors and that each glomerulus makes only a small contribution to the global representation of a given odor. However, recent discoveries suggest that the olfactory system may not be as redundant as previously thought.

Mice exhibit innate aversion to volatile amines, such as beta-phenylethylamine (PEA) and isopentylamine (IPA) that are excreted in cat urine. Trace amines robustly activate trace-amine associated receptors (TAARs). There are 15 TAAR genes in mouse. Targeted concomitant deletion of 14 of them (TAAR2 through 9) show no apparent phenotype. Homozygous mutant mice are healthy and breed normally. The only difference with wild-type and heterozygous littermates is that their aversion to PEA and to cat urine is abolished. This effect is specific, since their response to compounds produced by red fox remains unchanged. Among TAAR genes, TAAR4 is of particular interest, since it is exquisitely sensitive to PEA, with apparent affinities rivaling those seen with mammalian pheromone receptors. Amazingly, knockout of this single gene produces a loss of aversion to PEA and to puma or lynx urine, although homozygous mutant animals still avoid other odorants, such as IPA, exactly as their wild-type and heterozygous littermates do. To our knowledge, this is the first report of an individual main olfactory receptor contributing substantially to odor perception.

This type of exciting discovery reported in the literature triggers yet another innate reaction, that of Swiss-Prot curators to update UniProtKB. The revised mouse TAAR4 entry is now publicly available.

UniProtKB news

Cross-references to PRO

Cross-references have been added to PRO (Protein Ontology), which provides an ontological representation of protein-related entities by explicitly defining them and showing the relationships between them.

PRO is available at http://pir.georgetown.edu/pro/pro.shtml

The format of the explicit links in the flat file is:

Resource abbreviation PRO
Resource identifier PRO identifier
Example O42634:
DR   PRO; PR:O42634; -.

Show all the entries having a cross-reference to PRO.

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted diseases:
  • Microphthalmia, isolated, with cataract, 4

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • Methionine (R)-sulfoxide
  • Methionine (S)-sulfoxide

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2013_09

Published September 18, 2013

Headline

With a little help from my… Lassa virus

Dystroglycan provides a physical link between components of the extracellular matrix, including laminin, and the intracellular actin cytoskeleton. This link is crucial for a number of cellular processes, including laminin and basement membrane assembly, sarcolemmal stability, cell survival, peripheral nerve myelination, cell migration and epithelial polarization.

The dystroglycan protein is extensively glycosylated at multiple sites, and an unusual O-linked glycan is required for proper interaction with extracellular matrix ligands including laminin. Glycosyltransferases responsible for this modification were first identified using classical biochemical techniques, and mutations in the associated genes were identified in patients presenting with one of a number of dystroglycanopathies. These are a heterogeneous group of disorders characterized by muscular dystrophy that can be associated with brain anomalies, mental retardation, eye malformations, and other clinical symptoms. However until recently some 50% of newly diagnosed cases of dystroglycanopathy showed no significant association with variants in known glycosyltransferase genes.

To address this issue, Jae et al., 2013 developed a powerful approach to dystroglycanopathy candidate gene identification that exploits another, less beneficial property of dystroglycan. The hemorrhagic Lassa virus binds to glycosylated dystroglycan during infection, the efficiency of which depends on the glycosylation level. By using gene-trap insertion mutagenesis the authors were able to identify genes whose inactivation conferred resistance to Lassa virus infection, which by extension may include regulators of the level of dystroglycan glycosylation. These genes included all those previously known to be associated with a dystroglycanopathy, as well as several novel candidates. Exon sequencing of a panel of patients with severe dystroglycanopathy identified variants in two of them, POMK/SGK196 and TMEM5, while confirming the absence of variants in known dystroglycanopathy genes. The other candidates await further characterization.

We may be about to witness the elucidation of the underlying genetic causes of a range of dystroglycanopathies, disorders associated with defective dystroglycan modification, through the use of a deadly virus that normally targets the affected protein.

As of this release, all proteins involved in dystroglycanopathies can be retrieved from UniProtKB/Swiss-Prot with the keyword Dystroglycanopathy.

UniProtKB news

Removal of the cross-reference to Pathway_Interaction_DB

Cross-references to Pathway_Interaction_DB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Cataract, pulverulent, juvenile-onset, MAF-related
  • 2-aminoadipic 2-oxoadipic aciduria

Changes to keywords

New keywords:

Modified keywords:

UniProt release 2013_08

Published July 24, 2013

Headline

Girls just want to have … IFNE

Interferons (IFNs) are proteins made and released in answer to the presence of pathogens, such as viruses or bacteria, that trigger the protective defenses of the immune system. In other words, they “interfere” with infections, hence their name. Within the large IFN family, type I IFNs are clustered on a defined locus on chromosome 9p21 in humans and in a region of conserved synteny on chromosome 4 in mice. Their expression is induced by the activation of signaling pathways downstream of pattern-recognition receptors and they all bind to the IFN-alpha cell surface receptor complex consisting of IFNAR1 and IFNAR2 chains, leading to the expression of a whole set of genes.

There is, however, an alien on the type I IFN locus: IFN-epsilon (IFNE). IFNE shares less than 40% amino acid identity with bona fide type I IFNs, such as IFN-alpha or IFN-beta, but it does still bind to IFNAR, as expected for a type I IFN. However, unlike any of the other family members, it is not induced by the activation of any known pattern-recognition, including Toll-like receptor pathways. In addition, while other type I IFNs are mainly produced by haemopoietic cells, IFNE is constitutively expressed by epithelial cells of the female reproductive tract in humans and mice. At first glance, these observations seem to challenge a potential protective function for IFNE.

In a recent publication, Fung et al. reported that IFNE expression varied approximately 30-fold at different stages of the estrous cycle in the mouse uterus, with the highest levels at estrus (when estrogen levels are high) and was reduced during pregnancy (when progesterone levels are high). Similarly, in the human endometrium, IFNE levels were highest in the proliferative phase of the menstrual cycle and lowest in postmenopausal women (when estrogen levels are low). The suspected hormonal regulation could then be confirmed in mice and in humans: IFNE is induced by estrogens and reduced by progesterone. What about IFNE function? Fung et al. demonstrated that IFNE regulates IFN-regulated genes, including IRF7 and ISG15, as well as 2’5’oligoadenylate synthetase. What is more, Ifne-/- female mice, whose vaginas were infected with Chlamydia muridarum or herpes simplex virus 2, had more severe clinical disease than wild-type mice, as well as higher levels of virus or bacteria at defined time points after infection. Hence IFNE seems to play an important – though local – protective role against sexually transmitted infections.

These very interesting observations may have pinpointed the cause of susceptibility to infections of the reproductive tract in women on progesterone-containing contraception, i.e. a progesterone-induced decrease in IFNE expression.

In UniProtKB/Swiss-Prot, IFNE entries have been updated accordingly.

UniProtKB news

Cross-references to GeneWiki

Cross-references have been added to GeneWiki, an initiative that aims to create seed articles for every notable human gene.

GeneWiki is available at http://en.wikipedia.org/wiki/Gene_Wiki

The format of the explicit links in the flat file is:

Resource abbreviation GeneWiki
Resource identifier GeneWiki identifier
Example Q96N67:
DR   GeneWiki; Dock7; -.

Show all the entries having a cross-reference to GeneWiki.

Change of the cross-reference GlycoSuiteDB to UniCarbKB

GlycoSuiteDB, an annotated and curated relational database of glycan structures, has been integrated into UniCarbKB, with a new user interface and added functionalities.

We therefore changed the corresponding resource abbreviation from GlycoSuiteDB to UniCarbKB.

Example: P02763:

Previous flat file format:
DR   GlycoSuiteDB; P02763; -.
New flat file format:
DR   UniCarbKB; P02763; -.

UniProtKB/Swiss-Prot is currently linked to this resource from the cross-reference section (DR lines), but we also have some site-specific links from the sequence annotation section (FT CARBOHYD) of relevant UniProtKB/Swiss-Prot entries. An increase of the number of cross-linked entries is planned, including more literature based glycan data from UniCarbKB.

Removal of the cross-reference to GermOnline

Cross-references to GermOnline have been removed.

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted diseases:
  • Cataract, congenital, cerulean type, 3
  • Cataract, congenital, non-nuclear polymorphic, autosomal dominant
  • Cataract, cortical, age-related, 2
  • Cataract-microcornea syndrome
  • Cataract, sutural, with punctate and cerulean opacities
  • Cataract, zonular
  • Hereditary non-polyposis colorectal cancer 3
  • Leukotriene C4 synthase deficiency
  • Neuropathy, congenital amyelinating
  • Pallido-ponto-nigral degeneration
  • Platyspondylic lethal skeletal dysplasia Sand Diego type
  • Thromboxane synthetase deficiency
  • Weaver syndrome 2

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 5-glutamyl N2-arginine
  • 5-glutamyl N2-glutamate

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2013_07

Published June 26, 2013

Headline

How to go green, or red?

Chlorophyll is the major photosynthetic pigment. It performs the essential processes of harvesting light energy in the antenna complexes and transferring this energy to the reaction centers to produce chemical energy.

The chlorophyll molecule is present in all photosynthetic organisms. It is made up of 2 moieties of distinct origin, chlorophyllide and phytol. The early enzymatic steps of chlorophyllide biosynthesis from glutamyl-tRNA to protoporphyrin IX are shared with the heme biosynthesis pathway. Hence, protoporphyrin IX is the last common reactant for the synthesis of both heme and chlorophyll. To produce chlorophyll, a magnesium chelatase (EC=6.6.1.1) inserts Mg(2+) into the protoporphyrin IX ring, while an iron chelatase (EC=4.99.1.1) inserts Fe(2+) into the ring during heme biosynthesis.

In Arabidopsis thaliana, there are 15 enzymes and 27 genes required for chlorophyll biosynthesis from glutamyl-tRNA to chlorophyll b. Nine proteins are encoded by single-copy genes, and the others are encoded by gene families consisting of two to three members. The magnesium chelatase is a complex of three subunits, CHLI, CHLD and CHLH encoded by 4 different genes. As of this release, all 27 proteins are manually annotated in UniProtKB/Swiss-Prot. They all contain the subtopic PATHWAY: Porphyrin-containing compound metabolism; chlorophyll biosynthesis in ‘General annotation (Comments)’ and the keyword Chlorophyll biosynthesis. This keyword also allows the retrieval of additional proteins involved in the regulation of the process or in the biosynthesis of the long phytol side chain, for example.

Enzymes involved in the biosynthesis of the porphyrins, common to both heme and chlorophyll, are also annotated with the comment PATHWAY: Porphyrin-containing compound metabolism; protoporphyrin-IX biosynthesis.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases:

UniProt release 2013_06

Published May 29, 2013

Headline

Back to the wild

Nearly half of our genome consists of mobile elements and their recognizable remnants. These elements are thought to have shaped both our genes and our entire genome, driving genome evolution. However, mobile elements can undergo ‘molecular domestication’, whereby the transposon genes are incorporated into cellular gene expression programs, but are no longer mobile. They can also evolve cellular DNA recombination functions, such as the V(D)J antigen receptor-recombination system. The human genome contains some 50 genes that were derived from transposable elements or transposons, and many are now integral components of cellular gene expression programs.

Human THAP9 is one such transposon-derived gene. It is homologous to Drosophila P element DNA transposase. Both human and Drosophila proteins show a typical site-specific DNA-binding Zn finger domain. Human THAP9 is a single-copy gene and does not contain any terminal inverted repeats or target-site duplications, indicating that it constitutes a bona fide domesticated stationary sequence. It thus came as a surprise that this gene has nevertheless retained the catalytic activity to mobilize P transposable elements in Drosophila and human cells. The physiological relevance of this observation remains elusive, but what is clear is that domesticated transposons may have retained enough “wild” properties to keep our genome on the move.

The human THAP9 entry has been updated accordingly in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to SignaLink

Cross-references have been added to SignaLink, an integrated resource to analyze signaling pathway proteins, cross-talks, transcription factors, miRNAs and regulatory enzymes.

SignaLink is available at http://signalink.org/

The format of the explicit links in the flat file is:

Resource abbreviation SignaLink
Resource identifier UniProtKB accession number
Example Q24306:
DR   SignaLink; Q24306; -.

Show all the entries having a cross-reference to SignaLink.

Removal of the cross-reference to HSSP

Cross-references to HSSP have been removed.

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted disease:
  • Ichthyosis, lamellar, 1

UniProt release 2013_05

Published May 1, 2013

Headline

Human genetic diseases in UniProtKB/Swiss-Prot

During the past decade, next-generation sequencing (NGS) technologies have accelerated the detection of genetic variants resulting in the rapid discovery of new disease-associated genes. More than 100 causative genes in various Mendelian disorders have been identified by means of whole exome sequencing. However, the wealth of variation data made available by NGS is not sufficient, alone, to understand the mechanisms underlying disease pathogenesis and manifestation. Diseases are the consequences of series of events that include not only primary mutations in disease-causing genes, but also variations in disease-modifying genes, as well as the combined effects of gene-gene and gene-environment interactions. That is why new approaches to unravel disease mechanisms are based on biological network analysis.

In addition to providing a large amount of information on protein functions, interactions and biological pathways, UniProt pays particular attention to the annotation of human genetic diseases and disease-linked variants. Information on genetic diseases is shown in the ‘Involvement in disease’ subsection of the ‘General Annotation (Comments)’ section. In the current release, over 4,600 phenotypes are described in close to 3,000 human entries. The great majority of UniProtKB disease descriptions have links to the Online Mendelian Inheritance in Man knowledgebase (OMIM), allowing users to retrieve more detailed information.

In order to improve the clarity of medical annotation and to facilitate the retrieval of disease information from UniProtKB, we have modified the format of the subsection ‘Involvement in disease’. The newly modified subsection is organized in 2 parts. Firstly, the disease name, acronym and features are defined using a controlled vocabulary. Secondly, the role of the gene/protein in the disease is described in a ‘Note:’, that allows discrimination between disease-causing, disease-modifying and susceptibility genes. This note, partly written in free text, provides information on the biological context or other interesting information that may not be directly related to the phenotype description, such as the involvement of different proteins in the pathological mechanism. For example, multiple sulfatase deficiency (MSD) is due to the simultaneous decrease of activity of all sulfatases. However, the primary cause is a mutation in SUMF1, an enzyme required for post-translational modification and catalytic activation of these enzymes. This additional information is stored in the ‘Involvement in disease’ note.

Genetic diseases annotated in UniProtKB/Swiss-Prot are indexed in the humdisease.txt file, available for our users as of this release. Each record in this file consists of a disease identifier, acronym, and description, as well as known disease synonyms, links to OMIM, Medical Subject Headings (MeSH) and associated UniProtKB keywords.

UniProtKB news

Complete proteomes for Ensembl species

For UniProt release 2013_05, one new species from Ensembl vertebrates and 3 new Ensembl Genomes have been made available. These are:

Felis catus (Cat)
Brassica rapa subsp. pekinensis (Chinese cabbage)
Hyaloperonospora arabidopsidis (Downy mildew agent)
Magnaporthe poae (Kentucky bluegrass fungus)

In addition to the new imports, existing proteomes derived from Ensembl species have been updated with data from Ensembl release 70.
All predicted protein sequences from an Ensembl Genome are mapped to their UniProtKB counterparts under stringent conditions: 100% identity over 100% of the length of the two sequences is required. Any sequence found to be absent from UniProtKB is imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. All UniProtKB entries that map to an Ensembl Genome are used to build the proteome; they are tagged with the keyword Complete proteome and an Ensembl Genome cross-reference is added.
We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to make proteomes for the remaining Ensembl and Ensembl Genomes species currently absent from UniProtKB.

Removal of the cross-reference to GenomeReviews

Cross-references to GenomeReviews have been removed.

Changes to keywords

New keywords: Modified keyword:

UniProt release 2013_04

Published April 3, 2013

Headline

Major progress in adenovirus annotation

Adenoviruses were first isolated by Wallace Rowe in 1953 from adenoid tissue of sick children. These viruses infect a wide range of vertebrates, including humans. Infectious virions are spread primarily via respiratory droplets, however they can also be spread by fecal routes. Most infections with Human Adenovirus (HAdV) result in upper respiratory tract diseases; they account for about 10% of acute respiratory infections in children. They can also cause fever, diarrhea, pink eye (conjunctivitis), bladder infection (cystitis), rash illness, etc.

HAdV are medium-sized (90-100 nm), non-enveloped icosahedral viruses composed of a capsid and a double-stranded linear DNA genome. The viral genome is approximately 36kb long. It encodes 37 proteins which are produced by complex alternative splicing of 6 mRNA transcription units. The viral genome replicates in the host cell nucleus, but never integrates into the host genome. This is the reason why adenoviruses are widely used in gene therapy and anticancer virus vector trials.

The JCVI adenovirus project recently resulted in the sequencing of 150 new HAdV genomes. In order to support the annotation of these new genomes, the community needs a high quality set of data that can serve as a reference. In this context, a collaboration including UniProt, NCBI, JCVI and several field experts has been initiated to update reference adenovirus genomes and proteomes. Gene predictions have been corrected with the most recent proteomic and cDNA sequencing data. This major collaborative effort has resulted in a consistent and up-to-date annotation of the viral genome in NCBI RefSeq and of the HAdV reference proteome in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of MEDLINE identifiers

We have removed the MEDLINE identifiers from the bibliographic database cross-references of literature citations since they have been superceded by PubMed identifiers. The valid bibliographic database names and their associated identifiers are now:

Name Identifier
PubMed PubMed Unique Identifier (PMID)
DOI Digital Object Identifier (DOI)
AGRICOLA AGRICOLA Unique Identifier

UniProt release 2013_03

Published March 6, 2013

Headline

Latest from the prokaryotic world: bacterial Cas9, a new tool for genome engineering

The CRISPR system (Clustered Regularly Interspaced Short Palindromic Repeat) is a bacterial and archaeal, RNA-based adaptive immune system, which degrades invading genetic material. Very briefly, invading viruses or plasmids are recognized by their complementarity to CRISPR RNA (crRNA) and degraded by dedicated nucleases.

There are 3 major CRISPR systems, with a growing number of recognized subtypes depending on the Cas proteins (CRISPR-associated proteins) used to affect the various steps of crRNA generation and invading nucleic acid destruction. In type I and III CRISPR systems, different specialized Cas endonucleases generate crRNAs, which then assemble with other Cas proteins to create large crRNA-protein complexes that recognize and degrade invading nucleic acids complementary to the crRNA. Type II CRISPR systems are a little different. In these systems, correct processing of pre-crRNA requires a trans-encoded small RNA (tracrRNA), endogenous RNase III and the Cas9 protein. The tracrRNA serves as a guide for RNase III-aided processing of pre-crRNA. Subsequently the Cas9/crRNA/tracrRNA complex endonucleolytically cleaves linear or circular dsDNA target complementary to the crRNA. Degradation requires the Cas9 protein and both RNA species. Thus, in type II CRISPR systems, crRNA-guided degradation of DNA relies upon a single protein. This discovery has implications beyond the world of bacteria. Expressing Cas9 with specifically chosen crRNA should allow site-specific genome modifications, knocking-out genes on demand not only in bacteria where it is already relatively simple to do so, but also in higher organisms, such as vertebrates.

And indeed it works! In 2 back-to-back Science articles published online in January of this year, Streptococcus pyogenes strain SF370 Cas9 endonuclease was codon-optimized and targeted to the nucleus in human or mouse cells. In one article, RNase III was engineered in a similar fashion while the tracrRNA and pre-crRNA were expressed either separately or as a hybrid molecule, while in the other, only a hybrid crRNA-tracrRNA was expressed. In both papers, various gene targets were cloned into the crRNA locus, leading to site-specific target cleavage which was subsequently repaired by either nonhomologous end-joining or homologous recombination. While the efficiency of the process varies, introducing multiple targets within a single gene or targeting multiple genes at a time is feasible, allowing for comparatively easy manipulation of a genome of interest. Additionally, no toxicity has been observed upon expression in human cells.

A similar approach has been successfully used not only in other bacteria, but also in zebrafish, as well as in different human cell lines.

The work described above has been carried out using Cas9 from Streptococcus pyogenes strain SF370, and the corresponding UniProtKB/Swiss-Prot entry has been updated, as have been experimentally characterized orthologous proteins in other bacteria (Streptococcus thermophilus strain DGCC7710, Streptococcus thermophilus strain ATCC BAA-491 / LMD-9 and Listeria innocua serovar 6a strain CLIP 11262). Additionally, a new HAMAP rule has been made for the Cas9 family (MF_01480).

UniProtKB news

Cross-references to ChiTaRS

Cross-references have been added to ChiTaRS, a database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data.

ChiTaRS is available at http://chitars.bioinfo.cnio.es/

The format of the explicit links in the flat file is:

Resource abbreviation ChiTaRS
Resource identifier gene name
Optional information 1 organism name
Example P16320:
DR   ChiTaRS; ATP6AP1; drosophila.

Show all the entries having a cross-reference to ChiTaRS.

Cross-references to SABIO-RK

Cross-references have been added to SABIO-RK, a database of biochemical reaction kinetics.

SABIO-RK is available at http://sabiork.h-its.org/

The format of the explicit links in the flat file is:

Resource abbreviation SABIO-RK
Resource identifier UniProtKB accession number
Example P10172:
DR   SABIO-RK; P10172; -.

Show all the entries having a cross-reference to SABIO-RK.

Removal of the cross-reference to 8 2D gel databases

Cross-references to 2DBase-Ecoli, Aarhus/Ghent-2DPAGE, ANU-2DPAGE, Cornea-2DPAGE, PHCI-2DPAGE, PMMA-2DPAGE, Siena-2DPAGE, and Rat-heart-2DPAGE have been removed.

Removal of the cross-reference to AGD

Cross-references to AGD have been removed.

Gene3D

The Gene3D database no longer provides names for their signatures. The entry name that has been displayed in the cross-references was therefore replaced by a dash (’-’).

Examples: Q12933:

Previous format:
DR   Gene3D; 2.60.210.10; TRAF-type; 1.
DR   Gene3D; 3.30.40.10; Znf_RING/FYVE/PHD; 1.
New format:
DR   Gene3D; 2.60.210.10; -; 1.
DR   Gene3D; 3.30.40.10; -; 1.

Changes to keywords

New keyword: Modified keywords:

UniProt release 2013_02

Published February 6, 2013

Headline

The smoke's devils

The first written evidence of the therapeutic and psychoactive use of Cannabis is attributed to the legendary emperor of China Shen-nung who lived some 5,000 years ago. He stated in his famous herbal “Pen-ts’ao Ching” that “the fruits of hemp, if taken in excess will allow ‘seeing devils’. If taken over a long term, it makes one communicate with spirits and lightens one’s body” (in An archaeological and historical account of Cannabis in China). Until 1942, Cannabis was listed in the United States Pharmacopoeia and it was only in 1971 that most European countries banned Cannabis by adopting the Convention on Psychotropic Substances established by the United Nations.

Although marijuana has been used for centuries, the biological processes underlying its psychoactive effects have long remained a mystery. It is only recently that the cannabinoid biosynthetic pathway has been elucidated. The production of Cannabis’ major psychoactive ingredient, delta-9-tetrahydrocannabinol (THC), starts with the condensation of hexanoyl-CoA with three molecules of malonyl-CoA to yield olivetolic acid (OA). It was postulated that a type III polyketide synthase was catalyzing this reaction, although all type III PKSs from Cannabis characterized so far were only able to produce byproducts instead of OA. A few months ago, it was shown that the inability of OLS/TKS, a cloned tetraketide synthase, to synthesize OA was due to the absence of an accessory protein, olivetolic acid cyclase. In the presence of olivetolic acid cyclase, OA is synthesized. It is then geranylated to form cannabigerolic acid, which is further converted by oxidocyclase enzymes to the major cannabinoids, delta-9-tetrahydrocannabinolic acid (THCA) in “drug-type” Cannabis and cannabidiolic acid (CBDA) in “fiber-type” Cannabis. THCA and CBDA are decarboxylated by a non-enzymatic reaction during storage or smoking to give rise to their chemically neutral forms, THC (the neurologically active substance) and CBD, respectively.

Thanks to the recent publication of the complete sequence of the genome of Cannabis sativa, most enzymes involved in the THC/CBD biosynthetic pathway have been identified and manually annotated in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to mycoCLAP

Cross-references have been added to mycoCLAP, a database of fungal genes encoding lignocellulose-active proteins.

mycoCLAP is available at https://mycoclap.fungalgenomics.ca/mycoCLAP/

The format of the explicit links in the flat file is:

Resource abbreviation mycoCLAP
Resource identifier mycoCLAP identifier
Example P55296:
DR   mycoCLAP; MAN26A_PIRSP; -.

Show all the entries having a cross-reference to mycoCLAP.

Changes to keywords

New keyword: Modified keywords:

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • (3R)-3-hydroxyarginine
  • (3S)-3-hydroxyhistidine

UniProt release 2013_01

Published January 9, 2013

Headline

Hereditary sensory and autonomic neuropathy type IA: New dietary hope?

Hereditary neuropathies are common neurological conditions characterized by progressive loss of motor and/or sensory function. There are no effective treatments. HSAN1A is one of many hereditary peripheral neuropathies, characterized by axonal degeneration and disappearance of myelin sheaths. The prominent feature of this pathology is sensory abnormalities with a variable degree of motor and autonomic dysfunction. HSAN1A patients most frequently present with decreased sensation in the feet, as well as painless blisters and ulcers, often preceded by hyperpathia and spontaneous shooting or lancinating pain. The loss of sensation, especially pain, leads to the horrible complications of unheeded infections and painless ulcers that can result in amputations of the affected extremities.

The culprits are mutations in the SPTLC1 gene. SPTLC1 is a subunit of serine palmitoyltransferase. It catalyzes the condensation of serine and palmitoyl-CoA, the initial step in the de novo synthesis of sphingolipids.

The most frequent HSAN1A mutation is found at position 133 where a cysteine residue is substituted by a tryptophan (C133W). This mutation induces a shift in the substrate specificity, allowing the condensation of alanine or glycine, instead of serine, and subsequent formation of 2 atypical deoxysphingolipids: 1-deoxy-sphinganine and 1-deoxymethylsphinganine, respectively. These metabolites lack the C1 hydroxyl group of sphinganine and can therefore neither be converted to complex sphingolipids, nor degraded by the classical catabolic pathway. Accumulation of these metabolites is toxic for sensory neurons.

In cultured cells, as well as in transgenic mice, a serine-enriched medium/diet can force the defective enzyme to use serine, hence restoring the original reaction. A pilot study in 14 human patients showed a marked decrease in plasma deoxysphingolipid levels. Unfortunately, only the biochemical effects of the diet were evaluated, while the neurological outcome was not assessed. In addition, the number of patients is too small to draw any conclusion, but it opens a door for a new potentially efficient and simple treatment for a specific type of hereditary neuropathy.

Missense neutral polymorphisms and disease-causing mutations are annotated in UniProtKB/Swiss-Prot in ‘Sequence annotation (Features)’. The SPTLC1 variant C133W has now joined some 68,000 polymorphisms reported in the knowledgebase.

UniRef news

Modification of the UniRef clustering algorithm

UniRef clusters are formed in a hierarchical fashion by the serial application of the CD-HIT algorithm to sequences from UniProtKB and selected UniParc entries. Identical sequences (and sub-fragments) are first clustered to form UniRef100. Then the longest sequence is selected from each UniRef100 cluster as input for clustering in UniRef90. Each UniRef90 cluster in turn provides its longest sequence as input for clustering in UniRef50.

Until now, UniRef90 and UniRef50 clusters are computed only with identity thresholds of 90% and 50%, respectively. Starting with the first release of 2013, an 80% overlap threshold will be used for the computation of UniRef90 and UniRef50 clusters. This means that the longest (seed) sequence of each UniRef90 and UniRef50 cluster will have a minimum length overlap of 80% with each of the other member sequences.

Our motivations for introducing this overlap threshold were:
  • to create tighter clusters to support use cases such as sequence similarity searches
  • to improve cluster computation performance by avoiding false positive sequence alignments arise during clustering

Based on our analyses this change will have a minimal impact on existing cluster topologies (less than 5% increase in the number of clusters and less than 2% changes of the representative sequence) and will at the same time provide a more than five-fold gain in computation time for UniRef50.

UniProt release 2012_11

Published November 28, 2012

Headline

RALF, a growing family of plant peptide hormones

The first plant peptide hormone to be identified was systemin. Systemin regulates systemic wound signaling during herbivore and pathogen attacks. Since its discovery in 1991, several other polypeptide signals have been reported in plants, including phytosulfokines and CLAVATA3 and CLAVATA3-related proteins.

In 2001, Pearce et al. used a cell suspension culture assay to identify polypeptide hormones in plant extracts that cause alkalinization of the medium. In addition to systemins, the authors isolated a 5-kDa polypeptide from tobacco leaves that induced rapid alkalinization of the culture medium and the concomitant activation of an intracellular mitogen-activated protein kinase. The peptide has been called RALF for Rapid ALkalinization Factor. The 49-amino acid long active peptide is produced by processing of a 115 amino acid long preprotein. Genes encoding RALF preproproteins are expressed in various tissues and organs in many different plant species. In Arabidopsis thaliana, the RALF family consists of 36 members. As in tobacco, they are produced by the processing of precursors containing signal peptides and, for some of them, the cleavage of an additional propeptide is required. The presence of disulfide bonds contributes to their stabilization after secretion. One member of the family, RALF1, has been shown to induce an intracellular Ca(2+) increase, likely caused by both Ca(2+) influx across the plasma membrane and release of Ca(2+) from intracellular stores. This mechanism could be common to other RALFs.

Further studies are needed for a better understanding of RALF functions, but as of this release, all Arabidopsis thaliana RALF family members have been manually annotated with all available information.

UniProtKB news

Cross-references to ChEMBL

Cross-references have been added to ChEMBL, a database of bioactive drug-like small molecules.

ChEMBL is available at https://www.ebi.ac.uk/chembldb

The format of the explicit links in the flat file is:

Resource abbreviation ChEMBL
Resource identifier ChEMBL identifier
Example P69332:
DR   ChEMBL; CHEMBL4259; -.

Show all the entries having a cross-reference to ChEMBL.

Cross-references to PaxDb

Cross-references have been added to PaxDb (Protein Abundance Across Organisms), a comprehensive absolute protein abundance database, which contains whole genome protein abundance information across organisms.

PaxDb is available at http://pax-db.org

The format of the explicit links in the flat file is:

Resource abbreviation PaxDb
Resource identifier UniProtKB accession number
Example P85829:
DR   PaxDb; P85829; -.

Show all the entries having a cross-reference to PaxDb.

Removal of the cross-reference to ECO2DBASE

Cross-references to ECO2DBASE have been removed.

Removal of the cross-reference to TIGR

Cross-references to TIGR have been removed.

New format of the documentation files yeast.txt, yeast chromosome files, pombe.txt and calbican.txt

UniProtKB provides documentation files for some key species. These files list the relevant UniProtKB/Swiss-Prot entries with information like the primary accession number and entry name, gene designations, protein length, cross-references to organism-specific databases and whether a 3D structure is available or not.

We have slightly changed the file format so that all information from one protein is now found on a single line, which should make it easier to parse these files.

The following files are affected by this change:

Yeast
Yeast chromosome I
Yeast chromosome II
Yeast chromosome III
Yeast chromosome IV
Yeast chromosome V
Yeast chromosome VI
Yeast chromosome VII
Yeast chromosome VIII
Yeast chromosome IX
Yeast chromosome X
Yeast chromosome XI
Yeast chromosome XII
Yeast chromosome XIII
Yeast chromosome XIV
Yeast chromosome XV
Yeast chromosome XVI
Candida albicans
Schizosaccharomyces pombe

UniProt release 2012_10

Published October 31, 2012

Headline

CIA: on your Genome service

Life evolved in an anaerobic world and it is thought that iron-sulfur (Fe-S) clusters played a crucial role in this process by facilitating chemical transformations. Once photosynthesis evolved, oxygen became prevalent, threatening Fe-S clusters as they are susceptible to destruction by oxidation. Despite this potential problem, Fe-S clusters are still cofactors in hundreds of proteins. They are required in virtually all organisms from bacteria to humans and are involved not only in ‘redox’ catalysis in some enzymes, but also in many other functions. Interestingly, Fe-S clusters have been found in many proteins involved in DNA repair and replication and telomere length maintenance.

In eukaryotic cells, most biosynthesis of Fe-S clusters occurs in the mitochondria, but it may also occur in the cytosol and nucleus. In the cytosol, Fe-S clusters are escorted and presented to their cytoplasmic and nuclear apoproteins by the conserved cytoplasmic iron-sulfur assembly (CIA) machinery. However, it is not clear how Fe-S clusters are transferred to target apoproteins, nor how target specificity is achieved.

Two recent and elegant publications have shown that the MMS19 protein is associated with the CIA machinery. This protein also binds a subset of cellular Fe-S proteins, specifically nuclear ones involved in DNA metabolism, including the DNA helicases RTEL1, ERCC2 and ERCC3. MMS19 is required for in vivo incorporation of iron into various DNA repair enzymes and, in the absence of MMS19, cells become more sensitive to DNA damage. The authors suggest that MMS19 functions as a platform to facilitate Fe-S cluster transfer to proteins critical for DNA replication and repair. These experiments point to the importance of Fe-S clusters for the maintenance of genome integrity and imply a central role for mitochondria in genomic DNA metabolism.

In spite of their interest, we might have missed these publications, if not for one of our users who contacted us asking for their review and integration into UniProtKB/Swiss-Prot. We immensely value feedback and update requests and we would like to thank all users who are taking time to help us improve UniProtKB. If you would like to contribute, please use ‘Send feedback’ button in the clickable box found at the top-right corner of each entry and we will handle your request with high priority.

UniProtKB news

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N6-crotonyl-L-lysine

UniProt release 2012_09

Published October 3, 2012

Headline

New discovery for an old virus: PA-X, influenza’s twelfth protein

Influenza A virus (IAV) remains a major cause of human mortality and morbidity due to its remarkable genetic variability which limits vaccine effectiveness. Understanding the determinants of influenza virus molecular biology is a fundamental step for effective control of viral epidemics, and this virus has been the subject of intensive research efforts for more than 60 years. The viral RNA genome was first sequenced in the early 80’s. It comprises eight segments totaling 13.5 kb, that was thought to encode eleven proteins. The eleventh protein PB1-F2 was characterized in 2001. Coinciding with the 30th anniversary of the first segment 3 sequence, Jagger et al. have published in Science the identification of the twelfth protein of influenza A virus. This protein is expressed by an unusual ribosomal frameshifting in the polymerase acidic (PA) protein open reading frame encoded on segment 3. The frameshift product, called PA-X, comprises the endonuclease domain of the viral PA protein with a C-terminal domain encoded by the X-ORF. Its function is to repress cellular gene expression and modulate IAV virulence in a mouse infection model, acting to decrease pathogenicity. It is not surprising to discover a new open reading frame in a small RNA virus 30 years after it was first sequenced, because non-structural viral proteins are often difficult to identify in the midst of host proteins. Moreover, PA-X expression relies on an unusual ribosomal frameshift which could not be predicted. This new finding will allow a better understanding of host-virus interactions and improve the surveillance of new outbreaks.

As of this release, 87 new PA-X entries have been manually annotated in UniProtKB/Swiss-Prot.

UniProtKB news

Changes to keywords

New keywords: Modified keyword:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N6-(3,6-diaminohexanoyl)-5-hydroxylysine

UniProt release 2012_08

Published September 5, 2012

Headline

Prokaryotes do it too: CRISPR, an RNA-based adaptive immune system in UniProt

Like all other cellular organisms, bacteria and archaea are constantly bombarded by viruses, and unlike eukaryotes, many are also susceptible to infective plasmids. While we have known about defenses such as restriction-modification systems, blockage of absorption and/or DNA injection and abortive infection for some time, new ways in which bacteria and archaea defend themselves against these infective agents have been found more recently. One of these is the clustered regularly interspaced short palindromic repeat (CRISPR) sequences. CRISPR is an RNA-based adaptive immune system, which degrades invading genetic material. The system is mechanistically different from eukaryotic RNA interference (RNAi) and the proteins involved in prokaryotes are not homologous to those in eukaryotes (review).

CRISPRs are repetitive loci on the genome consisting of unique sequences 20-50 bases long (the spacer sequences) interspaced with repeated sequences of about the same length. Examination of the spacer sequences has shown that some are identical to viral and plasmid sequences; they are thought to serve as a “memory” of a previous infection. Bacteria and archaea can have from 0 to 18 CRISPR loci, with between 2 and 249 repeat-spacer units. While many pathogens have CRISPR loci, obligate parasites do not. The CRISPR loci are transcribed and processed to give short CRISPR-derived RNA (crRNA) complementary to a previously-encountered infective agent. It is this crRNA that is at the heart of adaptive prokaryotic immunity.

There are a large number of proteins associated with CRISPR loci, the operon-encoded CRISPR-associated or Cas proteins. The Cas proteins present in each locus have allowed the definition of 3 major CRISPR-Cas systems with further division into a number of subtypes. The number of subtypes will probably continue to increase as more prokaryotic genomes are fully sequenced. Many of these proteins are predicted to be nucleases, helicases and/or RNA binding proteins as is to be expected given the function of CRISPR.

There are 3 stages in CRISPR-Cas mediated immunity:

Stage 1, adaptation or acquisition, is the least well characterized. A short piece of DNA homologous to an invading agent is integrated into the 5’ end of the CRISPR loci. This requires the metal-dependent Cas1 endoribonuclease, the only Cas protein found in all organisms with CRISPR loci, although almost all organisms also encode Cas2, another metal-dependent endoribonuclease which is also thought to be involved in adaptation.

Stage 2, expression or crRNA biogenesis, requires transcription and processing of the CRISPR loci to produce the crRNA. Type I CRISPR systems use one of the related, metal-independent Cas6, Cas6e or Cas6f endoribonucleases to process the precursor, while type III systems use endogenous RNase III to generate the crRNA. It is not yet known which protein produces crRNA in type II systems.

Stage 3, interference, is the destruction of the target (be it virus or plasmid) and is performed by a complex of crRNA and proteins. While it is generally thought to recognize invading DNA, the type III-B CRISPR system of Pyrococcus furiosis cleaves target RNA.

While CRISPR-Cas systems can now be assumed to be involved in adaptive immunity, there are tantalizing hints that they may perform other functions as well. In Pseudomonas aeruginosa UCBPP-PA14, the type I CRISPR system does not confer resistance to phages DMS3 or MP22, but is required for DMS3-dependent inhibition of biofilm formation and possibly motility, while in Myxococcus xanthus, a CRISPR system is involved in the regulation of fruiting body development.

We have recently annotated and updated characterized Cas proteins in UniProtKB/Swiss-Prot, although the field moves so quickly that it is impossible to be fully up-to-date with all the latest research. All manually annotated CRISPR-associated protein entries can be retrieved from UniProtKB/Swiss-Prot using the query term ‘CRISPR’ in ‘Protein name’.

UniProtKB news

Cross-references to GenomeRNAi

Cross-references have been added to GenomeRNAi, a database containing phenotypes from RNA interference (RNAi) screens in Drosophila and Homo sapiens.

GenomeRNAi is available at http://genomernai.de/GenomeRNAi/

The format of the explicit links in the flat file is:

Resource abbreviation GenomeRNAi
Resource identifier GenomeRNAi identifier
Example Q9BXP5:
DR   GenomeRNAi; 51593; -.

Show all the entries having a cross-reference to GenomeRNAi

Cross-references to UniPathway

Cross-references have been added to UniPathway, a fully manually curated resource for the representation and annotation of metabolic pathways.

UniPathway provides explicit representations of enzyme-catalyzed and spontaneous chemical reactions, as well as a hierarchical representation of metabolic pathways. All of the pathway data in UniPathway has been extensively cross-linked to existing pathway resources such as KEGG and MetaCyc, as well as sequence resources such as UniProtKB, for which UniPathway provides a controlled vocabulary for pathway annotation.

The format of the explicit links in the flat file is:

Resource abbreviation UniPathway
Resource identifier UniPathway pathway ID (UPA)
Optional information UniPathway enzymatic reaction ID (UER)
Examples Q8LL69:
DR   UniPathway; UPA00842; -.
Q9M6F0:
DR   UniPathway; UPA00842; UER00808.

Show all the entries having a cross-reference to UniPathway

Changes to keywords

New keywords:

UniProt release 2012_07

Published July 11, 2012

Headline

To pee or not to pee

There is a season and a time for every purpose. There is a time to sleep and Nature has done its best to avoid as much as possible to have it interrupted by an urgent need to urinate. During a sound sleep, healthy humans produce less urine than during the daytime and also store more urine, as if bladder had an increased capacity at night. This is not simply due to the fact that we usually drink less at night, since temporal variation in urine production is maintained in subjects who take food and drink equally during 24 hours. This phenomenon is also observed in rodents, with an inverted clock, the active phase being at night and the resting phase during the day.

The contraction of smooth muscles of the urinary bladder on a sensation of fullness leads to micturition. This event is precisely controlled by regulation of the central and peripheral nerves. It has been formerly reported that an increase in connexin-43/GJA1 enhances intercellular electrical and chemical transmission and sensitizes the response of bladder muscles to cholinergic neural stimuli. Connexin-43 is a gap junction protein expressed in the urinary bladder. Gap junctions are channels that directly connect the cytoplasm of two cells, allowing various molecules and ions to pass freely between cells and hence establishing a direct chemical and electrical communication between cells. An increase in connexin-43 levels lead to enhanced intercellular communication and a better response of bladder smooth muscle cells to signals from the nervous system.

Does connexin-43 link urinary bladder capacity to the circadian clock? The answer came from a recent publication by Negoro and al.. The authors measured micturition frequency and urine volume using wild-type and heterozygous connexin-43 knockout mice. Both genotypes show the typical day/night variation, but, while the total urine volume is not significantly different, the heterozygous connexin-43 knockout animals exhibit a higher urine volume voided per micturition. This suggests that connexin-43 does not influence the urine volume, but determines the functional capacity of the urinary bladder. Interestingly, connexin-43 expression exhibits a circadian rhythm. mRNA levels peak at the beginning of the active phase and drop by the end of the night, closely followed by protein levels. Circadian connexin-43 expression seems to be transcriptionally regulated by the direct binding of NR1D1/Rev-erbA-alpha to SP1 sites in a biological clock-dependent manner. Connexin-43 expression levels closely correlate with cell-cell communication rates and show an inverse correlation with urine volume by micturition.

Now the pieces of the puzzle give a coherent, although probably still partial, picture: bladder muscle cells have an internal rhythm that generates an oscillation in gap junction function. During the active phase, the intercellular communication is optimal, the sensation of bladder fullness is readily perceived, and animals frequently urinate small volumes. When resting, the decrease in gap junctions leads to a decreased sensitivity to neuronal signals and hence to an increase in bladder capacity. This limits disturbance of sleep by micturition.

As of this release, this new information has been annotated in connexin-43/GJA1 UniProtKB/Swiss-Prot entries.

UniProtKB news

Removal of the cross-reference to CMR

Cross-references to CMR have been removed.

Changes to keywords

New keywords: Modified keywords: Deleted keyword:
  • Dephosphorylation of host translation factors by virus

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2012_06

Published June 13, 2012

Headline

Fungal prion proteins – disease or evolutionary motor?

The word “prion”, coined in 1982 by Stanley B. Prusiner, is derived from the words “protein” and “infection”. It is used to describe the infectious, non-chromosomal genetic elements that are at the heart of the mammalian transmissible spongiform encephalopathies (TSEs, including scrapie of sheep, “Mad cow disease”, and Creutzfeldt-Jakob disease of humans). It is believed that these diseases are caused by the self-propagating conformational change of a protein, PRNP, or its assembly into an amyloid form.

A prion is an infectious agent made of a protein in a misfolded form. This altered inactive form converts its normal active counterpart into the same inactive form. Three distinct genetic traits have been defined that must be satisfied by a prion: 1. “Curing” of a prion is reversible. In the appropriate conditions, for instance in the absence of a specific molecular chaperone, the protein can reacquire its active conformation. The prion form can nonetheless arise again de novo because the protein is still present in the cell. 2. Overproduction of the protein should increase its frequency of conversion to the prion (infectious) form, whatever the mechanism. 3. Prions being inactive forms of physiological occurring proteins, the protein-encoding gene should be necessary for propagation of the prion, and inactivating mutations of this gene could produce a similar phenotype to that observed in the presence of the prion. Based on these criteria, prion proteins were also identified in fungi, primarily in the yeast Saccharomyces cerevisiae, for example the well-studied [PSI+], [URE3], and [PIN] prions. These classical, amyloid-forming prion proteins provide an excellent model for the understanding of the disease-forming mammalian prions.

In recent years, several additional fungal prion proteins have been identified. Their study provided 2 fundamental insights into prion biology. First, a protein does not need to form amyloid aggregates to be infectious. Other mechanisms like covalent autoactivation of an enzyme ([beta]) or even the interaction between two proteins ([GAR+]) can turn proteins into prions. But even more interesting is the fact that some of the fungal prions are not associated with any disease state, but may even have a beneficial role for the host. The Podospora anserina [Het-s] prion confers heterokaryon incompatibility, a process that ensures that during spontaneous, vegetative cell fusion only compatible cells from the same colony survive (non-self-recognition). In S. cerevisiae, the prevalence of transcriptional regulators (Cyc8, Mot3, Sfp1, Swi1 and Ure2) among the yeast prions led to the speculation that prion properties of transcription factors may generate an optimized phenotypic heterogeneity that buffers yeast populations against diverse environmental insults. Even more recent results on the adaptation of cells to anti-fungal drugs by the prion form of the mitochondrial tRNA dimethylallyltransferase ([MOD+]) shows that this may also be true for enzymes and supports the hypothesis that fungal prions may be beneficial for the host and contribute to cellular adaptation in living organisms.

As of this release, all prion-forming fungal proteins known to date have been reviewed and updated, with a special emphasis put on the prion-forming mechanism and on the consequences and phenotypes of the intracellular prion form. To make a clear distinction between the prion form characteristics and the physiological properties of the soluble cellular protein, the annotation dealing with prion have been integrated in a separate subsection (‘Miscellaneous’) in ‘General annotation (Comments)’.

Fungal prion-forming proteins can be retrieved using the keyword ‘Prion’.

UniProtKB news

Complete proteomes for Ensembl Genomes species

Ensembl Genomes species were made available for the first time in UniProt release 2012_04.

For UniProt release 2012_06, 5 new Ensembl Genome species have been made available, these are:

Amphimedon queenslandica
Gibberella zeae
Brachypodium distachyon
Glycine max
Oryza glaberrima

All predicted protein sequences from an Ensembl Genome are mapped to their UniProtKB counterparts under stringent conditions: 100% identity over 100% of the length of the two sequences is required. Any sequence found to be absent from UniProtKB is imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. All UniProtKB entries that map to an Ensembl Genome are used to build the proteome; they are tagged with the keyword Complete proteome and an Ensembl Genome cross-reference is added.

We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to make proteomes for the remaining Ensembl Genome species currently absent from UniProtKB.

Genome submission for Bos taurus updated to be in line with Ensembl

The underlying genome submission for Bos taurus has been updated to be in sync with the third party assembly of the genome used by Ensembl for their annotations. For details of the Ensembl assembly for Bos taurus, see the Ensembl website.

Changes to cross-reference to PhosSite

The resource identifiers of the cross-references to the Phosphorylation Site Database for Archaea and Bacteria (PhosSite) have changed from a UniProtKB primary accession number to a Phosphorylation Site Database unique identifier for a phosphoprotein.

Example:
Previous format:
DR   PhosSite; P08839; -.
New format:
DR   PhosSite; P0810428; -.

Show all the entries having a cross-reference to PhosSite.

UniProt Gene Ontology Annotation

UniProt is a central member of the Gene Ontology Consortium, an initiative founded in 1998 to develop and use a set of ontologies to represent three aspects of biology carried out by gene products from any organism. Terms within the Gene Ontology (GO) describe those molecular functions and biological processes that gene products carry out and the subcellular locations in which they are located.

UniProt curators contribute manual GO annotations to proteins from a wide range of species. In addition, to ensure that UniProt provides a comprehensive GO annotation resource and to avoid duplication of effort, GO annotations are also integrated from more than 30 external model organism and multi-species databases including dictyBase, EcoCyc, FlyBase, Gramene, Human Protein Atlas, IntAct, LifeDB, MGI, PomBase, Reactome, RGD, TAIR, SGD, WormBase and ZFIN.

High-quality automatic GO annotations are also supplied to the UniProt GO annotation set by Ensembl, EnsemblGenomes, InterPro and UniProt prediction pipelines. Such automatic pipelines differently exploit gene orthology data, protein sequence signatures and existing cross-references or keywords from external controlled vocabularies, to infer that a protein has a particular function or subcellular location. The inclusion of such high-quality, automatic annotation predictions ensures the UniProt GO annotation dataset supplies functional information to a wide range of proteins, including those from poorly characterised, non-model organism species. In the May 2012 UniProt release, a total of 125 million GO annotations are supplied for 14.8 million proteins from more than 338,000 taxonomic groups.

GO annotations are present in the ‘Ontologies’ section UniProtKB entries (see for example P09960) and are available to download from the GOA ftp site. GO annotations can additionally be viewed via the QuickGO browser. We are pleased to announce an addition to the UniProt GO automatic annotation pipelines: UniPathway2GO.

New UniPathway2GO pipeline

In collaboration with the SIB Swiss Institute of Bioinformatics, INRIA (Rhone-Alpes) and Laboratoire d’Ecologie Alpine (Grenoble), UniProt is pleased to announce the inclusion of an additional 113,285 GO annotations that describe the pathway(s) in which 105,041 UniProtKB entries are involved.

UniPathway is a manually curated resource of enzyme-catalyzed and spontaneous chemical reactions that provides a hierarchical representation of metabolic pathways.

Currently 425 UniPathway pathway terms have been manually mapped to GO terms and 48% of these annotations apply a GO term that either uniquely describes a protein’s involvement in a certain process, or supplies a more granular term than is supplied by other automatic annotation methods.

UniProt release 2012_05

Published May 16, 2012

Headline

Sex by deception

All is fair in love and war and… species survival, including the most brazen cheating. In this context, strategies developed by orchids of the genus Ophrys to attract pollinators are astounding. While the majority of flowering plants achieve pollination by exploiting the food-seeking behavior of animals, Ophrys uses alternative ploys that exploit their mate-seeking behavior. These beautiful flowers imitate female insects to attract males, predominantly male hymenoptera. They mimic the insect body through one modified petal, called the labellum, but the misleading cues are not only visual and tactile: they are also chemical. During development the Ophrys labellum accumulates substances that mimic sex pheromones – which consist mostly of cuticular hydrocarbons, such as alkanes and alkenes – that induce the pollinator to attempt mating (pseudocopulation) with the labellum. During pseudocopulation, pollen becomes attached to the hapless suitor, which transfers this pollen to other flowers when it is once again enticed into pseudocopulation.

This pollination system is highly specialized, with each orchid species targeting a single pollinator with chemical cues consisting of alkenes whose specificity is determined by the precise position of the double bonds. This allows even closely related Ophrys species, living in the same environment and in the absence of geographic barriers, to remain reproductively separated, since they attract different insects. The enzymes involved in Ophrys alkene synthesis have been recently identified. The SAD2 desaturase has the catalytic activity and tissue-specific expression pattern (i.e preferentially in the labellum) expected for a determinant of pollination specificity. Small differences in the expression level or sequence of SAD2 homologs could explain the observed differences in desaturation among Ophrys species, and hence the selective attraction of specific pollinators. Although alignments of orthologous SAD2 sequences from Ophrys sphegodes and O. exaltata indicate striking identity, as yet uncharacterized variations could conceivably affect the precise reaction products.

As of this release, SAD2 gene products have been manually annotated in UniProtKB. They can be retrieved by searching the Swiss-Prot section for SAD2 in ‘Gene names’ (gene:SAD2 AND reviewed:yes). Both available sequences (from O. sphegodes and O. exaltata) can be selected and aligned directly from the search output.

UniProtKB news

Update to Reference proteomes in UniProtKB

With the significant increase in the number of complete genomes sequenced, it is critically important to organize this data in a way that allows users to effectively navigate the growing number of available complete proteome sequences. In collaboration with Ensembl and the NCBI Reference Sequence collection, UniProt began this organization by defining a set of ‘reference proteomes’. These were first introduced in UniProt release 2011_09 and the keyword ‘Reference proteome’ was created to allow their easy retrieval.

The number of reference proteomes has grown from 455 in UniProt release 2011_09 to 549 in release 2012_05. The proteomes have been selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB.

The reference proteome will be continuously reviewed as new proteomes of interest become available and as existing taxonomic classifications are revised. We would very much welcome feedback on our current list of reference proteomes and suggestions for new candidates via help@uniprot.org.

Link to complete and reference proteomes.

UniProt release 2012_04

Published April 18, 2012

Headline

Of serpents, humans and pain

Pain can be viewed as an indispensable communication tool to warn us that something is wrong, and to help us minimize physical harm to our body. Congenital insensitivity to pain leads to severe problems. Although pain is very useful, persistent pain can turn into a nightmare and the spontaneous reaction of most persons is to seek relief – often by reaching for painkillers. Understanding the mechanism of nociception could help develop treatments that provide relief for millions of people.

Surprisingly, a hint may come from a predator: the Texas coral snake. This beautiful snake with black, yellow and red banding lives in the southern United States and throughout most of Mexico. In the absence of antivenom treatment, the fatality rate of coral snake envenomations is estimated at 10%. Death is primarily due to respiratory or cardiovascular failure. In addition, coral snake bite causes excruciating and unremitting pain.

The culprit is MitTx, a venom toxin active as a heterodimer made of MitTx-alpha and MitTx-beta. MitTx-alpha contains a BPTI/Kunitz domain, found in many protease inhibitors. MitTx-beta belongs to the phospholipase A2 (PLA2) family, but it lacks critical catalytic residues normally found in the active site of related PLA2 enzymes and has been shown to be inactive as a phospholipase. The MitTx heterodimer activates acid-sensing ion channels (ASICs). ASICs are voltage-independent channels expressed in neurons and activated by acid. They are preferentially permeable to Na+, but to a lesser extent can also conduct other cations, such as Ca2+, K+ and Li+ and H+. Physiologically ASICs can be triggered by tissue injuries, inflammation or build-up of lactic acid. This alert system is hijacked by coral snake venom. Whereas protons elicit very transient responses, those evoked by MitTx are dramatically prolonged, reflecting both lack of desensitization and slow reversibility after washout.

At neutral pH, the most robust toxin-evoked responses are observed with the ACCN2 ASIC subtype. However, if the extracellular pH drops below neutrality, the toxin becomes an excellent ACCN3 agonist, essentially enhancing the potency of protons by three orders of magnitude.

Brazilian coral snake venom also activates ACCN2 expressing cells. This very channel had already been shown to be targeted by the PcTx1 toxin from the Trinidad chevron tarantula. In this case, the toxin does not activate the channel by itself, but rather serves as a functional antagonist of proton-evoked responses by locking the channel in a desensitized state.

Animal toxins often act on very restricted targets and have proven to be extremely useful tools for basic research. The identification of MitTx should allow further investigation the role of ASICs in pain signaling, and eventually the development of new analgesics.

For more information on toxins in UniProtKB, see the Animal toxin annotation program.

UniProtKB news

Complete proteomes for Ensembl Genomes species

The source of the UniProtKB complete proteomes are genomes in INSDC and Ensembl and now, to further increase the taxonomic coverage, species from Ensembl Genomes will also be incorporated. Ensembl Genomes aims to work with all sections of the scientific community to represent the best annotation for every genome. Its role varies according to the species, from displaying the genome assembly, gene prediction and functional annotation, through to providing a portal through which genomic data from model organism and community databases can be visualised and analysed in their wider context, and also integrated with other data stored in the core repositories maintained by the EBI.

The new species are:
Caenorhabditis japonica
Phytophthora ramorum
Pristionchus pacificus
Strongylocentrotus purpuratus

All predicted protein sequences from an Ensembl Genome are mapped to their UniProtKB counterparts under stringent conditions: 100% identity over 100% of the length of the two sequences is required. Any sequence found to be absent from UniProtKB is imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. All UniProtKB entries that map to an Ensembl Genome are used to build the proteome; they are tagged with the keyword Complete proteome and an Ensembl Genomes cross-reference is added.

We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to make proteomes for the remaining Ensembl Genomes species currently absent from UniProtKB.

Update of Complete proteomes with Ensembl release 66

Ensembl release 66 was made available at the end of February 2012 and, in response, the appropriate complete proteomes have been updated in UniProtKB. Of note, the human reference proteome has grown in size by just over 8,000 new UniProtKB entries. This growth is a consequence of the following updates:

  • Incorporation of the latest set of cDNAs from the European Nucleotide Archive and NCBI RefSeq. A total of 224,907 cDNAs are aligned to the current genome showing an increase of 491 cDNAs compared to release 65.
  • New CCDS import – the updated gene set includes 26,437 transcript models.
  • The patches for GRCh37.p6 were annotated using a combination of manual annotation, annotation projected from the primary assembly and annotation derived from cDNA and protein alignment evidence.
  • Update of Havana manual annotation representing data present in Vega release 46 which includes GENCODE release 11.

The proteomes of 35 chordate species are now fully synchronised with Ensembl 66. The species are:
Ailuropoda melanoleuca (Giant panda)
Anolis carolinensis (American chameleon)
Bos taurus (Cow)
Callithrix jacchus (White-tufted-ear marmoset)
Canis familiaris (Dog)
Cavia porcellus (Guinea pig)
Ciona intestinalis (Transparent sea squirt)
Ciona savignyi (Pacific transparent sea squirt)
Danio rerio (Zebrafish)
Equus caballus (Horse)
Gallus gallus (Chicken)
Gasterosteus aculeatus (Three-spined stickleback)
Gorilla gorilla (Lowland gorilla)
Homo sapiens (Human)
Latimeria chalumnae (West Indian ocean coelacanth)
Loxodonta africana (African elephant)
Macaca mulatta (Rhesus macaque)
Meleagris gallopavo (Common turkey)
Monodelphis domestica (Gray short-tailed opossum)
Mus musculus (Mouse)
Myotis lucifugus (Little brown bat)
Nomascus leucogenys (Northern white-cheeked gibbon)
Ornithorhynchus anatinus (Duckbill platypus)
Oryctolagus cuniculus (Rabbit)
Oryzias latipes (Medaka fish)
Otolemur garnettii (Garnett’s greater bushbaby)
Pan troglodytes (Chimpanzee)
Pongo abelii (Sumatran orangutan)
Rattus norvegicus (Rat)
Sarcophilus harrisii (Tasmanian devil)
Sus scrofa (Pig)
Taeniopygia guttata (Zebra finch)
Takifugu rubripes (Japanese pufferfish)
Tetraodon nigroviridis (Spotted green pufferfish)
Xenopus tropicalis (Western clawed frog)

Update to the Tetraodon nigroviridis proteome

The Tetraodon nigroviridis complete proteome has been updated with data from Ensembl release 66. Until now the proteome has reflected the Genoscope gene model annotations provided within the whole genome shotgun project (accession CAAE00000000) that were made available in March 2007. The proteome has been updated to reflect the annotations of the genome using Ensembl’s more conservative, evidence-based pipeline. Although a consequence of this update is a slightly reduced proteome size, the gene model predictions are high-quality and fit well into the Ensembl Compara gene trees. An example of an Ensembl sourced protein sequence is entry H3C526.

Cross-references to EvolutionaryTrace

Cross-references have been added to EvolutionaryTrace, which ranks amino acid residues in a protein sequence by their relative evolutionary importance.

EvolutionaryTrace is available at http://mammoth.bcm.tmc.edu/ETserver.html

The format of the explicit links in the flat file is:

Resource abbreviation EvolutionaryTrace
Resource identifier UniProtKB accession number
Example P06611:
DR   EvolutionaryTrace; P06611; -.

Show all the entries having a cross-reference to EvolutionaryTrace.

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 3-hydroxyhistidine
  • (3S)-3-hydroxyaspartate
  • (5R)-5-hydroxylysine
  • (5S)-5-hydroxylysine

UniProt release 2012_03

Published March 21, 2012

Headline

The importance of being manual

Manual annotation is a time-consuming and expensive process, but undoubtedly adds great value to knowledgebases like UniProtKB. A recent and very elegant study on sirtuin-5 illustrates how new functions continue to be discovered within what are thought to be well characterized protein families. Curating this information facilitates its dissemination as well as its subsequent (re)use in automatic annotation and function prediction systems.

Sirtuins, also called Sir2 proteins, are NAD-dependent deacetylases that regulate important biological processes. The name ‘Sir2’ comes from the yeast ‘silent information regulation 2’ gene, a gene involved among others in transcriptional repression. Sirtuins belong to a family of evolutionally conserved proteins occurring in all kingdoms. Mammals have seven sirtuins, SIRT1 to SIRT7. Robust deacetylase activity has been demonstrated for mammalian SIRT1 to SIRT3 and the annotation concerning this potential function has been propagated to other paralogues on the basis of their sequence similarity. However, so far SIRT4 to SIRT7 have been shown to have only a very weak deacetylase activity, if any. While this could be due to an inappropriate choice of peptides for the analysis, it could also be envisioned that their physiological activity is different.

A major breakthrough in the field came from the study of SIRT5 crystal structure. It appeared that the pocket used by SIRT2 to host acetyl groups was much larger in SIRT5, large enough to host a negatively charged acyl group instead. The most common acyl-CoA molecules with a carboxylate group in cells are malonyl-CoA and succinyl-CoA. Hence, malonyl-and succinyl-peptides were produced and tested as substrates for SIRT5. Goal! SIRT5 was actually able to catalyze their hydrolysis, proving it is a desuccinylase and a demalonylase, rather than a deacetylase. This discovery raised another question: do such post-translational modifications (PTMs) exist at all? Lysine succinylation has been shown on E.coli homoserine trans-succinylase, but not on mammalian proteins, and lysine malonylation had never been reported.

The presence of these PTMs was investigated in mitochondria, the organelle hosting SIRT5. Goal! Several proteins were found to be either succinylated or malonylated or both. Among them is CPS1 whose activity has been previously shown to be regulated by SIRT5.

UniProtKB/Swiss-Prot SIRT5 entries have been updated and lysine-succinylation and malonylation have been introduced in the UniProtKB controlled vocabulary of PTMs.

UniProtKB news

Cross-references to DNASU

Cross-references have been added to DNASU, a plasmid repository providing centralized archival and distribution of over 131,000 plasmids and empty vectors, including over 45,000 plasmids containing more than 7,000 human genes.

DNASU is available at http://dnasu.asu.edu

The format of the explicit links in the flat file is:

Resource abbreviation DNASU
Resource identifier DNASU identifier
Example A0EJG6:
DR   DNASU; 1400; -.

Show all the entries having a cross-reference to DNASU.

Changes to keywords

New keyword:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • Methionine sulfoxide

UniProt release 2012_02

Published February 22, 2012

Headline

Thiamine thiazole synthase: enzyme, catalyst or co-substrate?

Thiamine is a cofactor essential for many biochemical reactions in all living beings. Humans depend on their diet to supply it as vitamin B1, while bacteria, plants and yeast can make their own. They do so by coupling two precursor molecules: a sulfur-containing ring structure known as a thiazole and a nitrogenous pyrimidine.

In eukaryotes, it has been known for some time that thiamine thiazole synthase catalyzes thiazole biosynthesis, but the source of the thiazole sulfur at the heart of the reaction remained elusive. A recent publication unveiled a very unusual mechanism, whereby a sulfide ion is transferred from a conserved cysteine of thiamine thiazole synthase itself to become part of the thiazole precursor in Saccharomyces cerevisiae. This transfer is strictly dependent on the presence of Fe(2+). The donor cysteine, Cys-205, is irreversibly converted to dehydroalanine, leading to the inactivation of the enzyme. Surprisingly the inactivated protein is not degraded, but accumulates in the cell where it can form up to about 1.5% of total cellular protein. Could it have another physiological function? This has yet to be explored, but it has been suggested to play a role in mitochondrial DNA damage tolerance.

Although very rare, the use of a protein as a metabolic reagent has already been observed. The best characterized example is methylated-DNA--protein-cysteine methyltransferase, which repairs O-6 alkylated guanine lesions in DNA by stoichiometrically transferring the alkyl group to a cysteine residue in the enzyme. Here again we face a suicidal reaction, the enzyme being irreversibly inactivated. Interestingly, the inactive enzyme serves as a signal to induce other DNA repair enzymes.

Can such proteins be considered as “enzymes”? An enzyme is defined as a protein that catalyzes chemical reactions of other substances without itself being destroyed or altered upon completion of the reactions. Thiamine thiazole synthase functions as a “one-shot” reagent, so therefore does not comply with the definition. At most it can be considered as a catalyst, i.e. a reagent which promotes a reaction and may act repeatedly or only once.

Such an unusual mechanism has led to some inconsistencies. The Enzyme Commission attributed the EC number 2.1.1.63 to methylated-DNA--protein-cysteine methyltransferases, mentioning the ambiguity of this attribution: “This enzyme catalyzes only one turnover and therefore is not strictly catalytic.” Actually the protein is a catalyst, but it is not strictly an enzyme. The later decision not to provide an EC number to thiamine thiazole synthases is more consistent in view of the definition of an enzyme. This inconsistency is also visible in UniProtKB entries which show EC numbers in the ‘Names and origin’ section of methylated-DNA--protein-cysteine methyltransferases, but not in that of thiamine thiazole synthases.

As of this release, thiamine thiazole synthases, have been updated in UniProtKB/Swiss-Prot and a new post-translational modification, 2,3-didehydroalanine has been introduced.

UniProtKB news

Update to the human proteome

The human reference proteome has been updated with data from Ensembl release 65. Ensembl 65 has numerous updates to the human genome including an update of Havana manual annotation representing data present in Vega release 45. As a result, the human reference proteome has increased in size by over 7,000 entries. These new entries correspond to fragment entries that have transcription evidence captured by Havana and as such they are considered valid members of the proteome. Two examples of these fragment entries are H0Y5B1 and H0Y653.

Change of the cross-reference GeneDB_Spombe to PomBase

The Schizosaccharomyces pombe GeneDB was replaced by PomBase, the new model organism database for the fission yeast Schizosaccharomyces pombe. We have therefore changed the corresponding resource abbreviation from GeneDB_SPombe to PomBase.

Change of the category of the cross-reference KO

The KO database has been moved from the category “Family and domain databases” to the category “Phylogenomic databases”.

Removal of the cross-reference NMPDR

Cross-references to NMPDR have been removed.

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 2,3-didehydroalanine (Cys)
  • N6-malonyllysine

UniProt release 2012_01

Published January 25, 2012

Headline

What’s in a (species) name?

Carl Linnaeus, the father of taxonomy, was responsible during his life for the naming of nearly 8,000 plants, many animals and the scientific designation for humans: Homo sapiens. Linnaeus used many of his supporters and detractors as inspiration for naming plants. The most beautiful plants were often named in honor of his supporters while his detractors often supplied the names of common weeds or unattractive plants. Rather like an artist signing a painting, Linnaeus signed all his descriptions, his signature becoming over the centuries a simple L followed by a point. Sober. Even to this day taxonomically approved names may use this idea, but less soberly; the red alga Gracilaria chilensis was discovered in 1986 by C.J. Bird, J. McLachlan & E.C. Oliveira, giving us Gracilaria chilensis C.J. Bird, J. McLachlan & E.C. Oliveira, 1986.

Linnaeus advocated the use of commemorative personal names as botanical names. In ‘Critica Botanica’, he commented with humor about the naming of Linnaea borealis: “It is commonly believed that the name of a plant which is derived from that of a botanist shows no connection between the two… [but]... Linnaea was named by the celebrated [Jan Frederik] Gronovius and is a plant of Lapland, lowly, insignificant, disregarded, flowering but for a brief space – after Linnaeus who resembles it”. It may not be an excessively objective statement.

Thunbergia was named in 1780 by Retzius in honor of Carl Peter Thunberg (1743-1828), the Swedish naturalist, and perhaps the greatest pupil of Linnaeus. Kosteletzkya for Vincenz Franz Kosteletzky (1801-1887), Bohemian physician and botanist. Jacobsenia for Hermann Johannes Heinrich Jacobsen (1898-1978), German botanist and curator at Kiel botanic garden… there are many, many more examples.

Latin is still necessary at least to understand the species epithet. Ehrharta longiflora, longiflora referring to the elongate flowers of this species. And Ehrharta? J.F. Ehrhart (1742-1795) was a German botanist, yet another of Linnaeus’ pupils.

Sometimes scientific names bear the names of people who described the species or were instrumental in discovering them. Several archaeabacteria have been named in honor of Carl Woese (1928-), famous for defining the archaea in 1977, such as Pyrococcus woesei, or Methanobrevibacter woesei or Conexibacter woesei.

Euzebya tangerina, tangerine-colored bacterium was named in 2010 after Jean-Paul Euzéby, a French microbiologist who has contributed significantly to microbial systematics, including the Latinization of microbial names.

And Przewalskium albirostris? The Latin etymology of the name suggests that this creature has a white beak (albus: white and rostrum: beak, trunk or proboscis), or a white-lip. It was formerly named Cervus albirostris. Cervus means deer!! Now we know: it is a white-lipped deer. But it was renamed Przewalskium albirostris after N.M Przhevalsky (1839-1888), a Russian geographer.

In view of the names cited above, you may have the feeling of attending a popularity contest in the scientific community, but other kinds of tribute are also possible: a pheasant was named Chrysolophus amherstiae to commemorate Sarah Countess Amherst who sent the first specimen to London in 1828. In a more recent past, a newly discovered bacterium was named Midichloria mitochondrii. Does Midichloria remind you of anything? Schoooooooooooooooo…Luke, I am your father. Midichloria, a gram-negative bacterium, takes its name from the Star Wars microbes, midi-chlorians, which grant the Jedi and the Sith the ability to use the Force. In real life, Midichloria mitochondrii are non-obligate symbionts that reside primarily in the mitochondria.

Of course the appreciation of people deserving a tribute remains questionable. A nice cactus has been called Rebutia einsteinii, but there is no Opuntia oppenheimerii Why not oppenheimerii? Why a cactus? We leave the question open for the future generations of taxonomists.

UniProtKB news

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 5-hydroxy-3-methylproline (Ile)
Deleted term:
  • 5-hydroxy-3-methylproline

Website news

Clustal Omega replaces Clustal W as UniProt’s protein alignment program

We have upgraded the alignment web service used to align protein sequences in UniProt from Clustal W to Clustal Omega. This has been made possible with the help of a new bioinformatics analysis tools framework at EMBL-EBI. Clustal Omega is the latest addition to the Clustal family of programs. It offers a significant improvement upon Clustal W in the following areas:

  • Accuracy – Better quality protein sequence alignments.
  • Scalability – Better at aligning larger numbers of sequences.
  • Speed – Faster alignments, making use of multiple processors where present.

Clustal Omega is currently only suitable for aligning protein sequences and not DNA or RNA sequences.

UniProt release 2011_12

Published December 14, 2011

Headline

Between Charybdis and Cilia

A large number of genetic disorders, displaying a widely varying set of symptoms, are highly related in their root cause and can be grouped into a single category. This is the case for the ciliopathies in which the underlying cause is a cilium dysfunction. This emerging class of disease groups very different types of syndromes, including the Alstrom, Bardet-Biedl, Ellis-van Creveld, Joubert, Meckel, Sensenbrenner syndromes and many more.

Cilia are organelles found in almost all vertebrate cells. They contain a ciliary axoneme, i.e. a ring-shaped core of 9 microtubule doublets, which connects the base of the cilium to its tip. This axoneme is covered by the ciliary membrane and projects from a modified centriole, the basal body.

There are two types of cilia: motile cilia and non-motile (primary cilia). Motile cilia are found in certain types of highly specialized cells and are dedicated to a powerful motion of the extracellular fluid, for example, in the epithelial cells lining of the trachea, where they sweep mucus and dirt out of the lungs. By contrast, the majority of cells develop a single, non-motile primary cilium, which typically serves as a sensory organelle. The primary cilium membrane harbours receptors for crucial signaling cascades, most prominently Hedgehog, Wnt, planar cell polarity, FGF, Notch, mTor, PDGF or Hippo signaling. As a result, primary cilia play a role in cell proliferation, polarity, differentiation, tissue maintenance, and nerve growth.

The range of diseases due to cilia defects therefore include multiple phenotypes that affect different organs (predominantly kidney, eye, liver, bone and brain) and often show overlapping clinical features. Commonly observed clinical manifestations are renal cysts, retinal degeneration, polydactyly, mental retardation, and obesity.

The genetics of ciliopathies is complex. In some cases, identical phenotypes are caused by mutations in different genes. For example, over 15 genes have been shown to be involved in Bardet-Biedl syndrome, and close to 10 and 15 genes in Meckel and Joubert syndromes, respectively. On the other hand, multiple allelism at a single locus can lead to different phenotypes. For example, mutations in CEP290, a centrosomal protein involved in ciliogenesis, cause Bardet-Biedl syndrome type 14, Joubert syndrome type 5, Senior-Loken syndrome type 6, Leber congenital amaurosis type 10, Meckel syndrome type 4. Additionally, recent studies suggest that ciliopathy loci can be modulated by pathogenic lesions in other ciliary genes to either exacerbate overall severity or induce specific phenotypes.

Variations across multiple sites of the ciliary proteome may influence the clinical outcome and explain the variable penetrance and expressivity of ciliopathies. Examples are the TTC21B and KIF7 genes, which code for two ciliary proteins involved in the regulation of sonic hedgehog signaling. TTC21B mutations primarily cause nephronophthisis type 12 and asphyxiating thoracic dystrophy type 4, but have also been found in patients with Bardet-Biedl syndrome or Meckel-Gruber syndrome carrying disease causing mutations in other ciliopathy genes. KIF7 mutations are primarily responsible for acrocallosal syndrome, Joubert syndrome type 12, and hydrolethalus syndrome type 2, but may also genetically interact with Bardet-Biedl syndrome genes and contribute to disease manifestation and severity in Bardet-Biedl syndrome patients.

A number of ciliopathies have been annotated in UniProtKB/Swiss-Prot. The newly created keyword Ciliopathy allows users to retrieve all proteins involved in these diseases. More specific keywords can be used to restrict the set of proteins to those associated with special types of ciliopathies, such as Bardet-Biedl syndrome, Joubert syndrome, Kartagener syndrome, Meckel syndrome, Nephronophthisis, Primary ciliary dyskinesia, or Senior-Loken syndrome.

Proteins involved in cilia formation, organization, maintenance and degradation can be retrieved with the keyword Cilium biogenesis/degradation.

UniProtKB news

Cross-references to DMDM

Cross-references have been added to DMDM (Domain Mapping of Disease Mutations), a database in which each disease mutation can be displayed by its gene, protein or domain location. DMDM provides a unique domain-level view where all human coding mutations are mapped on the protein domain.

DMDM is available at http://bioinf.umbc.edu/dmdm/.

The format of the explicit links in the flat file is:

Resource abbreviation DMDM
Resource identifier DMDM identifier
Example Q9N2K0:
DR   DMDM; 44887889; -.

Show all the entries having a cross-reference to DMDM.

Cross-references to PATRIC

Cross-references have been added to PATRIC, a resource which integrates vital information on pathogens, provides key resources and tools to scientists, and helps researchers to analyze genomic, proteomic and other data arising from infectious disease research.

PATRIC is available at http://www.patricbrc.org/.

The format of the explicit links in the flat file is:

Resource abbreviation PATRIC
Resource identifier PATRIC identifier
Optional information 1 PATRIC locus tag
Example A5A616:
DR   PATRIC; 32118368; VBIEscCol129921_1604.

Show all the entries having a cross-reference to PATRIC.

Changes in the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • N2,N2-dimethylarginine

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • 2-(4-guanidinobutanoyl)-5-hydroxyimidazole-4-carbothionic acid (Arg-Cys)
  • 5-methyloxazole-4-carboxylic acid (Cys-Thr)
  • 5-methyloxazole-4-carboxylic acid (Thr-Thr)
  • 5-methyloxazoline-4-carboxylic acid (Ser-Thr)
  • Oxazole-4-carboxylic acid (Ile-Ser)
  • Oxazole-4-carboxylic acid (Ser-Ser)
  • Thiazole-4-carboxylic acid (Arg-Cys)
  • Threonine 5-hydroxy-oxazole-4-carbonthionic acid (Thr-Cys)

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2011_11

Published November 16, 2011

Headline

Who wants to be a millionaire? The first million HAMAP-annotated entries in UniProtKB/TrEMBL

As humanity explores more environmental and ecological niches, we are discovering a treasure-trove of organisms of which very little, if anything, is known. Sequencing genomes is becoming cheaper, and so to understand this diversity we sequence; but to begin to appreciate a genome’s possibilities quality annotation is required. HAMAP is an annotation project started over 10 years ago to provide annotation to the massive influx of completely sequenced bacterial and archaeal genomes and is now an integral part of the UniProt Automatic Annotation program.

The HAMAP rules automatically annotate bacterial and archaeal proteins, as well as related plastid-encoded proteins, based on manually-annotated, characterized template entries. These latter entries are used to generate the HAMAP profiles. UniProtKB/TrEMBL entries that belong to a family, i.e. that match a HAMAP profile, acquire annotation based on the manually annotated templates as well as template-based feature propagation. The propagated annotation also includes protein and gene names, general annotation (comments), keywords and GO terms. The annotation templates (http://hamap.expasy.org/families.html), seed alignments used to generate the HAMAP profiles and much more are available on the HAMAP website and will be integrated into the www.uniprot.org automatic annotation portal in the future.

Two years ago we wrote a headline highlighting the incorporation of 300,000 HAMAP annotated entries into UniProtKB/Swiss-Prot. Since that time we have discontinued incorporation of these semi-automatically annotated entries into UniProtKB/Swiss-Prot; this annotation is now added to UniProtKB/TrEMBL entries instead, while manually annotated ‘template’ entries (see above) are still integrated into UniProtKB/Swiss-Prot. With this release there are over 1 million bacterial, archaeal and plastid-encoded proteins in UniProtKB/TrEMBL that have been annotated by the HAMAP rules. With each UniProt release, and as families and new template entries are created or updated based on new experiments, entries from all genomes are (re)annotated, enriching them beyond what was known when the genomes were originally submitted to the DNA databases. All these entries are thus improved by this high quality semi-automated annotation, rendering them more useful to the community.

UniProtKB news

Cross-references to KO (KEGG Orthology)

Cross-references have been added to KO consisting of manually defined ortholog groups that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes.

KO is available at http://www.genome.jp/kegg/ko.html.

The format of the explicit links in the flat file is:

Resource abbreviation KO
Resource identifier KO identifier
Example P41932:
DR   KO; K06630; -.

Show all the entries having a cross-reference to KO.

Changes to keywords

New keyword:

Changes in the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 4-hydroxyglutamate
New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • 3-hydroxypyridine-2,5-dicarboxylic acid (Ser-Cys) (with S-...)
  • 3-hydroxypyridine-2,5-dicarboxylic acid (Ser-Ser) (with C-...)
  • Thiazole-4-carboxylic acid (Glu-Cys)

UniProt release 2011_10

Published October 19, 2011

Headline

The sound of silence

Cytosine methylation is the major and best characterized epigenetic modification of metazoan DNA. It is implicated in long-term gene silencing, X chromosome inactivation, genomic imprinting, etc. 5-methylcytosine (5mC) is recognized by methyl-binding proteins (MBDs), that in turn recruit repressive histone modifiers, such as H3K9 methyltransferases, to establish a heterochromatin state.

Cytosine base methylation is catalyzed by the C5-methyltransferase enzyme family. DNMT3A and DNMT3B methylate DNA de novo. DNMT1 maintains the methylation status across cell divisions. In the absence of DNMT1 activity, DNA methylation is progressively lost since methylation is not replicated onto the newly synthesized strand, leading to passive DNA demethylation.

However, passive DNA demethylation cannot account for rapid demethylation that occurs in the paternal genome in the zygote within the first 4 hours following fertilization or that observed in primordial germ cells, both of which are independent of DNA replication. While demethylases have been identified in Arabidopsis thaliana, the mechanism of active demethylation in mammals remained elusive (reviews). 2011 has unveiled the central role played by TET family members. These enzymes have already been shown to to catalyze the conversion of 5mC into 5-hydroxymethylcytosine. 5hmC can be further processed, either by G/T mismatch-specific thymine DNA glycosylase (TDG) or by deamination enzymes, such as APOBEC1 and AICDA/AID, and eventually removed and replaced by unmodified cytosine by base excision repair mechanism.

Interestingly, TET1 has a role in transcriptional repression, independently of its enzymatic activity. It binds a significant proportion of Polycomb group target genes and associates and colocalizes with the SIN3A co-repressor complex.

These new exciting data pave the way for understanding transcriptional fine-tuning during embryonic development, as well as in adult organisms and will keep us busy updating UniProtKB for quite a while.

UniProtKB news

Changes to keywords

Deleted keyword:

UniProt release 2011_09

Published September 21, 2011

Headline

Reference proteomes in UniProt

With the significant increase in the number of complete genomes sequenced, it is critically important to organise this data in a way that allows users to effectively navigate the growing number of available complete proteome sequences. The approach adopted by UniProt to meet this challenge is to define a set of “reference proteomes” which are “landmarks” in proteome space.

Reference proteomes have been selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB. They include the proteomes of well-studied model organisms and other proteomes of interest for biomedical and biotechnological research. Species of particular importance may be represented by numerous reference proteomes for specific ecotypes or strains of interest.

Currently, UniProt has defined 455 reference proteomes in collaboration with Ensembl and NCBI Reference Sequence collection. The keyword ‘Reference proteome’ has been created to allow their easy retrieval, and the keyword ‘Virus reference strain’ has been deprecated to reflect this.

The reference proteome will be continuously reviewed as new proteomes of interest become available and as existing taxonomic classifications are revised. We would very much welcome feedback on our current list of reference proteomes and suggestions for new candidates via help@uniprot.org.

Link to complete and reference proteomes.

UniProtKB news

Changes to keywords

Replacement of the keyword ‘Virus reference strain’ by ‘Reference proteome’

We have introduced the more widely applicable keyword ‘Reference proteome’ to replace the keyword ‘Virus reference strain’. All ‘Virus reference strains’ are now defined as ‘Reference proteomes’. See preceding text for further information on ‘Reference proteomes’.

New keyword:

UniProt release 2011_08

Published July 27, 2011

Headline

UniProt collaboration with IMEx for the annotation of protein interactions to MIMIx standard

UniProt is committed to the development and application of workflows and standards in the curation of biological data, its dissemination and exchange, and works with other consortia and data providers to achieve this. An example of this ongoing effort is the collaboration between UniProt and the International Molecular Exchange (IMEx) consortium.

The IMEx consortium is an international collaboration between a group of major public interaction data providers who share curation effort, and work to common curation rules using common standards. Our collaboration with IMEx will increase the flow of curated interaction data into IMEx, and will allow UniProt to leverage existing standards for the curation of protein interaction data and to contribute to the future development of such standards.

The standard we have chosen to adopt for the curation of protein interaction data in UniProt is the “minimum information required for reporting a molecular interaction experiment” standard, or MIMIx. MIMIx provides a useful compromise between free-text descriptions of protein interactions (which are difficult to parse) and the very detailed curation performed within IMEx (which aims to capture most experimental parameters). MIMIx-level annotation requires the accession numbers of the interacting proteins as well as a number of key experimental annotations made using terms from the Proteomics Standards Initiative (PSI) molecular interaction (MI) vocabulary. These annotations cover the type of interaction, the methods used to detect the interaction and identify the participants, the experimental roles of the participants, and the host organism in which the interaction was observed. MIMIx provides information that should be sufficient to allow a trained biologist to evaluate the biological relevance of an experimentally observed interaction.

UniProt curators have begun to curate protein interaction data to MIMIx standards as part of their normal workflow. Interactions are curated directly within the IntAct database, which forms the contact point between UniProt and the wider IMEx consortium. These curated interactions form a small part of the larger IntAct dataset which can be accessed from the IntAct website. A subset of presumably reliable interactions is extracted from the IntAct dataset and made available within the ‘Binary interactions’ section of UniProtKB entries (see for example entry Q13426). From UniProt release 2011_09, export from IntAct to UniProt will be determined using a simple scoring system developed by IntAct, coupled to a score threshold that has been deliberately chosen to exclude interactions supported by only one experimental observation. Further details of how interactions are scored can be found at the IntAct website. This simple score-based filter will be used in combination with a set of defined rules that excludes certain types of data, such as interactions that have been inferred but not experimentally proven.

We anticipate that these developments will enhance the availability and usability of high quality protein interaction data within UniProtKB, and promote the use of the MIMIx in reporting such data. We welcome feedback on this development and other curation standards.

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • (2-aminosuccinimidyl)acetic acid (Asn-Gly)
  • N,N-(cysteine-1,S-diyl)phenylalanine (Cys-Phe)
New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • S-bacillithiol cysteine disulfide
Modified terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 4-amino-3-isothiazolidinone serine (Cys-Ser) > N,N(cysteine-1,S-diyl)serine (Cys-Ser)

UniProt release 2011_07

Published June 28, 2011

Headline

Killing myself softly – Bacterial and Archaeal Type II Toxin-Antitoxin modules (TA)

Bacteria produce many kinds of toxins that attack other organisms: cholera toxin, botulinum toxin, aerolysins, insecticidal toxins, extracellular proteases, to name just a few. In the past 5 years it has become apparent that most free-living bacteria produce another kind of internal toxin. These toxins are almost always encoded as bicistronic antitoxin-toxin operons (TA module), where the antitoxin is unstable and neutralizes the toxin. In type I and III systems the antitoxin is a small RNA, while in type II systems the antitoxin is a protein. If the antitoxin levels decrease then the toxin levels increase and toxic effects can be seen at the cellular level. In most type II cases studied, the antitoxin acts as an autoregulator, repressing transcription; frequently but not always, the toxin acts as a corepressor. First identified in bacteriophage and on plasmids, where their role is clearly in plasmid or phage maintenance, the role of the chromosomally encoded toxins in bacteria is hotly debated. Proposed functions include maintenance of mobile genetic elements, programmed cell death, induction of persistence (dormancy), stress response, virulence promotion in a host, and regulation of biofilm formation. The toxin’s role may in fact depend on the physiology of the organism in question. Although they are widespread in Archaea, no function for these toxins has been shown in vivo.

There are many toxin families. The best characterized so far are the bacterial ribonuclease toxins which belong to the MazF, RelE, MqsR, HigB, YoeB and VapC families. Most of these toxins degrade mRNA; some are sequence-specific, some work only in association with ribosomes, while for others the mode is unknown. VapC has been shown to degrade the anticodon loop of tRNAfMet. Other cellular functions are also toxin targets; DNA gyrase is targeted by the ParE toxin, HipA toxin probably acts by inappropriately phosphorylating cellular targets, RatA blocks ribosomal subunit association, PezT/zeta toxin corrupts peptidoglycan synthesis and CbtA (formerly YeeV) toxin binds FtsZ and MreB, inhibiting them, possibly simultaneously. While the toxins form distinct families, their cognate antitoxins do not, although almost all of them have a DNA-binding domain, in accordance with their probable role in operon regulation. To further complicate matters, in Mycobacterium tuberculosis H37Rv cross-talk between toxins and some non-cognate antitoxins has been seen, while in Caulobacter crescentus such cross-talk does not occur. Additionally, potential new toxins are detected quite frequently.

We recently performed a major update of many of the type II TA families in UniProtKB/Swiss-Prot, with particular attention given to the model organisms Mycobacterium tuberculosis strain H37Rv and E.coli K12 / MG1655. 65 TA modules have been annotated in M. tuberculosis and 15 in E.coli; gene names for M.tuberculosis were assigned in collaboration with the TubercuList database. Interestingly, TA modules are more abundant in pathogens than in related non-pathogenic strains. Hence Mycobacterium smegmatis, a non-pathogenic mycobacterium, is only predicted to encode 3 TA modules. Since January 2011 the mode of action of at least 4 toxin families has been elucidated (CbtA, PezT/zeta, RatA and VapC).

Although the PezT/zeta toxin and associated antitoxin module have not been predicted to exist in M.tuberculosis, there are indeed loci belonging to this TA module encoded in the genome. This is currently such a hot topic that integrating the data will keep us busy for quite a while yet.

All manually annotated type II TA module entries can be retrieved from UniProtKB/Swiss-Prot using the query toxin-antitoxin (TA) module.

UniProtKB News

Provision of complete proteome data sets for IPI species by UniProt

Complete proteome data sets are now available for download from the FTP and web sites for the species in the International Protein Index (IPI) which is scheduled for closure this year. IPI is an integrated database which clusters protein sequences from different databases to provide non-redundant complete data sets for selected higher eukaryotic organisms. Since it was launched in 2001, IPI has covered the gaps in the gene predictions between different databases, but since then the situation has improved for many of the most-studied genomes. This is due to a close collaboration between Ensembl, RefSeq and UniProt which aims to provide a standard set of gene predictions for the genomes of interest. These new complete proteomes will therefore provide high coverage complete proteomes for IPI users. The complete UniProtKB proteomes will be based on existing UniProtKB sequences supplemented by missing high quality predictions imported from Ensembl.

For Homo sapiens, a first pass annotation of the complete proteome was completed by UniProt in 2008 and all entries were incorporated into UniProtKB/Swiss-Prot. Within this UniProtKB/Swiss-Prot complete H. sapiens proteome, approximately 20,000 putative protein-coding genes are represented by one canonical protein sequence, with some entries describing multiple isoform sequences. Since its initial release, the UniProtKB/Swiss-Prot complete H. sapiens proteome has been extensively curated and the Ensembl cross-references -mapped based on sequence identity -are in the process of being manually verified. All predicted protein sequences from Ensembl (except fragments) that were found to be absent from the UniProtKB/Swiss-Prot complete H. sapiens proteome were imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL (see release 2011_05 headline). These imported UniProtKB/TrEMBL entries were tagged with the keyword ‘Complete proteome’. The aim of this import was to increase the coverage of the existing complete proteome, by supplementing it with those Ensembl protein sequences that had no UniProtKB counterpart. The resulting UniProtKB complete H. sapiens proteome includes both reviewed sequences from UniProtKB/Swiss-Prot (equivalent to an updated version of the complete H. sapiens proteome completed in 2008), now supplemented by unreviewed sequences from UniProtKB/TrEMBL. This process will enable the synchronization of the UniProt set with the CCDS project. This version of the complete H. sapiens proteome provides higher sequence coverage than the preceding version, but now includes sequences that have not been manually reviewed. Users can choose to opt either for this expanded complete H. sapiens proteome or a reduced version that derives exclusively from UniProtKB/Swiss-Prot.

For the other IPI species (mouse, rat, chicken, zebrafish, cow and dog), we added the keyword ‘Complete proteome’ to the existing UniProtKB/Swiss-Prot entries. We identified those entries in UniProtKB/TrEMBL which mapped to the complete genome in Ensembl and imported the predicted protein sequences (except fragments) from Ensembl which were found not to be present in UniProtKB. The keyword ‘Complete proteome’ was also added to these entries. As for the human counterpart, these proteomes can now be easily retrieved using this keyword. The Ensembl cross-references have been added to the UniProtKB entries on the basis of 100% sequence identity over their full length.

We will expand the coverage to other species of interest in the near future and expect this will be very useful for our users as it will eliminate the need to combine data from different databases.

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • 2-(3-methylbutanoyl)-5-hydroxy-oxazole-4-carbothionic acid (Leu-Cys)
  • Proline 5-hydroxy-oxazole-4-carbothionic acid (Pro-Cys)
New term for the feature key ‘Lipidation’ (‘LIPID’ in the flat file):
  • S-(15-deoxy-Delta12,14-prostaglandin J2-9-yl)cysteine

Changes to keywords

Modified keyword:

UniProt release 2011_06

Published May 31, 2011

Headline

New biocuration pages on UniProt website

One of the central activities of the UniProt Consortium is the biocuration of the UniProt Knowledgebase (UniProtKB). This involves the integration and interpretation of information from a variety of sources as well as accurate and comprehensive representation of the data. The biocuration process adds a wealth of information to UniProtKB records including information related to the role of a protein such as its function, structure, subcellular location, interactions with other proteins, and domain composition, as well as a wide range of sequence features such as active sites and post-translational modifications.

Both manual and automatic approaches are used to add information to UniProtKB records. Manual curation provides high-quality data for experimentally characterised proteins and consists of a critical review of experimental and predicted data for each protein as well as manual verification of each protein sequence. This information is included in the manually reviewed Swiss-Prot section of UniProtKB. In response to the ever-increasing amounts of sequence data, automated methods have been developed by the UniProt Consortium to annotate uncharacterised proteins with a high degree of accuracy and these methods are used to enhance the unreviewed records in UniProtKB/TrEMBL by enriching them with automatic classification and annotation.

In order to keep UniProt users informed of curation practices and priorities within the project, the UniProt website has been updated to include a new section describing UniProt biocuration. This section provides an overview of the manual curation process as well as details of current manual curation priorities. In addition, information is provided about the automatic annotation systems developed and used within the group. Additional useful information such as statistics, links to related resources and relevant publications are also provided.

The pages will continue to be updated on a regular basis to provide users with the latest information about the UniProt curation process and activities.

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

New term for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • S-(2-aminovinyl)-D-cysteine (Cys-Cys)

Changes to keywords

Modified keyword:

UniProt release 2011_05

Published May 3, 2011

Headline

Complete proteomes for Homo sapiens and Mus musculus

With the imminent closure of the International Protein Index (IPI), UniProt has pledged to provide comprehensive and non-redundant complete proteomes for all species that are currently covered by this soon to be defunct resource. With this release of UniProtKB, we provide the first version of the complete proteome for Mus musculus and an updated version of the Homo sapiens set.

We describe here how each of these complete proteomes is produced, and outline their major characteristics.

For Homo sapiens, a first pass annotation of the complete proteome was completed by UniProt in 2008 and all entries were incorporated into UniProtKB/Swiss-Prot. Within this UniProtKB/Swiss-Prot complete H. sapiens proteome, approximately 20,000 putative protein-coding genes are represented by one canonical protein sequence, with some entries describing multiple isoform sequences. Since its initial release, the UniProtKB/Swiss-Prot complete H. sapiens proteome has been extensively curated and the Ensembl cross-references – mapped based on sequence identity – are in the process of being manually verified. All predicted protein sequences from Ensembl (except fragments) that were found to be absent from the UniProtKB/Swiss-Prot complete H. sapiens proteome were imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. These imported UniProtKB/TrEMBL entries were tagged with the keyword ‘Complete proteome’. The aim of this import was to increase the coverage of the existing complete proteome, by supplementing it with those Ensembl protein sequences that had no UniProtKB counterpart. The resulting UniProtKB complete H. sapiens proteome includes both reviewed sequences from UniProtKB/Swiss-Prot (equivalent to an updated version of the complete H. sapiens proteome completed in 2008), now supplemented by unreviewed sequences from UniProtKB/TrEMBL. This process will enable the synchronization of the UniProt set with the CCDS project. This version of the complete H. sapiens proteome provides higher sequence coverage than the preceding version, but now includes sequences that have not been manually reviewed. Users can choose to opt either for this expanded complete H. sapiens proteome or a reduced version that derives exclusively from UniProtKB/Swiss-Prot.

For Mus musculus, we added the keyword ‘Complete proteome’ to the existing UniProtKB/Swiss-Prot mouse entries. We identified those entries in UniProtKB/TrEMBL which mapped to the complete genome in Ensembl and imported the predicted protein sequences (except fragments) from Ensembl which were found not to be present in UniProtKB. The keyword ‘Complete proteome’ was also added to these entries. As for the human counterpart, the UniProtKB complete Mus musculus proteome can now be easily retrieved using this keyword. The Ensembl cross-references have been added to the UniProtKB entries on the basis of 100% sequence identity over their full length.

There has been a deliberate introduction of redundancy into the proteomes based on the complete genomes to ensure that all alternative protein variants and isoforms are presented in the set. Over time, these will be merged with the parent entry as is UniProtKB/Swiss-Prot curation policy. We are also evaluating the fragment sequences predicted in the Ensembl complete proteomes for future incorporation.

We very much welcome the feedback of the community on our efforts. We expect to make the remaining IPI species (Gallus gallus, Bos taurus, Danio rerio, Arabidopsis thaliana and Rattus norvegicus) and some additional species of interest (Sus scrofa, Canis familiaris) available soon.

UniProtKB News

Changes to the taxonomy of UniProtKB entries from the model fungal organisms Saccharomyces cerevisiae (YEAST) and Schizosaccharomyces pombe (SCHPO)

Historically, UniProt assigned species identification codes (i.e. the 5-letter mnemonic that forms the second part of the composite UniProtKB entry name) at species-level. All sequences of a given protein from a single species were merged into a single UniProtKB/Swiss-Prot entry, which could therefore contain sequences from many different strains of that species. Discrepancies between individual sequences were annotated as “conflicts” or “variants” in the feature table of the entry.

The number of complete genome submissions from different strains of individual species is now increasing at an ever accelerating rate and this data has elucidated that distinct strains of many species can exhibit considerable differences in both gene content and within individual shared genes. The sheer amount of complete proteome data and the associated variability between proteomes means that manual merging of individual strains is no longer sustainable, and this approach has been discontinued. UniProt now assigns mnemonic codes at strain-level for complete genome sequences. We are in the process of reassigning many proteomes corresponding to known strains from a species-level taxonomic identifier (defined by the NCBI_TaxID) to the appropriate strain-level taxonomic identifier. In parallel we are also resolving some of the most significant historical cases of strain-level merging.

Here we describe the changes which accompanied the reassignment of the proteomes of the model fungal organisms Saccharomyces cerevisiae (YEAST) and Schizosaccharomyces pombe (SCHPO) to a strain-level taxonomic identifier.

We have changed the taxonomy of all UniProtKB entries of the S. cerevisiae complete proteome from strain S288c (the reference genome sequence stored at SGD) from NCBI_TaxID=4932 to NCBI_TaxID=559292:

OS   Saccharomyces cerevisiae (Baker's yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;
OC   Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.
OX   NCBI_TaxID=4932;
to
OS   Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;
OC   Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.
OX   NCBI_TaxID=559292;

To facilitate entry recognition and tracking of the entries, those of strain S288c (NCBI_TaxID=559292) have kept the mnemonic YEAST.

All S. cerevisiae entries not originating from the genome strain S288c, or one of the other completely sequenced strains (strain RM11-1a (YEAS1), strain JAY291 (YEAS2), strain AWRI1631 (YEAS6), strain YJM789 (YEAS7) and Lalvin EC1118 (YEAS8)), remain at a species level taxonomic identifier (NCBI_TaxID=4932), for which the new mnemonic YEASX was created (e.g. MAL62_YEASX for P07265).

A similar procedure was applied to the S. pombe proteome. We have changed the taxonomy of all UniProtKB entries of the S. pombe complete proteome from strain 972 (the reference genome sequence stored at S. pombe GeneDB) from NCBI_TaxID=4896 to NCBI_TaxID=284812:

OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=4896;
to
OS   Schizosaccharomyces pombe (strain ATCC 38366 / 972) (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=284812;

All S. pombe entries not originating from the genome strain 972 retain a species-level taxonomic identifier (NCBI_TaxID=4896), for which the new mnemonic SCHPM was created.

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • 1-(tryptophan-3-yl)-tryptophan (Trp-Trp) (interchain)
  • S-(2-aminovinyl)-L-cysteine (Cys-Cys)
New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • (E)-2,3-didehydrobutyrine
  • (Z)-2,3-didehydrobutyrine
  • 5’-chlorotryptophan
  • L-allo-isoleucine

UniProt release 2011_04

Published April 5, 2011

Headline

The art of defining the unknown

When the human gene C22orf28 was predicted in 1998, its existence was supported by many cDNAs identified by large scale cDNA sequencing projects. Although nothing was known about the protein it encoded, it was conserved in many species in all 3 kingdoms. A consensus pattern was created in the PROSITE database which was distinctive for all related sequences, from bacteria to mammals. This allowed us to classify all these proteins into a single family, named ‘the Uncharacterized Protein Family (UPF) 0027 (UPF0027)’ to unambiguously indicate the lack of functional data.

For several years, the only annotation in the ‘General annotation (Comments)’ section of the C22orf28 entry was: “Belongs to the UPF0027 (rtcB) family.” Things changed at the beginning of this year with the publication of 3 articles that unraveled C22orf28 function in human, archaea and bacteria. In archaea and human, the protein was shown to be involved in the ligation step during tRNA splicing. In bacteria, the ligation activity may be used in the context of tRNA repair. As of this release, all entries belonging to this family have been updated with these new data and UPF0027 has been deleted from UniProtKB and replaced by the RtcB family.

In the course of manual annotation, we have encountered many examples of uncharacterized conserved proteins. This has led to the definition of a total of 765 UPFs which are listed in the upflist.txt file, available from the UniProt documentation pages, along with all associated entries and information concerning the taxonomic range concerned. Within this list, characterized, hence deleted UPFs represent about 28%. They are tagged with a comment indicating the reason for the deletion, for example, for UPF0027, the comment states that it is “now characterized as a family of RNA-splicing ligases”. It should be mentioned that, in parallel with our efforts to create UPFs, the Pfam database has established an analogous classification system based on ‘Domains of Unknown Function’ (DUFs). It currently reports some 3’000 DUFs.

For bench scientists, UPFs provide a pool of exciting targets for future research since protein sequence conservation suggests important, yet unknown, functions. For database maintenance, the classification of uncharacterized proteins into families presents the major advantage of simplifying the update when functional information becomes available for at least one member.

UniProtKB news

Changes to keywords

Modified keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:

Modified subcellular locations:

UniProt release 2011_03

Published March 8, 2011

Headline

Dealing with erroneous information: a tricky task

There are many reasons for mistakes in databases, including annotation errors. However, sometimes the annotation is correct, but the original source of information contains erroneous data. Histone arginine demethylase JMJD6 is a good example of the problems raised when this happens.

Histone arginine demethylase JMJD6 was discovered in 2000. At the origin of this discovery lies a monoclonal antibody (mAb217) raised against stimulated macrophages. Phosphatidylserine-displaying liposomes inhibited the binding of mAb217 to macrophages and the antibody prevented the uptake of apoptotic cells. These characteristics suggested that mAb217 interacted with a receptor for phosphatidylserine on the membrane. Using this antibody, a 48-kDa protein was isolated and called “Phosphatidylserine receptor” (PTDSR). The effect of the deletion of the corresponding gene was investigated in knockout mice, but the reactivity of mAb217 was not compared between cells from knockout and wild-type animals. When this experiment was finally carried out, the result was quite surprising: similar staining patterns were observed with cells of both genotypes. It appeared that mAb217 could bind weakly to a PTDSR peptide, but the antibody mainly recognizes another membrane-associated protein. Parallel studies revealed that PTDRS was actually a nuclear protein, an unlikely location for a membrane receptor. After several years and many publications in high-profile journals, it was eventually demonstrated that the 48-kDa protein is not a phosphatidylserine receptor, but a dioxygenase that acts in the nucleus as a histone arginine demethylase and a lysyl-hydroxylase. Since then, other genes have emerged as candidates for the role of phosphatidylserine receptor, including STAB2 , BAI1 and TIMD4.

Once the true function of the 48-kDa protein had been established, curators were faced with the challenge of updating the existing annotation to reflect this. First of all, the original recommended protein name “Phosphatidylserine receptor” had to be modified into “Bifunctional arginine demethylase and lysyl-hydroxylase JMJD6”. “Phosphatidylserine receptor” became an alternative name, as it is UniProtKB policy to keep all protein names, even obsolete ones, to facilitate the identification of the protein of interest. The problem in this case is that the obsolete name is misleading. In order to clarify this, the update of the ‘General annotation’ section included the addition of a ‘Caution’ comment, as well as the review of other subsections, such as ‘Function’ or ‘Subcellular location’. In the ‘Caution’ comment, the attention of the users is drawn to the ambiguity of the ‘Alternative name’, as well as to the erroneous conclusions reported in published references still cited in the entries. These references describe the original sequence of the protein and as such cannot be simply deleted from the entry, since this contribution has to be acknowledged. In addition, it confirms for the users that this information has been reviewed in the context of the protein and has not been overlooked.

We thus advise our users to carefully read the ‘Caution’ subsections found in entries which have had an interesting evolution (examples). Our users are also encouraged to send us feedback using the option “Contribute” at the top of each entry if they find mistakes or inconsistencies in our entries.

It is possible to track all changes occurring in an entry across releases by clicking on ‘History’, an option available at the top of each entry. For example, the major update of the human PTDSR entry can be visualized by comparing version 27 (which contained the original information) with the current one.

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • Cyclopeptide (His-Asn)
  • Cyclopeptide (His-Asp)
New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N6-succinyllysine
  • Lysino-D-alanine (Lys)

UniProt release 2011_02

Published February 8, 2011

Headline

Automatic annotation of UniProtKB/TrEMBL using PDB-derived data

Producing manual annotations for UniProtKB/Swiss-Prot entries containing 3D-structural data has been a priority from the very start of the Swiss-Prot database in 1986, at which point the PDB archive contained a total of 213 structures. Almost 25 years later, the PDB archive had a record 7,971 structures deposited in 2010 (giving a total of 70,229 structures) of which 7,848 contained a polypeptide chain. Although not all polypeptides are mapped to UniProtKB (immunoglobulins and synthetic sequences are not mapped, for example, as these sequences are not within the scope of UniProtKB), the vast majority are. In addition, most vertebrate proteins whose 3D structure has been deposited to the PDB archive have been manually annotated in the Swiss-Prot section of UniProtKB.

Of these mapped polypeptides, about 85,000 PDB cross-references map to 25,000 UniProtKB entries. These are divided between approximately 16,000 UniProtKB/Swiss-Prot and 8,000 UniProtKB/TrEMBL entries with at least one PDB cross-reference. Only three years earlier, the UniProtKB/TrEMBL section contained close to 3,000 entries compared to UniProtKB/Swiss-Prot’s about 12,000 entries with a PDB cross-reference. The vast majority of these 8,000 TrEMBL entries are from bacteria and other microbes.

In UniProtKB/Swiss-Prot, 3D-structure data are manually annotated and integrated mainly in the ‘Sequence annotation (Features)’ section (see Q9C0B1 as an example). This enables users to find the salient structural information directly in the UniProtKB/Swiss-Prot entry. The situation is different for entries in UniProtKB/TrEMBL which have not yet benefited from the manual addition of these data. Furthermore, with the number of UniProtKB/TrEMBL entries with a PDB cross-reference more than doubling in three years, an ever-growing amount of new and potentially interesting PDB-derived data would remain difficult to access for many UniProtKB users. The new UniProt-PDB import pipeline addresses this issue and the UniProtKB/TrEMBL now contains annotations derived from the PDBe and PDBe Motif databases. These annotations include 15,000 new sequence annotations (‘Features’) where ligand interactions are shown for interactions with small molecules and metal ions (e.g. Q8U2I8, Q8U2V3, Q939U1), and 10,000 new citations in almost 8,000 entries.

The procedure produces feature annotations for about 200 types of small molecules in the PDB archive, which have been hand-picked to offer the most unambiguous biological activity. Typically, these include the most common metals, enzyme cofactors (as detailed in the CoFactor database), post-translationally modified residues, carbohydrates, flavins and nucleotide phosphates. None of this would have been possible without the UniProtKB-PDB mappings data produced and maintained in a collaborative effort between UniProt and the PDBe in the form of the SIFTS project which provides the crucial link between UniProtKB and PDB sequence coordinates down to the residue level.

By including these data, we hope to improve the accessibility to experimental, position-specific data about ligand binding sites for the scientific community.

All UniProtKB entries containing 3D-structure data can be retrieved using the keyword 3D-structure. They include UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entries.

UniProtKB news

Cross-references to neXtProt

Cross-references have been added to neXtProt, the human protein knowledge platform.

neXtProt is available at http://www.nextprot.org/.

The format of the explicit links in the flat file is:

Resource abbreviation neXtProt
Resource identifier neXtProt unique identifier.
Example P31946:
DR   neXtProt; NX_P31946; -.

Show all the entries having a cross-reference to neXtProt.

Cross-references to GeneTree

Cross-references have been added to the phylogenetic gene trees that are available at www.ensembl.org and www.ensemblgenomes.org.

The format of the explicit links in the flat file is:

Resource abbreviation GeneTree
Resource identifier GeneTree unique identifier.
Example P32234:
DR   GeneTree; EMGT00050000006238; -.

Show all the entries having a cross-reference to GeneTree.

UniProt release 2011_01

Published January 11, 2011

Headline

An old-timer, but still trendy: 10’000 entries for Arabidopsis thaliana in UniProtKB/Swiss-Prot

Arabidopsis thaliana belongs to the Brassicaceae family that includes the well-known dietary staples cauliflower, broccoli, cabbage, turnip, radish, canola and mustard. The split between the Arabidopsis group and the other crops of the genus Brassica has been estimated at around 43 million years ago. A. thaliana has been widely studied since the 1980s, when the development of T-DNA mediated transformation made the generation of mutants and their study relatively easy. Since then, A. thaliana has been become the model of choice for the study of many biological processes of flowering plants, as well as those specific to the Brassica species.

Its utility as a model organism was further boosted by the completion of the whole genome sequence, which revealed a relatively small genome with a low level of duplication compared to other flowering plants, and by the inception of a number of complementary efforts to sequence the transcriptome.

In 2001, the Swiss-Prot group created the Plant Proteome Annotation Program (PPAP) whose main focus is the annotation of proteins and protein families of A. thaliana and rice. One decade on, we have now annotated over 10’000 A. thaliana entries in UniProtKB/Swiss-Prot. This corresponds to around 36% of the A. thaliana proteome according to version 10 Arabidopsis thaliana genome annotation from The Arabidopsis Information Resource (TAIR), which estimates at 27’416 the number of protein-coding genes in this organism. According to the Multinational Arabidopsis Steering Committee (MASC) report 2010, at least one third of these genes still have no known function, so the work of experimentally characterizing and of annotating each gene is far from finished.

All manually annotated A.thaliana entries can be retrieved from UniProtKB/Swiss-Prot using the organism name “Arabidopsis thaliana” (or the taxonomy identifier 3702), with the restriction: “reviewed:yes”.

UniProtKB news

Cross-references to Allergome

Cross-references have been added to Allergome, a platform for allergen knowledge.

Allergome is available at http://www.allergome.org/.

The format of the explicit link in the flat file is:

Resource abbreviation Allergome
Resource identifier Allergome unique identifier
Optional information 1 Allergen name
Example O76821
DR   Allergome; 2; Aca s 13.
DR   Allergome; 3051; Aca s 13.0101.

Changes to keywords

New keyword: Modified keyword: Deleted keywords:
  • Core protein
  • Fiber protein
  • Fusion protein
  • Hexon protein
  • Hexon-associated protein
  • Phage recognition

Changes in the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N,N-dimethylleucine

UniProt release 2010_12

Published November 30, 2010

Headline

Fishing for new mutations in the human exome

Understanding the role of genetic variants in human health and disease is crucial in modern biology and medicine. The International HapMap Project and, more recently, the 1000 Genomes Project are progressively unveiling the map of human genome variation at the scale of the human population, generating a flood of interesting data. Smaller research projects focused on disease-causing mutations also contribute through the development of new fruitful approaches. One of the current trends in large and small scale projects is exome sequencing. The rationale is that the clear majority of allelic variants known to underlie mendelian disorders disrupt protein-coding sequences. Restricting sequencing to exons decreases the sample size to 2-5% of that of the whole genome, thus saving time and money, while allowing the identification of missense and nonsense mutations, of small insertions and deletions (indels), as well as of splice donor and acceptor site variants. By definition, exome sequencing does not permit the discovery of mutations in non-coding, regulatory or intronic genomic regions which are known to affect disease.

The exome sequencing strategy is proving to be quite effective, as it has recently been used to pinpoint several genes whose mutations are associated with diseases, including DHODH involved in postaxial acrofacial dysostosis (Ng et al., 2010), WDR62 in severe cerebral cortical malformations (Bilguvar et al., 2010) and MLL2 in Kabuki syndrome (Ng et al., 2010).

The annotation of single amino acid polymorphisms (SAPs) has always been a priority in UniProtKB/Swiss-Prot, including not only ‘neutral’ polymorphisms, resulting from normal variations among individuals, but also disease-associated mutations. Thus missense SAPs identified by the exome-sequencing strategy have been quickly annotated and integrated in the ‘Sequence annotation (Features)’ section of their respective entries (Q02127, O43379 and O14686). The associated phenotypes are described in the ‘General annotation (Comments)’ section in ‘Involvement in disease’ (Q02127, O43379 and O14686).

Over the years, we have developed a defined format to describe SAPs in the ‘Sequence annotation (Features)’ section, including dbSNP accession numbers, when they exist, and links to bibliographic references. Disease-causing mutations are tagged, whenever possible, with the official abbreviation of the phenotype provided by the OMIM database. In addition to missense mutations, in-frame indels are also reported (P35453, P02730 or P33897). When it is not possible to represent the whole variation landscape for a given protein within the UniProtKB entry, we try and provide cross-references to specialized resources (see for instance the ‘Web resources’ section in human p53 entry). Our annotation effort does not include the representation of mutations that cause major changes to a protein sequence, such as frameshift mutations or variations at splice sites, as their deleterious effects on protein function are usually obvious.

Close to 63’000 human SAPs are currently stored in UniProtKB/Swiss-Prot and about 30% of them are reported as disease-associated in the literature. SAPs selected from this pool are mapped to reference nucleotide sequences from RefSeq and LRG, following the guidelines established by the Human Genome Variation Society for sequence variant designation, and submitted to dbSNP (see for instance dbSNP/Swiss-Prot variant rs121908210). Thanks to a tight collaboration with Ensembl, all human variants stored in UniProtKB and characterized by a dbSNP accession number (or submitted to dbSNP) can also be accessed from the Ensembl database and viewed in the context of their nucleotide sequence (see variant rs1269215 stored in UniProtKB entry Q9BVK8). Our ultimate goal is to spread information about protein variations to the broadest possible audience.

UniProtKB news

Line length limit

Historically, UniProtKB flat file entries were formatted to not exceed 75 characters per line. This limitation served on one hand to display them nicely on small screens and to allow them to be processed by programs that had memory limitations. Meanwhile, computers have become more powerful and most programs have been adapted accordingly. UniProt has already made a few exceptions to the line length limit for data that cannot be wrapped, such as URLs or DOIs, or where wrapping does not increase readability, such as for protein names and a few cross-references to other databases. Especially for the latter, we have increasingly more additional information to incorporate. We will continue to wrap lines at 75 characters where it helps to increase readability, but allow for more characters where necessary. The new upper limit is 255 characters per line, as some users still depend on software with this limitation.

Changes to cross-references to RefSeq

We have introduced an additional field to the cross-reference (DR line in the flat file) to the NCBI Reference Sequences database to show the RefSeq nucleotide accession number.

The format of the explicit links in the flat file is:

DR   RefSeq; RefSeq protein accession number; RefSeq nucleotide accession number.

Example: P00816

Previous format in the flat file:

DR   RefSeq; AP_000992.1; -.
DR RefSeq; NP_414874.1; -.

New format:

DR   RefSeq; AP_000992.1; AC_000091.1.
DR RefSeq; NP_414874.1; NC_000913.2.

Changes to keywords

New keywords:

Changes in subcellular location controlled vocabulary

New subcellular locations:

Changes in the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 3’-nitrotyrosine

UniProt release 2010_11

Published November 2, 2010

Headline

Pupylation: a ubiquitin-like tagging system in bacteria

While ubiquitin has been known for decades as a post-translationally conjugated protein degradation tag in eukaryotes, the first identified prokaryotic protein that is functionally analogous to ubiquitin, i.e. prokaryotic ubiquitin-like protein Pup, has only recently been discovered in mycobacteria.

Pup (64 residues) and ubiquitin (76 residues) show neither structural nor sequence homology except for a GG motif near or at the C-terminus. Although both Pup and ubiquitin are attached to the epsilon-amino group of lysine side chains in substrates and target the substrates for degradation by the proteasome, the enzymology of ubiquitination and pupylation and the chemistry of the coupling reaction appear completely different. Ubiquitin is coupled to substrates via the carboxyl group of its C-terminal glycine in a multistep reaction involving several enzymes (see release 2010_10 headline). In the mycobacterial pupylation pathway, the C-terminal glutamine of Pup is first deamidated to glutamate by Dop (deamidase of Pup) after which it is ligated to the substrate lysine of target proteins by proteasome accessory factor A (PafA). Neither Dop nor PafA is similar to ubiquitin-activating enzymes. The covalently Pup-modified protein is then recognized and unfolded by the proteasomal ATPase Mpa and degraded by the proteasome. The very recent discovery of a depupylase activity provided by Dop, able to remove conjugated Pup from target proteins in a manner analogous to the deconjugation of ubiquitin from eukaryotic proteins, strengthens the parallels between the Pup- and ubiquitin-tagging systems of prokaryotes and eukaryotes, respectively. However Mycobacterium appears to have a single Pup ligase to mediate all pupylation and a single depupylase for all pupylated substrates, in contrast to the human genome that encodes hundreds of ubiquitin ligases and dozens of deubiquitinating enzymes.

Taken together, prokaryotes and eukaryotes appear to have developed distinct but parallel mechanisms to regulate protein stability by a similar proteolytic machinery: the proteasome found in all eukaryotes and archaea, and in bacteria of the class Actinobacteria, including the genus Mycobacterium.

All the known pupylation-related proteins in bacteria have now been annotated in UniProtKB/Swiss-Prot.

UniProtKB news

Changes concerning keywords

New keywords: Modified keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:
  • Filopodium tip
  • Pseudopodium tip
  • Bleb
  • Phagocytic cup

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N,N,N-trimethylserine
  • N,N-dimethylserine
  • N-methylserine
  • CysO-cysteine adduct

UniProt release 2010_10

Published October 5, 2010

UniProtKB/Swiss-Prot ubiquitin pathway annotation

Post-translational modifications (PTMs) can have a profound effect on protein function. They act as switches to activate or inactivate polypeptides, change their subcellular location, modify protein-protein partnerships, etc. However, no PTM is as versatile as ubiquitination, i.e. the post-translational conjugation of ubiquitin. Ubiquitination can occur on a large range of proteins and not only controls their lifespan, but also expands their functional repertoire (see reviews). In view of its importance in many cellular events, we have decided to qualitatively and quantitatively improve our annotation of proteins involved in the ubiquitin and ubiquitin-like pathways in various species, ranging from plants to mammals. Bacteria and archaea which have been recently shown to have an ubiquitin-like system for protein degradation, called pupylation, were not neglected (see next release’s headline).

Ubiquitin (see for instance entry P0CG47 featuring one of the human ubiquitin gene products) is a small 76 amino-acid protein that is ubiquitously expressed (hence its name) in all eukaryotic cells and highly conserved among eukaryotic species: human and yeast ubiquitin share 96% sequence identity. Ubiquitination most frequently occurs via an isopeptide bond between a lysine of the target protein and the C-terminal glycine of ubiquitin. Substrates can be monoubiquitinated, via the attachment of a single ubiquitin, or multiubiquitinated, when more than one amino acid is modified with monoubiquitin. Ubiquitin can also be added sequentially to substrates to form ubiquitin chains resulting in polyubiquitination. In ubiquitin polymers, the lysine side chain of one ubiquitin molecule is linked to the C terminus of another ubiquitin molecule, and so on. Ubiquitin contains 7 lysine residues, all of which can contribute to such linkages with a different functional outcome for the target protein. For instance the most prominent function of ubiquitin is labeling proteins for proteasomal degradation. This signal is conveyed by polyubiquitin chains linked through the ubiquitin lysine-48 side chain (‘Lys-48’-linked chains). ‘Lys-63’-linked polyubiquitin chain functions in signal transduction and DNA repair without functioning as a degradation signal. Monoubiquitination has recently been shown to have a signaling function in the endocytic pathway.

Three types of enzyme – E1, E2 and E3 – carry out ubiquitination. E1s activate ubiquitin, E2s pick up the ubiquitin from E1 and, in close collaboration with E3, conjugate it to substrates. E3s have a crucial role in recognition of the substrate. They are either catalytically active and directly transfer the activated ubiquitin to the target, or serve as a scaffold linking catalytic E2 to the appropriate substrate. All eukaryotes encode a very limited number of E1 enzymes (a single gene in many species, 3 in humans), but multiple isozymes of E2 and E3, up to several dozen E2s and many hundreds of E3s. This allows the modification of many proteins in a highly specific and controlled manner.

Ubiquitin modification is only transient: enzymes, known as deubiquitinating enzymes (DUBs), can remove ubiquitin molecules that are attached to proteins. They also show specificity towards the type of ubiquitin linkage. For instance, the BRCC3 metalloprotease specifically cleaves ‘Lys-63’-linked polyubiquitin chains, while the cysteine protease USP15 shows preference for ‘Lys-48’ chains.

The ubiquitin pathway turned out to be even more complex with the discovery of several ubiquitin-like proteins, including SUMO, ISG15, NEDD8, UFM1. These proteins also regulate a vast array of cellular events, such as nuclear transport, transcriptional regulation, apoptosis, protein stability, signalling, protein-protein interactions, etc.

The UniProtKB annotation marathon led to the integration of 940 new eukaryotic entries and annotation of 942 new sites of ubiquitination. Close to 4’000 experimental GO terms have been manually added to UniProtKB entries. 469 proteins directly involved in the process of ubiquitination (and ubiquitin-like conjugation) have been annotated or updated and can now be retrieved with the keyword ‘Ubl conjugation pathway’, along with some other 3’400 manually reviewed entries. Proteins undergoing ubiquitination, including autoubiquitination classically observed in E3 proteins, are tagged with the keyword ‘Ubl conjugation’ and, when known, the effect of the PTM is indicated in the ‘Post-translational modification’ subsection of ‘General Annotation’ (see for instance entry Q9Y243).

UniProtKB News

Changes to cross-references to Ensembl

The cross-references to the Ensembl database have been modified. The optional field describing the species name has been removed, because it is no longer necessary to build a valid URL.

Example:

Previous format in the flat file:

DR   Ensembl; ENST00000220809; ENSP00000220809; ENSG00000104368; Homo sapiens.

New format:

DR   Ensembl; ENST00000220809; ENSP00000220809; ENSG00000104368.

UniProt release 2010_09

Published August 10, 2010

Headline

‘De-merge’ of multi-gene entries derived from a single species in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot has historically “merged” 100% identical protein sequences from different genes in the same species into one single record. The aim of this approach was to reduce sequence redundancy within the proteome of individual species, facilitating protein identification and the functional annotation of protein sequences. These merged entries provide extensive annotation of the protein sequence itself, as well as information on each of the individual source genes, including cross-references to external gene-centric resources that provide gene models and genomic information.
As the availability and usage of genomic information has greatly increased in recent years, UniProtKB is modifying its merging policy. We have already begun to “de-merge” entries containing multiple individual genes coding for 100% identical protein sequences into individual UniProtKB/Swiss-Prot entries containing a single gene. This will give a gene-centric view of protein space, where the same protein sequence can be represented multiple times by distinct UniProtKB/Swiss-Prot entries, each of which is based on the translation of a single distinct gene. It will allow a cleaner and more logical mapping of gene and genomic resources to UniProtKB, which provide the major point of entry to the resulting proteome for many users. It will also facilitate the annotation of protein features that are uniquely associated with specific copies of duplicated genes, such as alternative splice forms that are found in genes encoded by multiple exons but not in single exon copies derived from retro-transposed cDNAs. This type of information can be most effectively captured in a gene-centric view of protein space, providing a precise description of how genome evolution and structure impact the protein complement of a cell. One consequence of this change in annotation policy is that the level of protein sequence redundancy in UniProtKB will slightly increase, as multiple identical instances of a given protein sequence may now exist within the proteome of a particular species or strain. The process of de-merging has already begun with a number of proteins from Escherichia coli and Homo sapiens and other vertebrates, and will be an ongoing process in UniProtKB. For pragmatic reasons, there are several multi-gene families which will not be targeted for de-merge in the near future, as the difficulties associated with maintaining these individual annotated sequences are significant. These include the human histone genes and the calmodulins, which will continue to be grouped into one entry for the current time. However for simpler cases, especially those in which the genomic context of the gene affects the properties of the encoded protein, de-merging will be preferred.

The de-merge procedure

In simple cases, the demerge procedure simply involves the creation of one new UniProtKB entry for each gene in the current merged UniProtKB entry. A new primary accession number is attributed to each de-merged entry, and the primary accession number of the formerly merged entry is retained as a secondary accession number in each of the resulting de-merged entries. To illustrate how the demerge procedure affects the representation of protein sequences in UniProtKB, consider the example of the human ubiquitin protein. Ubiquitin in humans is encoded by four distinct genes, RPS27A, UBA52, UBB and UBC. RPS27A and UBA52 include a single ubiquitin moiety as an N-terminal fusion to a ribosomal protein, while UBB and UBC encode distinct poly-ubiquitin chains. In UniProtKB release 2010_08, the human ubiquitin protein sequence was represented by one single UniProtKB entry (UBIQ_HUMAN, P62988), that included the ubiquitin protein sequences derived from all four of the aforementioned genes. For UniProt release 2010_09, these four genes were de-merged into 4 distinct UniProtKB entries corresponding to each of the four ubiquitin genes. Following the de-merge, ubiquitin chains from RPS27A and UBA52 were then re-merged to the entries describing their cognate ribosomal proteins, and are now represented as peptides derived from the translated ubiquitin-ribosomal protein fusion. The final result of this process is four distinct UniProtKB entries that include ubiquitin protein sequences derived from four loci: RPS27A, UBA52, UBB, and UBC. Each of these entries retains the primary accession number of the old merged entry UBIQ_HUMAN (P62988) as a secondary accession number.

UniProtKB news

Cross-references to Protein Model Portal

Cross-references have been added to Protein Model Portal, developed as a module of the PSI-Nature Structural Biology Knowledgebase (http://sbkb.org/). The Protein Model Portal provides a single interface to query simultaneously the existing precomputed models
at various sites, gives access to interactive services for template selection, target-template alignment, model building, and quality assessment. Models are provided by the PSI centers (CSMP, JCSG, MCSG, NESG, NYSGXRC, JCMM), and by independent modeling groups. The task of the portal is to unify the model data from the different sites.

Protein Model Portal is available at http://www.proteinmodelportal.org/.

The format of the explicit links in the flat file is:

Resource abbreviation ProteinModelPortal
Resource identifier UniProtKB accession number
Examples P84155:
DR   ProteinModelPortal; P84155; -.
P27362:
DR   ProteinModelPortal; P27362; -.

Show all the entries having a cross-reference to Protein Model Portal.

Changes to keywords

New keywords:

Changes in subcellular location controlled vocabulary

New subcellular locations:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • Glycyl cysteine thioester (Cys-Gly) (interchain with G-...)

Website news

New BLAST features

We have updated the BLAST results view of uniprot.org:

  • All information that is visible or configurable in the results page of a text search in one of the UniProt core datasets is now also available in the results page of a BLAST search. Click the Customize display link on the BLAST results page to see which additional columns you can add to the Detailed BLAST results table. For instance, select Comment, press Show, tick Function, press Show to add the column comment(FUNCTION) to see the available functional annotation of your BLAST hits.
  • BLAST results can now be filtered by dataset and taxonomy. The Filter section will show you in brackets the number of hits for each dataset or taxonomy branch. For instance, after running BLAST against the full UniProtKB dataset, you can filter your results to show only hits that are from Bacteria.
  • When running a BLAST search against UniProtKB, it is possible to project the sequence annotations of the matched UniProtKB entries onto the alignments generated by BLAST. To see an alignment, click on it in the Alignments column of the Detailed BLAST results table, then tick the Annotations that you would like to highlite in the alignment. This allows you to see at a glance if important positions are conserved.

Another new feature is the option to run BLAST searches against UniParc. Please use this with caution, as UniParc is an archive that also contains pseudogenes and incorrect CDS predictions.

Updated look and feel

The website received a small face lift to improve the navigation. The UniProt entry views, as well as the various tools’ results views, now have blue navigation bars at the top and bottom with links that allow you to quickly access different sections of big views. Where applicable, the top bar features a Customize display link that lets you customize the view.

UniProt release 2010_08

Published July 13, 2010

Headline

Viral reference strains: a virtual vaccine against virus pandemic in sequence databases

Viruses are not only the most abundant biological entities on the planet, they are also the most represented taxonomic group in UniProtKB. Without contest the title holder is the HIV-1 virus with about 350’000 entries. Taking into account that the HIV genomes encode about 9 proteins, these entries correspond to the equivalent of about 35’000 complete genomes!

While these numbers reflect the tremendous sequence diversity of viruses, they also make it difficult to find one’s way around, and users looking for general information on a viral species face a dilemma: which one to choose? Retrieving only manually reviewed proteins will still leave the user in doubt as the same viral proteins can be present by the dozen in UniProtKB/Swiss-Prot. For example, which Influenza A Hemagglutinin proteins should be selected preferentially among the 170 reviewed entries?

The UniProt solution to this problem is to define viral reference strains, each being representative of one virus genus, to curate them to the highest quality standards and to continuously maintain their annotation. The reference strains that have been selected are those whose genomes belong to the NCBI Reference Sequence collection (RefSeq). Therefore not only their proteomes, but also their genomes are carefully reviewed. The keyword ‘Virus reference strain’ has been created to allow their easy retrieval. At the current time we have defined 355 viral reference strains. These reference strains contain 12’576 proteins, of which 4’500 entries, most representing double strand DNA viruses, have been tagged with the ‘Virus reference strain’ keyword. We are actively updating the remaining 8’000 entries to provide a full set of tagged entries reflecting the diversity of the virus world.

Reference strains allow users to identify the strain with the best and most up-to-date information for any given virus. For bioinformaticians, they present another interesting feature as they can serve as templates for high quality automated annotation of other viruses of the same genus, following a pipeline analogous to the one used in UniProtKB for microbial proteins (see HAMAP program).

The viral reference strains are also accessible via the ViralZone fact sheet which provides links to the corresponding UniProtKB proteome and RefSeq genome (see for instance Influenza A).

UniProtKB News

Format change in the cross-references to WormBase

C.elegans and C.briggsae entries used to have cross-references to both WormPep and WormBase databases. WormPep is no longer active, and all worm sequences are contained in WormBase, a comprehensive database for biological information on worm sequences and annotation. We have therefore removed cross-references to WormPep and modified the WormBase cross-references to include transcript and protein identifiers from WormPep. Proteins with alternative products have one WormBase cross-reference per gene product.

Previous format in the flat file:

DR   WormPep; TranscriptIdentifier; ProteinIdentifier.
DR WormBase; GeneIdentifier; GeneName.

New format:

DR   WormBase; TranscriptIdentifier; ProteinIdentifier; GeneIdentifier; GeneName.

If there is no GeneName, a dash (’-’) is stored in that position.

Example: O45818

Previous format in the flat file:

DR   WormBase; WBGene00012019; dkf-2.
DR WormPep; T25E12.4a; CE18967.
DR WormPep; T25E12.4b; CE18283.
DR WormPep; T25E12.4c; CE42507.

New format:

DR   WormBase; T25E12.4a; CE18967; WBGene00012019; dkf-2.
DR WormBase; T25E12.4b; CE18283; WBGene00012019; dkf-2.
DR WormBase; T25E12.4c; CE42507; WBGene00012019; dkf-2.

Show all the entries having a cross-reference to WormBase.

Cross-references to WormPep have been removed.

Changes concerning keywords

New keywords:

Changes concerning the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • S-(coelenterazin-3a-yl)cysteine

Deleted terms:

  • Glutamyl lysine isopeptide (Gln-Lys) (interchain with K-...)
  • Glutamyl lysine isopeptide (Lys-Gln) (interchain with Q-...)

UniProt release 2010_07

Published June 15, 2010

Headlines

UniProt and the International Nucleotide Sequence Database Collaboration

UniProt has had a very beneficial and long-standing collaboration with the three members of the International Nucleotide Sequence Database Collaboration (INSDC) – the EMBL-Bank, GenBank and the DNA Data Bank of Japan (DDBJ). It began at the most basic level with an exchange of nucleotide and protein sequences, evolved through co-development of the nucleotide entry feature table definition to ensure efficient automatic integration of appropriate protein information into UniProt followed by reciprocal cross-references, and from there has recently progressed to a joint endorsement of protein naming guidelines section. This was one outcome of the third NCBI Genome Annotation Workshop in Washington, USA in April 2010 where researchers from life science organizations world-wide collaborated to establish minimal standards for prokaryotic and viral annotation. Extremely productive discussions concerning annotation and underlying problems led to a number of resolutions that were adopted by the international microbial sequencing community. The highlight was the development and acceptance by the community of prokaryotic protein naming guidelines (see file proknameprot.txt) based on an initial proposal from the INSDC and UniProt. Following this agreement, INSDC and UniProt also created a more generalised protein guideline (see file gennameprot.txt) to make this useful for taxa outside cellular prokaryotes. The decision by the INSDC to provide these guidelines for adoption by all submitters to their databases will greatly enhance the annotation of complete genomes and proteomes and ensure that the user community can exploit this data to its full potential. This is a particularly timely and exciting development given the data avalanche. Future plans for the INSDC and UniProt involve collaboration with the NCBI’s Genome project and the Reference Sequence (RefSeq) collection groups to provide synchronized well-annotated genomes and proteomes.

The new files gennameprot.txt and proknameprot.txt are available in UniProt Documents, Nomenclature and guidelines section, and can be accessed from the Documentation/Help pages.

UniProtKB News

New feature key INTRAMEM in the flat file

In addition to the feature keys TOPO_DOM (which describes the topology of regions for transmembrane proteins that span membrane compartments) and TRANSMEM (which describes the extent of the region spanning a membrane), we have introduced a new feature key INTRAMEM in the flat file to describe the extent of a region located in a membrane without crossing it.

Cross-references to EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants and EnsemblProtists

Cross-references have been added to Bacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants and EnsemblProtists. These databases are part of Ensembl Genomes. Ensembl Genomes has been created to complement the existing Ensembl site, which focuses on vertebrate genomes.

The format of the explicit links in the flat file is:

Resource abbreviation EnsemblBacteria or EnsemblFungi or EnsemblMetazoa or
EnsemblPlants or EnsemblProtists
Resource identifier Transcript ID
Optional information 1 Protein ID
Optional information 2 Gene ID
Examples Q53653:
DR   EnsemblBacteria; EBSTAT00000032812; EBSTAP00000031682; EBSTAG00000032810.
Q07163:
DR   EnsemblFungi; YDR365W-B; YDR365W-B; YDR365W-B.
Q9NDJ2:
DR   EnsemblMetazoa; FBtr0071602; FBpp0071528; FBgn0020306.
DR   EnsemblMetazoa; FBtr0071603; FBpp0071529; FBgn0020306.
DR   EnsemblMetazoa; FBtr0071604; FBpp0071530; FBgn0020306.
P49333:
DR   EnsemblPlants; AT1G66340.1-TAIR; AT1G66340.1-P; AT1G66340-TAIR-G.
Q54L85:
DR   EnsemblProtists; DDB0305146; DDB0305146; DDB_G0286833.

Show all the entries having a cross-reference to EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants or EnsemblProtists.

Changes concerning keywords

New keywords:

UniProt release 2010_06

Published May 18, 2010

Headlines

UniProt and Ensembl

The Ensembl project was launched in 2000 as a joint project between the EBI and the Wellcome Trust Sanger Institute, some years before the draft human genome was completed. Even at that early stage, it was clear that manual annotation of 3 billion base pairs of sequence would not be able to offer researchers timely access to the latest data. The goal of Ensembl was therefore to automatically annotate the genome, integrate this annotation with other available biological data and make all this publicly available. Since the launch, many more genomes have been added and the range of available data has expanded to include comparative genomics, variation and regulatory data. A collaboration between UniProt and Ensembl was initiated in 2008 to contribute towards the goal of having the complete human proteome available in UniProtKB/Swiss-Prot. A pipeline was established to import those Ensembl sequences not yet in UniProtKB which is updated with each Ensembl release along with a quality assurance feedback loop which ensures that the Ensembl predictions benefit from the manual review in UniProtKB. Since then, the scope of Ensembl has been extended to include manual annotation by the Human And
Vertebrate Analysis aNd Annotation (Havana) group at Sanger Institute which further adds value to the predictions. Ensembl and UniProt are pleased to announce that this collaboration has now been extended to Mus musculus and Rattus norvegicus and will shortly be extended to Gallus gallus and Bos taurus. The provision of a complete set of protein sequences to users is a priority for the UniProt Consortium and this collaboration contributes significantly to this effort.

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • Glycyl serine ester (Gly-Ser) (interchain with S-...)
  • Glycyl threonine ester (Gly-Thr) (interchain with G-...)
New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 7’-hydroxytryptophan

UniProt release 2010_05

Published April 20, 2010

Headlines

Nonsense-mediated mRNA decay: To be or not to be… integrated in UniProtKB

It has been known for over 30 years that, in yeast, nonsense mutations reduce mRNA levels and that the strength of the reduction depends on the position of the nonsense codon within the locus. This observation, followed by many others in a great variety of eukaryotic organisms, led to the concept of ‘Nonsense-mediated mRNA decay’ (NMD), ‘a surveillance mechanism that detects and degrades mRNAs with premature termination codons (PTCs), thereby preventing the production of faulty proteins’. The key question was what Mother Nature considers a ‘premature’ stop. For mammals, a rule was established stating that ‘if a termination codon is more than about 50 nucleotides upstream of the final exon, it is a PTC and the mRNA that harbors it will be degraded’ (see Nagy and Maquat, 1998). Although we know today that NMD is a much more sophisticated mechanism than previously anticipated (see reviews), the ‘50 nucleotide rule’ is still used to predict potential NMD targets and, on this basis, some databases deleted them from their collections. Since many PTCs are generated by alternative splicing (at least one third of the human alternatively spliced mRNAs contain PTCs), several alternatively spliced isoforms have disappeared from databases, victims of the ‘50 nucleotide rule’.

Eukaryotic cells detect PTC during the first round of translation undergone by mRNAs freshly exported from the nucleus. During this ‘pioneer’ round of translation, if the ribosome terminates at a termination codon (TC) in the vicinity of the poly(A) tail, PABPC1 – a poly(A)-binding protein – sends a signal which promotes proper termination of translation. This results in efficient reinitiation of the ribosome at the 5’ end of the mRNA, and the production of a stable mRNP. If the ribosome terminates at a TC that is too far away from the poly(A) tail for it to receive the PABPC1 – mediated translation-termination-promoting signal, the UPF1 protein binds to the stalled ribosome instead, thereby marking this TC as premature. Subsequently, a PTC-specific protein complex forms around UPF1, promoting UPF1 phosphorylation and committing the mRNA to rapid degradation.

It is thought that the physical distance, rather than the number of nucleotides, between a TC and the poly(A) tail is a crucial determinant in defining a TC as premature (Eberle et al., 2008). This distance depends on the 3D structure of the mRNA 3’ UTR. This structure can be modified by altering (1) intramolecular base pairing, (2) interaction of the mRNA with RNA-binding proteins and (3) interactions between the involved proteins through post-translational modifications (PTMs). In other words, it can be regulated in a tissue-specific manner, during development, and by environmental cues.

In higher eukaryotes, an additional level of complexity exists which links PTC detection and mRNA splicing. During pre-mRNA processing, the spliceosome removes intron sequences and a set of proteins called the exon-junction complex (EJC) is deposited 20-24 nucleotides upstream of the sites of intron removal. EJCs located within the ORF are removed from the mRNA by elongating ribosomes, and only EJCs located downstream of the TC will still be present when the first ribosome terminates. In organisms producing a large number of PTC-containing mRNAs by extensive alternative pre-mRNA splicing, such as humans, the EJC may have evolved to facilitate efficient recognition and degradation of these transcripts. An EJC downstream of a TC functions as an NMD enhancer by shortening the time window between UPF1 binding and its phosphorylation, hence promoting mRNA degradation.

NMD rarely downregulates the expression of a transcript completely. More commonly, 10-30% of the PTC-containing transcripts survive and may allow the production of physiologically relevant levels of protein products (Neu-Yilik et al., 2004). This is why in UniProtKB, we favour a conservative approach when dealing with protein isoforms predicted to be encoded by an NMD target mRNA. We do not delete them from the database, but rather tag them with the comment: ‘May be produced at very low levels due to a premature stop codon in the mRNA, leading to nonsense-mediated mRNA decay.’ For instance, in entry Q9HB09 (human Bcl-2-like protein 12), 2 isoforms are described, one of which has been predicted to be an NMD target by Hillman et al., 2004 (see also the ‘References’ section of the entry). In some cases, despite the presence of a PTC in the encoding mRNA, the isoform produced seems to be the predominant form, at least in some tissues (see human Gamma-aminobutyric acid type B receptor subunit 1 isoform 1E in entry Q9UBS5).

Currently in UniProtKB/Swiss-Prot, over 300 protein entries describe isoforms that could be produced at low levels due to NMD. 228 proteins from different species are directly involved in the NMD process itself and can be retrieved from UniProtKB with the keyword ‘Nonsense mediated mRNA decay’.

UniProtKB News

Cross-references to UCD-2DPAGE

Cross-references have been added to the University College Dublin 2-DE Proteome Database, (UCD-2DPAGE). The database HSC-2DPAGE,previously hosted at Harefield Hospital (and previously also cross-referenced from UniProtKB/Swiss-Prot), has been integrated into UCD-2DPAGE. UCD-2DPAGE currently contains data from Canis familiaris (dog), Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat) and Saccharomyces cerevisiae (baker’s yeast).

UCD-2DPAGE is available at http://proteomics-portal.ucd.ie:8082/cgi-bin/2d/2d.cgi.

The format of the explicit links in the flat file is:

Resource abbreviation UCD-2DPAGE
Resource identifier UCD-2DPAGE accession number (in most cases the primary UniProtKB accession number)
Examples P02648:
DR   UCD-2DPAGE; P02648; -.
O75112:
DR   UCD-2DPAGE; O75112; -.
DR   UCD-2DPAGE; Q9Y4Z5; -.

Show all the entries having a cross-reference to UCD-2DPAGE.

Changes concerning cross-references to HSC-2DPAGE

Cross-references to HSC-2DPAGE have been removed.

Changes concerning keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular location:

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • Glycyl serine ester (interchain with G-Cter in ubiquitin)
  • Glycyl threonine ester (interchain with G-Cter in ubiquitin)
New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 2-(S-cysteinyl)pyruvic acid O-phosphothioketal

UniProt release 2010_04

Published March 23, 2010

Headlines

UniProtKB wonder web

UniProtKB/Swiss-Prot was the first biomolecular database to include cross-references in its entries, long before the advent of the internet, and a high level of integration with other databases is a hallmark of the resource. UniProtKB is indeed a general interest database, and the cross-references it includes provide users with easy access to relevant additional information from more specialized resources.

The number of cross-references keeps growing. Over the past year, 21 new databases have been added and 6 out of the 8 phylogenomic databases cross-referenced in UniProtKB have been added during the last 10 months. Today 126 databases are explicitly cross-referenced in the knowledgebase. Most links are stored in the ‘Cross-references’ section.

As of this release, the total number of cross-references in UniProtKB/Swiss-Prot passed 13 million and the average number per entry is over 25. In TrEMBL, the unreviewed section of UniProtKB, the average number of cross-references per entry is approximately half lower (over 11). For both sections, the most represented databases reflect our information sources and annotation strategies. They are:
  1. EMBL-Bank (on average 1.7 cross-references per entry): the vast majority of UniProtKB sequences come from translated CDS submitted to the EMBL-Bank/GenBank/DDBJ, it is therefore not surprising that more than 98% UniProtKB/Swiss-Prot entries contain a cross-reference to the original nucleotide submission(s). For extensively studied organisms, such as human, the average number of EMBL-Bank cross-references may exceed 7.
  2. InterPro (on average 3.1 cross-references per entry): this integrated database classifies proteins at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. In UniProtKB, we have always paid special attention to domain and family annotation. InterPro predictions are automatically integrated into TrEMBL entries and domain/family annotation is later manually reviewed and completed before integration into Swiss-Prot.
  3. Gene Ontology (GO): UniProtKB annotators manually assign GO terms to all entries they curate and high-quality manually assigned GO terms from other GO Consortium groups are imported to ensure that a comprehensive collection of GO annotations is available through UniProtKB. In addition, UniProtKB incorporates GO terms generated from a range of electronic mapping methods. As a result, the number of GO cross-references per entry is expected to further grow significantly in the near future.

In addition to the “regular” ‘Cross-references’ section, the ‘Web resources’ section offers links to specific web pages or databases whose scope is too specialized to warrant the creation of specific cross-references. For instance, the IARC TP53 mutation database, a repository of somatic and germline TP53 mutations in human cancers is only available from the human p53 entry. Currently more than 6’500 entries contain ‘Web resources’ sections, which represent some 8’500 additional links. Note that links to relevant databases pepper all sections of Swiss-Prot entries. Cross-references to ENZYME are available from the EC numbers provided in the ‘Protein names’ subsection, links to PubMed from the ‘References’ section, etc.

In conclusion, for a complete overview on a given protein, users should use different resources, each of them shedding complementary light on the field. The coexistence of various databases does not imply competition between them, but rather collaboration, to better serve the life science community. UniProtKB may be used to get a manually reviewed summary of the current knowledge and to direct users to more specialized databases, such as organism-oriented, phylogenomic or genome annotation databases, for more detailed information.

For detailed statistics on cross-references, see our release notes, section 5 (‘Statistics for some line types’).

UniProtKB News

Change of release numbers

In the past, we have distinguished major and minor releases of the UniProt knowledgebase and this was reflected in the release number format: major releases were numbered x.0, minor releases were x.1, x.2, etc. We have abandoned this distinction and changed the format to YYYY_XX where YYYY is the calendar year and XX a 2-digit number that is incremented for each release of a given year, e.g. 2010_01, 2010_02, etc. We will archive previous releases on our ftp site for at least 2 years.

Change of release cycle

UniProt releases are now published every 4 weeks.

Cross-references to GenoList

Cross-references have been added to the GenoList Integrated Environment for the Analysis of Microbial Genomes. GenoList hosts numerous model organism databases for complete microbial genomes, including BuruList, ListiList, MypuList, PhotoList, SagaList and SubtiList which used to be cross-referenced from UniProtKB individually.
Relevant UniProtKB entries from the following organisms are therefore now linked to GenoList:

  • Mycobacterium ulcerans (strain Agy99) (formerly linked to BuruList)
  • Listeria monocytogenes and innocua (formerly linked to ListiList)
  • Mycoplasma pulmonis (formerly linked to MypuList)
  • Photorhabdus luminescens subsp. laumondii (formerly linked to PhotoList)
  • Streptococcus agalactiae serotype III (formerly linked to SagaList)
  • Bacillus subtilis (formerly linked to SubtiList)

GenoList is available at http://genodb.pasteur.fr/cgi-bin/WebObjects/GenoList.woa/

The format of the explicit links in the flat file is:

Resource abbreviation GenoList
Resource identifier Ordered locus name
Examples Q925X3:
DR   GenoList; LIN0124; -.
DR   GenoList; LIN2378; -.
DR   GenoList; LIN2564; -.
P37551:
DR   GenoList; BSU00470; -.

Show all the entries having a cross-reference to GenoList.

Changes concerning cross-references to BuruList, ListiList, MypuList, PhotoList, SagaList and SubtiList.

Cross-references to BuruList, ListiList, MypuList, PhotoList, SagaList and SubtiList have been removed.

Cross-references to ConoServer

Cross-references have been added to the Cone snail toxin database ConoServer. The ConoServer database is a manually curated database dedicated to conopeptides. ConoServer uses standardized names and a genetic and structural classification scheme to present data retrieved from UniProtKB, GenBank, the Protein Data Bank and the literature.

The ConoServer web site incorporates specialized features like the graphic display of post-translational modifications that are extensively present in conopeptides. ConoServer manages nucleic sequences, proteic sequences, and 3D structures. The aim of this resource is to give a comprehensive overview over the diversity of conopeptides and their uses as drugs, drug leads and diagnostic tools.

ConoServer is available at http://www.conoserver.org/.

The format of the explicit links in the flat file is:

Resource abbreviation ConoServer
Resource identifier ConoServer identifier
Optional information 1 Toxin name
Examples P0C8R2:
DR   ConoServer; 2838; ArIA precursor.
DR   ConoServer; 3450; Sequence 299 from Patent EP1852440.
P0C1W3:
DR   ConoServer; 1574; RVIIIA.

Show all the entries having a cross-reference to ConoServer.

Cross-references to MINT

Cross-references have been added to the Molecular INTeraction database MINT, which focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators.

MINT is available at http://mint.bio.uniroma2.it/mint/.

The format of the explicit links in the flat file is:

Resource abbreviation MINT
Resource identifier MINT interactor ID
Examples P00925:
DR   MINT; MINT-517950; -.
P0A887:
DR   MINT; MINT-1243319; -.

Show all the entries having a cross-reference to MINT.

Changes concerning keywords

New keywords: Modified keyword:

Changes in subcellular location controlled vocabulary

New subcellular location:
  • Host basolateral cell membrane

Changes concerning the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • S-(dipyrrolylmethanemethyl)cysteine

UniProt release 15.15

Published March 2, 2010

Headlines

Bacillus subtilis, a Gram-positive model bacterium fully annotated in UniProtKB/Swiss-Prot

We are all aware of the importance of model bacterial systems. Escherichia coli K12 is the paradigm for Gram-negative bacteria, but what of Gram-positive bacteria? There are a large variety of these bacteria that serve us, are neutral or infect us, and model systems for these bacteria are in demand.

Bacillus subtilis, a rod-shaped, soil-and water-dwelling bacterium originally described as Vibrio subtilis in 1835 by Ehrenberg and renamed in 1872 by Cohn has served this role for over a century. B.subtilis differentiates to produce endospores, can be made naturally competent for DNA uptake and is a bacteriophage host. In the wild it has been seen to produce over 2 dozen different antibiotics. These characteristics make it an obvious choice as a model system for bacterial differentiation and genetics, as well as a model for other - often more dangerous - bacteria such as Bacillus anthracis, Mycobacterium tuberculosis or Staphylococcus aureus. Additionally, it is used for the production of various industrially interesting enzymes such as amylases and proteases. A substrain, B.subtilis natto, is used to prepare natto, a traditional Japanese dish made from fermented soybeans. Although B.subtilis is not considered pathogenic for any known organism, it has been isolated from patients suffering from various illness such as endocarditis, pneumonia etc., and also occasionally from spoiled food where it might be responsible for cases of food poisoning.

The genome of B.subtilis 168, a widely used laboratory strain, was sequenced by a large international consortium in 1997 - the 6th bacterium to be fully sequenced. The sequence was updated and reannotated in 2009 by the Institut Pasteur and the Génoscope. In coordination with them we have annotated the complete proteome, providing all 4'192 B.subtilis proteins in UniProtKB/Swiss-Prot, each of which has a cross-reference to the dedicated B.subtilis database SubtiList/GenoList as well as other databases. A list of all B.subtilis UniProtKB/Swiss-Prot entries is available in the bacsu.txt file. This of course provides a snapshot of the knowledge about this first fully manually annotated Gram-positive model organism and will date easily. Despite having been so intently studied for so long, there are many B.subtilis proteins about which we know very little. There will be work for years to come for the B.subtilis (and larger scientific) community as these proteins and their homologues are characterized.

All B.subtilis entries can be retrieved from UniProtKB/Swiss-Prot combining the organism name "Bacillus subtilis" (or the taxonomy identifier 1423) with the keyword 'Complete proteome' (organism:"Bacillus subtilis" AND keyword:"Complete proteome" or organism:1423 AND keyword:181).

UniProtKB News

Cross-references to EuPathDB

Cross-references have been added to the Eukaryotic Pathogen Database Resources EuPathDB (formerly ApiDB), an integrated database covering the eukaryotic pathogens of the genera Cryptosporidium, Giardia, Leishmania, Neospora, Plasmodium, Toxoplasma, Trichomonas and Trypanosoma. While each of these groups is supported by a taxon-specific database built upon the same infrastructure, the EuPathDB portal offers an entry point to all these resources ("child databases": e.g. ToxoDB, PlasmoDB, CryptoDB...), and the opportunity to leverage orthology for searches across genera.

EuPathDB is available at http://www.eupathdb.org/.

The format of the explicit links in the flat file is:

Resource abbreviation EuPathDB
Resource identifier Combination of the child database name and the accession number in this database concatenated by a ":".
Examples
P84155:
DR   EuPathDB; TritrypDB:LmjF06.1270; -.

Q38FA5:
DR   EuPathDB; TritrypDB:Tb09.160.2970; -.

Show all the entries having a cross-reference to EuPathDB.

Cross-references to ProtClustDB

Cross-references have been added to Entrez Protein Clusters ProtClustDB, a collection of related protein sequences (clusters) which consists of Reference Sequence proteins encoded by complete genomes. This database contains both curated and non-curated clusters. The Protein Clusters database provides easy access to annotation information, publications, domains, structures, and external links and analysis tools including multiple alignments, phylogenetic trees, and genomic neighborhoods (ProtMap).

ProtClustDB is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters.

The format of the explicit links in the flat file is:

Resource abbreviation ProtClustDB
Resource identifier ProtClustDB accession number.
Examples
P99178:
DR   ProtClustDB; PRK05431; -.

P92693:
DR   ProtClustDB; MTH00098; -.

Show all the entries having a cross-reference to ProtClustDB.

Cross-references to SUPFAM

Cross-references have been added to the Superfamily database of structural and functional annotation SUPFAM, a database of structural and functional annotation for all proteins and genomes. The SUPFAM annotation is based on a collection of hidden Markov models, which represent structural protein domains at the SCOP superfamily level. A superfamily groups together domains which have an evolutionary relationship. The annotation is produced by scanning protein sequences from over 1,200 completely sequenced genomes against the hidden Markov models.

SUPFAM is available at http://supfam.org.

The format of the explicit links in the flat file is:

Resource abbreviation SUPFAM
Resource identifier SUPFAM superfamily identifier.
Optional information 1 SUPFAM superfamily domain name.
Optional information 2 Number of hits found.
Examples
P08519:
DR   SUPFAM; SSF57440; Kringle-like; 38.
DR   SUPFAM; SSF50494; Pept_Ser_Cys; 1.

P00967:
DR   SUPFAM; SSF56042; AIR_synth_C; 2.
DR   SUPFAM; SSF53328; formyl_transf; 1.
DR   SUPFAM; SSF52440; PreATP-grasp-like; 1.
DR   SUPFAM; SSF55326; PurM_N-like; 2.
DR   SUPFAM; SSF51246; Rudmnt_hyb_motif; 1.

Show all the entries having a cross-reference to SUPFAM.

Format change in the cross-references to HOVERGEN

The format of the cross-references to the HOVERGEN project has changed: The resource identifier, which was a UniProtKB accession number, has been replaced by a HOVERGEN identifier.

Example:

Previous format:

DR   HOVERGEN; P32754; -.

New format:

DR   HOVERGEN; HBG005987; -.

Show all the entries having a cross-reference to HOVERGEN.

Changes concerning keywords

New keywords:

Changes concerning the controlled vocabulary for PTMs

Modified term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

New terms:

  • Alanine isoaspartyl cyclopeptide (Ala-Asn)
  • Glycyl cysteine dithioester (Cys-Gly) (interchain with G-...)
  • Trithiocysteine (Cys-Cys)

Modified terms for the feature key 'Lipidation' ('LIPID' in the flat file):

New terms:

  • N-[(12R)-12-hydroxymyristoyl]cysteine
  • N-(12-oxomyristoyl)cysteine

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

New terms:

  • S-(4-hydroxycinnamyl)cysteine
  • S-cysteinyl cysteine
  • Tele-(1,2,3-trihydroxypropan-2-yl)histidine

UniProt release 15.14

Published February 9, 2010

Headlines

Bornavirus: another viral stowaway in the human genome

Analysis of the human genome sequence has revealed that our 'book of life' is multi-authored. About 0.5% of human genes are derived from bacteria and 8% of our total genetic material results from viral infections (see also release 2.1 headline). These genomic viral "fossils" are ancient retroviruses, which are known to insert their genetic information into host chromosomal DNA. They do so by producing a DNA copy from their RNA genome by use of a viral enzyme, called reverse transcriptase. The viral DNA then integrates into the host genome, becoming a permanent part of the cell.

A recent Japanese study has unveiled another viral stowaway in the human gene pool. Several copies of the bornavirus N gene turn out to be part of the human genome and of other mammalian genomes, including chimpanzees, gorillas and African elephants. These genes are remnants of a bornavirus which presumably infected proto-hominids, and other species, some forty million years ago. This ancient virus has disappeared and nowadays bornaviruses are known to infect mainly horses, inducing neurological diseases.

This discovery came as a surprise since the bornaviral RNA genome is not known to be retrocopied into DNA at any stage of the viral replication cycle and never integrates into the host genome. This unusual integration into our ancestor's genome may have helped him survive against a pathogenic virus or may have played a role in primate evolution. As often in evolutionary biology, there are many more questions than answers, but this serves as a useful reminder that human evolution does not rely only on our own intrinsic potential, but also on a tight interaction with other living species in our environment.

A bornavirus-derived gene is actually expressed in human cells. It is called 'Endogenous Borna-like N element' (EBLN-1) and can be retrieved from UniProtKB/Swiss-Prot using the accession number Q6P2I7.

UniProtKB News

Changes concerning keywords

New keyword:

Changes concerning the controlled vocabulary for PTMs

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

New terms:

  • Diiodotyrosine
  • Glycyl adenylate
  • Iodotyrosine
  • Threonine methyl ester

Modified term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

New term:

  • Glycyl cysteine dithioester (Gly-Cys) (interchain with C-...)

UniProt release 15.13

Published January 19, 2010

Headlines

XMRV complete proteome in UniProtKB/Swiss-Prot

Despite the 118 human pathogenic viruses identified so far, our knowledge of these pathogens is still incomplete. Several human pathologies are suspected to be induced by unknown viruses. In this context, a new virus was isolated from human prostate in 2006 and was named 'Xenotropic Moloney murine leukemia virus-Related Virus' (XMRV). This retrovirus is the first representative of the gammaretrovirus genus to be isolated in humans. These retroviruses are known to induce various cancers in their host and a causal link with prostate cancer was suspected. This link was experimentally established but later refuted and thus remains a matter of debate. The same virus has been recently associated with chronic fatigue syndrome (CFS): XMRV has been isolated in 4% of healthy subjects, and in 67% of CFS patients. Large scale epidemiological studies must be performed to establish with certainty whether these correlations are relevant.

Where did XMRV come from? Retroviruses identified in patients with CFS or prostate cancer are highly related (more than 90% DNA sequence identity) to a group of mouse viruses called xenotropic murine leukemia virus (MLV). Xenotropic MLVs are endogenous retroviruses, i.e. the viral DNA is stably integrated in the mouse genome. Mice produce low levels of the virus - a few infectious particles per ml of blood - but the virus cannot reinfect mouse tissues. Instead it spreads to other species, such as humans, which is the reason for the term 'xenotropic', meaning the virus can grow in species other than the species of origin. Therefore it makes sense to hypothesize that XMRV is a xenotropic MLV that crossed from mice to humans.

The mode of transmission of XMRV is largely unknown. It could be via transfusion, intravenous drug use, or by other blood-borne routes, but other modes of transmission (respiratory, sexual, etc.) cannot be excluded.

It will take time to answer the numerous questions raised by the discovery of XMRV. In terms of treatment, the good news is that some of the anti-retroviral drugs used for treating AIDS can immediately be tested for their efficacy against CFS. Indeed, susceptibility of XMRV to AZT has recently been demonstrated.

The complete proteome of XMRV has been annotated along with that of the well-studied MLV which is 65% (env) to 85% (gag-pol) identical and has served as a model for XMRV functional annotation.

UniProtKB News

Cross-references to eggNOG

Cross-references have been added to eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups), a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories).

eggNOG is available at http://eggnog.embl.de/.

The format of the explicit links in the flat file is:

Resource abbreviation eggNOG
Resource identifier eggNOG cluster identifier.
Example
P33887:
DR   eggNOG; maNOG10115; -.

Show all the entries having a cross-reference to eggNOG.

Format change in the cross-references to HAMAP

The format of the cross-references to the HAMAP database has changed in order to align it with the format of other InterPro member databases.

Previous format:

Resource abbreviation HAMAP
Resource identifier HAMAP unique identifier for a protein family.
Optional information 1 Nature of hits found. The values are either 'fused', 'atypical', 'atypical/fused' or '-': 'fused' indicates that the family signature does not cover the entire protein; 'atypical' means that the protein is divergent in sequence or has mutated functional sites and should not be included in family datasets; 'atypical/fused' is a combination of the previous two cases; '-' is a placeholder for an empty field.
Optional information 2 Number of hits found, which is generally 1, rarely 2 for the fusion of identical domains/proteins.
Examples
P12743:
DR   HAMAP; MF_00326; -; 1.

Q9K3D6:
DR   HAMAP; MF_00006; fused; 1.
DR   HAMAP; MF_01105; atypical/fused; 1.

New format:

Resource abbreviation HAMAP
Resource identifier HAMAP unique identifier for a protein family signature.
Optional information 1 HAMAP entry name for a protein family.
Optional information 2 Number of hits found, which is generally 1, rarely 2 for the fusion of identical domains/proteins.
Optional information 3 Nature of hits found. The values are either 'fused', 'atypical', 'atypical/fused' or '-': 'fused' indicates that the family signature does not cover the entire protein; 'atypical' means that the protein is divergent in sequence or has mutated functional sites and should not be included in family datasets; 'atypical/fused' is a combination of the previous two cases; '-' is a placeholder for an empty field.
Examples
DR   HAMAP; MF_00326; Ribosomal_L7Ae; 1; -.

DR   HAMAP; MF_00006; Arg_succ_lyase; 1; fused.
DR   HAMAP; MF_01105; N-acetyl_glu_synth; 1; atypical/fused.

Show all the entries having a cross-reference to HAMAP.

Format change in the cross-references to HOGENOM

The format of the cross-references to the HOGENOM project has changed: The resource identifier, which was a UniProtKB accession number, has been replaced by a HOGENOM identifier.

Example:

Previous format:

DR   HOGENOM; P0A9I1; -.

New format:

DR   HOGENOM; HBG676713; -.

Show all the entries having a cross-reference to HOGENOM.

Changes concerning keywords

New keywords:

Changes in controlled vocabulary for subcellular locations

New subcellular locations:

  • Barrier septum
  • Cell septum
  • Cell tip
  • Photoreceptor inner segment
  • Photoreceptor outer segment

Changes concerning the controlled vocabulary for PTMs

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

New terms:

  • 5-glutamyl 2-aminoadipic acid
  • 5-glutamyl N2-lysine

UniProt release 15.12

Published December 15, 2009

Headlines

Through the Looking-Glass

All amino acids but glycine can exist in either of two optical isomers, called L-or D-amino acids, which are mirror images of each other. However, we have been taught for decades that proteins that occur in nature are made out of L-forms. There are some well-known exceptions, of course, but restricted to prokaryotes. Indeed, D-forms are abundant components of the peptidoglycan cell walls of bacteria, and are also observed in bacterial natural antibiotics, such as actinomycin D, bacitracin or tetracycline. These latter are quite unusual peptides that are synthesized by multienzyme complexes in a stepwise fashion without the participation of mRNA. It has also been observed that the mammalian brain contains high levels of free D-serine which appears to be a physiological coagonist of N-methyl D-aspartate receptors (NMDARs) and, as such, may act as a neurotransmitter in the brain, but this activity is carried out by the amino acid itself and does not occur within the context of a polypeptide. The isolation, in the 1980s, of naturally occurring animal peptides containing D-amino acids challenged the dogma, leading to the discovery of a new post-translational modification (PTM): L- to D-isomerization.

In 1981, Montecucchi et al., looking for enkephalin-related peptides in various amphibia, isolated dermorphin from the skin of Phyllomedusa sauvagei. Dermorphin is produced by 2 different precursors: cleavage of Dermorphin-1 gives rise to 4 mature dermorphins and that of Dermorphin-2 to 5 mature peptides, all of which have the identical sequence: YAFGYPS. This heptapeptide binds with high affinity and selectivity to mu-type opioid receptors and appears to be a thousand times more potent than morphine in inducing deep long-lasting analgesia when injected into mice or rats. Interestingly, the second amino acid of dermorphin is D-alanine. A synthetic isomer, containing L-alanine at that position, is virtually devoid of biological activity.

This discovery was followed by many others. Deltorphins, another class of frog opioid peptides, also characterized by a D-amino acid at position 2, were isolated. Another amphibian, Bombina variegata, was shown to express antimicrobial D-amino acid-containing peptides, called bombesins, on its skin. Arthropoda, such as spider, lobsters and crayfish, and Mollusca entered the game. Cone snail peptide toxins have been extensively studied in this context and they currently represent 60% of all animal D-amino acid-containing proteins annotated in UniProtKB/Swiss-Prot. A single mammal appears on the list: platypus with 2 peptides, C-type natriuretic peptide 39 and Defensin-like peptide 2/4, expressed in its venom gland.

Animal D-amino acid-containing proteins are synthesized on ribosomes following a classical mRNA template; unusual codons have not been observed. In addition, some of them have been isolated from their biological source with both L- and D-amino acid at the appropriate position. These observations suggested that L- to D-amino acid isomerization is a bona fide PTM. An enzyme catalyzing the conversion of an Omega-agatoxin-Aa4b serine (at position 46 of the mature peptide, 81 in the precursor) from L- to D-form has been isolated from the funnel-web spider Agelenopsis aperta and its partial sequence is available in UniProtKB/TrEMBL. A similar mammalian activity has been characterized from platypus venom.

L- to D-amino acid isomerization presents significant advantages. The modified peptides become more resistant to protease degradation and hence much more stable. In addition, X-ray crystallography studies have shown that the isomerization creates new structures, such as peculiar beta-turns. The creation of these new structural elements seems crucial for interaction with specific partners, opiate receptors for instance, and may act as a switch that turns on protein activity.

L- to D-amino acid isomerization could be more frequent than initially thought. It cannot be predicted by software tools and is not detectable by any of the standard techniques used in proteomics. It was only discovered when a synthetic peptide with the same sequence of L-amino acids appeared to be biologically inactive. We could be facing a novel strategy of multicellular organisms to circumvent stereochemical limitations imposed by the genetic code in an effort to increase molecular diversity.

In UniProtKB, all D-amino acid-containing proteins can be retrieved using the keyword 'D-amino acid'. To restrict the search to animal proteins, add 'Metazoa' to the taxonomy field.

UniProtKB News

Cross-references to ArachnoServer

Cross-references have been added to ArachnoServer, a spider toxin database. ArachnoServer is a manually curated database containing information on the sequence, three-dimensional structure, and biological activity of protein toxins derived from spider venom.

ArachnoServer is available at http://www.arachnoserver.org/.

The format of the explicit links in the flat file is:

Resource abbreviation ArachnoServer
Resource identifier ArachnoServer unique identifier.
Optional information 1 Toxin name.
Examples
P61232:
DR   ArachnoServer; AS000384; beta-hexatoxin-Mg1a.
DR   ArachnoServer; AS000417; beta-hexatoxin-Mr1a.

Q7M485:
DR   ArachnoServer; AS000160; Sphingomyelinase D (LrSicTox1) (N-terminal fragment).

Show all the entries having a cross-reference to ArachnoServer.

Cross-references to InParanoid

Cross-references have been added to InParanoid, a database of eukaryotic ortholog groups. The InParanoid database is a collection of pairwise comparisons between currently 35 complete proteomes. The InParanoid program uses the pairwise similarity scores, calculated using NCBI-Blast, between two complete proteomes for constructing orthology groups.

InParanoid is available at http://inparanoid.sbc.su.se/.

The format of the explicit links in the flat file is:

Resource abbreviation InParanoid
Resource identifier UniProtKB accession number.
Example
P10038:
DR   InParanoid; P10038; -.

Show all the entries having a cross-reference to InParanoid.

Changes concerning keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular location:

  • Host multivesicular body

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Glutamyl lysine isopeptide (Gln-Lys) (interchain with K-...)
  • Glutamyl lysine isopeptide (Lys-Gln) (interchain with Q-...)

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Glutamyl 5-glycerylphosphorylethanolamine -> 5-glutamyl glycerylphosphorylethanolamine

UniProt release 15.11

Published November 24, 2009

Headlines

Why do we keep dubious sequences in UniProtKB? How to discard them from a protein set?

More than 99% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources. These CDS are either generated by the application of gene prediction programs to genomic DNA sequences or via the hypothetical translation of cloned cDNAs (see FAQ 37). These methods themselves provide varying degrees of support for the existence of a protein, which may be further supplemented in some cases by other types of evidence (such as mass spectrometry data or evidence from direct protein sequencing).

In July 2007, a new topic was introduced into UniProtKB to indicate the evidence for the existence of a given protein, called 'Protein existence' (PE). 5 levels of evidence have been defined: 1. evidence at protein level (e.g. clear identification by mass spectrometry), 2. evidence at transcript level (e.g. the existence of a putative coding cDNA), 3. inferred by homology (a predicted protein which has been assigned membership of a defined protein family in UniProtKB), 4. predicted (a predicted protein which has not yet been assigned membership of a defined protein family in UniProtKB) and 5. uncertain (e.g. dubious sequences,