uniprot logo

News

UniProt release 2017_08

Published August 30, 2017

Headline

Curation of human immunoglobulin genes: a fruitful collaboration between UniProtKB/Swiss-Prot and IMGT®

The existence of an agent in the blood that could neutralize diphteria toxin was reported as early as 1890. Over a century after this major discovery, much is known about immunoglobulins (IG) or antibodies. They are large heterodimeric proteins made up of 2 heavy (H) chains and 2 light (L) kappa or lambda chains, held together by disulfide bonds to form a ‘Y’ shaped molecule. Each chain comprises one variable (V) domain at the N-terminal end and one or several (for L and H, respectively) constant (C) domains. The antigen binding site is formed by the V domain of one H chain, together with that of its associated L chain. Thus, each immunoglobulin has 2 antigen binding sites with remarkable affinity for a particular antigen. Each variable domain is encoded by a variable (V) gene, a diversity (D) gene (only for H) and a joining (J) gene which are assembled by a process called V-(D)-J rearrangement and can then be subjected to somatic hypermutations which, after exposure to antigen and selection, allow affinity maturation for a particular antigen. The resulting rearranged V-(D)-J genes are further spliced to C genes. The C region determines the effector properties and the mechanism used to destroy the antigen, such as activation of complement or binding to Fc receptors. An immunoglobulin is encoded by 7 genes (IGHV, IGHD, IGHJ, IGHC for the H chain and IGKV, IGKJ, IGKC for a kappa or IGLV, IGLJ or IGLC for a lambda L chain). The human genome contains 176 functional immunoglobulin genes clustered in 3 loci, IGH on chromosome 14 (50 V, 23 D, 6 J and 9 C), IGK on chromosome 2 (40 V, 5 J and 1 C) and IGL on chromosome 22 (32 V, 5 J and 5 C). During the development of B cells, the mechanisms of diversity involved in the immunoglobulin synthesis (combinatorial V-(D)-J diversity, junctional diversity and somatic hypermutations) lead to the huge potential antibody repertoire of each individual, estimated to comprise 1012 different immunoglobulins, the limiting factor being only the number of B cells that an organism is genetically programmed to produce.

In 2008, we announced the first draft of the complete human proteome in UniProtKB/Swiss-Prot, and have been continuing to update this resource ever since. Recent work performed in collaboration with the IMGT® team has included a thorough review and update of the immunoglobulin genes, for which we now present a representative set of full-length germline immunoglobulin protein sequences. 15 entries showing the sequence of all C gene products and 122 representing all V gene products are now publicly available. These entries can be retrieved with the keyword ‘Immunoglobulin C region’ and ‘Immunoglobulin V region’, respectively. D and J gene products are extremely small, with an average of 5 amino acids for D genes and 15-30 for J. In other words, they are too short to be informative on their own. Therefore we have decided to curate a single peptide representative of D gene products and 3 of J gene products, one for H chains and 2 for L chains kappa and lambda. As for other human proteins, the sequences shown match the translation of the reference genome (Genome Reference Consortium GRCh38/hg38). The nomenclature used is the official one from IMGT/GENE-DB, approved by HGNC and endorsed by NCBI Gene and the IUIS-Nomenclature SubCommittee. Cross-references were implemented in the 141 UniProtKB/Swiss-Prot immunoglobulin entries, providing direct access to the dedicated IMGT® resource and its comprehensive sequence repertoire, which currently describes 927 alleles from 462 functional and non-functional genes together with a wealth of additional information concerning immunoglobulins. Reciprocal links to UniProtKB from IMGT® ensure easy navigation between both resources.

We also provide several examples of full-length rearranged immunoglobulins. Among the 1012 predicted sequences, we have selected some of those that have been entirely sequenced at the amino acid level. However, the representation of the full repertoire is beyond the scope of our knowledgebase and UniProtKB users interested in these complex molecules are advised to visit IMGT®.

We would like take this opportunity to thank Marie-Paule Lefranc, Sofia Kossida and the IMGT® team for this fruitful collaboration, which is beneficial not only for both resources, but hopefully also for the scientific community as a whole.

Cross-references to ELM

Cross-references have been added to the Eukaryotic Linear Motif (ELM) resource for functional sites in proteins.

ELM is available at http://elm.eu.org.

The format of the explicit links is:

Resource abbreviation ELM
Resource identifier UniProtKB accession number

Example: P12931

Show all entries having a cross-reference to ELM.

Text format

Example: P12931

DR   ELM; P12931; -.

XML format

Example: P12931

<dbReference type="ELM" id="P12931"/>

RDF format

Example: P12931

uniprot:P12931
  rdfs:seeAlso <http://purl.uniprot.org/elm/P12931> .
<http://purl.uniprot.org/elm/P12931>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/ELM> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Mental retardation, X-linked, syndromic, 10

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniParc news

UniParc XSD change for InterPro annotations

To reduce the sequence redundancy in UniProtKB, we apply a procedure to identify highly redundant proteomes within selected species groups to exclude them from UniProtKB. Their sequences are still available for download from the UniParc sequence archive, which stores protein sequences that are 100% identical and the same length in a single record, with cross-references to the source database where the protein exists. UniParc also includes basic annotation data (taxonomy, gene and protein names, proteome identifier and component) to allow users interested in redundant proteomes to retrieve meaningful data sets, and we have now further enhanced UniParc with InterPro annotations and for this purpose extended the UniParc XSD with new elements and types as shown below in red color:

    <xs:element name="entry">
        <xs:complexType>
            <xs:sequence>
                ...
                <xs:element name="signatureSequenceMatch" type="seqFeatureType" minOccurs="0" maxOccurs="unbounded"/>
                ...
            </xs:sequence>
            ...
        </xs:complexType>
    </xs:element>
    ...
    <xs:complexType name="seqFeatureType">
        <xs:sequence>
            <xs:element name="ipr" type="seqFeatureGroupType" minOccurs="0" maxOccurs="1"/>
            <xs:element name="lcn" type="locationType" minOccurs="1" maxOccurs="unbounded"/>
        </xs:sequence>
        <xs:attribute name="database" type="xs:string" use="required"/>
        <xs:attribute name="id" type="xs:string" use="required"/>
    </xs:complexType>

    <xs:complexType name="seqFeatureGroupType">
        <xs:attribute name="name" type="xs:string"/>
        <xs:attribute name="id" type="xs:string" use="required"/>
    </xs:complexType>

    <xs:complexType name="locationType">
        <xs:attribute name="start" type="xs:int" use="required"/>
        <xs:attribute name="end" type="xs:int" use="required"/>
    </xs:complexType>

UniProt release 2017_07

Published July 5, 2017

Headline

A pseudogene turns into an active DNA methyltransferase dedicated to male fertility

It is well established that in mammals, the DNA methylation machinery is composed of 3 DNA methyltransferase (DNMT) enzymes, DNMT1, DNMT3A, and DNMT3B, and one catalytically inactive cofactor, DNMT3L. Some 46 million years ago, in the last common ancestor of the muroid rodents, the DNMT3B gene was duplicated, giving rise to Gm14490. The genes share about 70% identity, but Gm14490 underwent pseudogenization, and there is no evidence for its transcription. Germline-specific knockouts of DNMT3A or DNMT3B demonstrate the crucial role of these genes in methylation of most imprinted loci in germ cells (and somatic tissues), but some transposon loci, such as minor satellite DNA and intracisternal A particle (IAP) repeats, are only minimally affected, an observation which can be attributed to the functional redundancy of the 2 genes. This is what was thought and published, until recently.

Retrotransposon silencing is of paramount importance, especially in the male germline. Indeed, in the absence of silencing, retrotransposon reactivation leads inexorably to meiotic failure, azoospermia, and sterility marked by small testis size, a phenotype called hypogonadism. It is therefore essential to understand which actors are involved in this process. Barau et al. tackled the issue by generating mutant mice through N-ethyl-N-nitrosourea (ENU) mutagenesis and screening hypogonadal male mice for ectopic retrotransposon activity, followed by whole genome sequencing to identify the culprits. This approach led to the discovery of an ENU-independent mutation, which was identified as a de novo IAP insertion located in an unexpected locus, the last intron of the Gm14490 pseudogene. Serendipity definitely is a scientist’s best friend!

This was only the beginning of surprises. Contrary to what had been previously reported, the Gm14490 gene proved to be expressed, but exclusively in male germ cells. This restriction could explain the absence of corresponding ESTs in databases and the erroneous former assumption that it was untranscribed. During embryonic development, its expression peaks at the time of de novo DNA methylation (between 16.5 to 18.5 dpc) in prospermatogonia. Moreover, Gm14490 appeared to be catalytically active when transfected in ES cells. A new genuine DNA methylase was born and renamed DNMT3C!

In the absence of DNMT3C, either by knockout or by IAP insertion, retrotransposons, and more specifically some types of long interspersed nuclear elements (LINEs) and some endogenous retroviruses (ERV), are reactivated. Interestingly, this reactivation is particularly strong for evolutionarily ‘young’ subfamilies, indicating DNMT3C’s unique selectivity. The existence of a 5th DNA methylase selectively targeted at young retrotransposons, acting only in the context of fetal spermatogenesis, may be of particular relevance in Muroidea, including mice and rats. This lineage is particularly enriched in young transposons with about 25% that have integrated into the genome in the last 25 million years with currently thousands of active copies. In comparison, in the primate ancestor, massive integration occurred long before (80 million years ago for elements such as LINEs) and these transposons have since become extinct.

In view of these results, DNMT3C has been deleted from our pseudogene list, annotated and integrated into UniProtKB/Swiss-Prot, where it is available to you. The knowledgebase contains some other sequences derived from putative pseudogenes (see headline of November 2009). Like all other UniProtKB/Swiss-Prot entries, they are continuously reviewed. Some of them are deleted from UniProtKB, when data pointing at an inactive gene are overwhelming, but they can always be retrieved from UniParc. Other entries are progressively ‘upgraded’, when new data become available, to bona fide proteins as was the case for DNMT3C.

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to keywords

New keyword:

UniProt release 2017_06

Published June 7, 2017

Headline

Sexual reproduction: good ideas shared with viruses

Sexual reproduction is a brilliant eukaryotic invention that allows the reassortment of alleles through recombination. The first step is the formation of haploid male and female gametes that unite to form a new individual. Most gametes unite by membrane fusion, a process mediated by specialized proteins, called fusogens. The study of these proteins is difficult, since they are often scarce. The few identified so far are clade-specific, such as bindin in echinoderms or izumo in mammals, suggesting that each clade has evolved its own fusion strategy. This is at least what was thought until the discovery of hapless-2 (HAP21), also called generative cell specific-1 (GCS1).

Hapless-2 is a single-span transmembrane protein located at the gamete cell surface, typically at mating structures. It is essential for gamete fusion in the green alga Chlamydomonas reinhardtii, but also in other plants, including Arabidopsis thaliana, and Lilium longiflorum and in protozoans, such as Plasmodium berghei or Tetrahymena thermophila. A thorough eukaryotic genome examination reveals the existence of this gene in many major eukaryotic taxa, from slime molds to the honey bee. It is however not present in fungi, nor in most animals, including humans. The wide evolutionary distribution of hapless-2 suggests it was present in the last eukaryotic common ancestor and lost in some clades later on. Disruption of hapless-2 blocks gamete fusion, but not adhesion to gametes of the opposite mating type (or sex), suggesting that gamete adhesion relies on proteins that are species-specific, but that fusion itself is mediated by an ancestral common gene product.

Earlier this year, the 3D-structure of Chlamydomonas reinhardtii hapless-2 was unraveled. The secondary and tertiary structures of the ectodomain are almost identical to viral class II proteins, such as the envelope protein E of flaviviruses, with which hapless-2 shares very low identity at the amino acid level, and which are also involved in membrane fusion. Fédry et al. hypothesize that these fusion proteins most certainly derived from a common ancestor, whose gene has likely been transferred via horizontal exchange.

Like the flavivirus class II proteins, the hapless-2 ectodomain trimerizes concomitantly with insertion into the membrane of the partner gamete. The trigger for trimerization of hapless-2 is not yet known, although acidification, which drives trimerization of flavivirus class II proteins in late endosomes, is not required.

Information gained from the 3D structure of hapless-2 may help in the development of transmission-blocking vaccines (TBVs), a new strategy to fight malaria (and other protozoan diseases). Successful transmission of Plasmodium from humans to mosquitoes relies on hapless-2-dependent fusion of the parasite gametes and fertilization, which occurs rapidly after ingestion by the mosquito. If TBVs could be designed to induce anti-hapless-2 antibodies in human hosts, these would be ingested by Anopheles mosquitoes along with blood Plasmodium gametocytes. The initial gamete fusion step could be prevented and the deadly cycle of transmission blocked. This approach has already been tested in model animals and, although the preliminary results look promising, they are not yet sufficient for clinical development. The identification of new peptides, that are both functionally crucial and immunogenic, may prove very helpful in the design of efficient anti-malaria TBVs.

As of this release, hapless-2 UniProtKB/Swiss-Prot entries have been created and are publicly available.

1 The acronym HAP2 is somewhat unfortunate, since this protein has nothing to do with the yeast HAP2 transcription factor. These are the mysterious ways of nomenclature, which sometimes may be quite confusing…

UniProtKB news

Modification of cross-references to PATRIC

We have modified our cross-references to the PATRIC database in order to reflect the new PATRIC primary identifier scheme. The earlier identifier scheme used simple numeric ids, e.g.
32117610
which were replaced by more informative primary identifiers such as
fig|1427269.3.peg.1028.

Text format

Example: Q9ZNI1

Previous format:

DR   PATRIC; 19579917; VBIStaAur99865_1117.

New format:

DR   PATRIC; fig|93061.5.peg.1117; -.

XML format

Example: Q9ZNI1

Previous format:
<dbReference type="PATRIC" id="19579917">
  <property type="gene designation" value="VBIStaAur99865_1117"/>
</dbReference>

New format:

<dbReference type="PATRIC" id="fig|93061.5.peg.1117"/>

RDF format

Example: Q9ZNI1

Previous format:

uniprot:Q9ZNI1
  rdfs:seeAlso <http://purl.uniprot.org/patric/19579917> .
<http://purl.uniprot.org/patric/19579917>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/PATRIC> ;
  rdfs:comment "VBIStaAur99865_1117" .

New format:

uniprot:Q9ZNI1
  rdfs:seeAlso <http://purl.uniprot.org/patric/fig%7C93061.5.peg.1117> .
<http://purl.uniprot.org/patric/fig%7C93061.5.peg.1117>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/PATRIC> .

New file linking deleted entries to their subsequently reinstated versions

Since release 2015_04, we are applying at each release a procedure to identify highly redundant proteomes within selected species groups using a combination of manual and automatic methods. This procedure prevents the creation of UniProtKB/TrEMBL entries from these redundant proteomes, but also means that a huge number of previously existing entries had to be deleted from UniProtKB when the procedure was put in place.

It may happen that proteomes that were identified as redundant are later reinstated as non-redundant, e.g. a proteome for a strain used as a model by a significant community or with proteins that have been crystallized. In the past, it has also happened on rare occasions that entries were deleted but later reinstated for other reasons. In such cases, the UniProtKB entries are created anew, with new accession numbers.

To help users to link deleted to subsequently reinstated entries, we are introducing a file that maps old to new accession numbers via their protein_ids. This file is available (in compressed format) by FTP at

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/reinstated_map.txt.gz

This mapping will also be used to make queries for obsolete identifiers on the UniProt website more meaningful.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • Cyclopeptide (Glu-Asn)

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • S-methylmethionine

Deleted term

  • N-acetylated lysine

Changes in subcellular location controlled vocabulary

New subcellular location:

Changes to keywords

New keyword:

UniProt release 2017_05

Published May 10, 2017

Headline

A certain taste for light

In most organisms, light perception is essential for survival. It not only mediates image-forming vision, but also performs other functions, such as phototaxis and circadian rhythm. Light-sensing function is carried out by photoreceptors, of which only 2 types are known in metazoans: opsins and cryptochromes. They are typically composed of two moieties: a protein and a prosthetic chromophore, the latter is responsible for light absorption. Consequently, photoreceptor denaturation, which targets the protein moiety, does not abolish light absorption, although it shifts absorbance peaks to different wavelengths. Photoreceptor activation by light induces a signaling pathway, called phototransduction, which involves the activation of a G-protein, the modulation of cGMP levels and ultimately a change in the permeability of cyclic nucleotide-gated channels.

It has been long thought that Caenorhabditis elegans, an eyeless, soil-dwelling nematode, could not sense light. This assumption turned out to be erroneous. Not only does C.elegans sense light, but it vigorously escapes from it. This behaviour is elicited only in response to blue or shorter wavelengths of light, with maximal responsiveness to UV light. This mechanism may have evolved to protect the animal against prolonged direct sunlight exposure that paralyzes and eventually kills it. Indeed worms appear to spend much of their time above ground, living on small surface-dwelling animals or their carcasses and may therefore be frequently exposed to direct sunlight. From the very beginning of the discovery of phototransduction in C.elegans, it was obvious that the lite-1 gene was involved in this process, as its heterologous expression in muscle cells was sufficient to confer light responsiveness on these cells that were normally unresponsive. Lite-1 was also shown to act upstream of G proteins, but its exact function remained unclear. Is it a bone fide photoreceptor? Or is it just sensing light-produced chemicals? Like opsins, which are the most common photoreceptor proteins in metazoan photoreceptor cells, lite-1 contains a 7-transmembrane domain. However, it does not share any sequence similarity with opsins and its topology is opposite to conventional 7-transmembrane receptors, with its N-terminus located intracellularly and its C-terminus extracellularly. In fact, lite-1 belongs to the insect gustatory receptor family of chemoreceptors, rather than opsin family. To clarify its role, Gong et al. purified lite-1 and showed that it directly absorbs photons with an efficiency 10 to 100 times that of all known photoreceptors, capturing both UVA and UVB light. Interestingly, absorption of UVA and UVB light can be separated. For instance, mutations at residues Ala-332 and Ser-226 disrupt UVA absorption, but do not affect UVB absorption. In addition, prolonged light illumination, which bleaches conventional photoreceptors, abolishes lite-1 absorption of UVA, but does not affect that of UVB, which appears to be more stable and relatively resistant to photobleaching.

Another remarkable lite-1 feature is that it loses all photoabsorption abilities upon denaturation, suggesting that this activity strictly depends on its conformation and not upon the presence of a chromophore. Mutational analysis pointed at 2 tryptophan residues (Trp-77 and Trp-328) that are required for the absorption of both UVA and UVB light. In order to confirm the importance of these residues, Gong et al. introduced ‘Trp-77’ by mutagenesis at the equivalent position in a structurally related gustatory receptor, called gur-3, which contains ‘Trp-328’, but is not photosensitive. Amazingly, mutated gur-3 absorbs UVB light with an efficiency of about 30% of that of lite-1. All these observations indicate that lite-1 is a bona fide photoreceptor of a novel type.

C.elegans lite-1 entry has been updated and is publicly available as of this release.

UniProtKB news

Extension of controlled vocabulary for PTM to glycosylation sites

Our controlled vocabulary for post-translational modification, so far used to standardize the annotation of modified residues, lipidation sites and protein cross-links, has been extended to include terms for glycosylation sites.

Change of the nomenclature for glycosylation sites

We have introduced a change to the nomenclature for glycosylation sites.

We previously described the occurrence of the attachment of a glycan (mono- or polysaccharide) to an amino-acid residue with the following elements:

  • The type of linkage (C-, N-, O- or S-linked) to the protein
  • The abbreviation of the reducing terminal sugar (shown between parentheses): If three dots ’...’ follow the abbreviation, this indicates an extension of the carbohydrate chain. Conversely the absence of dots means that a monosaccharide is linked.

To this we have added:

  • The name of the glycosylated amino acid

The new nomenclature is thus composed of three elements:

<linkage type> (<reducing carbohydrate>) <amino acid name>.

The valid values have been added to our controlled vocabulary for post-translational modifications and applied to all Glycosylation annotations.

Example: Q9HCN3

Previous nomenclature:

FT   CARBOHYD    144    144       N-linked (GlcNAc...).

New nomenclature:

FT   CARBOHYD    144    144       N-linked (GlcNAc...) asparagine.

Note that this information about the type of glycosylation can be complemented by

  • the name of the modified protein form,
  • information on whether the modification is carried out by a host protein,
  • the frequency of the modification or the relationship with another feature (‘partial’, ‘alternate’, ‘transient’),
  • evidence attribution

as documented for modified residues.

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Glycosylation’ (‘CARBOHYD’ in the flat file):

  • C-linked (Man) hydroxytryptophan
  • C-linked (Man) tryptophan
  • N-linked (DATDGlc) asparagine
  • N-linked (GalNAc) asparagine
  • N-linked (GalNAc…) asparagine
  • N-linked (GalNAc…) (glycosaminoglycan) asparagine
  • N-linked (Glc) arginine
  • N-linked (Glc) asparagine
  • N-linked (Glc) (glycation) histidine
  • N-linked (Glc) (glycation) isoleucine
  • N-linked (Glc) (glycation) lysine
  • N-linked (Glc) (glycation) valine
  • N-linked (Glc…) arginine
  • N-linked (Glc…) asparagine
  • N-linked (GlcNAc) arginine
  • N-linked (GlcNAc) asparagine
  • N-linked (GlcNAc…) arginine
  • N-linked (GlcNAc…) asparagine
  • N-linked (GlcNAc…) (complex) arginine
  • N-linked (GlcNAc…) (complex) asparagine
  • N-linked (GlcNAc…) (high mannose) arginine
  • N-linked (GlcNAc…) (high mannose) asparagine
  • N-linked (GlcNAc…) (hybrid) arginine
  • N-linked (GlcNAc…) (hybrid) asparagine
  • N-linked (GlcNAc…) (keratan sulfate) arginine
  • N-linked (GlcNAc…) (keratan sulfate) asparagine
  • N-linked (GlcNAc…) (paucimannose) arginine
  • N-linked (GlcNAc…) (paucimannose) asparagine
  • N-linked (GlcNAc…) (polylactosaminoglycan) arginine
  • N-linked (GlcNAc…) (polylactosaminoglycan) asparagine
  • N-linked (Hex) arginine
  • N-linked (Hex) asparagine
  • N-linked (Hex) tryptophan
  • N-linked (Hex…) arginine
  • N-linked (Hex…) asparagine
  • N-linked (HexNAc) arginine
  • N-linked (HexNAc) asparagine
  • N-linked (HexNAc…) arginine
  • N-linked (HexNAc…) asparagine
  • N-linked (Lac) (glycation) lysine
  • N-linked (Man) tryptophan
  • O-linked (Ara) hydroxyproline
  • O-linked (Ara…) hydroxyproline
  • O-linked (DADDGlc) serine
  • O-linked (DATDGlc) serine
  • O-linked (GATDGlc) serine
  • O-linked (Fuc) serine
  • O-linked (Fuc) threonine
  • O-linked (Fuc…) serine
  • O-linked (Fuc…) threonine
  • O-linked (FucNAc) serine
  • O-linked (FucNAc…) serine
  • O-linked (Gal) hydroxylysine
  • O-linked (Gal) hydroxyproline
  • O-linked (Gal) serine
  • O-linked (Gal) threonine
  • O-linked (Gal…) hydroxylysine
  • O-linked (Gal…) hydroxyproline
  • O-linked (Gal…) serine
  • O-linked (Gal…) threonine
  • O-linked (GalNAc) serine
  • O-linked (GalNAc…) serine
  • O-linked (GalNAc…) (keratan sulfate) serine
  • O-linked (GalNAc) threonine
  • O-linked (GalNAc…) threonine
  • O-linked (GalNAc…) (keratan sulfate) threonine
  • O-linked (GalNAc) tyrosine
  • O-linked (GalNAc…) tyrosine
  • O-linked (Glc) hydroxylysine
  • O-linked (Glc) serine
  • O-linked (Glc…) serine
  • O-linked (Glc) tyrosine
  • O-linked (Glc…) tyrosine
  • O-linked (GlcA) serine
  • O-linked (GlcNAc) hydroxyproline
  • O-linked (GlcNAc…) hydroxyproline
  • O-linked (GlcNAc) serine
  • O-linked (GlcNAc…) serine
  • O-linked (GlcNAc) threonine
  • O-linked (GlcNAc…) threonine
  • O-linked (GlcNAc) tyrosine
  • O-linked (GlcNAc…) tyrosine
  • O-linked (GlcNAc1P) serine
  • O-linked (GlcNAc6P) serine
  • O-linked (Man) serine
  • O-linked (Man…) serine
  • O-linked (Man…) (keratan sulfate) serine
  • O-linked (Man) threonine
  • O-linked (Man…) threonine
  • O-linked (Man…) (keratan sulfate) threonine
  • O-linked (Man1P) serine
  • O-linked (Man1P…) serine
  • O-linked (Man6P) threonine
  • O-linked (Man6P…) threonine
  • O-linked (Xyl) serine
  • O-linked (Xyl…) serine
  • O-linked (Xyl…) (chondroitin sulfate) serine
  • O-linked (Xyl…) (dermatan sulfate) serine
  • O-linked (Xyl…) (heparan sulfate) serine
  • O-linked (Xyl…) (glycosaminoglycan) serine
  • O-linked (Xyl…) (keratan sulfate) threonine
  • O-linked (Xyl…) (glycosaminoglycan) threonine
  • O-linked (Hex) hydroxylysine
  • O-linked (Hex…) hydroxylysine
  • O-linked (Hex) hydroxyproline
  • O-linked (Hex…) hydroxyproline
  • O-linked (Hex) serine
  • O-linked (Hex…) serine
  • O-linked (Hex) threonine
  • O-linked (Hex…) threonine
  • O-linked (Hex) tyrosine
  • O-linked (Hex…) tyrosine
  • O-linked (HexNAc) hydroxyproline
  • O-linked (HexNAc…) hydroxyproline
  • O-linked (HexNAc) serine
  • O-linked (HexNAc…) serine
  • O-linked (HexNAc) threonine
  • O-linked (HexNAc…) threonine
  • O-linked (HexNAc) tyrosine
  • O-linked (HexNAc…) tyrosine
  • S-linked (Gal) cysteine
  • S-linked (Gal…) cysteine
  • S-linked (Glc) cysteine
  • S-linked (Glc…) cysteine
  • S-linked (GlcNAc) cysteine
  • S-linked (GlcNAc…) cysteine
  • S-linked (Hex) cysteine
  • S-linked (Hex…) cysteine
  • S-linked (HexNAc) cysteine
  • S-linked (HexNAc…) cysteine

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • Cysteine sulfonic acid (-SO3H)

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Cardiomyopathy, dilated 1T
  • Sarcoidosis early-onset

UniRef news

Addition of GO annotation to UniRef90 and UniRef50 clusters

We have started to compute Gene Ontology (GO) annotations for UniRef90 and UniRef50 clusters: A GO term is assigned to a cluster when it is found in all UniProtKB members that are annotated with this term, or when it is a common ancestor of at least one GO term of each such member.

The UniRef XML format now represents the GO annotations with property elements. We have introduced three new types: "GO Molecular Function", "GO Biological Process", "GO Cellular Component". The values of these property elements are GO identifiers.

Example:

<entry id="UniRef50_B0KJL7" updated="2017-03-15" 
  <name>Cluster: Animal haem peroxidase</name>
  ...
  <property type="GO Molecular Function" value="GO:0004601"/>
  <property type="GO Biological Process" value="GO:0006979"/>
  ...

This change does not affect the XSD, but may nevertheless require code changes.

UniProt release 2017_04

Published April 12, 2017

Headline

Death (by insulin) in paradise

Have you ever been lucky enough to see cones snails in their natural habitat? Their shells are beautiful and you may be tempted to pick them up to admire them. Try to resist: cone snails hate that! These venomous animals can fire their harpoons and inject toxins under your skin. In some cases, these injections can be fatal. Cone snails produce 100-200 distinct venom peptides, and most of the characterized ones target their prey’s nervous system, including specific receptors, ion channels and transporters.

Cone snails predominantly live in warm seas and feed on fish, worms or molluscs. Fish-hunting cone snails can be classified into 2 categories depending upon their hunting strategy. There are ‘hook-and-line hunters’, who use a venomous harpoon, which is shot into the fish. There are ‘net hunters’, who protrude a sort of stretchy mouth, aim it at fish, and eventually engulf it. Cone snails move very slowly and all this process takes some time, so why does the fish not simply swim away? It has been proposed that cone snails release a subset of narcotizing or relaxing toxins, called the ‘nirvana cabal’, into water, causing fish to become disoriented and to stop moving.

The analysis of the Conus geographus venom gland transcriptome led to the amazing discovery of 3 transcripts (Con-Ins G1, Con-Ins G2 and Con-Ins G3), expressed at high levels and sharing very high homology with vertebrate insulin. The N-terminal half of Con-Ins G1 is almost identical to that of the fish hormone. It is known that the addition of human insulin to water causes hypoglycemia in fish, which severely affects their swimming behavior, insulin being absorbed via the gills. The effect can be reversed by placing fish in a 2% glucose bath. A similar effect was observed with synthetic Con-Ins G1, suggesting that it is indeed a component of the ‘nirvana cabal’.

Venom insulins are widely used by cone snails. All mollusc eaters produce venom insulins, as do many worm hunters, though not all. In fish hunters, all net hunters produce venom insulins, while hook-and-line do not. Venom insulins found in fish hunting cone snails closely resemble fish insulins, whereas those identified in snail-hunters share sequence and structural similarities with mollusc insulins. Interestingly, while cone snail insulin, produced in nerve rings to control their own glucose homeostasis, is highly conserved across all tested species, venom insulins diverge rapidly, suggesting adaptation to their specific prey.

Cone snail venom insulins are the smallest known insulins found in nature. They lack A- and B-chain C-terminal residues that, in vertebrates, are crucial for hormone storage and activity. In human pancreatic beta-cells, insulin is stored as a hexamer (a trimer of dimers), but it is the monomer that bears the hormonal activity. Hexamer-to-monomer conversion can cause a delay in insulin action that can lead to a delay in blood glucose control following insulin injection in diabetic patients. Attempts to shorten the C-terminus of human insulin B chain in order to abolish self-association have resulted in near-complete loss of activity. By contrast, Con-Ins G1 is monomeric, bypassing the hexamer conversion step, but it also potently binds to the human insulin receptor. It is yet not entirely clear how Con-Ins G1 achieves that. As most conotoxins, C. geographus insulins are extensively post-translationally modified. In the absence of modifications, insulin receptor activation is reduced by approximately 8-fold. The study of Con-Ins G1 crystal structure shows how Con-Ins G1 can compensate for the lack of C-terminal key residues, paving the way for the design of fast-acting therapeutic insulins.

The use of insulins in venoms has not been reported in any other animals, but cone snails. However, the Gila monster, a venomous lizard living in southwestern United States and northwestern Mexico, also targets the glucose homeostasis of its prey. It produces a peptide, called exendin-4, which mimics the incretin hormone glucagon-like peptide 1 (GLP-1), and acts as a potent stimulator of glucose-dependent insulin release. Exendin-4 has been developed as a commercial drug, under the name ‘Exenatide’, for the treatment of type 2 diabetes.

As of this release, the Con-Ins G1 entry is publicly available in the safe conotoxin-free environment of your computer.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Ceroid lipofuscinosis, neuronal, 12

UniProt release 2017_03

Published March 15, 2017

Headline

Viral Short Message Service: peptide texting guides the outcome of infection

Communication is not simply a dispensable tool invented by Homo sapiens to do business and to have an enjoyable social life. Long before the advent of cell phones, most living organisms, from animals and plants to bacteria, were communicating with each other in order to ensure species survival. The recent discovery of a peptide-based communication system in some bacterial viruses extends this observation far beyond our wildest imaginings.

Some bacterial viruses, called temperate bacteriophages, have the ability to infect their host through a lytic (productive infection) or a lysogenic (latent) cycle. The lytic cycle leads to the lysis of the host bacterial cell and release of progeny virions. In the lysogenic cycle, on the other hand, the bacteriophage genome becomes integrated into the host genome as a prophage without any virion production. The decision between lysis and lysogeny is probabilistic in nature, but usually depends on the number of co-infecting viruses and the bacterial nutritional state. When uninfected bacteria are abundant and healthy, the lytic pathway is preferred. In later stages of infection, when the number of uninfected bacteria is reduced, progeny phages are at risk of no longer having a new host to infect. At this point, lysogeny is favoured. Although the molecular mechanism undelying the phage lytic or lysogenic decision is still largely unknown, even in well-studied bacteriophages like Lambda or Mu, a substantial leap forward was made earlier this year.

Erez et al. were investigating whether phage-infected bacteria may produce molecules to alert other bacterial cells of their infection, when they made an amazing discovery. A screening of the culture medium of Bacillus subtilis infected by Phi3T bacteriophages led to the identification not of a bacterial, but of a… viral hexapeptide! This peptide was called AimP. The bacteriophage also encodes a cytoplasmic receptor for AimP, called AimR. In the absence of AimP, the AimR receptor behaves as a DNA-binding homodimer which activates the transcription of a third phage component of the system, AimX. AimX is a regulatory non-coding RNA which favors lysis, either by inhibiting lysogeny or by promoting lysis, in an as yet undefined manner. In the presence of AimP, the AimR receptor becomes a peptide-bound, transcriptionally inactive monomer. As a result, the expression of AimX drops and lysogeny is promoted.

The current experimental data suggest the following model. AimP is synthesized in infected bacteria as a pre-pro-peptide. Its N-terminal signal sequence is recognized by the host secretion system and cleaved off upon secretion. Once released in the extracellular milieu, the inactive pro-peptide is further processed by bacterial extracellular proteases to yield the mature active 6 amino-acid long AimP peptide, which is internalized by surrounding bacteria through the oligopeptide permease transporter (OPP). AimP accumulates in the bacterial cytoplasm. When a phage infects an ‘AimP-rich’ bacterium, the expressed AimR receptor binds AimP and cannot activate the expression of AimX, leading to preferential lysogeny. In other words, a phage can “sense” the level of global infection in the environment and adapt to preserve chances for viable reproduction.

This viral mode of 3-membered communication has been called ‘arbitrium’ (after the Latin word meaning ‘decision’). It may not be restricted to Phi3T bacteriophages. Indeed, Erez et al. found 112 instances of AimR homologues in Bacillus phages and, in all cases, aimR homologues were found upstream of aimP candidate genes.

As of this release, Bacillus phage Phi3T AimP and AimR entries have been updated and are publicly available.

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to keywords

New keywords:

Modified keyword:

UniProt release 2017_02

Published February 15, 2017

Headline

Freshwater fish see red

Vision relies on specialized neurons found in the retina, called photoreceptor cells. Vertebrate photoreceptor cells contain visual pigments consisting of a G-protein-coupled receptor, called opsin, and a covalently bound chromophore derived from vitamin A, most commonly 11-cis retinal (a derivative of vitamin A1). Light-induced isomerization of 11-cis retinal to all-trans triggers a conformational change leading to G-protein activation, release of all-trans retinal and activation of the phototransduction cascade.

Typical rod photopigments have a maximum light absorbance of around 500 nm. However, at the end of the 19th century, Köttgen and Abelsdorff observed that the rod pigments in certain freshwater fish were “red-shifted” towards 20-30 nm longer wavelengths than those of marine fish and terrestrial animals. This difference is due to a change in chromophore. Instead of 11-cis retinal, freshwater vertebrates use 11-cis 3,4-didehydroretinal, a derivative of vitamin A2, whose only difference with vitamin A1 is an additional conjugated double bond within its beta-ionone ring. What is the evolutionary advantage of this modification? Fresh water, in lakes or streams, is often murky. As a result, the light environment is shifted to the red and infrared end of the spectrum. Switching light absorbance seems to be the appropriate response to optimize vision in this specific aquatic milieu.

The chromophore switch is not only specific for certain species, it can also be regulated during life. For example, many amphibians use 11-cis 3,4-didehydroretinal during the tadpole stage, that they spend in ponds. Upon metamorphosis, they switch to 11-cis retinal which provides clear vision to the terrestrial adult they have become. Conversely, salmon live happily with 11-cis retinal in the open ocean. During spawning migration, however, 11-cis retinal is progressively replaced by 11-cis 3,4-didehydroretinal, possibly through the action of thyroid hormones. In zebrafish also, the switch to vitamin A2-based chromophores can be induced by thyroid hormone treatment. Maybe the most striking example of differential usage of visual chromophores is provided by the American bullfrog. This voracious predator spends a large part of its life floating or swimming at the surface of the water, looking for aquatic, as well as aerial prey, with its eyes just above the waterline. Its dorsal retina, steered towards water, contains 11-cis 3,4-didehydroretinal, while its ventral retina uses 11-cis retinal.

While much of this knowledge on vitamins A1 and A2 was acquired long ago, the identity of the dehydrogenase catalyzing the switch between both forms remained elusive until December 2015, when Enright et al. published the identification of the enzyme. The authors compared the expression profile of zebrafish retinal pigment epithelium (RPE) of thyroid hormone-treated versus control animals. The most highly up-regulated transcript was that encoding cyp27c1, a cytochrome P450 family member. cyp27c1 was also strongly expressed in dorsal, but not ventral bullfrog RPE, correlating with the distribution of vitamin A2. In vitro, purified cyp27c1 was able to very efficiently catalyze the conversion of vitamin A1 to vitamin A2. In vivo, cyp27c1 knockout zebrafish survive to adulthood without overt developmental abnormalities. However, upon treatment with thyroid hormone, the mutant fish eyes fail to produce any vitamin A2 and their photoreceptors do not undergo a red-shift in sensitivity. Thus, the expression of a single enzyme, cyp27c1, mediates the dynamic spectral tuning of the entire visual system by controlling the balance of vitamin A1 and A2 in the eye.

Obviously, humans are not adapted for aquatic vision. However, they do produce vitamin A2, as has been documented in keratinocytes, and they express CYP27C1 in liver, kidney and pancreas. The human enzyme catalyzes the same reaction as fish and amphibian orthologs, but the physiological relevance of this observation is not clear at present.

Zebrafish and bullfrog CYP27C1 entries have been annotated in UniProtKB/Swiss-Prot. The preliminary sequence of American bullfrog CYP27C1 was kindly provided by Professor Corbo and Dr. Enright and we would like to thank them sincerely. The human ortholog has been updated. All 3 entries are publicly available as of this release.

Cross-references to Araport

Cross-references have been added to the Arabidopsis Information Portal Araport, an open-access online community resource for Arabidopsis research.

Araport is available at https://www.araport.org/.

The format of the explicit links is:

Resource abbreviation Araport
Resource identifier AGI locus code

Example: Q43125

Show all entries having a cross-reference to Araport.

Text format

Example: Q43125

DR   Araport; AT4G08920; -.

XML format

Example: Q43125

<dbReference type="Araport" id="AT4G08920"/>

RDF format

Example: Q43125

uniprot:Q43125
  rdfs:seeAlso <http://purl.uniprot.org/araport/AT4G08920> .
<http://purl.uniprot.org/araport/AT4G08920> 
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Araport> .

Cross-references to IMGT/GENE-DB

Cross-references have been added to IMGT/GENE-DB, the genome database of the international Immunogenetics information system (IMGT) for genes encoding immunoglobulins and T-cell receptors.

IMGT/GENE-DB is available at http://www.imgt.org/genedb/.

The format of the explicit links is:

Resource abbreviation IMGT/GENE-DB in entry view, IMGT_GENE-DB in source formats
Resource identifier Gene name

Example: P01871

Show all entries having a cross-reference to IMGT/GENE-DB.

Text format

Example: P01871

DR   IMGT_GENE-DB; IGHM; -.

XML format

Example: P01871

<dbReference type="IMGT_GENE-DB" id="IGHM"/>

RDF format

Example: P01871

uniprot:P01871
  rdfs:seeAlso <http://purl.uniprot.org/imgt_gene-db/IGHM> .
<http://purl.uniprot.org/imgt_gene-db/IGHM> 
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/IMGT_GENE-DB> .
 

Change of the cross-references to TAIR

We have modified our cross-references to the TAIR database, and now use the TAIR accession number as the primary resource identifier, while continuing to show the TAIR locus name in an additional field.

Text format

Example: Q9ZVI3

Previous format:

DR   TAIR; AT2G38610; -.

New format:

DR   TAIR; locus:2064097; AT2G38610.

XML format

Example: Q9ZVI3

Previous format:

<dbReference type="TAIR" id="AT2G38610"/>

New format:

<dbReference type="TAIR" id="locus:2064097">
  <property type="gene designation" value="AT2G38610"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: Q9ZVI3

Previous format:

uniprot:Q9ZVI3
  rdfs:seeAlso <http://purl.uniprot.org/tair/AT2G38610> .
<http://purl.uniprot.org/tair/AT2G38610>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TAIR> .

New format:

uniprot:Q9ZVI3
  rdfs:seeAlso <http://purl.uniprot.org/tair/locus:2064097> .
<http://purl.uniprot.org/tair/locus:2064097>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TAIR> .
  rdfs:comment "AT2G38610" .

Removal of sequence similarity annotations for domains

Sequence similarity annotations were mainly used to describe two types of information:

  1. A family to which the protein belongs, worded as:
    Belongs to FamilyName.
  2. A structural domain that the protein contains, worded as:
    Contains NumberOfOccurence DomainName.

The domains that a protein contains are also annotated in ‘Domain’, ‘Zinc finger’, ‘Repeat’, ‘Calcium binding’ or ‘DNA binding’ annotations, which describe a domain’s name and sequence coordinates. The ‘Sequence similarity’ annotations of type 2, however, described only a domain’s name and number of occurences. We have therefore removed these less detailed annotations.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Thyroxine-binding globulin deficiency

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • N,N,N-trimethylmethionine

UniProt release 2017_01

Published January 18, 2017

Headline

Sheep in wolves’ clothing: human variant reannotation in UniProtKB/Swiss-Prot with ExAC

Annotation of sequence variants has always been an important part of the curation of human proteins in UniProtKB/Swiss-Prot. As of this release, about 76,500 variants are annotated in the knowledgebase. 99% of them are single amino acid polymorphisms (SAPs), the rest are small indels. 38% of the SAPs are associated with a genetic disorder. This high percentage of rare SAPs reflects our strategy to prioritize the annotation of disease-causing and/or functionally characterized variants reported in peer-reviewed scientific literature. Most are annotated as involved in diseases (as disease-causing agents, susceptibility factors or disease modifiers), but for some, the role in the phenotype is not clear, although they have been found in patients and not (yet?) in healthy individuals. These variants are called Variants of Unknown Significance (VUS). In the ‘good old days’, we were quite confident and we associated SAPs with diseases provided some criteria were met, such as cosegregation of the mutation with the phenotype, and absence of the mutation in a reasonably high number of healthy controls. At that time, 100 control individuals, ethnically matched if possible, seemed acceptable. Those days are gone. Nowadays, these simple criteria have been changed for a real roadmap, based on guidelines developed by Richards et al. The stumbling block remains the frequency of a given variant in the population in view of the occurrence of the disease. In other words, if a variant is not found in healthy individuals, is it because it is pathogenic, or simply not looked for hard enough? In this context, the high-quality sequence of almost 61,000 exomes provided by the Exome Aggregation Consortium (ExAC) is a major achievement.

ExAC aims to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. The sequence of 60,706 exomes from unrelated individuals is currently available on the ExAC website. Surprisingly, each ExAC exome donor harbored on average 54 mutations reported to be disease-causing in HGMD or ClinVar. The pathogenicity of most of them (41) could be ruled out due to high allele frequency. Take for instance the gene CLN8. Mutations in this gene have been shown to cause neuronal ceroid lipofuscinosis-8, an autosomal recessive neurodegenerative disorder with an onset age of 2 to 7 years. In view of the clinical synopsis, no ‘healthy’ adult homozygous for any disease-causing mutation is expected. ExAC observed 93 individuals homozygous for the p.Pro229Ala variant, which had formerly been reported to be pathogenic. An analogous result was obtained for the variant p.Met1444Ile in GLI2. This mutation was reported to cause holoprosencephaly-9 (HPE9), an autosomal dominant disorder characterized by a wide phenotypic spectrum of brain developmental defects. Although HPE9 has variable expressivity and incomplete penetrance, the presence of this mutation in 20 homozygous individuals analyzed by ExAC lead to its reclassification as a benign polymorphism.

The ExAC publication has a fruitful impact on our annotation. First, 38 variants (in 36 gene entries) reported in UniProtKB/Swiss-Prot and thought to be pathogenic have been reclassified as either benign polymorphisms or VUS. Second, the ExAC database has become an invaluable tool for curators, helping them to tag human variants with the appropriate status ‘Disease’ (disease-associated), ‘Polymorphism’ (innocuous) or ‘Unclassified’ (i.e. VUS). Third, we are learning to be more and more cautious when annotating new variants. The result is an increased number of VUS in UniProtKB/Swiss-Prot (currently representing about 20% of the total number of variants identified in patients). Old variants will be progressively confirmed or reclassified as new knowledge becomes available.

As of this release, the variants updated thanks to ExAC data are available in UniProtKB/Swiss-Prot.

The UniProt team wishes you a Happy New Year!

Cross-references to SFLD

Cross-references have been added to the Structure Function Linkage Database (SFLD), a resource that links evolutionarily related sequences and structures from mechanistically diverse superfamilies of enzymes to their chemical reactions.

SFLD is available at http://sfld.rbvi.ucsf.edu/django/.

The format of the explicit links is:

Resource abbreviation SFLD
Resource identifier SFLD identifier
Optional information 1 SFLD model name
Optional information 2 Number of hits

Example: P00877

Show all entries having a cross-reference to SFLD.

Text format

Example: P00877

DR   SFLD; SFLDS00014; RuBisCO; 1.

XML format

Example: P00877

<dbReference type="SFLD" id="SFLDS00014">
  <property type="entry name" value="RuBisCO"/>
  <property type="match status" value="1"/>
</dbReference>

RDF format

Example: P00877

uniprot:P00877
  rdfs:seeAlso <http://purl.uniprot.org/sfld/SFLDS00014> .
<http://purl.uniprot.org/sfld/SFLDS00014>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SFLD> ;
  rdfs:comment "RuBisCO" ;
  up:signatureSequenceMatch <http://purl.uniprot.org/isoforms/P00877-1#SFLD_SFLDS00014_match_1> .

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • O-UMP-histidine
  • O-UMP-serine
  • O-UMP-threonine

Changes to keywords

Deleted keyword:

  • Cyclosporin

UniRef news

Change of the UniRef FASTA header

We have added the NCBI taxonomy identifier of the common taxon of a UniRef cluster to the UniRef FASTA header, which now has the format:

>UniqueIdentifier ClusterName n=Members Tax=TaxonName TaxID=TaxonIdentifier RepID=RepresentativeMember

Where:

  • UniqueIdentifier is the primary accession number of the UniRef cluster.
  • ClusterName is the name of the UniRef cluster.
  • Members is the number of UniRef cluster members.
  • TaxonName is the scientific name of the lowest common taxon shared by all UniRef cluster members.
  • TaxonIdentifier is the NCBI taxonomy identifier of the lowest common taxon shared by all UniRef cluster members.
  • RepresentativeMember is the entry name of the representative member of the UniRef cluster.
Example:
>UniRef50_Q9K794 Putative AgrB-like protein n=2 Tax=Bacillus TaxID=1386 RepID=AGRB_BACHD
MLERLALTLAHQVKALNAEETESVEVLTFGFTIILHYLFTLLLVLAVGLLHGEIWLFLQI
ALSFTFMRVLTGGAHLDHSIGCTLLSVLFITAISWVPFANNYAWILYGISGGLLIWKYAP
YYEAHQVVHTEHWERRKKRIAYILIVLFIILAMLMSTQGLVLGVLLQGVLLTPIGLKVTR
QLNRFILKGGETNEENS

This addresses the issue that scientific taxon names can be ambiguous. Example: “Bacillus” refers to both a genus of bacteria as well as a genus of insects.

UniProt release 2016_11

Published November 30, 2016

Headline

From mouth to gut, a new mechanism for fimbria assembly

Fighting the oral microbiome is a daily task. Ineffective oral hygiene leads not only to dental caries, but also to inflammatory gum diseases, such as gingivitis. In some cases, gingivitis can worsen and turn into periodontitis, which involves the chronic destruction of connective tissues, including that of the alveolar bone around the teeth, and consequently loosening and subsequent loss of teeth. We are not all equally affected by periodontal diseases. There are marked differences in disease progression rate and severity, reflecting personal susceptibility, diversity in virulence among the microorganism species (and subspecies) and environmental conditions. Despite these variables, Porphyromonas gingivalis is now recognized as a major contributor to periodontitis. This Gram-negative black-pigmented anaerobic rod resides in subgingival biofilms and harbors an arsenal of virulence factors, among which are fimbriae (also called pili). Described for the first time in the early 1950s, fimbriae are non-flagellar appendages, formed by the assembly of proteins called pilins at the bacterial surface. They are often involved in the initial adhesion of the bacteria to host tissues during colonization, and also in biofilm formation, cell motility (twitching mobility), and transport of proteins and DNA across cell membranes. There are major (long) and minor (short) fimbriae, both containing a structural, stalk-forming subunit (FimA for the major fimbriae, Mfa1 for the minor fimbriae) and 3 accessory subunits (FimC, FimD and FimE for the major fimbriae; Mfa3, Mfa4 and Mfa5 for the minor fimbriae) thought to form the fimbria tip. The last subunit is FimB (major fimbriae) or Mfa2 (minor fimbriae), which anchors the pilus to the outer membrane.

A very thorough study published last April, combining X-ray structure, biochemical and mutational analyses, sheds new light on the fimbria assembly mechanism in several bacteria from the Bacteroidia class, including P. gingivalis. The assembly occurs from tip to base. A tip pilin monomer is incorporated first, followed by stalk-forming structural pilin subunits and finally an anchor pilin at the base. Tip and structural pilins are synthesized in the cytoplasm as lipoprotein precursors, and exported into the periplasm using the Sec pathway. In the periplasm, they are folded and become lipidated at the N-terminus. The modified pilins are then exported across the outer membrane. During this process, they undergo a cleavage that releases the lipid moiety and several amino acids from the N-terminus, creating a groove. At this stage, mature structural pilins adopt an extended “open” conformation, allowing the assembly of the fimbriae where a C-terminal extension binds to the N-terminal groove of the previous subunit, a little like interlocking Lego bricks. The tip pilins exhibit a similar N-terminal groove to accommodate the C-terminal extension from structural pilin, but their C-terminus remains buried. Anchor pilins do not undergo cleavage and remain tethered to the outer membrane. As for structural pilin subunits, their C-terminus is involved in their incorporation into fimbriae.

Although fimbria assembly has been studied in numerous phylogenetically distinct bacteria, until this recent publication, very little was known about pilin structure and assembly in human-associated Bacteroidales members. The reported mechanism was hitherto unseen, but it could be widespread. Indeed, FimA proteins represent a large and diverse superfamily, which is highly represented in the gut microbiome, suggesting that they may confer adaptive advantages in bacterial colonization of this environment.

Close to 30 entries have been updated in UniProtKB/Swiss-Prot to include these new findings. The entries can be consulted just as well before or after brushing your teeth!

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

RDF news

Change of URIs for Ensembl and Ensembl Genomes

For historic reasons, UniProt had to generate URIs to cross-reference databases that did not have an RDF representation. Our policy is to replace these by the URIs generated by the cross-referenced database once it starts to distribute an RDF representation of its data.

We have therefore updated the URIs for the Ensembl and Ensembl Genomes databases from

http://purl.uniprot.org/ensembl/<identifier>
http://purl.uniprot.org/ensemblbacteria/<identifier>
http://purl.uniprot.org/ensemblfungi/<identifier>
http://purl.uniprot.org/ensemblmetazoa/<identifier>
http://purl.uniprot.org/ensemblplants/<identifier>
http://purl.uniprot.org/ensemblprotists/<identifier>
to
  • http://rdf.ebi.ac.uk/resource/ensembl/<identifier>
    for genes
  • http://rdf.ebi.ac.uk/resource/ensembl.transcript/<identifier>
    for transcripts
  • http://rdf.ebi.ac.uk/resource/ensembl.protein/<identifier>
    for proteins

UniProt release 2016_10

Published November 2, 2016

Headline

N-acyl amino acids: a new treatment for obesity?

Mitochondria play a fundamental role in energy production. After glycolysis, glucose products are imported into the mitochondrial matrix, where they go through the citric acid cycle. The electrons produced in this process are transported from one protein complex to the next in the mitochondrial inner membrane. The final electron acceptor is molecular oxygen, which is ultimately reduced to water. During electron transport, the participating protein complexes pump protons out of the matrix space into the intermembrane space and thus create a concentration gradient. This gradient is used by ATP synthase to power the phosphorylation of ADP into ATP. However not all energy liberated from the oxidation of dietary substrates is converted into ATP. Protons can leak back to the matrix through the inner membrane independently of ATP synthase and the energy accumulated is dissipated as heat. Several proteins are known to be involved in this process, called “uncoupled respiration”. One of them, UCP1 has been most extensively studied in the context of thermogenesis mediated by brown and beige adipose tissues.

Adaptive thermogenesis does not rely exclusively upon UCP1. Adipose tissues secrete many bioactive proteins, some of which potentially play a role in the regulation of energy expenditure. Recently, Long et al. identified a protein secreted by brown and beige fat cells, PM20D1. This protein is co-expressed with UCP1 in adipocytes. When injected with PM20D1 viral expression vectors and placed on high fat diet for a period of 47 to 54 days, mice exhibited a blunted weight gain, due to a massive reduction in fat mass compared with control animals. There was no difference in food intake, nor in movement between treated and untreated animals, suggesting the activation of a thermogenic gene program in the classical brown fat (BAT), subcutaneous inguinal white fat (iWAT), or both. Interestingly, UCP1 levels were unchanged in these experiments.

In vitro, PM20D1 appeared to be a bidirectional N-acyl amino acid synthase and hydrolase, the synthase activity being lower than the hydrolase activity. In vivo, plasma levels of N-oleyl-phenylalanine (C18:1-Phe) were indeed elevated in mice injected with PM20D1 expression vector. But what is the effect of N-lipidated amino acids on cells? When treated with N-acyl amino acids, primary BAT adipocytes and differentiated iWAT cells showed increased oxygen consumption in a UCP1-independent manner, indicating respiratory uncoupling activity of these compounds. The N-acyl amino acids tested (N-arachidonyl-glycine (C20:4-Gly), C20:4-Phe, and C18:1-Phe) acted directly on mitochondria, possibly by interaction with mitochondrial transporter proteins, such as SLC25A4 and SLC25A5. Of note, SLC25A4 and SLC25A5 exhibit ADP/ATP symport activity, but are also thought to translocate protons across the inner membrane. Finally treatment of obese mice with C18:1-Leu induced weight loss through the reduction of fat mass and improved glucose tolerance tests.

In the 1930s, the mitochondrial uncoupling 2,4 dinitrophenol was used in diet pills to stimulate metabolism and promote weight loss and actually it can still be purchased on the internet for this purpose. Though quite efficient in terms of weight loss, this drug has severe side effects. It can cause an excessive rise in body temperature due to the heat produced during uncoupling. DNP overdose causes fatal hyperthermia, with body temperature rising to as high as 44oC shortly before death. Will N-acyl-amino acids become a new, this time innocuous, treatment of choice for obesity? It’s difficult to anticipate. Chronic treatment of mice with C18:1-Phe or C20:4-Gly not only increases energy expenditure, with no effects on movement, but also reduces food intake, which obviously also contributes to weight loss. However, several N-acyl-amino acids have other biological functions, besides respiratory uncoupling, and hence may have other (undesirable?) effects. Nevertheless the study of Long et al. sheds light on new endogenous mitochondrial uncouplers and new thermogenic mechanisms that are undoubtedly worth further investigation.

As of this release, PM20D1 entries have been updated and are publicly available.

UniProtKB news

Cross-references to DisGeNET

Cross-references have been added to DisGeNET, a discovery platform for the dynamical exploration of human diseases and their genes.

DisGeNET is available at http://www.disgenet.org.

The format of the explicit links is:

Resource abbreviation DisGeNET
Resource identifier Gene identifier (corresponding to GeneID gene identifier)

Example: P02649

Show all entries having a cross-reference to DisGeNET.

Text format

Example: P02649

DR   DisGeNET; 348; -.

XML format

Example: P02649

<dbReference type="DisGeNET" id="348"/>

RDF format

Example: P02649

uniprot:P02649
  rdfs:seeAlso <http://identifiers.org/ncbigene/348> .
<http://identifiers.org/ncbigene/348>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/DisGeNET> .

Cross-references to OpenTargets

Cross-references have been added to OpenTargets. This Target Validation platform brings together information on the relationships between potential drug targets and diseases. The core concept is to identify evidence of an association between a target and disease from various data types.

OpenTargets is available at https://www.targetvalidation.org/.

The format of the explicit links is:

Resource abbreviation OpenTargets
Resource identifier Gene identifier (corresponding to Ensembl gene identifier)

Example: P15056

Show all entries having a cross-reference to OpenTargets.

Text format

Example: P15056

DR   OpenTargets; ENSG00000157764; -.

XML format

Example: P15056

<dbReference type="OpenTargets" id="ENSG00000157764"/>

RDF format

Example: P15056

uniprot:P15056
  rdfs:seeAlso <http://purl.uniprot.org/opentargets/ENSG00000157764> .
<http://purl.uniprot.org/opentargets/ENSG00000157764> 
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/OpenTargets> .

Change of the cross-references to PhosphoSite

The PhosphoSite resource has changed its name to PhosphoSitePlus and we have updated our cross-references to reflect this name change.

Change of the cross-references to SMR

We have modified our cross-references to the SWISS-MODEL Repository (SMR) database. These cross-references used to indicate the sequence ranges of the UniProt canonical sequence that can be modelled with high confidence. This information is now no longer available in our cross-references, but you can get the most up-to-date data in SMR which is now updated weekly for several model organisms, or by triggering yourself the update of a specific entry in SMR.

Text format

Example: Q00362

Previous format:

DR   SMR; Q00362; 4-376, 492-523.

New format:

DR   SMR; Q00362; -.

XML format

Example: Q00362

Previous format:

<dbReference type="SMR" id="Q00362">
  <property type="residue range" value="4-376, 492-523"/>
</dbReference>

New format:

<dbReference type="SMR" id="Q00362"/>

RDF format

Example: Q00362

Previous format:

uniprot:Q00362
  rdfs:seeAlso <http://purl.uniprot.org/smr/Q00362> .
<http://purl.uniprot.org/smr/Q00362>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SMR> ;
  rdfs:comment "4-376, 492-523" .

New format:

uniprot:Q00362
  rdfs:seeAlso <http://purl.uniprot.org/smr/Q00362> .
<http://purl.uniprot.org/smr/Q00362>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SMR> .

Change of RDF representation of the cross-references to PDB

We have modified the representation of our cross-references to PDB. These cross-references indicate the sequence ranges of the UniProt canonical sequence that are covered by a PDB structure when this data is available. This piece of information was provided via a reification of the cross-reference statement and each range was represented with a chain property that had a string literal value. We have introduced a new chainSequenceMapping property to simplify this description.

Example: P00750

Previous format:

uniprot:P00750
  rdfs:seeAlso <http://rdf.wwpdb.org/pdb/1A5H> .

<http://rdf.wwpdb.org/pdb/1A5H>
  rdf:type up:Structure_Resource ;
  up:database <http://purl.uniprot.org/database/PDB> ;
  up:method up:X-Ray_Crystallography ;
  up:resolution "2.90"^^xsd:float .

<#_5030303735300036>
  rdf:type rdf:Statement ;
  rdf:type up:Structure_Mapping_Statement ;
  rdf:subject uniprot:P00750 ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://rdf.wwpdb.org/pdb/1A5H> ;
  up:chain "A/B=311-562" ,
           "C/D=298-304" .

New format:

uniprot:P00750
  rdfs:seeAlso <http://rdf.wwpdb.org/pdb/1A5H> .

<http://rdf.wwpdb.org/pdb/1A5H>
  rdf:type up:Structure_Resource ;
  up:database <http://purl.uniprot.org/database/PDB> ;
  up:method up:X-Ray_Crystallography ;
  up:resolution "2.90"^^xsd:float ;
  up:chainSequenceMapping isoform:P00750-1#PDB_1A5H_tt311tt562 ,
                          isoform:P00750-1#PDB_1A5H_tt298tt304 .

isoform:P00750-1#PDB_1A5H_tt311tt562
  up:chain "A/B=311-562" .

isoform:P00750-1#PDB_1A5H_tt298tt304
  up:chain "C/D=298-304" .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • Hydroxylated arginine
  • N6-(beta-hydroxybutyrate)lysine

UniProt website news

Web browser support update

UniProt strives to support all major web browsers up to the oldest version that is supported by the browser developers. Since Microsoft stopped the support for Internet Explorer versions older than 11 in January 2016, we have dropped the support for these versions from UniProt release 2016_10.

We recommend to use one of the following major web browsers for the UniProt website:

  • Internet Explorer 11+
  • FireFox 45+
  • Chrome (latest update)
  • Safari 9+

Please note that for older versions of these browsers certain features of the website may not be available (you can check here which browser version you are using).

UniProt release 2016_09

Published October 5, 2016

Headline

Ki-67: the great leap from simple marker to functional actor

A marker is ‘something (such as a sign or an object) that shows the location, the presence or the existence of something’. Used daily in laboratories worldwide, from basic research to clinics, markers are a scientist/practitioner’s best friend and the community continuously seeks new markers, notably for improving diagnosis and prognosis in medicine. Take for instance Ki-67. This protein, encoded by the MKI67 gene, is present during all active phases of the cell cycle, G1, S, G2, and mitosis, but is absent from resting G0 cells. During interphase, it is predominantly present in the cortex and dense fibrillar components of the nucleolus. During mitosis, it relocates to the periphery of the condensed chromosomes. It is a widely used marker for cell proliferation, very valuable in cancer diagnosis and prognosis. In this case, the term “widely” seems an understatement. A search in the NCBI PubMed database retrieves over 22’200 publications, but hardly any deal with its actual function. Indeed, while Ki-67 association with cellular proliferation is well established, its precise role in this process was unknown until recently. It was quite tempting to suggest that it is ‘required for maintaining cell proliferation’, as it was cautiously stated in the human UniProtKB/Swiss-Prot entry. However, a marker is just a marker and drawing any functional conclusion from expression levels may be hazardous.

At the very beginning of mitosis, chromosomes are compacted into thick fibers. After nuclear envelope breakdown (NEBD), chromosomes separate from one another in the cytoplasm, attach to the mitotic spindle and align along the center of the cell during metaphase. The spindle pulls a set of chromosomes to each pole of the dividing cell. How do chromosomes maintain their structural individuality during this process? As the molecules responsible for chromosome compaction are by themselves unable to distinguish different chromosomes, what are the factors that prevent chromosome coalescence?

Earlier this year, Cuylen et al. tackled this issue. Using automated live-cell imaging, the authors analyzed the effect of removing different proteins from cells. Out of almost 1,300 candidate genes, the knockdown of only one caused the sought-after chromosome clustering phenotype: MKI67. The internal structure of mitotic chromosomes appeared unaffected by Ki-67 depletion, but soon after NEBD, chromosomes merged into a single mass of chromatin, whose access to spindle microtubules was impaired.

Ki-67 is a large, about 3’000 amino acid long, protein that localizes at the chromosome surface from prophase until telophase, as mentioned above. Cuylen et al. show that the protein’s adsorption at the chromosome surface is mediated by its C-terminal region. The elongated N-terminal portion orients perpendicular to the chromosomes, a little like bristles on a brush. Ki-67 size and overall electric charge may form a repulsive shield, preventing coalescence. The range of Ki-67-mediated chromosome repulsion seems to depend on molecular density. When Ki-67 was overexpressed, mitotic chromosomes were spaced further apart.

Hence natural proteins seem to be able to act as surfactants in intracellular compartmentalization. It would be interesting to investigate whether it is also the case for membrane-less organelles, such as nucleoli, with which Ki-67 was also shown to be associated.

As of this release, the human Ki-67 entry has been updated in UniProtKB/Swiss-Prot and is publicly available.

UniProtKB news

Change of RDF representation of the cross-references to family and domain databases

We have modified the representation of our cross-references to family and domain databases. These cross-references indicate the number of matches of the family or domain signature to the UniProt canonical sequence, and this piece of information was provided via a reification of the cross-reference statement. We have introduced a new Signature_Resource class with a signatureSequenceMatch property to describe each match as a resource and thereby simplify this description.

Example: A0AVT1

Previous format:

uniprot:A0AVT1
  rdfs:seeAlso <http://purl.uniprot.org/pfam/PF00899> .

<http://purl.uniprot.org/pfam/PF00899>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Pfam> ;
  rdfs:comment "ThiF" .

<#_4130415654310021>
  rdf:type rdf:Statement ;
  rdf:type up:Domain_Assignment_Statement ;
  rdf:subject uniprot:A0AVT1 ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://purl.uniprot.org/pfam/PF00899> ;
  up:hits 2 .

New format:

uniprot:A0AVT1
  rdfs:seeAlso <http://purl.uniprot.org/pfam/PF00899> .

<http://purl.uniprot.org/pfam/PF00899>
  rdf:type up:Signature_Resource ;
  up:database <http://purl.uniprot.org/database/Pfam> ;
  rdfs:comment "ThiF" .
  up:signatureSequenceMatch isoforms:A0AVT1-1#Pfam_PF00899_match_1 ,
                            isoforms:A0AVT1-1#Pfam_PF00899_match_2 .

Change of RDF representation of the cross-references to EMBL

We have modified the representation of our cross-references to nucleotide CoDing Sequences (CDS) from the INSDC. When a CDS differs substantially from a reviewed UniProtKB/Swiss-Prot sequence, the UniProt curators indicate the nature of the difference in the corresponding cross-reference. This piece of information was provided via a reification of the cross-reference statement. We have introduced a new sequenceDiscrepancy property to simplify this description.

Example: P30154

Previous format:

uniprot:P30154
  rdfs:seeAlso <http://purl.uniprot.org/embl-cds/BAG59103.1> .

<http://purl.uniprot.org/embl-cds/BAG59103.1>
  rdf:type up:Nucleotide_Resource ;
  up:database <http://purl.uniprot.org/database/EMBL> ;
  up:locatedOn <http://purl.uniprot.org/embl/AK296455> .

<#_503330313534001A>
  rdf:type rdf:Statement ;
  rdf:type up:Nucleotide_Mapping_Statement ;
  rdf:subject uniprot:P30154 ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://purl.uniprot.org/embl-cds/BAG59103.1> ;
  rdfs:comment "Frameshift." .

New format:

uniprot:P30154
  rdfs:seeAlso <http://purl.uniprot.org/embl-cds/BAG59103.1> .

<http://purl.uniprot.org/embl-cds/BAG59103.1>
  rdf:type up:Nucleotide_Resource ;
  up:database <http://purl.uniprot.org/database/EMBL> ;
  up:locatedOn <http://purl.uniprot.org/embl/AK296455> ;
  up:sequenceDiscrepancy uniprot:P30154#EMBL_BAG59103.1 .

uniprot:P30154#EMBL_BAG59103.1
  rdfs:comment "Frameshift." .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2016_08

Published September 7, 2016

Headline

Butterfly fashion: all they need is cortex

Butterfly and moth wing patterns fulfill various functions, such as mate attraction, thermal regulation, and protection by concealment, mimicry or warning. Patterns are produced by a dust-like layer of tiny colored scales that cover an otherwise transparent membrane. Scales can be pigmented with melanins resulting in black and brown colors. Blue, red and iridescence are usually created by the microstructure of the scales, resulting in the scattering of light. Each scale is produced by a single cell on the wing surface.

Wing pattern and color can change in order to adapt to environmental changes. The classical example of such a phenomenon is provided by Biston betularia. This moth used to camouflage itself against lichen-covered tree trunks. Its peppered white wings makes it almost invisible on this background. With the advent of the industrial revolution in the 19th century in Britain, trunks turned soot black and so did Biston betularia. The new melanic morph was described for the first time in Manchester in 1848 and called carbonaria. It spread all over England and its frequency was over 90% in the 1950s. Several years after the Clean Air Act, in the early 1970s, its frequency started to drop again and nowadays the maximum is evaluated less than 50% and in most places below 10%.

The mutation that gave rise to Biston betularia industrial melanism has just been identified. It is the insertion of a large, tandemly repeated, transposable element into the first intron of the cort gene, which results in increased gene expression. The transposition event is thought to have occurred around 1819, which is consistent with the historical record. Surprisingly, the cort gene does not encode a transcription factor that would be involved in the expression of pigmentation genes. Its only known function has been reported in Drosophila, where the cort-encoded protein cortex is a cell-cycle regulator, required for the completion of meiosis in oocytes. In Heliconius numata tarapotensis and Heliconius melpomene rosina, 2 butterfly species, cortex is expressed in final instar larval hindwing discs, in regions fated to become black in the adult wing. Although cortex function in the regulation of pigmentation patterning is yet unknown, the current hypothesis is that it may regulate scale cell development.

In other latitudes, butterflies escape from predators not by concealment, but by warning that they are unpalatable with bright and distinctive wing colors. Within a given area, experienced birds have been “educated” to avoid certain patterns. This pattern recognition varies upon geographical locations. As a result, in a given area, a number of butterfly species, edible or not, mimic each other and have the same color pattern, even though they may be only distantly related, while Lepidopteria of the same species found in other locations may exhibit very different patterns. A recent study focused on different Heliconius species living in South America. The result was quite striking. In these species too, the cort gene appeared to be a major regulator of color and pattern. This result suggests that the recruitment of cortex to wing patterning may have occurred before the major diversification of the Lepidoptera. This gene has repeatedly been targeted by natural selection to generate both cryptic, as in Biston betularia, and aposematic, as in Heliconius genus, patterns.

As of this release, UniProtKB/Swiss-Prot Biston betularia, Heliconius melpomene and Heliconius erato cortex entries have been updated with this new knowledge and are publicly available.

UniProtKB news

Cross-references to Conserved Domains Database

Cross-references have been added to the Conserved Domains Database (CDD), a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins.

CDD is available at https://www.ncbi.nlm.nih.gov/cdd.

The format of the explicit links is:

Resource abbreviation CDD
Resource identifier CDD identifier
Optional information 1 CDD model name
Optional information 2 Number of hits

Example: Q196W5

Show all entries having a cross-reference to CDD.

Text format

Example: Q196W5

DR   CDD; cd04278; ZnMc_MMP; 1.

XML format

Example: Q196W5

<dbReference type="CDD" id="cd04278">
  <property type="entry name" value="ZnMc_MMP"/>
  <property type="match status" value="1"/>
</dbReference>

RDF format

Example: Q196W5

uniprot:Q196W5
  rdfs:seeAlso <http://purl.uniprot.org/cdd/cd04278> .
<http://purl.uniprot.org/cdd/cd04278>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/CDD> ;
  rdfs:comment "ZnMc_MMP" .

Change of the cross-references to VectorBase

We have modified our cross-references to the VectorBase database. We now use the VectorBase Transcript identifier as the primary resource identifier, while showing the VectorBase Protein and Gene identifiers in additional fields.

VectorBase is available at http://vectorbase.org.

The new format of the explicit links is:

Resource abbreviation VectorBase
Resource identifier Transcript identifier
Optional information 1 Protein identifier
Optional information 2 Gene identifier

Example: A7UVJ5

Show all entries having a cross-reference to VectorBase.

Text format

Example: A7UVJ5

Previous format:

DR   VectorBase; AGAP001789. Anopheles gambiae.

New format:

DR   VectorBase; AGAP001789-RA; AGAP001789-PA; AGAP001789.

XML format

Example: A7UVJ5

Previous format:

<dbReference type="VectorBase" id="AGAP001789">
  <property type="organism name" value="Anopheles gambiae"/>
</dbReference>

New format:

<dbReference type="VectorBase" id="AGAP001789-RA">
  <property type="protein sequence ID" value="AGAP001789-PA"/>
  <property type="gene ID" value="AGAP001789"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: A7UVJ5

Previous format:

uniprot:A7UVJ5
  rdfs:seeAlso <http://purl.uniprot.org/vectorbase/AGAP001789> .
<http://purl.uniprot.org/vectorbase/AGAP001789>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/VectorBase> ;
  rdfs:comment "Anopheles gambiae" .

New format:

uniprot:A7UVJ5
  rdfs:seeAlso <http://purl.uniprot.org/vectorbase/AGAP001789-RA> .
<http://purl.uniprot.org/vectorbase/AGAP001789-RA>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/VectorBase> ;
  up:translatedTo <http://purl.uniprot.org/vectorbae/AGAP001789-PA> ;
  up:transcribedFrom <http://purl.uniprot.org/vectorbase/AGAP001789> .

Change of the cross-references to WormBase

Cross-references to WormBase may now be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Example: G5EG14

Changes to the controlled vocabulary of human diseases

New diseases:

UniProt website news

Peptide search tool

We have introduced a new tool called Peptide search that is available from a link in the header of the UniProt website. You can enter one or several peptide sequences (for example from a proteomics experiment) into the search field and the tool quickly finds all UniProtKB sequences that exactly match one of your query sequences. Searches can be restricted to a taxonomic subset of UniProtKB to decrease the search time. The tool returns a results page showing the matched UniProtKB entries in a design consistent with the UniProtKB text search results page, including filters on the left, results on the right and an option to customise the results table through the ‘Columns’ button.

Publications view added to UniProtKB entries

UniProt Knowledgebase (UniProtKB) protein entries now have a dedicated view of publications relevant for a protein. UniProtKB contains more than 350,000 unique publications, with over 210,000 of these fully curated in UniProtKB/Swiss-Prot and the remainder imported in UniProtKB/TrEMBL. This set is complemented by more than 640,000 additional publications that have been computationally mapped from other resources to UniProtKB entries. The publications annotated in UniProtKB have previously been displayed in the main ‘Entry’ view and a link provided access to a separate page that listed the computationally mapped publications. We have now combined all publications into a new ‘Publications’ view that can be accessed from a link under the ‘Display’ heading on the left hand side of a UniProtKB page. In this view you can filter the publications list by source and categories that are based on the type of data a publication contains about the protein (such as function, interaction, sequence, etc.) or the number of proteins it describes (‘small scale’ vs ‘large scale’), see for example P10276.

UniProt release 2016_07

Published July 6, 2016

Headline

(Bacterial) immigration under control

Essentially all our mucosal surfaces are covered by microorganisms, not only bacteria, but also archaea, fungi, protozoans and viruses. Most of them reside within the gastrointestinal tract. Normal gut flora is largely responsible for overall health of the host and it does not trigger any inflammatory response… as long as it remains where it belongs. In order to maintain a subtle, though strict segregation, the colonic epithelium is covered by mucus. The latter is organized in 2 layers. The inner layer adheres firmly to the epithelial cells. It is dense and does not allow bacterial penetration, thus keeping the epithelial cell surface free from bacteria. The outer layer is the habitat of the commensal flora. The inner mucus layer is converted into the outer layer by proteolytic activities provided by the host and also probably by commensal bacterial proteases and glycosidases.

Colonic quietness is not only maintained by the mucus physical barrier, the immune system plays also a crucial role, among others, through the secretion of IgA into the gut lumen. These dimeric immunoglobulins bind flagellin, a highly conserved protein component of the bacterial flagellum that is expressed by many different commensal species. This interaction limits the association of flagellated bacteria with the intestinal mucosa. The mechanism leading to IgA production by B cells in this context is not yet fully uncovered, but it is known that flagellin is sensed by at least 3 different innate immune receptors, including TLR5, which plays an instrumental role in this process.

In this peaceful, though cautious cohabitation, another host protein actor has been recently identified, LYPD8. In the absence of LYPD8, bacteria penetrate the inner mucus layer despite normal mucin production, the main building block of mucus, and further into the crypts of the large intestine, causing severe inflammation. LYPD8 is membrane protein, attached to the plasma membrane through a glycophosphatidylinositol (GPI) anchor. It is selectively expressed in epithelial cells at the uppermost layer of the large intestinal gland and can be released into the gut lumen by the action of specific phospholipases. Once in the extracellular milieu, it binds to flagellated bacteria, including Proteus mirabilis. Contrary to TLR5, this interaction seems to be specific to flagella, a higher order structure comprised of polymerized flagellins, not to monomeric flagellins. This binding severely impairs bacterial swarming activity, thereby regulating gut homeostasis.

Until these recent observations, nothing was known about LYPD8. It had only been identified through large scale cDNA and genome sequencing. The sole annotations provided in UniProtKB were based on protein domain predictions, including that of the GPI anchor (UPAR/Ly6 domain) and of the signal peptide. As of this release, LYPD8 entries have been updated with this new functional information and are publicly available.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Ciliary dyskinesia, primary, 31
  • Jensen syndrome
  • Mental retardation, autosomal dominant 12
  • Thiopurine S-methyltransferase deficiency

UniProt release 2016_06

Published June 8, 2016

Headline

Strength through unity

Reversible phosphorylation of proteins is a fundamental regulatory mechanism for many processes across a wide range of taxa. It has been extensively studied in the context of intracellular events in the nucleus and in the cytoplasm. Less is known about extracellular phosphorylation, but a family of secretory pathway kinases has been identified within the Golgi apparatus and in the extracellular milieu in recent years. Among them, FAM20C has been shown to phosphorylate many secreted proteins involved in biomineralization, including enamel matrix proteins, such as AMBN, AMELX, AMTN and ENAM. The importance of extracellular phosphorylation in bone physiology is further supported by the observation that mutations in FAM20C are associated with Raine syndrome, an autosomal recessive osteosclerotic bone dysplasia with a neonatal lethal outcome.

FAM20A, FAM20C’s closest paralog, exhibits all characteristics of a kinase, except for one residue, a conserved glutamic acid residue which is replaced by a glutamine, causing a loss of enzyme activity. This is not a characteristic unique to FAM20A. About 10% of the proteins classified as protein kinases lack some of the key features required for activity. They are called “pseudokinases”. In spite of its lack of activity, mutations in FAM20A also produce a defect in biomineralization, namely amelogenesis imperfecta 1G.

This apparent paradox was solved by Cui et al. last year. They showed that in the absence of FAM20A, FAM20C activity dramatically drops. Moreover, FAM20A mutants associated with amelogenesis imperfecta 1G fail to activate FAM20C. The proteins have to form a complex for full FAM20C activity.

Kinases are synthesized as inactive proteins. Classically, their activation is achieved through the phosphorylation of a domain called the “activation loop” which induces a conformational change. FAM20C does not have an activation loop that could be phosphorylated. Yet another kind of activation, called “allosteric activation”, has already been reported for kinase-pseudokinase pairs. In this model, it is the pseudokinase binding that induces the shape change of the bona fide kinase into its active conformation. Although the exact mechanism of FAM20C activation is still unclear, experimental results suggest that it may join the growing list of kinases regulated by dimerization-induced allostery.

FAM20A and FAM20B are quite old enzymes, evolutionarily related to kinases found in bacteria and slime molds. The fact that they do not use activation loop phosphorylation suggests that the allosteric mode of kinase activation may be very ancient, before the activation loop evolved. The presence of many conserved pseudokinases in the genomes of higher organisms suggests that allosteric activation may still be an efficient regulatory mechanism.

As of this release, FAM20A and FAM20C have been updated and are publicly available.

UniProtKB news

Removal of the cross-references to NextBio

Cross-references to NextBio have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Deleted diseases

  • Epilepsy, progressive myoclonic 5

RDF news

Change of URIs for neXtProt

For historic reasons, UniProt had to generate URIs to cross-reference databases that did not have an RDF representation. Our policy is to replace these by the URIs generated by the cross-referenced database once it starts to distribute an RDF representation of its data.

The URIs for the neXtProt database have therefore been updated from:

http://purl.uniprot.org/nextprot/<ID>

to:

http://nextprot.org/rdf/entry/<ID>

If required for backward compatibility, you can use the following query to add the old URIs:

PREFIX owl:<http://www.w3.org/2002/07/owl#> 
PREFIX up:<http://purl.uniprot.org/core/> 
INSERT
{
   ?protein rdfs:seeAlso ?old .
   ?old owl:sameAs ?new .
   ?old up:database <http://purl.uniprot.org/database/neXtProt> .
}
WHERE
{
   ?protein rdfs:seeAlso ?new .
   ?new up:database <http://purl.uniprot.org/database/neXtProt> .
   BIND(iri(concat('http://purl.uniprot.org/nextprot/', substr(str(?new),31))) AS ?old)
}

The dereferencing of existing http://purl.uniprot.org/nextprot/<ID> URIs will be maintained.

UniProt release 2016_05

Published May 11, 2016

Headline

Slow/White and the 6 DWORFs

Striated muscle function relies on a cycle of contraction and relaxation. Upon electrical stimulation of the myocyte plasma membrane, Ca(2+) is released from the sarcoplasmic reticulum (SR) into the cytosol. The released calcium activates movement of the molecular motor myosin along actin filaments and contraction occurs. Cytosolic Ca(2+) is then pumped back into the SR, through the action of SERCA proteins, allowing actomyosin relaxation. The SERCA proteins are SR-resident transmembrane ATPases, that couple the hydrolysis of ATP with Ca(2+) translocation.

Recent studies have highlighted a role for a network of (very) small ORFs (smORFs) in SERCA regulation. The first members of this exclusive but growing club were phospholamban (PLN, 52 amino acids) and sarcolipin (SLN, 31 amino acids), which were both isolated by classical biochemical approaches decades ago. Both bind SERCA and reduce the rate of calcium movement in heart and slow skeletal muscle fibers. More recently the SERCA inhibitory micropeptide myoregulin (MRLN, 46 amino acids), was identified in fast muscle fibers by Anderson et al. These authors started by screening for skeletal muscle-specific RNAs and discovered MRLN in an apparent long non-coding RNA (lncRNA). Encouraged by this discovery, Olson lab members continued to look for smORFs in other muscle-specific lncRNAs and found DWORF (34 amino acids), encoded by 2 exons of a 795 bp-long transcript; very difficult to predict using current software. In mouse myocytes, DWORF expression stimulates Ca(2+) uptake in the SR, not by direct activation of SERCA, but rather by relieving MRLN-, PLN- and SLN-mediated inhibition. DWORF expression may be particularly beneficial for recovery from periods of prolonged contraction.

SERCA regulation by micropeptides encoded in supposed lncRNAs is not a vertebrate-specific phenomenon. In Drosophila melanogaster, a single muscle-specific transcript encodes 2 smORFs related to sarcolipin, sarcolamban A and B (SCLA, 28 amino acids, and SCLB, 29 amino acids). Computer simulations predicted that both peptides fit the groove of SERCA, and this has been experimentally verified. While mutant flies deficient in sarcolamban showed no behavioral or morphological muscle phenotype, they do exhibit significantly more arrhythmic cardiac contractions than wild-type flies.

The idea that smORFs may be overlooked in the current genome annotation is not new, and these recent advances in muscle physiology underscore the likelihood that many transcripts annotated as noncoding RNAs may actually encode peptides with important biological functions. These smORFs could represent fast-evolving key regulators of larger molecular complexes. They also highlight the need for expert biocuration to make these data available in databases, as they cannot be automatically predicted, retrieved, nor annotated at the current time.

The 6 dworfs have been curated and integrated into UniProtKB/Swiss-Prot and we continue to survey the literature for other hidden micropeptide treasures (motivated solely by biological interest and not by our desire to find a seventh member for the purposes of this headline).

UniProtKB news

Cross-references to SIGNOR

Cross-references have been added to SIGNOR, the Signaling Network Open Resource, a resource that organizes and stores, in a structured format, signaling information published in the scientific literature. The core of this project is a large collection of manually-annotated causal relationships between proteins that participate in signal transduction.

SIGNOR is available at http://signor.uniroma2.it/.

The format of the explicit links is:

Resource abbreviation SIGNOR
Resource identifier UniProtKB accession number.

Example: P00533

Show all entries having a cross-reference to SIGNOR.

Text format

Example: P00533

DR   SIGNOR; P00533; -.

XML format

Example: P00533

<dbReference type="SIGNOR" id="P00533"/>

RDF format

Example: P00533

uniprot:P00533
  rdfs:seeAlso <http://purl.uniprot.org/signor/P00533> .
<http://purl.uniprot.org/signor/P00533>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SIGNOR> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt website news

Change of UniProt website job identifiers

To enable a more flexible and scalable infrastructure, we have extended the length of the UniProt website’s job identifiers.

Example:
M201604052M3YWGETHB
has become:
M2016040537D007A56D816107CE5B52C10342DB3700000452

We will continue to store job results for 7 days.

UniProt release 2016_04

Published April 13, 2016

Headline

Small changes, big effects

Our brain has the ability to reorganize itself by forming new neural connections throughout life. This plasticity allows neurons to adjust their activities in response to new situations, to changes in their environment, and to compensate for injury and disease. Plasticity is not only due to the creation/destruction of neuronal connections, but also to the modulation of synaptic strength depending upon its activity, a process called ‘short-term synaptic plasticity’ (STP). There are 2 types of STP, with opposite effects, known as ‘depression’ and ‘facilitation’. When neurons receive excitatory input, they generate strong electrical impulses (called spikes) which cause a release of neurotransmitters at the synaptic connections with other neurons. The neurotransmitters stimulate receptors on the postsynaptic neuron and trigger downstream electrical impulses. Action potential activity leads to the depletion of neurotransmitters consumed during the synaptic signaling process at the axon terminal of a presynaptic neuron, causing ‘depression’. It also induces an influx of calcium into the axon terminal. The calcium accumulation increases neurotransmitter release by the next presynaptic spike, facilitating synaptic transmission and temporarily potentiating the synapse (‘facilitation’).

Facilitation is important for the proper function of mammalian brains. It may form the basis of short-term working memory. In the hippocampus, it has been proposed to play a role in the acquisition of spatial information. In the auditory pathway, it allows the maintenance of linear transmission of rate-coded sound intensity.

Although synaptic facilitation was observed more than 70 years ago, the underlying mechanism is not yet fully elucidated. However, a major breakthrough was recently achieved and published in January in Nature. In their article, Jackman et al. identified a synaptotagmin-7 (SYT7) requirement for facilitation to occur in most central synapses. SYT7 is a calcium- and phospholipid-binding protein involved in the exocytosis of many secretory and synaptic vesicles. In SYT7-knockout mice, facilitation was eliminated at all synapses (except for mossy fiber synapses), although calcium influx was not affected by the mutation.

To rule out an indirect effect of SYT7 knockout, the authors tried to rescue facilitation through viral expression of SYT7 in hippocampal CA3 pyramidal cells. To do so, they used an adeno-associated virus that drove bicistronic expression of both channelrhodopsin-2 and SYT7. Channelrhodopsins are unicellular green algae proteins that serve as sensory photoreceptors. When expressed in the experimental setting established by Jackman et al., they enabled light to control electrical excitability only in the fibers expressing SYT7. The result was clear-cut: facilitation was restored. The identification of a protein required for synaptic facilitation may pave the way for future investigations on the functional role of this process.

As of this release, SYT7 proteins have been updated in UniProtKB/Swiss-Prot and are publicly available.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Inclusion body myopathy 2

UniProt service news

New UniProt JAPI

We have developed a new version of the UniProt JAPI. The legacy UniProt JAPI will be retired as of Wednesday, April 13th 2016. If you have any questions or concerns, please feel free to contact us at help@uniprot.org.

UniProt RDF news

Change of the UniProt RDF files distribution

The UniProt RDF distribution has been available on the UniProt FTP site since 2008 with data split into one file per dataset. Over time the size of the largest files has grown to over 80 Gigabytes. These large files are difficult to download and they also limit the maximum rate at which the data can be loaded into many RDF stores. We have therefore split the files of the three biggest datasets into sets of smaller files:

  • The UniProtKB dataset is split based on taxonomy and whether entries are active or not. The resulting files contain at most 1 million active or 10 million obsolete entries.
  • The UniRef dataset is split into files that contain at most 1 million entries.
  • The UniParc dataset is split into files of approximately 1 Gigabyte in size.

We also reduced the data redundancy between the datasets to further decrease the total data volume:

  • The UniProtKB dataset has always been fully normalized with respect to the taxonomy dataset and it is now also normalized with respect to the keywords, GO and citations datasets. The total number of unique triples across these datasets remains the same, but it means that if you have so far only loaded the UniProtKB and taxonomy RDF files into your RDF store, you must now also load the keywords.rdf.xz, go.owl.xz and citations.rdf.xz files in order to have the same data.
  • The UniRef dataset has been normalized with respect to the UniProtKB and UniParc datasets. It now only describes the UniRef cluster memberships. The sequence and entry information of UniProtKB and UniParc member entries is no longer repeated in the UniRef RDF files.

UniProt release 2016_03

Published March 16, 2016

Headline

From the Zika forest to the Amazon, news from a viral wanderer

In 2015, a large outbreak in Brazil put the Zika virus in the spotlight. Most people who become infected with Zika virus do not become sick and for those who do, the illness is generally mild. However, in some cases, complications can be quite severe. In addition, microcephaly has been reported in some babies born to mothers infected with Zika virus during pregnancy, pointing to the virus as an emerging human pathogen.

Although the Zika virus owes its worldwide infamy to its wandering to the Western hemisphere, it has been circulating in Africa for a long time before. It was first discovered in Uganda, in 1947 in rhesus monkeys living in the Zika Forest (after which it was named), and subsequently in humans in 1952. It is an RNA virus of the flavivirus genus, which also includes dengue, yellow fever and West Nile viruses. Like its relatives, it is transmitted by Aedes mosquitoes originally in endemic regions of central Africa. Taking advantage of modern means of transportation, it started spreading, first in Micronesia in 2007, then French Polynesia in 2013, and Brazil and Central America in 2014.

As it has long been considered insignificant, the Zika virus has not been extensively studied and most of our current knowledge has been inferred from other viruses of the same genus. The Zika virus entry into target cells can be triggered by binding to AXL and TYRO3. Interestingly, these proteins are also involved in Ebola virus and Lassa virus entry in human cells. Attachment to the host receptors is followed by internalization by a process called ‘apoptotic mimicry’ whereby the virus manages to be recognized by the target cell as an apoptotic body. After fusion of the virus membrane with the host endosomal membrane, the RNA genome is released into the cytoplasm. Flaviviruses are remarkable in that their genome encodes a single polyprotein that inserts into the endoplasmic reticulum (ER) membrane forming a complex pattern. This polyprotein is subsequently cleaved into 13 molecules by viral and host peptidases. The non-structural proteins form membrane spherules, presumably to protect the double stranded RNA intermediate of viral replication. The genomic viral RNA is replicated and translated, leading to creation of new Zika virions in the ER. The virions bud by hijacking the host endosomal sorting complex required for transport (ESCRT) system. They are transported to the Golgi apparatus, where further maturation occurs. Eventually fusion-competent virions are released by exocytosis.

As of this release, a Zika virus reference proteome has been manually curated in UniProtKB, where it can be safely visited.

A page dedicated to Zika has also been created in ViralZone to offer a global view of how this particular virus functions and provides access to other databases.

Cross-references to EPD

Cross-references have been added to EPD, the Encyclopedia of Proteome Dynamics, a resource that contains data from multiple, large-scale proteomics experiments aimed at characterising proteome dynamics in both human cells and model organisms.

EPD is available at https://www.peptracker.com/epd/analytics/.

The format of the explicit links is:

Resource abbreviation EPD
Resource identifier UniProtKB accession number.

Example: P00451

Show all entries having a cross-reference to EPD.

Text format

Example: P00451

DR   EPD; P00451; -.

XML format

Example: P00451

<dbReference type="EPD" id="P00451"/>

RDF format

Example: P00451

uniprot:P00451
  rdfs:seeAlso <http://purl.uniprot.org/epd/P00451> .
<http://purl.uniprot.org/epd/P00451>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/EPD> .

Cross-references to TopDownProteomics

Cross-references have been added to TopDownProteomics, a resource from the Consortium for Top Down Proteomics that hosts top down proteomics data presenting validated proteoforms to the scientific community.

TopDownProteomics is available at http://repository.topdownproteomics.org/.

The format of the explicit links is:

Resource abbreviation TopDownProteomics.
Resource identifier UniProtKB accession number.

Example: P10599

Show all entries having a cross-reference to TopDownProteomics.

Cross-references to TopDownProteomics may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Text format

Example: P10599

DR   TopDownProteomics; P10599-1; -. [P10599-1]
DR   TopDownProteomics; P10599-2; -. [P10599-2]

XML format

Example: P10599

<dbReference type="TopDownProteomics" id="P10599-1">
  <molecule id="P10599-1"/>
</dbReference>
<dbReference type="TopDownProteomics" id="P10599-2">
  <molecule id="P10599-2"/>
</dbReference>

RDF format

Example: P10599

uniprot:P10599
  rdfs:seeAlso <http://purl.uniprot.org/topdownproteomics/P10599-1> ,
    <http://purl.uniprot.org/topdownproteomics/P10599-2> .

<http://purl.uniprot.org/topdownproteomics/P10599-1>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TopDownProteomics> .
<#_5030303735300040>
  rdf:type rdf:Statement ;
  rdf:subject <P10599> ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://purl.uniprot.org/topdownproteomics/P10599-1> ;
  up:sequence isoform:P00750-1 .
<http://purl.uniprot.org/topdownproteomics/P10599-2>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TopDownProteomics> .
<#_5030303735300040>
  rdf:type rdf:Statement ;
  rdf:subject <P10599> ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://purl.uniprot.org/topdownproteomics/P10599-2> ;
  up:sequence isoform:P00750-2

Changes to the controlled vocabulary of human diseases

New diseases:

UniProt release 2016_02

Published February 17, 2016

Another one (antibiotic) bites the dust

Polymyxin E (also known as colistin) and other polymyxin antibiotics are among our last-resort drugs against multi-drug resistant Gram-negative bacteria such as Klebsiella pneumoniae, Pseudomonas aeruginosa and Acinetobacter.

The initial target of polymyxin antibiotics is the lipopolysaccharide layer (LPS) of the Gram-negative bacterial outer membrane. LPS has two 2-keto-3-deoxyoctonoic acid units bound to lipid A, which itself consists of 2 glucosamine units with attached fatty acyl chains and a phosphate group on each sugar. Lipid A acts as a hydrophobic anchor, in which the tight packing of the fatty acyl chains helps to stabilize the overall outer membrane structure. The positively charged L-2,4-diaminobutyric acid residues of polymyxins interact with the negatively charged phosphate groups on lipid A. The amphipathic antibiotics are thought to form pores that permeabilize the outer membrane. The polymyxins would then insert into and disrupt the inner membrane, leading to further pore formation. There is also some evidence that polymyxins have other intracellular targets.

As the initial contact of polymyxin antibiotics is with lipid A, resistance often occurs via its modification, frequently masking its negative charge. Before August 2015 a number of chromosomal resistance loci were known, but no resistance had been identified on a more easily transferred plasmid. During a routine surveillance of commensal Escherichia coli for antibiotic resistance, scientists in China identified mcr1, a plasmid-encoded gene which encodes a protein of the phosphoethanolamine transferase family. The gene confers both colistin and polymyxin B resistance by modifying lipid A, and probably originated in Paenibacillus. This would seem logical as Paenibacillus is the natural source of polymyxin antibiotics.

The gene was first identified from a pig farm in Shanghai in July 2013. Retrospective screening of isolated E.coli plasmids in China showed an alarming rise in its presence in pork, ranging from 6% in 2011 to 22% in 2014. The gene has also been detected in chicken meat in China, rising from 5% in 2011 to 28% in 2014. Screening hospital inpatients in 2014 showed both E.coli and K.pneumoniae mcr1-containing plasmid; 1.4% from E.coli, 0.7% from K.pneumoniae. The gene was also detected in E.coli genomes from Malaysia. An in situ test in mice showed that the gene was indeed able to confer colistin resistance. The original plasmid can transfer to other E.coli cells via conjugation, but only via transformation into K.pneumoniae or P.aeruginosa; it is stable in the absence of selective pressure.

Since the publication of the paper identifying mcr1 on-line November 15, 2105, numerous papers have appeared reporting retrospective screening for the gene. So far its earliest isolation is from a French calf in 2005, in which a worrying co-localization with a wide-spectrum beta-lactamase resistance gene was also reported. The gene has been found in human fecal samples dating from 2012 on, in Europe, Africa, South America and Asia. It was found in E.coli isolated from pigs in Germany in 2010, from Belgian calves in 2011-2012, in European food samples from June 2011 on, and from animal feces in Asia. The gene is not always isolated from the same plasmid background, and mcr1 is often associated with mobile genetic elements, probably aiding its dispersal.

In short, the gene has been slowly spreading around the world since before we were even aware of its existence. Colistin has been used in agriculture since the 1950s and is widely used in China, which is probably contributing to its steady dissemination. There are increasingly urgent calls for its agricultural use to be reevaluated before resistance spreads even further.

As of this release, Mcr-1 has been annotated and is available in UniProtKB/Swiss-Prot.

Cross-references to SwissPalm

Cross-references have been added to SwissPalm, a manually curated resource to study protein S-palmitoylation. It encompasses S-palmitoylated protein hits from more than 50 species and provides curated information and filters that increase the confidence in true positive hits. SwissPalm integrates predictions of S-palmitoylated cysteine scores, orthologs and isoform multiple alignments.

SwissPalm is available at http://swisspalm.epfl.ch/.

The format of the explicit links is:

Resource abbreviation SwissPalm
Resource identifier UniProtKB accession number.

Example: Q13530

Show all entries having a cross-reference to SwissPalm.

Text format

Example: Q13530

DR   SwissPalm; Q13530; -.

XML format

Example: Q13530

<dbReference type="SwissPalm" id="Q13530"/>

RDF format

Example: Q13530

uniprot:Q13530
  rdfs:seeAlso <http://purl.uniprot.org/swisspalm/Q13530> .
<http://purl.uniprot.org/swisspalm/Q13530>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SwissPalm> .

Change of the cross-references to Gramene

We have modified our cross-references to the Gramene database.

The new format of the explicit links is:

Resource abbreviation Gramene
Resource identifier Transcript identifier
Optional information 1 Protein identifier
Optional information 2 Gene identifier

Cross-references to Gramene may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

The Gramene database has also been moved from the category “Organism-specific databases” to the category “Genome annotation databases”.

Example: Q10DK7

Show all entries having a cross-reference to Gramene.

Text format

Example: Q10DK7

Previous format:

DR   Gramene; Q10DK7; -.

New format:

DR   Gramene; OS03T0727600-01; OS03T0727600-01; OS03G0727600.

XML format

Example: Q10DK7

Previous format:

<dbReference type="Gramene" id="Q10DK7"/>

New format:

<dbReference type="Gramene" id="OS03T0727600-01">
  <property type="protein sequence ID" value="OS03T0727600-01"/>
  <property type="gene ID" value="OS03G0727600"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: Q10DK7

Previous format:

uniprot:Q10DK7
  rdfs:seeAlso <http://purl.uniprot.org/gramene/Q10DK7> .
<http://purl.uniprot.org/gramene/Q10DK7>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Gramene> .

New format:

uniprot:Q10DK7
  rdfs:seeAlso <http://purl.uniprot.org/gramene/OS03T0727600-01> .
<http://purl.uniprot.org/gramene/OS03T0727600-01>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/Gramene> ;
  up:translatedTo <http://purl.uniprot.org/gramene/OS03T0727600-01> ;
  up:transcribedFrom <http://purl.uniprot.org/gramene/OS03G0727600> .

Removal of the cross-references to GeneFarm

Cross-references to GeneFarm have been removed.

Removal of the cross-references to GenoList

Cross-references to GenoList have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Periventricular nodular heterotopia 4
  • Transposition of the great arteries dextro-looped 2

UniProt website news

UniProt feature viewer added to UniProtKB entries

UniProt provides sequence annotations, a.k.a. protein features, to describe regions or sites of biological interest; secondary structure regions, domains, post-translational modifications and binding sites among others, play a critical role in the understanding of what the protein does. With the growth in biological data, integration and visualization becomes increasingly important for exposing different data aspects that might be otherwise hidden, unclear or difficult to grasp.

Hence we are introducing the UniProt feature viewer, a BioJS component bringing together protein sequence features in one compact view. Similar to genome viewers, the viewer uses tracks to display different protein features providing an intuitive picture of co-localized elements. Each track can be expanded to reveal a more in-depth view of the underlying data. The variant track offers a novel visualization and presents UniProt curated natural variants along with imported variants from large-scale studies (such as 1000 Genomes and COSMIC).

The UniProt feature viewer is available for every UniProtKB protein entry through the ‘Feature viewer’ link under the ‘Display’ heading on the left hand side.

If you would like to include the feature viewer in your own website or resource, you can find instructions in our technical documentation.

UniProt release 2016_01

Published January 20, 2016

Headline

cGAMP, a welcome stowaway

We are often amazed by the strategies deployed by viruses to trick our defences, but our immune system does not lag behind and it can also fool viral invaders. The detection of viruses by the innate immune system relies on the detection of intracellular DNA by pattern recognition receptors, including cyclic guanosine monophosphate (GMP) adenosine monophosphate (AMP) synthase (cGAS, also called MB21D1). In response to cytosolic DNA, this enzyme synthesizes 2’3’-cyclic GMP-AMP (cGAMP), which then binds to STING (also called TMEM173), an endoplasmic reticulum transmembrane protein, leading to the activation of the type I interferon (IFN) response, thereby inducing an antiviral state.

Last year, Gentili et al. made a puzzling observation. To study cGAS function, they transduced human monocyte-derived dendritic cells with a cGAS-expressing lentivirus. As expected, the cells were strongly activated, but the stimulatory property of the cGAS-encoding lentivirus did not correlate with the transduction efficiency. This led to the hypothesis that it was not cGAS itself that was responsible for the activation of the infected cells, but some other stimulatory signal, which was transferred by the viral vector. Indeed, when dendritic cells were challenged with virus-like particles (VLPs) that did not themselves encode cGAS, but were produced in the presence of cGAS, the cells were stimulated. This effect was abolished when VLPs were produced in the presence of a catalytically inactive cGAS mutant. Concomitantly, Bridgeman et al. found that the incubation of macrophages, epithelial cells or lung fibroblasts with lentiviral particles collected from cells overexpressing cGAS led to the STING-dependent up-regulation of type I interferons and interferon-stimulated genes. All this evidence pointed to cGAMP as the stimulatory signal and indeed both groups identified the dinucleotide in the viral particles, by mass spectrometry, not only in their experimental system, but also in more physiological settings, using a herpes virus (MCMV) and a poxvirus (Modified Vaccinia Anakara virus). It is yet unclear whether the incorporation of cGAMP into virus particles is a selective host-directed process or simply a consequence of random fluid-phase uptake of cytosolic material into viral particles.

cGAMP has previously been shown to diffuse through gap junctions, thereby alerting non-infected neighboring cells to pathogen threat. The discovery by Gentili et al. and Bridgeman et al. suggests that cells located far from the initial infection site may also benefit from cGAMP transfer and initiate rapid antiviral responses bypassing the need for cGAS activation.

Although the downstream fate of the dinucleotide does not directly depend on cGAS enzyme activity, this piece of information has been introduced into cGAS entries as of this release.

Cross-references to CollecTF

Cross-references have been added to the CollecTF database of bacterial transcription factor binding sites. CollecTF stores data on experimentally-validated TFBS and places special emphasis on providing a transparent curation process that captures the experimental support for sites as reported by authors in peer-reviewed publications.

CollecTF is available at http://www.collectf.org.

The format of the explicit links is:

Resource abbreviation CollecTF
Resource identifier CollecTF identifier

Example: A0KST7

Show all entries having a cross-reference to CollecTF.

Text format

Example: A0KST7

DR   CollecTF; EXPREG_00000150; -.

XML format

Example: A0KST7

<dbReference type="CollecTF" id="EXPREG_00000150"/>

RDF format

Example: A0KST7

uniprot:A0KST7
  rdfs:seeAlso <http://purl.uniprot.org/collectf/EXPREG_00000150> .
<http://purl.uniprot.org/collectf/EXPREG_00000150> 
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/CollecTF> .

Cross-references to GeneDB

Cross-references have been added to GeneDB pathogen genome database from Sanger Institute. GeneDB provides access to the latest sequence data and annotation/curation for the whole range of organisms sequenced by the Sanger Pathogen group.

GeneDB is available at http://www.genedb.org.

The format of the explicit links is:

Resource abbreviation GeneDB
Resource identifier GeneDB identifier

Example: Q8WPT5

Show all entries having a cross-reference to GeneDB.

Text format

Example: Q8WPT5

DR    GeneDB; H25N7.01:pep; -.

XML format

Example: Q8WPT5

<dbReference type="GeneDB" id="H25N7.01:pep"/>

RDF format

Example: Q8WPT5

uniprot:Q8WPT5
  rdfs:seeAlso <http://purl.uniprot.org/genedb/H25N7.01:pep> .
<http://purl.uniprot.org/genedb/H25N7.01:pep>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/GeneDB> .

Cross-references to iPTMnet

Cross-references have been added to iPTMnet integrated resource for PTMs in systems biology context. iPTMnet connects multiple disparate bioinformatics tools and systems text mining, data mining, analysis and visualization tools, and databases and ontologies into an integrated resource to address the knowledge gaps in exploring and discovering PTM networks. iPTMnet database currently contains phosphorylation information.

iPTMnet is available at http://pir.georgetown.edu/iPTMnet.

The format of the explicit links is:

Resource abbreviation iPTMnet
Resource identifier UniProtKB accession number.

Example: Q15796

Show all entries having a cross-reference to iPTMnet.

Text format

Example: Q15796

DR   iPTMnet; Q15796; -.

XML format

Example: Q15796

<dbReference type="iPTMnet" id="Q15796"/>

RDF format

Example: Q15796

uniprot:Q15796
  rdfs:seeAlso <http://purl.uniprot.org/iptmnet/Q15796> .
<http://purl.uniprot.org/iptmnet/Q15796> 
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/iPTMnet> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • ADP-ribosyl aspartic acid

UniProt release 2015_12

Published December 9, 2015

Headline

Host proteins SERINC3 and SERINC5 decrease HIV-1 infectivity

It has long been known that the HIV-1 nef (“negative regulatory factor”) protein increases the infectivity of the HIV-1 virion (PMID:7981973). This mysterious protein is only found in primate lentiviruses. Its function is to manipulate the host’s cellular machinery and thus to allow infection, survival or replication of the virus. The abundant research performed on this topic has unraveled many phenotypes associated with nef, mainly in restricting host protein expression to cellular membrane. However, all these various functions have not allowed a clear understanding of the virion infectivity phenotype, although they have revealed the way HIV-1 avoids the host’s immune response.

Two recent papers in Nature have shown that nef actually prevents the incorporation of host SERINC3 and SERINC5 proteins into the HIV-1 virion. These proteins dramatically decrease virion infectivity when they are part of its membrane. This study improves the understanding of nef function in virion infectivity. The means used by nef to achieve this function are still unknown, but are related to its capacity to prevent specific host proteins from reaching the plasma membrane. Human SERINC3 and SERINC5 functions are still not well understood, but further study on these proteins will reveal their antiviral action.

As of this release, HIV-1 nef and human proteins SERINC3 and SERINC5 have been updated and are publicly available.

UniProtKB news

Displaying human UniProtKB sequence annotations in genome browser tracks

Genome browser tracks allow users to align sequence annotations to the reference genome data and genome annotations. Both UCSC and Ensembl genome browsers have custom tracks for displaying external annotations in their browsers. UniProt would like to announce the beta release of new genome tracks which allow the alignment of protein sequence annotations in our resource to a reference genome. These UniProt genome tracks include genomic locations of protein sequences and annotations such as active sites, metal binding sites, post-translational modifications, variants and domains with supporting literature evidence where available. Each species represented by the genome annotation tracks resource will have protein sequences and annotations defined by the BED and bigBed formats.
The beta release is available in the new dedicated ‘genome_annotation_tracks’ directory on the UniProt FTP site and provides tracks for human with the release of additional species in the future. UniProt would welcome your feedback on this new resource.

Cross-references to SwissLipids

Cross-references have been added to SwissLipids, a comprehensive reference database that links mass spectrometry-based lipid identifications to curated knowledge of lipid structures, metabolic reactions, enzymes and interacting proteins.

SwissLipids is available at http://www.swisslipids.org.

The format of the explicit links is:

Resource abbreviation SwissLipids
Resource identifier SwissLipids identifier

Cross-references to SwissLipids may be isoform-specific (e.g. Q08477). The general format of isoform-specific cross-references was described in release 2014_03.

Example: P52824

Show all entries having a cross-reference to SwissLipids.

Text format

Example: P52824

DR   SwissLipids; SLP:000000740; -.

XML format

Example: P52824

<dbReference type="SwissLipids" id="SLP:000000740"/>

RDF format

Example: P52824

uniprot:P52824
  rdfs:seeAlso <http://purl.uniprot.org/swisslipids/SLP:000000740> .
<http://purl.uniprot.org/swisslipids/SLP:000000740>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SwissLipids> . 

Cross-references to MalaCards

Cross-references have been added to MalaCards, an integrated database of human maladies and their annotations, modeled on the architecture and richness of the popular GeneCards database of human genes.

The MalaCards disease and disorders database is organized into “disease cards”, each integrating prioritized information, and listing numerous known aliases for each disease, along with a variety of annotations, as well as inter-disease connections.

MalaCards is available at http://www.malacards.org.

The format of the explicit links is:

Resource abbreviation MalaCards
Resource identifier Gene symbol

Example: P26439

Show all entries having a cross-reference to MalaCards.

Text format

Example: P26439

DR   MalaCards; HSD3B2; -.

XML format

Example: P26439

<dbReference type="MalaCards" id="HSD3B2"/>

RDF format

Example: P26439

uniprot:P26439
  rdfs:seeAlso <http://purl.uniprot.org/malacards/HSD3B2> .
<http://purl.uniprot.org/malacards/HSD3B2> 
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/MalaCards> .

Change of UniProtKB annotation cardinality constraints

Each UniProtKB entry may contain a variable number of different annotation topics. Most topics can be present more than once in a given entry (e.g. when a precursor protein is cleaved into chains/peptides with different functions, each one is described in a separate Function annotation). But some topics had been limited to occur no more than once per entry. We have lifted this restriction to allow for more flexibility and granularity in our annotations.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Fanconi anemia complementation group M
  • Paget disease of bone
  • Spinocerebellar ataxia, autosomal recessive, 5

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • PolyADP-ribosyl aspartic acid

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • 2-(S-cysteinyl)-methionine (Cys-Met)
  • Cyclopeptide (Cys-Ile)

UniProt service news

Retirement of UniProt BioMart

Based on user surveys and service evaluations, we decided to retire our UniProt BioMart service. For those who relied on the UniProt BioMart for tasks such as ID mapping, bulk retrieval of entries, or programmatic access to entry annotations, we have alternative services that will satisfy your needs. Please visit our YouTube channel and help pages for tutorials and more information about these services:

Please contact us if you have questions about this change.

Retirement of UniProt Distributed Annotation System (DAS)

The Distributed Annotation System (DAS) defines a communication protocol used to exchange annotations on genomic or protein sequences. It was first released in 2001 and UniProt had started to provide its data following the DAS protocol in July 2004. DAS has fulfilled a valuable role in integrating distributed and varied data, particularly for display in genome browsers and other applications that feature data visualisation, but unfortunately the level of usage of DAS in 2015 can no longer justify support and maintenance and we have therefore retired the UniProt DAS server.

Documentation on programmatic access to UniProt data can be found on the UniProt website.

Please contact us if you have questions about this change.

UniProt release 2015_11

Published November 11, 2015

Headline

The sense of a motion

No need to be a great scientist to understand that when a hawk is circling in the sky looking for food, small rodents should run and hide. This does not imply the mere recognition of a static image, or of a global movement, but most importantly to sense an asynchrony between a moving object (the hawk) and its background (the slow-moving clouds above it).

In vertebrates, visual motion sensing takes place in the retina and more specifically in a subset of retinal ganglion cells (RGCs). RGCs are located near the inner surface of the retina, where they receive visual information from photoreceptors via intermediate neurons, bipolar cells and amacrine cells. They extract salient features and send them deeper into the brain for further processing. The final picture is produced by the integration of many signals, each carried by a distinct population of RGCs. It is currently estimated that approximately 70 types of interneurons form specific synapses on roughly 30 types of RGCs. The discovery of the function of each RGC type and of their connections with specific interneurons is like trying to find the proverbial needle in a haystack.

Three years ago, Zhang et al. tackled this issue using a transgenic mouse line, called TYW3. In these mice, strong regulatory elements from the Thy1 gene drive the expression of yellow fluorescent protein (YFP). In the retina, YFP fluorescence could be detected in only a small subset of RGCs. The brightest cell population (W3-RGCs) was chosen for further characterization. Interestingly, these cells remained silent under most common visual inputs, including locomotion in a natural environment obtained with videos from a camera mounted on the head of a freely moving rat. The only condition that elicited reliable responses from W3-RGCs was the movement of small spots differing from that of the background, but not when these movements coincided.

The canonical pathway for delivering visual input to RGCs involves direct connections between bipolar cells and RGCs. In other words, RCGs typically are two synapses away from a photoreceptor, which ensures the fastest transmission of the signal. Surprisingly, W3-RGCs receive strong and selective input from unusual excitatory amacrine cell type interneurons, called VG3-ACs. With the introduction of the VG3-AC partner to the circuit, W3-RGCs appear to be three synapses away from a photoreceptor, slowing visual information delivery to the cells. A possible explanation is that W3-RGCs compare motion in the center and surround of the receptive field, firing only when the two are asynchronous. For the comparison to be temporally precise, input from the surround must arrive at the cell rapidly and/or input from the center must be delayed.

The crucial connection between W3-RGCs and VG3-ACs is ensured by homophilic interactions between Sdk2 proteins expressed at the cell surface of both cell types. Sdk2 is a cell adhesion protein whose expression is detected in the embryonic retina soon before birth and persists into adulthood, spanning the periods of lamina formation and synaptogenesis. Sdk2 knockout caused no alterations in retinal structure, but the strength of synaptic connections between VG3-ACs and W3-RGCs drops about 20-fold.

For your eyes only, the Sdk2 entries have been updated and are publicly available as of this release.

UniProtKB news

Change of the cross-references to eggNOG

We have introduced an additional field in the cross-references to the eggNOG database to indicate the taxonomic scope of an orthologous group.

Text format

Example: U3JAG9

DR   eggNOG; ENOG410IEUN; Eukaryota.
DR   eggNOG; ENOG410YVPU; LUCA.

XML format

Example: U3JAG9

<dbReference type="eggNOG" id="ENOG410IEUN">
  <property type="taxonomic scope" value="Eukaryota"/>
</dbReference>
<dbReference type="eggNOG" id="ENOG410YVPU">
  <property type="taxonomic scope" value="LUCA"/>
</dbReference>

This change did not affect the XSD, but may nevertheless require code changes.

RDF format

Example: U3JAG9

uniprot:U3JAG9
  rdfs:seeAlso <http://purl.uniprot.org/eggnog/ENOG410IEUN> ,
               <http://purl.uniprot.org/eggnog/ENOG410YVPU> .
<http://purl.uniprot.org/eggnog/ENOG410IEUN>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/eggNOG> ;
  rdfs:comment "Eukaryota" .
<http://purl.uniprot.org/eggnog/ENOG410YVPU>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/eggNOG> ;
  rdfs:comment "LUCA" .

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2015_10

Published October 14, 2015

Headline

The smell of the sea in UniProtKB

Memories left by a walk on the seashore bring into play all our senses, of which smell is not the least. This characteristic ‘smell of the sea’ is carried by a little molecule, dimethylsulfide (DMS), which is an enzymatic cleavage product of dimethylsulfoniopropionate (DMSP).

DMSP is one of the most abundant organic molecules in the world, with a billion tons made and turned over every year. It is produced by marine macroalgae, as well as by single-cell phytoplankton species, such as diatoms, dinoflagellates and haptophytes, and occurs at high concentrations in their cytoplasm. The physiological function of DMSP is not yet fully established. It is thought to function as an osmolyte. It has also been proposed to serve as a cryoprotectant in polar algae. DMSP enzymatic cleavage products, DMS and acrylate, are quite effective at scavenging free radicals and other reactive oxygen species. Hence they may serve as an antioxidant system.

In healthy growing phytoplankton, DMSP freely diffuses in the cytoplasm, and only minute quantities are released. This amount is sufficient to attract zooplankton which start feeding on algae. Organisms grazed upon or infected by viruses as well as stressed or senescent cells release greater amount of DMSP, which is taken up by bacterioplankton, metabolized into DMS and used as a source of carbon and sulfur. DMS is not only used by seawater microorganisms, it is also volatile and a small fraction of it is released into the atmosphere where it creates an olfactory landscape providing seabirds with orientation cues to potential food supplies. In the atmosphere, DMS is oxidized to sulfuric acid and becomes an important source of sulfate aerosols. These act as condensation nuclei, causing water molecules to coalesce and cloud to form. The cycle is closed when rain brings back the sulfur-containing particles into the ocean. Interestingly, phytoplankton appear to convert DMSP into DMS very rapidly when they are stressed by UV radiation. The local increase in volatile DMS increases cloud formation, hence decreasing direct sun light exposure and relieving stress. Through this mechanism, plankton may shape local weather for their own benefit.

DMS release by seaweed was described in 1935 and DMSP was identified as its precursor almost 70 years ago, but the enzyme catalyzing the reaction remained elusive until last June. Using classical biochemical approaches, as well as genomic and proteomic analyses, Alcombri et al. identified ALMA1 from the chloroplastic membrane fraction of the coccolithophore alga Emiliania huxleyi, an abundant bloom-forming marine phytoplankton. This enzyme is a redox-sensitive homotetramer, that belongs to the aspartate/glutamate racemase superfamily and catalyzes DMSP cleavage into DMS and acrylate. Phylogenetic studies show the presence of numerous ALMA1 homologs in major, globally distributed phytoplankton taxa and in other marine organisms. This major discovery paves the way for future investigations on the physiological role of DMS and may allow quantification of the relative biogeochemical contribution of algae and bacteria to global DMS production.

If you want to take a deep, though virtual breath of sea smell, you can visit ALMA1 entries that are available to you as of this release.

UniProtKB news

Cross-references to WBParaSite

Cross-references have been added to WBParaSite, an open access resource providing access to the genome sequences, genome browsers, semi-automatic annotation and comparative genomics analysis of parasitic worms (helminths). WormBase ParaSite is closely integrated with and complementary to the main WormBase resource, the central focus of which is the model nematode Caenorhabditis elegans and its close relatives.

WBParaSite is available at http://parasite.wormbase.org.

The format of the explicit links is:

Resource abbreviation WBParaSite
Resource identifier Transcript identifier
Optional information 1 Protein identifier
Optional information 2 Gene identifier

Cross-references to WBParaSite may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Example: A8PGQ3

Show all entries having a cross-reference to WBParaSite.

Text format

Example: A8PGQ3

DR   WBParaSite; Bm6838; Bm6838; WBGene00227099.

XML format

Example: A8PGQ3

<dbReference type="WBParaSite" id="Bm6838">
  <property type="protein sequence ID" value="Bm6838"/>
  <property type="gene ID" value="WBGene00227099"/>
</dbReference>

RDF format

Example: A8PGQ3

uniprot:A8PGQ3
  rdfs:seeAlso <http://purl.uniprot.org/wbparasite/Bm6838> .
<http://purl.uniprot.org/wbparasite/Bm6838>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/WBParaSite> ;
  up:translatedTo <http://purl.uniprot.org/wbparasite/Bm6838> ;
  up:transcribedFrom <http://purl.uniprot.org/wbparasite/WBGene00227099> .

Removal of the cross-references to CYGD

Cross-references to CYGD have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniParc news

UniParc cross-reference types changes

UniParc and UniProtKB entries both contain cross-references to external databases. For consistency reasons we have adjusted the names of these databases in UniParc to the ones in UniProtKB. In particular we have changed the following types of cross-references in UniParc:

Old type New type
ENSEMBL Ensembl
FLYBASE FlyBase
H_INV H-InvDB
REFSEQ RefSeq
TAIR_ARABIDOPSIS TAIR
WORMBASE WormBase
WormBase ParaSite WBParaSite

Example:

Previous XML:

<dbReference type="WormBase ParaSite" id="A_03330" version_i="1" active="Y" created="2014-09-12" last="2015-07-09">
  <property type="NCBI_taxonomy_id" value="6185"/>
</dbReference>

New XML:

<dbReference type="WBParaSite" id="A_03330" version_i="1" active="Y" created="2014-09-12" last="2015-07-09">
  <property type="NCBI_taxonomy_id" value="6185"/>
</dbReference>

UniProt release 2015_09

Published September 16, 2015

Headline

Life (and death) in 2D

While the cinema industry struggles to produce ever more realistic 3D, even 4D, films out of 2D images, scientists have achieved the exact opposite: in a collection of (3D) vertebrate embryos, they have identified a mutant that flattens in the course of development.

Vertebrates have a defined body shape in which correct tissue and organ shape and alignment are essential for function. Correct morphogenesis depends on force generation, force transmission through the tissue, and the response of tissues and extracellular matrix to force. In addition, embryos must be able to withstand environmental perturbations, such as gravity. Already in 1917, in his master work “On Growth and Form”, Sir D’Arcy Wentworth Thompson postulated that “the forms as well the actions of our bodies are entirely conditioned (save for certain exceptions in the case of aquatic animals) by the strength of gravity upon this globe”. It is actually from an “aquatic animal”, a fish, that the confirmation of this hypothesis came earlier this year. Screening of a Japanese rice fish mutant identified an embryo that displayed pronounced body flattening around stage 25-28 (50-64 h post fertilization). Although general development was not delayed, the mutant exhibited delayed blastopore closure and progressive body collapse from mid-neurulation, surviving until just before hatching. This mutant was aptly named hirame, which means flatfish in Japanese. When embryos were grown in agarose, their collapse correlated with the direction of gravity, reflecting the mutant’s inability to withstand external forces. The mutants also showed defective fibronectin fibril formation.

The hirame mutation lies within the Yap1 gene and creates a premature stop codon at position 164. Yap1 is a transcriptional co-activator that promotes proliferation and inhibits cell death during embryonic development. Porazinski and colleagues showed that Yap1 is also essential for actomyosin-mediated tissue tension.

The hypothesis with the strongest experimental support is that YAP1 acts on ARHGAP18 expression (and possibly that of other ARHGAP18-related genes), which in turn regulates cortical actomyosin network formation. Actomyosin contraction promotes fibronectin assembly, which could be a critical in vivo mechanism for the integration of mechanical signals, such as tension generated by actomyosin, with biochemical signals, such as integrin signaling, ensuring proper tissue shape and alignment and appropriate organ and body shape.

YAP1 knockdown in the human cell line hTERT-RPE1 caused a phenotype reminiscent of the fish embryo phenotype. When cultured in a 3D spheroid system, these retinal epithelial cells also exhibited collapse upon exposure to external forces, marked reduction of cortical F-actin bundles and lack of typical fibronectin fibril pattern. This suggests that YAP1 orthologs may play a similar role in all vertebrates, and possibly beyond.

As of this release, YAP1 protein entries have been updated and are publicly available.

UniProtKB news

Release of variation files for 27 new species

In collaboration with Ensembl and Ensembl Genomes, UniProt would like to announce the release of variation files for 27 species in addition to human, mouse and zebrafish files currently available in the dedicated variants directory on the UniProt FTP sites. This release includes a further 13 vertebrate species, including agriculturally important species: cow, chicken, pig and sheep. These new variant catalogues also expand the diversity of species with variants for plant, fungi and protist species that includes rice, bread wheat, barley and grape.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2015_08

Published July 22, 2015

Headline

Pseudo-allergy, real progress

Do you sniffle and sneeze as trees start to bloom and the pollen gets airborne? Your mast cells are to blame. These cells reside at strategic anatomical positions, such as skin, gastrointestinal tract and lung, and provide us with a first line of defence against potential harm from our environment. Besides their beneficial functions, mast cells can also react to compounds that do not represent any threat to our health, such as pollen. This process begins with the interaction of an antigen with immunoglobulin E (IgE) bound to high affinity Fc epsilon receptors at the mast cell surface. It ends with the release of histamine and various inflammatory and immunomodulatory substances, which causes allergy. Most adverse reactions to peptidergic and small molecule therapeutic agents, collectively called basic secretagogues, also rely on mast cell stimulation, but do not correlate with IgE antibody titer. They proceed through a different, not yet fully understood, IgE-independent mechanism called pseudo-allergy, that eventually also leads the release of granule-stored histamine. In human, MRGPRX2 has been proposed, among others, to serve as a receptor for basic secretagogues, but until recently there was no direct proof of its involvement.

Earlier this year, McNeil et al. showed that “basic secretagogues activate mouse mast cells in vitro and in vivo through a single receptor, Mrgprb2, the ortholog of the human G-protein-coupled receptor MRGPRX2”. The first achievement of this study was to prove the orthology of these 2 genes, which was not an easy task. In humans, MRGPRX2 is found in a cluster with 3 other MRGPRX family members. This cluster is dramatically expanded in mouse, with 22 potential protein-coding genes that show comparable sequence identity to MRGPRX2. To establish orthology, the authors used 2 criteria: expression pattern (expression in mast cells) and pharmacology (some 16 compounds were tested for mast cell activation). Then Mrgprb2a knockout mice were created. Gene targeting was performed using a zinc-finger-nuclease-based strategy, as classical homologous recombination approach was impossible in this genomic locus due to too many repetitive sequences. The null animals showed no visible phenotype in normal conditions, but didn’t produce any pseudo-allergic reaction in response to small-molecule therapeutic drugs. Secretagogue-induced histamine release, inflammation and airway contraction were abolished.

This elegant study does not deal simply with the identification of “just another receptor”. It addresses an issue that may concern all of us at some point in our lives. Basic secretagogues are compounds that are frequently encountered either in natural fluids, such as the wasp venom toxin mastoparan, or in various drugs, such as cationic peptidergic drugs, antibiotics (fluoroquinolone family), neuromuscular blocking agents, etc. These latter are routinely used in surgery to reduce unwanted muscle movement and are responsible for nearly 60% of allergic reactions in a surgical setting. The majority of these compounds activate mast cells in an Mrgprb2-dependent manner. The animal model created by McNeil et al. could then be used for pre-clinical testing of new drugs in order to minimize pseudo-allergic risks. In addition, the identification a motif common to several Mrgprb2 agonists may allow the prediction of side effects of clinically used compounds.

As of this release, primate MRGPRX2 and mouse Mrgprb2 entries have been updated and are publicly available.

UniProt service news

Programmatic access to UniProt with sparql.uniprot.org

We are happy to announce the public release of the UniProt SPARQL endpoint at sparql.uniprot.org, where you can also find links to the documentation of the UniProt RDF data model and an interactive query interface with sample queries to get you started.

For those unfamiliar with SPARQL, this is a W3C standardized query language for the Semantic Web. If you know SQL, it will look familiar to you and you can do similar types of queries with it. SPARQL also allows you to query and combine data from a variety of SPARQL endpoints, providing a valuable low-cost alternative to building your own data warehouse. You can combine UniProt data from sparql.uniprot.org with that from the SPARQL endpoints hosted by the EBI’s RDF platform, the SIB’s neXtProt SPARQL endpoint, etc.

We look forward to feedback from the community to help us improve this service further.

UniProtKB news

Addition of human somatic protein altering variants from COSMIC

The Catalogue of Somatic Mutations in Cancer (COSMIC) is a database of manually curated somatic variants from peer reviewed publications and genome-wide studies. UniProt, in collaboration with COSMIC, have integrated COSMIC release v71 protein altering variants into the homo_sapiens_variation.txt.gz file. The COSMIC variants provide the standard information found in the homo_sapiens_variation.txt.gz file and additional information on the primary tissue(s) the variant was found in within the Phenotype/Disease field.

Changes to the humdisease.txt file

We have added cross-references to MedGen to the humdisease.txt file. MedGen, the NCBI portal to information about human genetic disorders, conveys multiple disease names, medical terms and information for the same disorder from various sources into a specific concept. Each MedGen concept has a Concept Unique Identifier (CUI) that allows computational access to global disease information. Together with disease nomenclature, this includes disease definitions, clinical findings, available clinical and research tests, molecular resources, professional guidelines, original and review literature, consumer resources, clinical trials, and Web links to other related resources. MedGen is a valuable resource to allow UniProtKB users to access an extensive range of biomedical data.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Blepharophimosis-ptosis-intellectual disability syndrome
  • Ehlers-Danlos syndrome 2

UniProt release 2015_07

Published June 24, 2015

Headline

Coding-non-coding RNAs: a game of hide-and-seek

It is well-established that microRNAs (miRNAs) are small eukaryotic non-coding RNA molecules that repress the expression of their target genes. miRNAs are transcribed by RNA polymerase II as large primary transcripts (pri-miRNA), that share the same characteristics as all other RNA polymerase II-transcribed RNAs, such as the presence of a 5’-cap and a 3’-poly(A) tail. pri-miRNAs are processed to smaller pre-miRNAs, which in turn are cleaved to produce mature miRNAs. In animals, this final maturation step occurs in the cytoplasm, while in plants it takes place in the nucleus. Cytosolic mature miRNAs guide the RNA-induced silencing complex (RISC) in repressing target genes through either cleavage or translational repression of their mRNAs.

A recent article published in Nature revealed that plant pri-miRNAs may not be as non-coding as previously assumed. Some do actually encode small regulatory peptides, called miPEPs, which enhance the accumulation of their corresponding mature miRNAs. This has been shown for Medicago truncatula pri-miR171b and Arabidopsis thaliana pri-miR165a which encode miPEP171b and miPEP165a, respectively. These two 20- and 18-amino acid-long peptides have been shown to be translated in vivo and to promote the transcription of their pri-miRNAs, resulting in the accumulation of mature miR171b and miR165a. This increase leads to the reduction of lateral root development in the case of miR171b and stimulation of main root growth for miR165a. The same effects were observed when synthetic peptides were applied to plants, suggesting that miPEPs might have agronomical applications.

Five other pri-miRNAs were experimentally shown to encode active miPEPs, suggesting that the presence of such small regulatory peptides may be widespread in plants. Computer analysis of the 5’-end of 50 pri-miRNAs in Arabidopsis thaliana revealed that all of them contained at least one ORF, which, if translated, could give rise to 3- to 59-amino acid-long peptides of unknown biological activity. No common signature was found among them, possibly due to the specificity of each putative miPEP for its own pri-miRNA.

Arabidopsis thaliana miPEP165a, miPEP160b, miPEP164a and miPEP319a and Medicago truncatula miPEP171b peptides have been manually annotated and are integrated into UniProtKB/Swiss-Prot as of this release. The sequences of the other 2 Medicago truncatula functionally characterized peptides, miPEP169d and miPEP171e, are unfortunately not available.

UniProtKB news

Cross-references to ESTHER

Cross-references have been added to ESTHER, a database of the Alpha/Beta-hydrolase fold superfamily of proteins.

ESTHER is available at http://bioweb.ensam.inra.fr/ESTHER/general?what=index.

The format of the explicit links is:

Resource abbreviation ESTHER
Resource identifier Gene locus.
Optional information 1 Family name.

Example: P0C064

Show all entries having a cross-reference to ESTHER.

Text format

Example: P0C064

DR   ESTHER; bacbr-grsb; Thioesterase.

XML format

Example: P0C064

<dbReference type="ESTHER" id="bacbr-grsb">
  <property type="family name" value="Thioesterase"/>
</dbReference>

Cross-references to Genevisible

Cross-references have been added to Genevisible, a search portal to normalized and curated expression data from GENEVESTIGATOR.

Genevisible is available at http://genevisible.com/search.

The format of the explicit links is:

Resource abbreviation Genevisible
Resource identifier Gene identifier.
Optional information 1 Organism code.

Example: P31946

Show all entries having a cross-reference to Genevisible.

Text format

Example: P31946

DR   Genevisible; P31946; HS.

XML format

Example: P31946

<dbReference type="Genevisible" id="P31946">
  <property type="organism ID" value="HS"/>
</dbReference>

Removal of the cross-references to Genevestigator

Cross-references to Genevestigator have been removed.

Change of the cross-references to PomBase

Cross-references to PomBase may now optionally indicate a gene designation in order to align them with the format of other model organism databases.

Text format

Example: Q9P3A7

DR   PomBase; SPAC1565.08; cdc48.

Example: O60058

DR   PomBase; SPBC56F2.07c; -.

XML format

Example: Q9P3A7

<dbReference type="PomBase" id="SPAC1565.08">
  <property type="gene designation" value="cdc48"/>
</dbReference>

Example: O60058

<dbReference type="PomBase" id="SPBC56F2.07c"/>

This change did not affect the XSD, but may nevertheless require code changes.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Hypogonadism LHB-related

Changes to keywords

New keywords:

UniProt release 2015_06

Published May 27, 2015

Headline

POLQ, a new target for cancer therapy?

DNA double-strand breaks (DSBs) are our worse cellular enemy, yet they do occur all the time, often accidentally, as a result of endogenous metabolic reactions and replication stress. They can also be induced by exogenous sources, like radiation or exposure of cells to DNA-damaging agents, or serve as intermediates in a number of programmed recombination events, during meiosis or assembly of immunoglobulins or T-cell receptors. Whatever their origin, DSBs are highly toxic to cells if not repaired, and if repaired incorrectly, they can cause deletions, translocations, and fusions in the DNA, which can have dramatic consequences.

The most frequently used mechanisms for DSB repair are homologous recombination (HR) and non-homologous end-joining (NHEJ), but alternative forms of end-joining exist, such as microhomology-mediated end-joining (MMEJ). HR is highly accurate and therefore important for preserving genome integrity. NHEJ results in small, less than 10 bp deletions. The most error-prone is MMEJ, which promotes inter- and intrachromosome rearrangements associated with relatively large DNA deletions (30-200 bp).

While NHEJ preferentially acts on ‘blunt-ended’ DNA breaks, HR is preceded by resection of DNA around the 5’-ends of the break. RAD51 proteins bind to the resulting 3’ single-stranded overhangs and help them to recognize complementary (homologous) DNA in another intact DNA helix. The overhangs then invade the homologous double-strand and use it as a template for repair. MMEJ also starts with DNA resected ends, but in this case it is DNA polymerase theta (POLQ) that directly binds them and enables short (2-6 bp) homologous DNA sequences in overhangs to form base pairs. The homology can be either terminal, or internal, as far as 5 nucleotides away from the 3’ terminus. Once homology has been found, each DNA strand is extended from the base-paired region using the opposing overhang as a template, and, in case of internal homology, the terminal unpaired regions are removed.

Normal cells tend to down-regulate POLQ. Cancer cells, which exhibit HR deficiency due to mutations in genes involved in HR repair, tend to up-regulate POLQ. This allows them to limit DNA damage and survive, although at the expense of genome integrity. In these cells, increased levels in POLQ will further inhibit HR, by binding to RAD51 proteins and preventing their accumulation at resected DNA ends.

Cytotoxic drugs used for cancer therapy promote DSBs in order to overwhelm DNA repair mechanisms and induce cell death. Could the use of POLQ inhibitors, alone or in combination with other DNA damaging drugs, improve the treatment of HR-deficient tumors? It’s too early to tell, but preliminary results suggest that it is worth investigating. Indeed, knockdown of POLQ in HR-deficient cells reduces cell survival following treatment with cisplatin or mitomycin C, and human tumor cells expressing shRNA against both FANCD2 (HR knockdown) and POLQ (MMEJ knockdown) do not grow in mice.

At the beginning of this year, POLQ was in the spotlight thanks to 3 very interesting publications, which shed light on its role and mode of action. UniProtKB/Swiss-Prot POLQ entries have been updated accordingly and are publicly available as of this release.

UniProt release 2015_05

Published April 29, 2015

Headline

A never-ending race between evolution and genomic integrity

Primate evolution has been accompanied by several waves of retrotransposon insertions. Nowadays about 50% of our genome is composed of endogenous retroelements (EREs). Although many of them have lost their transposition ability, some remain quite active. For instance, among the 500,000 copies of long interspersed element-1 (LINE1 or L1) present in the human genome, about 100 are retrotransposition-competent, and over 40 of them are highly active. Other EREs, such as short interspersed nuclear elements (SINEs), including Alu repeats, and SINE-VNTR-Alu (SVA), a composite hominid-restricted ERE, also actively move in the genome. It is currently estimated that new, non-parental L1 integrations occur in nearly 1/100 births and roughly every 20th newborn baby has a new Alu retrotransposon somewhere in its DNA.

Obviously having DNA jumping around our genome may be quite harmful and our cells work hard to repress EREs. Transcriptional silencing is controlled by TRIM28 and KRAB domain-containing Zinc finger proteins (KRAB-ZNFs). TRIM28 forms a repressive complex (KAP1 complex) by interacting with CHD3, a subunit of the nucleosome remodeling and deacetylation (NuRD) complex, and SETDB1, which specifically methylates histone H3 at ‘Lys-9’, inducing heterochromatinization. KRAB-ZNFs bind DNA and recruit the KAP1 complex to target sites.

KRAB-ZNF genes are one of the fastest growing gene families in primates, possibly to limit the activity of newly emerged ERE classes. This hypothesis has gained support in an elegant study recently published in Nature. In this article, Jacobs et al. used a heterologous cell system in which murine embryonic stem cells harbored a copy of human chromosome 11, which contains a number of EREs, including SVA and the L1 subfamily L1PA. In this cellular environment, the primate-specific EREs were derepressed. Individual overexpression of highly expressed human KRAB-ZNFs, confirmed by reporter gene assays, allowed the identification of genes involved in the repression of specific ERE (sub)families: ZNF91 and ZNF93 which acted on SVA and L1PA4, respectively. The authors then traced back the phylogenic history of these genes in the primate lineage and analyzed the parallel evolution of their target EREs. They could show that a new wave of L1PA insertions in great ape genomes was made possible through the deletion of a 129-bp element in L1PA3, which destroyed the ZNF93-binding site. This could be interpreted as an ERE response to a series of structural changes in ZNF93 that occurred soon before and improved host repression of L1PA activity.

In conclusion, the expansion of a new ERE drives the evolution of a host repressor which leads to a subsequent change in ERE to escape repression, and so on. It is a never-ending race of our genome with itself, which leads inexorably to greater and greater complexity.

As of this release, updated human ZNF91 and ZNF93 entries are available in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of IPI species proteome data sets from FTP site

Since the closure of IPI in 2011, UniProt has provided proteome data sets for IPI species on its FTP site. In UniProt release 2015_03, we have started to provide new data sets for reference proteomes which cover also the IPI species and we have now removed the old ‘proteomes’ FTP directory that contained only data for the IPI species.

UniProtKB XSD change for evidence attribution

We have made the following changes to the UniProtKB XSD to allow a more fine-grained attribution of evidence to the parts of comment annotations that contain “free-text” descriptions:

  • The cardinality of all existing text elements was changed from maxOccurs="1" to maxOccurs="unbounded".
  • The phDependence, redoxPotential and temperatureDependence child elements of the bpcCommentGroup now have a sequence of text child elements.
  • The note child element of the isoformType was replaced by a sequence of text child elements.

The XSD changes are highlighted in red color below:

    <xs:complexType name="commentType">
        ...
            <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>
        ...
    <xs:group name="bpcCommentGroup">
       ...
             <xs:element name="absorption" minOccurs="0">
                ...
                        <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>
                ...
            <xs:element name="kinetics" minOccurs="0">
                ...
                        <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>
                ...

            <!-- The following 3 elements will in future each have a sequence of <text> child elements:
            <xs:element name="phDependence" type="evidencedStringType" minOccurs="0"/>
            <xs:element name="redoxPotential" type="evidencedStringType" minOccurs="0"/>
            <xs:element name="temperatureDependence" type="evidencedStringType" minOccurs="0"/>
            -->
            <xs:element name="phDependence" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="text" type="evidencedStringType" maxOccurs="unbounded"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:element name="redoxPotential" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="text" type="evidencedStringType" maxOccurs="unbounded"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:element name="temperatureDependence" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="text" type="evidencedStringType" maxOccurs="unbounded"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        ...
    <xs:complexType name="isoformType">
        ...
            <!-- The <note> element will be replaced by a sequence of <text> elements:
            <xs:element name="note" minOccurs="0">
                <xs:complexType>
                    <xs:simpleContent>
                        <xs:extension base="xs:string">
                            <xs:attribute name="evidence" type="intListType" use="optional"/>
                        </xs:extension>
                    </xs:simpleContent>
                </xs:complexType>
            </xs:element>
            -->
            <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>

Cross-references to BioMuta

Cross-references have been added to BioMuta, a curated single-nucleotide variation and disease association database.

BioMuta is available at https://hive.biochemistry.gwu.edu/tools/biomuta/.

The format of the explicit links is:

Resource abbreviation BioMuta
Resource identifier Gene name.

Example: P02787

Show all entries having a cross-reference to BioMuta.

Text format

Example: P02787

DR   BioMuta; TF; -.

XML format

Example: P02787

<dbReference type="BioMuta" id="TF"/>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Lipidation’ (‘LIPID’ in the flat file):

  • O-palmitoleyl serine

UniProt release 2015_04

Published April 1, 2015

Headline

Of CAT tails and protein translation by-products

Correct translation of mRNA into functional proteins is an essential cellular process. Defects in translation not only deprive cells of proteins needed for almost any task, but also produce by-products that can negatively impact these tasks and be toxic. Therefore translational garbage has to be removed.

One source of errors is defective ribosomes that stop during translation and hence produce incomplete polypeptide chains. All organisms have evolved mechanisms to manage translation arrest. In eukaryotes, ribosome stalling induces dissociation of the small 40S subunit and recruitment of the ‘ribosome quality control complex’ (RQC) to the large 60S subunit. RQC mediates the ubiquitination and degradation of the incompletely synthesized polypeptide chains.

Over the past few years, the mode of action of RQC has begun to be elucidated. The molecular components of RQC include listerin, an E3 ubiquitin ligase encoded by RKR1 in yeast and LTN1 in mammals, the AAA adenosine triphosphatase CDC48/VCP/p97 and ubiquitin-binding cofactors, as well as 2 proteins of unknown function. Listerin mediates the ubiquitination of the stalled polypeptide and subsequent recruitment of CDC48/VCP/p97 to the complex. The ATPase may provide the mechanical force to allow extraction of the nascent chain and its delivery to the proteasome for degradation.

Three recent studies have addressed the function of one of the uncharacterized proteins of the complex, called RQC2 in yeast and NEMF in mammals. In mammals, NEMF/RQC2 is responsible for the selective recognition of stalled 60S subunit. It does so by making multiple simultaneous contacts with 60S and peptidyl-tRNA to sense nascent chain occupancy. NEMF/RQC2 is also important for the stable association of listerin with the complex. Work in yeast not only corroborates these findings, but it reveals another unexpected function for NEMF/RQC2. NEMF/RQC2 recruits alanine- and threonine-charged tRNAs to the ribosomal A site and directs the elongation of stalled nascent chains independently of mRNA or 40S subunits, leading to non-templated C-terminal Ala and Thr extensions, aptly named CAT tails. The exact function of CAT tails is still under investigation, but they seem to induce an HSF1-dependent heat shock response in yeast through a mechanism that is yet to be determined. The heat shock response may help cells to buffer against malformed proteins. Alternatively, the extension at the C-terminus may serve to test the functional integrity of large ribosomal subunits, so that the cell can detect and dispose of defective large subunits that induce stalling.

mRNA-independent polypeptide biosynthesis has already been described in microorganisms. Classical examples of such peptides are peptide antibiotics, including actinomycin, bacitracin, colistin, and polymyxin B. In addition, in Staphylococcus aureus, pentaglycines acting as cross-linkers in the cell wall peptidoglycan are synthesized in the absence of mRNA. Although still considered as a very marginal event, the assembly of amino acids without mRNA blueprint might be more widespread than previously anticipated.

As of this release, updated yeast RQC2 and mammalian NEMF entries are available in UniProtKB/Swiss-Prot.

UniProtKB news

Reducing redundancy in proteomes

The UniProt Knowledgebase (UniProtKB) has witnessed an exponential growth in the last few years with a two-fold increase in the number of entries in 2014. This follows the vastly increased submission of multiple genomes for the same or closely related organisms. This increase has been accompanied by a high level of redundancy in UniProtKB/TrEMBL and many sequences are over-represented in the database. This is especially true for bacterial species where different strains of the same species have been sequenced and submitted (e.g. 1,692 strains of Mycobacterium tuberculosis, corresponding to 5.97 million entries). To reduce this redundancy, we have developed a procedure to identify highly redundant proteomes within species groups using a combination of manual and automatic methods. We have applied this procedure to bacterial proteomes (which constituted 81% of UniProtKB/TrEMBL in release 2015_03) and sequences corresponding to redundant proteomes (47 million entries) have been removed from UniProtKB. These sequences are still available in the UniParc sequence archive dataset within UniProt. From now on, we will no longer create new UniProtKB/TrEMBL records for proteomes identified as redundant.

Protein sequences belonging to proteomes that are not identified as redundant remain in UniProtKB. All proteomes are searchable through the UniProt website’s Proteomes pages. Sequences corresponding to redundant proteomes are available for download from UniParc and you will also be directed to alternate non-redundant proteome(s) available for the same species. The history (i.e. previous versions) of redundant UniProtKB records is still available.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Acid phosphatase deficiency

Changes to keywords

Modified keyword:

UniMES news

Retirement of UniProt Metagenomic and Environmental Sequences (UniMES)

The UniProt Metagenomic and Environmental Sequences (UniMES) database was developed as a repository for metagenomic and environmental data. UniProt has retired UniMES as there is now a resource at the EBI that is dedicated to serving metagenomic researchers. Henceforth, we recommend using the EBI Metagenomics portal instead. In addition to providing a repository of metagenomics sequence data, EBI Metagenomics allows you to view functional and taxonomic analyses and to submit your own samples for analysis.

UniProt release 2015_03

Published March 4, 2015

Headline

Regulation of translation initiation through folding

Many physiopathological events, such as stress or nutrient deprivation, induce rapid changes in cellular protein levels. In these cases, cells preferentially use translational control of existing mRNAs over transcriptional control, since the latter generates a slower response. Translation can be divided into 4 steps, initiation, elongation, termination, and ribosome recycling, but most regulation occurs at the initiation level.
In eukaryotes, translation initiation involves recruitment of the 40S ribosome to mRNA by the eukaryotic initiation factor 4F (eIF4F) complex. This complex is composed of eIF4E, which binds to the mRNA 5’ cap structure, eIF4A, an RNA helicase and eIF4G, a scaffolding protein. Availability of eIF4E is rate-limiting in this process and it is an important target for control. Under stress or starvation conditions, when translation has to be rapidly repressed, eIF4E binding proteins (4E-BPs) interact with eIF4E outcompeting eIF4G, hence preventing eIF4F assembly and cap-dependent translation initiation. 3 4E-BPs have been identified in mammals. 4E-BP2 (EIF4EBP2) is one of them. It is an intrinsically disordered protein (IDP) that contains several phosphorylation sites. In its unphosphorylated state, 4E-BP2 interacts with eIF4E via 2 domains: a YXXXXLΦ motif (residues 54 through 60) and a secondary dynamic motif (residues 78 through 82). The unphosphorylated (or minimally phosphorylated), eIF4E-binding form of EIF4EBP2 is unstable and targeted for degradation via the ubiquitin-proteasome pathway. By contrast, highly phosphorylated 4E-BP2 is very stable, but only weakly binds to eIF4E and hence can be outcompeted by eIF4G, allowing translation to occur.

How does phosphorylation regulate 4E-BP2 interaction with eIF4E and its stability? It has been recently shown that phosphorylation induces a widespread disorder-to-order transition occurring in 2 steps. First, phosphorylation at Thr-37 and Thr-46 by MTOR induces folding of residues Pro-18 to Arg-62 into a four-stranded β-domain that sequesters the helical YXXXXLΦ motif into a partially buried β-strand, blocking accessibility to eIF4E. The folding also protects Lys-57 from ubiquitination, preventing proteasomal degradation. This ordered structure is further stabilized by phosphorylation at Ser-65, Thr-70 and Ser-83. The fully phosphorylated protein has an affinity for eIF4E 4,000 fold lower than the unphosphorylated form. This observation implies that binding must be coupled to unfolding in order to free the YXXXXLΦ motif, and it is indeed what is experimentally observed. When the phosphorylated form binds eIF4E, it undergoes an order-to-disorder transition, as suggested by NMR spectra that are similar to those of the unphosphorylated form.

Although it has long been suspected that the function of IDPs may be controlled by post-translational modifications (PTMs), this is the first report experimentally showing how a PTM can fold an entire domain. This new data have been annotated into UniProtKB/Swiss-Prot and as of this release, the updated EIF4EBP2 entry is publicly available.

UniProtKB news

New proteomics mapping files

Mappings of UniProt Knowledgebase (UniProtKB) human sequences to identified human peptides from public mass spectrometry (MS) proteomics repositories can now be found in the new dedicated ‘proteomics_mapping’ directory on the UniProt FTP site together with a description of how the mappings were generated. The mappings are based on our analysis of the content of those MS proteomics repositories that openly share with us their data and quality metrics concerning peptide identifications.

Mass spectrometry provides direct experimental evidence for the existence of proteins and these new peptide mappings greatly increase the proportion of human sequences in UniProtKB whose existence is supported by experimental proteomics data. The human reference proteome currently contains 89383 sequences and our analysis provides mass spectrometry evidence for 68229 of those sequences.

In future UniProt releases, we expect to add data from more MS proteomics repositories and additional species. We very much welcome the feedback of the community on our efforts.

New FTP repository for reference proteomes

Based on a gene-centric perspective, UniProt Knowledgebase (UniProtKB) starts to provide data sets for reference proteomes, whose repository can be found at the new reference_proteomes directory.

As of release 2015_03, it encompasses 1933 species distributed in Eukaryota, Archaea and Bacteria. Viruses will be added in the next release.

Removal of the cross-references to PhosSite

Cross-references to PhosSite have been removed.

Removal of the cross-references to PptaseDB

Cross-references to PptaseDB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified disease:

Deleted diseases:

  • Amelogenesis imperfecta and gingival fibromatosis syndrome
  • Glycogen storage disease 14
  • Ichthyosis, autosomal recessive, with hypotrichosis
  • Loeys-Dietz syndrome 2A
  • Loeys-Dietz syndrome 2B
  • Leigh syndrome, X-linked
  • Mental retardation, X-linked 59

Changes to keywords

New keyword:

UniParc news

UniParc cross-references with proteome identifier and component

The UniParc XML format uses dbReference elements to represent cross-references to external database records that contain the same sequence as the UniParc record. Additional information about an external database record is provided with different types of property child elements. We have introduced two new types for cross-references to external database records from which UniProt proteomes are derived: The type "proteome_id" shows the identifier of the corresponding UniProt proteome and the type "component" the genomic component which encodes the protein. As a first step, we have added this information to bacterial ENA records.

Example:

<entry dataset="uniparc">
    <accession>UPI0000131B78</accession>
    <dbReference type="EMBL" id="AAK44239" version_i="1" active="Y" version="1" created="2003-03-12" last="2014-11-23">
        <property type="NCBI_GI" value="13879058"/>
        <property type="NCBI_taxonomy_id" value="83331"/>
        <property type="protein_name" value="serine/threonine protein kinase"/>
        <property type="gene_name" value="MT0017"/>
        <property type="proteome_id" value="UP000001020"/>
        <property type="component" value="Chromosome"/>
    </dbReference>
    <dbReference type="EMBL" id="ABQ71734" version_i="1" active="Y" version="1" created="2007-07-09" last="2014-11-23">
        <property type="NCBI_GI" value="148503925"/>
        <property type="NCBI_taxonomy_id" value="419947"/>
        <property type="protein_name" value="serine/threonine protein kinase"/>
        <property type="gene_name" value="pknB"/>
        <property type="proteome_id" value="UP000001988"/>
        <property type="component" value="Chromosome"/>
    </dbReference>
    ...
    <dbReference type="EMBL_CON" id="EFD75652" version_i="1" active="Y" version="2" created="2011-12-05" last="2014-11-23">
        <property type="NCBI_taxonomy_id" value="537209"/>
        <property type="protein_name" value="transmembrane serine/threonine-protein kinase B pknB"/>
        <property type="gene_name" value="TBIG_00439"/>
        <property type="proteome_id" value="UP000004676"/>
        <property type="component" value="Unassembled WGS sequence"/>
    </dbReference>
    ...
</entry>

This change did not affect the UniParc XSD, but may nevertheless require code changes.

UniProt RDF news

UniProt RDF files compressed with XZ instead of gzip

The UniProt RDF distribution has been available on the UniProt FTP site as gzip compressed RDF/XML files since 2008. We have now changed the compression algorithm from gzip to XZ, which has a number of features that make it a better choice for the UniProt RDF data:

  • It reduces the file size by approximately 23%, which improves FTP download time.
  • It can be decompressed in parallel, which can give faster decompression rates on current hardware with a minimum of 6-8 CPU cores.
  • It allows random access.

Replacement of UniProt RDF file go.rdf with go.owl

The UniProt RDF distribution that is available on the UniProt FTP site contained a go.rdf file that has been replaced with a go.owl file that contains a subset of the official go.owl distribution of the Gene Ontology consortium, which is taken as a snapshot that is in sync with the GO annotations in the UniProt Knowledgebase.

In practical terms this means:

UniProt release 2015_02

Published February 4, 2015

Headline

Mosquitoes prefer humans

Blood-feeding is extremely unusual in insects. Among the 1 to 10 million insect species, only some 10,000 feed on blood, and among these, only 100 target humans. Not only is this behavior rare in terms of species, but within one species, it may be gender-specific. However this small proportion of insects have a dramatic impact on human health. Female mosquitoes are major vectors of human diseases, such as malaria, dengue, yellow fever and chikungunya. Mosquito’s preference for humans is a matter of evolution. Aedes aegypti, the main vector of dengue and yellow fevers, actually exists as 2 subspecies, Aedes aegypti aegypti, feeding on human blood, and Aedes aegypti formosa, a generalist, zoophilic mosquito. It is currently thought that Aedes aegypti aegypti originated from a small population of forest-dwelling Aedes aegypti that became isolated in North Africa when a period of severe drought began in the Sahara approximately 4,000 years ago. The mosquito adapted to these harsh conditions, evolved a preference for breeding in artificial water storage containers and specialized in biting humans. This “domestic” form was reintroduced along the coast of East Africa following human movement and trade, and spread across much of the tropical and subtropical world. Today, along the coasts of Kenya, the 2 subspecies coexist, sometimes just a few hundreds of meters apart, domestic Aedes aegypti aegypti found in homes, laying eggs in water stored in containers indoors, and the forest Aedes aegypti formosa avoiding human settlements, laying eggs in tree holes outdoors.

What is the genetic basis underlying the mosquito’s preference for humans? In order to answer this question, Mc Bride et al. established 29 colonies of each Aedes aegypti subspecies. They observed that, contrary to their forest counterparts, domestic females showed a strong preference for human odor as compared to guinea pig, and were also more responsive in assays in which insects were directly exposed to live hosts, i.e. an anaesthetized guinea-pig and a human arm (the owner of which should be congratulated for her commitment). Analysis of gene expression in antennae, the major olfactory organ, in both subspecies revealed almost 1’000 differentially expressed genes and among them, odorant receptors, a family of insect chemosensory receptors, were significantly overrepresented. Odorant receptor 4 (Or4) was of particular interest. It was upregulated in human-preferring mosquitoes, and also the 2nd most highly expressed odorant receptor in the antennae of domestic females. In addition, Or4 exhibited extensive variations that might affect its function. Or4 responds to sulcatone, a volatile odorant produced by a variety of animals and plants, but whose levels in humans are uniquely high. 7 major Or4 alleles have been identified. Alleles A, B, C, F, and G were highly sensitive to sulcatone, whereas D and E were much less sensitive. Interestingly, human-preferring colonies from various African, Asian and American countries were dominated by A-like alleles, whereas animal-preferring colonies were highly variable. This suggests that both Or4 expression levels and ligand-sensitivity play a role in human preference. Surprisingly, sulcatone has been described as a mosquito repellent at certain concentrations. Mc Bride et al. hypothesized that it could be a repellent at high concentrations and an attractant at lower levels.

The important behavioral (r)evolution form the ancestral Aedes aegypti formosa to Aedes aegypti aegypti is unlikely to be due to a single gene, but at least Or4 is one genetic element clearly associated with these changes. The corresponding Or4 UniProtKB entry has been manually annotated and is publicly available as of this release.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Amelogenesis imperfecta and gingival fibromatosis syndrome
  • Glycogen storage disease 14
  • Ichthyosis, autosomal recessive, with hypotrichosis
  • Leigh syndrome, X-linked
  • Loeys-Dietz syndrome 2A
  • Loeys-Dietz syndrome 2B
  • Mental retardation, X-linked 59

Changes to keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2015_01

Published January 7, 2015

Headline

Thalidomide, the pharmacological version of yin and yang

In the 1950s, the German company Chemie Gruenenthal brought a new drug to the market, thalidomide. It was primarily used as a sedative, but as it also had anti-emetic properties, it soon became popular to alleviate “morning sickness” in pregnant women. About 10,000 children were born to women taking thalidomide. They exhibited severe malformations, affecting limbs, ears, heart and other internal organs and only 50% survived. By the early sixties, the teratogenic effect of thalidomide had been established and its use discontinued. However, scientists’ interest in this molecule never stopped. In 1965, thalidomide was shown to have immunomodulatory and anti-inflammatory properties in patients with erythema nodosum leprosum, an inflammatory complication of leprosy. More recently, thalidomide was proved to be efficient against several hematological cancers, including multiple myeloma, inhibiting cancer cell proliferation, modulating the immune system and the tumor microenvironment.

In 60 years, observations on thalidomide effects have accumulated, but its mode of action is still not fully elucidated. Nevertheless, some major steps have been accomplished to achieve this aim. A major breakthrough came in 2010 when thalidomide’s primary target, a protein called cereblon (CRBN), was identified. CRBN is a component of a ubiquitin E3 complex, called CRL4. This complex is made of at least 4 proteins, CUL4, DDB1, RBX1 and CRBN. Each protein has its specific function. CUL4 provides a scaffold for assembly of RBX1 and DDB1, RBX1 is the docking site for the activated E2 protein, and DDB1 recruits substrate-specificity receptors, such as CRBN, that form the substrate-presenting side of the CRL4 complex. The recently published CRL4 3D structure revealed that the ligase arm of CUL4 is quite mobile, establishing a ubiquitination zone. As it is a promiscuous enzyme, any lysine crossing this zone may be a target.

How does thalidomide affect CRBN activity within the CRL4 complex? In the presence of thalidomide, 2 transcription factors, IKZF1 and IKZF3, are recognized by CRBN and targeted for destruction by the proteasome. Neither of these proteins are substrates in the absence of the drug. Under normal conditions, IKZF1 and IKZF3 regulate B-and T-cell development. IKZF1 suppresses the expression of IL2 in T-cells and stimulates the expression of IRF4. This observation sheds light upon the immunomodulatory effects of thalidomide. What about endogenous CRBN substrates? Until recently, none were known. Last July, Fisher et al. published the results of their search for proteins whose ubiquitination by CRL4/CRBN was inhibited by thalidomide (or thalidomide derivatives) and identified MEIS2, a homeodomain-containing protein. MEIS2 has been involved in some aspects of normal human development. In bats, differential MEIS2 expression has been observed during limb development. A failure in limb development is a very striking feature of “thalidomide babies”. Hence MEIS2 may be a candidate for some aspects of thalidomide-induced teratogenicity.

Based on 3D structure analysis of the CRL4 complex, a model has been proposed in which thalidomide binds to CRBN at the canonical substrate-binding site. This interferes with the binding of endogenous CRBN substrates, impairs their ubiquitination and subsequent destruction, and results in their up-regulation. Conversely, the presence of thalidomide modifies the CRBN surface, creating a new binding site for neo-substrates, leading to their down-regulation.

As of this release, the updated versions of CRBN, DDB1, CUL4B, RBX1 entries are available in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to UniProt Proteomes

For several years now, UniProt has been providing ‘proteome’ sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced. In the past, these sets were based on the taxonomy of the organisms, but as more and more genomes of the same organism are being sequenced, we have recently introduced unique proteome identifiers to distinguish individual proteomes. These proteomes can be queried and downloaded from the new Proteomes section of the UniProt website. UniProtKB entries that are part of a proteome now have a cross-reference to their proteome and, where known, we also indicate the name of the component that encodes the respective protein.

UniProt Proteomes are available at http://www.uniprot.org/proteomes/.

The format of the explicit links is:

Resource abbreviation Proteomes
Resource identifier Proteome identifier.
Optional information 1 Component name.

Example: P78363

Text format

Example: P78363

DR   Proteomes; UP000005640; Chromosome 1.

XML format

Example: P78363

<dbReference type="Proteomes" id="UP000005640">
  <property type="component" value="Chromosome 1"/>
</dbReference>

RDF format

In the RDF format, we have introduced a new property proteome to represent a proteomes resource. The component is indicated by a relative URI reference.

Example: P78363

uniprot:P78363
  up:proteome <http://purl.uniprot.org/proteomes/UP000005640#Chromosome%201> .

Cross-references to DEPOD

Cross-references have been added to DEPOD, the human DEPhOsphorylation Database.

DEPOD is available at http://www.koehn.embl.de/depod/.

The format of the explicit links is:

Resource abbreviation DEPOD
Resource identifier UniProtKB accession number.

Example: Q99502

Show all entries having a cross-reference to DEPOD.

Text format

Example: Q99502

DR   DEPOD; Q99502; -.

XML format

Example: Q99502

<dbReference type="DEPOD" id="Q99502"/>

Cross-references to MoonProt

Cross-references have been added to MoonProt, a manually curated database containing information about the known moonlighting proteins.

MoonProt is available at http://www.moonlightingproteins.org/.

The format of the explicit links is:

Resource abbreviation MoonProt
Resource identifier UniProtKB accession number.

Example: P31230

Show all entries having a cross-reference to MoonProt.

Text format

Example: P31230

DR   MoonProt; P31230; -.

XML format

Example: P31230

<dbReference type="MoonProt" id="P31230"/>

Changes to the controlled vocabulary of human diseases

New diseases:

UniProt release 2014_11

Published November 26, 2014

Headline

Higher and higher

It is in human nature to push back the frontiers of what is possible. Modern humans left Africa and conquered the world. During their exploration, they met other humans who had already colonized the most improbable places tens of thousands of years earlier, maybe themselves being driven by the same urge to discover new horizons. Among the most challenging dwelling places is the Tibetan plateau, with an average elevation exceeding 4,500 meters. At this altitude, the oxygen concentration is only 60% of that available at sea level. Nevertheless, the Tibetan plateau is thought to have been inhabited for some 25,000 years.

To maintain oxygen homeostasis at high altitude (over 2,500 meters), the body responds in various ways, including increasing ventilation over the short term and increasing red blood cell production over the long term (see review). Hypoxia-inducible factor (HIF) plays a key role in the regulation of gene transcription in this process. HIF is a dimer composed of a common subunit beta, called ARNT, and 1 of 3 alpha subunits, called HIF1A, EPAS1, or HIF3A. Under normoxic conditions, HIFs-alpha are hydroxylated by prolyl hydroxylases EGLN1 (also known as PHD2), EGLN2 or EGLN3. Hydroxylation allows interaction with an E3-ubiquitin ligase, named VHL, followed by proteasomal degradation. Under hypoxic conditions, hydroxylation is arrested and HIFs-alpha are stabilized. They dimerize with ARNT and initiate the hypoxia response transcriptional program, which includes the stimulation of erythropoiesis. Strikingly, Tibetans exhibit a blunted erythropoietic response and their hemoglobin concentration is maintained at values expected at sea-level.

In 2010, 3 independent publications identified genes or loci showing evidence of hypoxia adaptation in Tibetans. All 3 studies pointed to 2 genes, among many others, being significantly associated with the decreased hemoglobin phenotype. They are EPAS1 and EGLN1. Interestingly, Tibetans may have inherited EPAS1 SNPs from Denisova man, an archaic Homo species identified in the Altai mountains of Siberia. The Tibetan-specific EGLN1 variant is more recent, currently estimated to have appeared some 8,000 years ago. It contains 2 single amino acid polymorphisms: p.Asp4Cys and p.Cys127Ser. Some characterization of this double variant came in September this year. Lorenzo et al. showed that it exhibited a lower K(m) value for oxygen, suggesting that it promotes increased HIF-alpha hydroxylation and degradation under hypoxic conditions. It could hence abrogate hypoxia-induced and HIF-mediated augmentation of erythropoiesis. Song et al. reported that the double variant specifically interferes with binding to PTGES3 (also called HSP90 cochaperone p23), but not to other known EGLN1 ligands, including FKBP8 or HSP90AB. As PTGES3-binding may facilitate HIF-alpha hydroxylation, a perturbation in this interaction would actually decrease HIF-alpha hydroxylation, hence decreased degradation and consequently increased HIF activity. The central question about the functional consequences of the Tibetan EGLN1 variant remains open…

It is not yet clear how high-altitude populations adapted to their harsh environment, but at least we begin to grasp the amazing complexity of this phenomenon. The scientific community has studied mostly 3 populations, Tibetans, Andeans and Ethiopians settled on the Simien plateau. They all exhibit patterns of genetic adaptation largely distinct from one another and the overlap is surprisingly low. The polymorphisms identified so far may not be straightforward loss- or gain-of-function, but they may instead fine tune complex interactions in which several proteins, possibly themselves carrying adaptive variations, are involved in a tissue-specific context.

As of this release, the UniProtKB/Swiss-Prot human EGLN1 has been updated with the new characterization data of the p.[Asp4Cys; Cys127Ser] polymorphism. On the new UniProt website, this information is to be found in the ‘Sequences’ section, ‘Polymorphism’ and ‘Natural variant’ subsections.

UniProtKB news

New mouse and zebrafish variation files

We would like to announce the release of two additional species, mouse and zebrafish, to the set of variation files available in the dedicated variants directory on the UniProt FTP sites. Both files catalogue protein altering Single Nucleotide Variants (SNVs or SNPs), stop-gained and stop-lost variants for UniProtKB/Swiss-Prot and UniProtKB/TrEMBL sequences of each species. These variants have been automatically mapped to UniProtKB sequences, including isoform sequences, through Ensembl. We very much welcome the feedback of the community on our efforts.

Structuring of ‘cofactor’ annotations

We have structured the previously free text cofactor annotations in UniProtKB and mapped individual cofactors to ChEBI identifiers. How this affects different UniProtKB distribution formats is described below.

Text format

 CC   -!- COFACTOR:( <molecule>:)?
(CC       Name=<cofactor>; Xref=<database>:<identifier>;( Evidence={<evidence>};)?)* 
(CC       Note=<free text>;)?

Note: Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional (?) or may occur 0 or more times (*).

A cofactor annotation consists of:

  • An optional <molecule> value that indicates the isoform, chain or peptide to which this annotation applies.
  • Zero or more cofactors that are each described with:
    • A Name= field that shows the cofactor name.
    • A Xref= field that shows a cross-reference to the corresponding ChEBI record.
    • An optional Evidence= field that provides the evidence for the cofactor (see Evidence in the UniProtKB flat file format)
  • An optional Note= field that provides additional information.

Each cofactor description and the optional Note= field start on a new line. Lines are wrapped at a line length of 75 characters and indented to increase readability.

Examples:

  • Protein binds alternate/several cofactors
    CC   -!- COFACTOR:
    CC       Name=Mg(2+); Xref=ChEBI:CHEBI:18420;
    CC         Evidence={ECO:0000255|HAMAP-Rule:MF_00086};
    CC       Name=Co(2+); Xref=ChEBI:CHEBI:48828;
    CC         Evidence={ECO:0000255|HAMAP-Rule:MF_00086};
    CC       Note=Binds 2 divalent ions per subunit (magnesium or cobalt).
    CC       {ECO:0000255|HAMAP-Rule:MF_00086};
    CC   -!- COFACTOR:
    CC       Name=K(+); Xref=ChEBI:CHEBI:29103;
    CC         Evidence={ECO:0000255|HAMAP-Rule:MF_00086};
    CC       Note=Binds 1 potassium ion per subunit. {ECO:0000255|HAMAP-
    CC       Rule:MF_00086};
    
  • Isoforms
    CC   -!- COFACTOR: Isoform 1:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
    CC         Evidence={ECO:0000269|PubMed:16683188};
    CC       Note=Isoform 1 binds 3 Zn(2+) ions. {ECO:0000269|PubMed:16683188};
    CC   -!- COFACTOR: Isoform 2:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
    CC         Evidence={ECO:0000269|PubMed:16683188};
    CC       Note=Isoform 2 binds 2 Zn(2+) ions. {ECO:0000269|PubMed:16683188};
    
  • Chains
    CC   -!- COFACTOR: Serine protease NS3:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
    CC         Evidence={ECO:0000269|PubMed:9060645};
    CC       Note=Binds 1 zinc ion. {ECO:0000269|PubMed:9060645};
    CC   -!- COFACTOR: Non-structural protein 5A:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105; Evidence={ECO:0000250};
    CC       Note=Binds 1 zinc ion in the NS5A N-terminal domain.
    CC       {ECO:0000250};
    
  • Cofactor unknown
    CC   -!- COFACTOR:
    CC       Note=Does not require a metal cofactor.
    CC       {ECO:0000269|PubMed:24450804};
    

XML format

We modified the XSD type commentType and introduced a new XSD type cofactorType as shown in red. We also moved the declaration of the molecule element – already used in the comment type "subcellular location" – to a more generic context so that it can also be used by other comment types such as "cofactor".

    <xs:complexType name="commentType">
        ...
        <xs:sequence>
            <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
            <xs:choice minOccurs="0">
            ...
                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'cofactor' annotations.</xs:documentation>
                    </xs:annotation>
                    <xs:element name="cofactor" type="cofactorType" maxOccurs="unbounded"/>
                </xs:sequence>

                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'subcellular location' annotations.</xs:documentation>
                    </xs:annotation>
                    <!-- <xs:element name="molecule" type="moleculeType" minOccurs="0"/> -->
                    <xs:element name="subcellularLocation" type="subcellularLocationType" maxOccurs="unbounded"/>
                </xs:sequence>
                ...
            </xs:choice>
            ...
            <xs:element name="text" type="evidencedStringType" minOccurs="0">
                <xs:annotation>
                    <xs:documentation>Used to store non-structured types of annotations,
                    as well as optional free-text notes of structured types of annotations.</xs:documentation>
                </xs:annotation>
            </xs:element>
            ...
        </xs:sequence>
        ...
    </xs:complexType>
    ...
    <xs:complexType name="cofactorType">
        <xs:annotation>
            <xs:documentation>Describes a cofactor.</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="name" type="xs:string"/>
            <xs:element name="dbReference" type="dbReferenceType"/>
        </xs:sequence>
        <xs:attribute name="evidence" type="intListType" use="optional"/>
    </xs:complexType>

A cofactor annotation consists of a sequence of:

  • An optional molecule element that indicates the isoform, chain or peptide to which this annotation applies.
  • Zero or more cofactor elements that each describe an individual cofactor with the following child elements:
    • A name element shows the cofactor name.
    • A dbReference element represents a cross-reference to the corresponding ChEBI record.
  • An optional text element that provides additional information.

Examples:

  • Protein binds alternate/several cofactors
    <comment type="cofactor">
      <cofactor evidence="1">
        <name>Mg(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:18420"/>
      </cofactor>
      <cofactor evidence="1">
        <name>Co(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:48828"/>
      </cofactor>
      <text evidence="1">Binds 2 divalent ions per subunit (magnesium or cobalt).</text>
    </comment>
    <comment type="cofactor">
      <cofactor evidence="1">
        <name>K(+)</name>
        <dbReference type="ChEBI" id="CHEBI:29103"/>
      </cofactor>
      <text evidence="1">Binds 1 potassium ion per subunit.</text>
    </comment>
    ...
    <evidence key="1" type="ECO:0000255">
      <source>
        <dbReference type="HAMAP-Rule" id="MF_00086"/>
      </source>
    </evidence>
    
  • Isoforms
    <comment type="cofactor">
      <molecule>Isoform 1</molecule>
      <cofactor evidence="9">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="9">Isoform 1 binds 3 Zn(2+) ions.</text>
    </comment>
    <comment type="cofactor">
      <molecule>Isoform 2</molecule>
      <cofactor evidence="9">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="9">Isoform 2 binds 2 Zn(2+) ions.</text>
    </comment>
    ...
    <evidence key="9" type="ECO:0000269">
      <source>
        <dbReference type="PubMed" id="16683188"/>
      </source>
    </evidence>
    
  • Chains
    <comment type="cofactor">
      <molecule>Serine protease NS3</molecule>
      <cofactor evidence="13">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="13">Binds 1 zinc ion.</text>
    </comment>
    <comment type="cofactor">
      <molecule>Non-structural protein 5A</molecule>
      <cofactor evidence="3">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="3">Binds 1 zinc ion in the NS5A N-terminal domain.</text>
    </comment>
    ...
    <evidence key="3" type="ECO:0000250"/>
    ...
    <evidence key="13" type="ECO:0000269">
      <source>
        <dbReference type="PubMed" id="9060645"/>
      </source>
    </evidence>
    
  • Cofactor unknown
    <comment type="cofactor">
      <text evidence="1">Does not require a metal cofactor.</text>
    </comment>
    ...
    <evidence key="1" type="ECO:0000269">
      <source>
        <dbReference type="PubMed" id="24450804"/>
      </source>
    </evidence>
    

RDF format

We introduced a new cofactor property to list individual cofactors as ChEBI resource descriptions. As for other types of annotations, an optional sequence property may describe the molecule to which the annotation applies and an optional rdfs:comment property may provide additional information.

Examples:

Note: Evidence tags are omitted from the examples to make it easier to read them. They are represented as for all other types of annotations by reification of the concerned statements.

  • Protein binds alternate/several cofactors
    uniprot:Q5M434
      up:annotation SHA:1, SHA:2 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 2 divalent ions per subunit (magnesium or cobalt)." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_18420> ,
                  <http://purl.obolibrary.org/obo/CHEBI_48828> .
    SHA:2
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 1 potassium ion per subunit." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29103> ;
    
  • Isoforms
    uniprot:O15304
      up:annotation SHA:1, SHA:2 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Isoform 1 binds 3 Zn(2+) ions." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence isoform:O15304-1 .
    SHA:2
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Isoform 2 binds 2 Zn(2+) ions." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence isoform:O15304-2 .
    
  • Chains
    uniprot:P26662
      up:annotation SHA:1, SHA:2 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 1 zinc ion." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence annotation:PRO_0000037644 .
    SHA:2
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 1 zinc ion in the NS5A N-terminal domain." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence annotation:PRO_0000037647 .
    
  • Cofactor unknown
    uniprot:A9CEQ7
      up:annotation SHA:1 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Does not require a metal cofactor." ;
    

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2014_10

Published October 29, 2014

Headline

K for Koagulation

After several weeks of a cholesterol-free diet, chickens start bleeding. The phenotype cannot be reversed by the addition of purified cholesterol to their chow, suggesting that another compound could have been extracted along with cholesterol during food preparation. This observation made by Henrik Dam in 1929 led to the identification of a fat-soluble vitamin involved in coagulation, also known as vitamin K (K standing for Koagulationsvitamin, the original German name for this compound, since the initial observations were reported in a German journal). This discovery was awarded the Nobel prize in 1943, but vitamin K function and metabolism are still extensively studied.

In plants, vitamin K plays an essential role in photosynthesis, which is why it is particularly enriched in photosynthetic tissues, such as green leaves. In animals, vitamin K is essential for blood clotting and bone mineralization. It also prevents the calcification of arteries and other soft tissues. More recently, vitamin K has been shown to function as a mitochondrial electron carrier and to serve as a ligand for the nuclear receptor SXR, which controls the expression of genes involved in transport and metabolism of endo- and xenobiotics.

The most extensively studied vitamin K function is its role as a cosubstrate for vitamin K-dependent gamma-carboxylase (GGCX). This enzyme catalyzes gamma-carboxylation of glutamate residues in target proteins. The modification activates several blood factor proteins and leads to initiation of the blood coagulation cascade. Widely used anticoagulant drugs, called coumarins, take advantage of this property and act as vitamin K antagonists. For example, warfarin is thought to inhibit vitamin K epoxide reductase complex subunit 1 (VKORC1), blocking vitamin K recycling, hence depleting active vitamin K stores. Although life-saving, the use of warfarin is quite tricky, as inadequate dosage may have dramatic consequences, either embolism or thrombosis (underdosage), or potentially fatal hemorrhage (overdosage). Interindividual genetic variations greatly affect warfarin efficiency. Polymorphisms within VKORC1 and CYP2C9, a cytochrome P450 family member involved in coumarin inactivation, together account for approximately 30% of population dose variance. A genetic variant p.Val433Met in another P450 family member, CYP4F2, has also been reported to increase warfarin requirements. CYP4F2 has recently been shown to catalyze vitamin K omega-hydroxylation, a key step in vitamin K degradation. The p.Val433Met polymorphism produces a decrease of CYP4F2 protein in the liver. Lower CYP4F2 levels likely lead to an increase in hepatic vitamin K levels, hence more molecules that warfarin must antagonize, resulting in coumarin resistance in individuals bearing this polymorphism.

As of this release, an updated version of the UniProtKB/Swiss-Prot CYP4F2 entry is available. Proteins undergoing gamma-carboxylation can be retrieved using the keyword Gamma-carboxyglutamic acid.

UniProtKB news

Change of the cross-reference ArrayExpress to ExpressionAtlas

The Expression Atlas database provides information on baseline and differential gene expression patterns under different biological conditions. Experiments in Expression Atlas are selected from the ArrayExpress database of functional genomics experiments. Because UniProtKB entries cross-reference only this subset of experiments, we have changed the resource abbreviation for these cross-references from ArrayExpress to ExpressionAtlas. We have at the same time added a field to indicate the type of expression patterns for which information can be found in the ExpressionAtlas (see examples below).

Text format

Example: P15822

DR   ExpressionAtlas; P15822; baseline and differential.

XML format

Example: P15822

<dbReference type="ExpressionAtlas" id="P15822"/>
  <property type="expression patterns" value="baseline and differential"/>
</dbReference>

RDF format

Example: P15822

uniprot:P15822
  rdfs:seeAlso <http://purl.uniprot.org/expressionatlas/P15822> .
<http://purl.uniprot.org/expressionatlas/P15822>
  rdf:type Resource ;
  up:database <http://purl.uniprot.org/database/ExpressionAtlas> ;
  rdfs:comment "baseline and differential" .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Amelogenesis imperfecta and gingival fibromatosis syndrome
  • Mental retardation, X-linked 59

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • (4R)-5-hydroxyleucine
  • (4R)-5-oxoleucine

Deleted term:

  • 5-methoxythiazole-4-carboxylic acid (Val-Cys)

UniProt release 2014_09

Published October 1, 2014

Headline

Small is beautiful (and useful)

In large scale studies, small proteins tend to be overlooked. They are difficult to predict using software tools and they often escape detection by mass spectrometry. When cDNA sequences are submitted, short coding sequences (CDS) are only rarely annotated and hence do not appear in any protein databases, including UniProtKB/TrEMBL or GenPept, and their nucleotide sequences can be tagged as ‘non-coding RNAs’. In UniProtKB/Swiss-Prot, we are aware of the problem, but we are often reluctant to annotate uncharacterized small ORFs, fearing to introduce imaginary sequences in a database we wish to be as reliable as possible. That is why we are thrilled when new data become available that allow us to fill the gap.

This happened a few months ago, with the publication of 2 articles that brought the ‘noncoding transcript’ AK092578 under the spotlight. Pauli et al. were investigating inductive events during early embryogenesis in zebrafish. In order to find new signaling peptides, they sequenced RNAs extracted from embryos at different developmental stages and combined this approach with ribosome profiling to select for transcripts most likely to be translated. This led to the discovery of 399 novel coding genes. 28 of them contained a signal peptide, but no transmembrane domain, making them good candidates for signaling proteins. Pauli et al. focused their attention on one of them, apela, that they called toddler, encoded by AK092578, so far considered to be a noncoding transcript. A few weeks earlier, Chng et al. had already published the identification of the same protein, which they named elabela.

Apela is a highly conserved protein among vertebrates; this conservation is particularly striking in the 30 amino acid long mature peptide, the last 13 residues being nearly invariant in all vertebrate species studied. Apela is expressed in the zygote, with a peak during gastrulation, and becomes undetectable by 4 days post-fertilization. Its disruption leads to a dramatic phenotype, including small or absent hearts, posterior accumulation of blood cells, malformed pharyngeal endoderm, and abnormal left-right positioning and formation of the liver. Most mutant embryos eventually die between 5 and 7 days of development. Interestingly, this phenotype was reminiscent of that observed for apelin receptor (aplnr) deficiency.

The pathway leading to aplnr activation that could explain the observed mutant phenotype remained unsolved for several years. Indeed, aplnr disruption in zebrafish demonstrated that aplnr was required prior to the onset of gastrulation for proper cardiac morphogenesis, but its known ligand, apln, was not expressed until midgastrulation, too late to play a role in such a very early event. Along the same line, it had been reported that Aplnr mutant animals were not born in the expected Mendelian ratio, and many showed cardiovascular developmental defects, while Apln-deficient mice were viable, fertile, and showed normal development. Taken together, these observations suggested that Aplnr might have yet another ligand, expressed very early in embryonic development. The newly discovered apela protein seemed to fulfill the conditions and, using different strategies, both groups convincingly showed that apela is indeed aplnr’s first ligand.

Human, mouse and zebrafish Apela orthologs have been updated accordingly and these entries are now available.

UniProtKB news

Evidence in the UniProtKB flat file format

The evidence for annotations in UniProtKB entries has been available for several years in the XML and RDF representation of the data and we have now added this information also to the text format (aka flat file format).

Representation of evidence

This section describes how evidence is represented, independently of the context in which they can be found.

An individual evidence description consists of a mandatory evidence type, represented by a code from the Evidence Codes Ontology (ECO) and, where applicable, the source of the data which is usually another database record that is represented by the database name and record identifier, but in the case of publications that are not in PubMed we indicate instead the corresponding UniProtKB reference number.

Examples:

  • An evidence type without source: {type}, e.g.
    {ECO:0000305}
    {ECO:0000250}
    {ECO:0000255}
    
  • An evidence type with source: {type|source}, e.g.
    {ECO:0000269|PubMed:10433554}
    {ECO:0000303|Ref.6}
    {ECO:0000305|PubMed:16683188} 
    {ECO:0000250|UniProtKB:Q8WUF5}
    {ECO:0000312|EMBL:BAG16761.1}
    {ECO:0000313|EMBL:BAG16761.1}
    {ECO:0000255|HAMAP-Rule:MF_00205}
    {ECO:0000256|HAMAP-Rule:MF_00205}
    {ECO:0000244|PDB:1K83}
    {ECO:0000213|PDB:1K83}
    
  • Several evidence attributions: {type|source, type|source, ...}, e.g.
    {ECO:0000269|PubMed:10433554, ECO:0000303|Ref.6}
    

Change of the representation of different line and annotation types

This section describes in which line and annotation types evidence may be found and where it is placed. We use here the symbolic representation {evidence} as a placeholder for all evidence representations that are described in the previous section.

DE lines

Evidence may be found at the end of subcategory fields, e.g.

DE   RecName: Full=Palmitoyl-protein thioesterase-dolichyl pyrophosphate phosphatase fusion 1 {evidence};
DE   Contains:
DE     RecName: Full=Palmitoyl-protein thioesterase {evidence};
DE              Short=PPT {evidence};
DE              EC=3.1.2.22 {evidence};
DE     AltName: Full=Palmitoyl-protein hydrolase {evidence};
DE   Contains:
DE     RecName: Full=Dolichyldiphosphatase {evidence};
DE              EC=3.6.1.43 {evidence};
DE     AltName: Full=Dolichyl pyrophosphate phosphatase {evidence};
DE   Flags: Precursor;
GN lines

Evidence may be found after each gene designation, e.g.

GN   Name=cysA1 {evidence}; Synonyms=cysA {evidence};
GN   OrderedLocusNames=Rv3117 {evidence}, MT3199 {evidence};
GN   ORFNames=MTCY164.27 {evidence};
GN   and
GN   Name=cysA2 {evidence}; OrderedLocusNames=Rv0815c {evidence}, MT0837
GN   {evidence}; ORFNames=MTV043.07c {evidence};
OG lines

Evidence may be found after an organelle or plasmid, e.g.

OG   Mitochondrion {evidence}.
OG   Plasmid pWR100 {evidence}, Plasmid pINV_F6_M1382 {evidence}, and
OG   Plasmid pCP301 {evidence}.
OX lines

Evidence may be found after the taxonomy identifier, e.g.

OX   NCBI_TaxID=9606 {evidence};
RN lines

Evidence may be found after the reference number, e.g.

RN   [1] {evidence}
RC lines

Evidence may be found after each value, e.g.

RC   STRAIN=C57BL/6J {evidence}, and DBA/2J {evidence}; TISSUE=Brain
RC   {evidence};
KW lines

Evidence may be found after each keyword, e.g.

KW   ATP-binding {evidence}; Cell cycle {evidence}; Cell division {evidence};
KW   DNA replication {evidence};
CC lines

The evidence location depends on the annotation type.

Unstructured annotations:

Evidence may initially be found at the end of the annotations because this is how they have historically been attributed, e.g.

CC   -!- FUNCTION: Possesses kinase activity. May be involved in
CC       trafficking and/or processing of RNA. {evidence}.

At a later time, we intend to start attributing evidence at a more fine-grained level by placing them behind the sentences or paragraphs to which they apply, e.g.

CC   -!- FUNCTION: Possesses kinase activity. {evidence}. May be involved
CC       in trafficking and/or processing of RNA. {evidence}.

Structured annotations:

ALTERNATIVE PRODUCTS:

Evidence may be found behind the values of the Name= and Synonyms= fields. It may also be found in Comment= and Note= fields where it is placed as in unstructured annotations, e.g.

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=13;
CC         Comment=Additional isoforms seem to exist. {evidence};
CC       Name=1 {evidence}; Synonyms=LST1/A {evidence};
CC         IsoId=O00453-1; Sequence=Displayed;
..
CC       Name=12;
CC         IsoId=O00453-12; Sequence=VSP_047367;
CC         Note=No experimental confirmation available. {evidence};

BIOPHYSICOCHEMICAL PROPERTIES:

In the structured subtopics Absorption and Kinetic parameters evidence may be found at the end of the Abs(max)=, KM= and Vmax= fields. It may also be found in Note= fields and the unstructured subtopics pH dependence, Redox potential and Temperature dependence, where it is placed as in unstructured annotations, e.g.

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Absorption:
CC         Abs(max)=465 nm {evidence};
CC         Note=The above maximum is for the oxidized form. Shows a maximal
CC         peak at 330 nm in the reduced form. These absorption peaks are
CC         for the tryptophylquinone cofactor. {evidence};
CC       Kinetic parameters:
CC         KM=5.4 uM for tyramine {evidence};
CC         Vmax=17 umol/min/mg enzyme {evidence};
CC         Note=The enzyme is substrate inhibited at high substrate
CC         concentrations (Ki=1.08 mM for tyramine). {evidence};

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       pH dependence:
CC         Optimum pH is 7-8 for ATPase activity. Is more active at pH 8 to
CC         10 than at pH 5.5. {evidence};
CC       Temperature dependence:
CC         Optimum temperature is 80 degrees Celsius for ATPase activity.
CC         {evidence};

RNA EDITING:

Evidence may be found behind the modified positions as well as in the optional Note= field where it is placed as in unstructured annotations, e.g.

CC   -!- RNA EDITING: Modified_positions=207 {evidence}; Note=Partially
CC       edited. Target of Adar. {evidence};

(Please note that we have taken this occasion to make an additional small format change to this annotation type: We have replaced the full-stop at the end of the annotation with a semi-colon to be consistent with other structured annotation types that consist of a list of Field=Value; items.)

MASS SPECTROMETRY:

In MASS SPECTROMETRY annotations the same evidence applies to all fields (incl. the optional Note= field) and all evidence attributions are thus displayed in a separate field instead of adding them at the end of each field. A new Evidence= field has replaced the previously existing Source= field, e.g.

CC   -!- MASS SPECTROMETRY: Mass=2189.4; Method=Electrospray; Range=167-
CC       186; Note=Monophosphorylated.; Evidence={evidence};

SEQUENCE CAUTION:

In SEQUENCE CAUTION annotations the same evidence applies to all fields (incl. the optional Note= field) and all evidence is thus displayed in a separate new Evidence= field instead of being added at the end of each field, e.g.

CC   -!- SEQUENCE CAUTION:
CC       Sequence=AAL25396.1; Type=Miscellaneous discrepancy; Note=Intron retention.; Evidence={evidence};
CC       Sequence=ABF70206.1; Type=Miscellaneous discrepancy; Note=Intron retention.; Evidence={evidence};
CC       Sequence=CAA32567.1; Type=Erroneous gene model prediction; Evidence={evidence};
CC       Sequence=CAA32568.1; Type=Erroneous gene model prediction; Evidence={evidence};

SUBCELLULAR LOCATION:

Evidence may be found at the same places where previously the non-experimental qualifiers By similarity, Probable and Potential were displayed (see Syntax modification of the ‘Subcellular location’ subtopic) as well as in the optional Note= field where it is placed as in unstructured annotations, e.g.

CC   -!- SUBCELLULAR LOCATION: Golgi apparatus, trans-Golgi network
CC       membrane {evidence}; Multi-pass membrane protein {evidence}.
CC       Note=Predominantly found in the trans-Golgi network (TGN). Not
CC       redistributed to the plasma membrane in response to elevated
CC       copper levels. {evidence}.
CC   -!- SUBCELLULAR LOCATION: Isoform 2: Cytoplasm {evidence}.
CC   -!- SUBCELLULAR LOCATION: WND/140 kDa: Mitochondrion {evidence}.

DISEASE:

Evidence may be found at end of the disease description as well as in the optional Note= field where it is placed as in unstructured annotations, e.g.

CC   -!- DISEASE: Sarcoidosis 1 (SS1) [MIM:181000]: An idiopathic,
CC       systemic, inflammatory disease characterized by the formation of
CC       immune granulomas in involved organs. Granulomas predominantly
CC       invade the lungs and the lymphatic system, but also skin, liver,
CC       spleen, eyes and other organs may be involved. {evidence}.
CC       Note=Disease susceptibility is associated with variations
CC       affecting the gene represented in this entry. {evidence}.
FT lines

Evidence may be found at the end of the feature description, e.g.

FT   VARIANT     341    341       P -> L (in AH2; strongly reduced
FT                                activity). {evidence}.
FT                                /FTId=VAR_065665.
FT   CONFLICT     52     53       RT -> KI (in Ref. 8; AAD14329).
FT                                {evidence}.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Glycogen storage disease 14

Changes to keywords

New keyword:

Modified keyword:

UniProt release 2014_08

Published September 3, 2014

Headline

Ubiquitin caught at its own game

Ubiquitination is a widely used post-translational modification (PTM) in eukaryotic cells. It is involved in a plethora of cellular activities ranging from removal of misfolded and unwanted proteins to signaling in innate immunity, from transcriptional regulation to membrane trafficking. Ubiquitination is the covalent attachment of the small 76-residue protein ubiquitin onto a target protein, most often via an isopeptide bond between the amino group of a lysine side chain and ubiquitin C-terminus. This process occurs in several steps: an ubiquitin-activation step catalyzed by E1 enzymes, an ubiquitin-conjugation step catalyzed by E2 enzymes, and a step ensuring the target specificity involving E3 ligases. Many different types of ubiquitination exist, monoubiquitination, multi(mono)ubiquitination and polyubiquitination, each type conveying a different signal. Polyubiquitination occurs via further ubiquitination of a single lysine residue on the substrate protein. Ubiquitin contains 7 ubiquitin lysines; each can serve as an acceptor for further elongation and each defines a distinct fate for the modified protein. The classic example is the Lys-48-linked chain which targets the protein bearing it to degradation via the proteasome.

An additional step of complexity has been unveiled in 3 recent publications: Ubiquitin was discovered to be itself subjected to another PTM, namely phosphorylation, which confers on it the ability to activate the E3 ubiquitin-protein ligase Parkin (PARK2).

Parkin and the PINK1 kinase are involved in the signaling pathway leading to mitophagy, a specialized program which eliminates damaged mitochondria and hence maintains health. Indeed, defects in any of these proteins cause early-onset Parkinson disease.

Under normal conditions, PINK1 is imported into mitochondria, where it is processed and rapidly degraded. When mitochondria lose membrane potential or amass unfolded proteins, PINK1 accumulates on the outer membrane where it recruits cytosolic Parkin and activates its latent E3 activity. As a result, mitochondrial outer membrane proteins are ubiquitinated and the defective organelle is targeted for destruction.

It is in the Parkin activation step that phosphorylated ubiquitin comes into play. PINK1 directly phosphorylates ubiquitin at Ser-65. Of note, Parkin itself contains a ubiquitin-like domain that is also phosphorylated by PINK1 at Ser-65. All three publications agree that phosphorylated ubiquitin is involved in the PINK1/PARK2 pathway. Nevertheless Koyano and colleagues found that both ubiquitin and Parkin Ser-65 phosphorylations are needed for full Parkin activation, whereas Kane et al. observed Parkin activation with phospho-ubiquitin alone. While phospho-ubiquitin can be used by Parkin as a substrate for ubiquitination, its Parkin-binding and -activating abilities seem to be separated from its role as a substrate.

As of this release, human Parkin, PINK1 and ubiquitin entries have been updated accordingly and annotations have been transferred to orthologous entries based on sequence similarity. Proteins known to undergo ubiquitination can be retrieved with the keyword Ubl conjugation and proteins involved in the ubiquitination pathway, such as E1, E2 or E3 enzymes, with the keyword Ubl conjugation pathway.

UniProtKB news

New variant types in homo_sapiens_variation.txt.gz on the UniProt FTP site

UniProt would like to announce the addition of two variant types, stop lost and stop gained, to the set of protein altering variants from the 1000 Genomes Project available in the homo_sapiens_variation.txt.gz file. Stop lost and stop gained variants have been selected as the first structural variants to be added to the UniProt variant catalogue because they are two of the most commonly occurring variant types. UniProt expects to add further structural variant types and somatic variants to the available variant types and to include additional species. This file, along with the humsavar.txt file, can now be found in the new dedicated variants directory in the UniProt FTP site. We very much welcome the feedback of the community on our efforts.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Ichthyosis, autosomal recessive, with hypotrichosis
  • Loeys-Dietz syndrome 2A
  • Loeys-Dietz syndrome 2B

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • Isoaspartyl glycine isopeptide (Asn-Gly)
  • Isoaspartyl glycine isopeptide (Asp-Gly)

Deleted terms:

  • Aspartyl isopeptide (Asn)
  • Aspartyl isopeptide (Asp)

Changes to keywords

Modified keyword:

Website news

The UniProt website is changing

We would like to introduce you to the new UniProt website! We have been working on this site behind the scenes for a while and we’re glad it’s finally time to share it with you.

We redesigned the UniProt website following a user centered design process, involving over 250 users worldwide with varied research backgrounds and use cases. User centered design is a design approach that is grounded in the requirements and expectations of users. They are included at every stage of the process, from gathering requirements to testing the end product.

Some highlights of the changes and improvements:

  • A new homepage and advanced search functionality
  • A new results page interface with easy to use filters
  • A basket to store your favorite proteins and build up your own set
  • New protein entry page content classification and navigation bar
  • New tool output interfaces (e.g. BLAST results)
  • New ‘Proteomes’ pages for full protein sets from completely sequenced organisms

Contextual help is available on the site as well as UniProt help videos from the UniProt YouTube channel. We look forward to feedback from the scientific community to help improve the site further.

UniProt release 2014_07

Published July 9, 2014

Headline

Lark or owl? PER3 is the answer

Unless you are like Napoleon who never needed more than 4 hours of sleep at a stretch, being both an early bird and a night owl, you certainly have a diurnal preference. It is not a simple matter of taste, it is a matter of genetics, involving the PER3 gene.

In humans, the PER3 gene exists in 2 versions: a short one and a long one. The length variation depends upon the number of 18 amino-acid tandem repeats in the protein’s C-terminus: 4 in the short version, 5 in the long one. Roughly 10% of the population is homozygous for the long allele (PER3 5/5) and 50% for the short allele (PER3 4/4). This polymorphism correlates significantly with extreme diurnal preference, the longer allele being associated with morningness and the shorter allele with eveningness. In addition, PER3 5/5 individuals are more vulnerable to sleep deprivation than their PER3 4/4 counterparts, exhibiting greater cognitive performance impairment. When allowed to take naps, PER3 5/5 individuals show a greater ability to sleep independently of circadian phase, suggesting that the polymorphism modifies the sleep homeostatic response without influencing circadian parameters.

The molecular mechanism of this behavioral difference is not known and there was no animal model to investigate it until recently. Indeed, the 18 amino-acid polymorphism does not exist in non-primate mammals. Earlier this year, Hasan et al. published a study in which they created 2 knock-in mice. These mice contained a “humanized” PER3 exon 18 with either the 4-repeat or 5-repeat allele. The transgenic mice exhibited a phenotypic response to sleep deprivation and recovery consistent with the observations made in humans. 816 genes were differentially expressed in the cortex of Per3 4/4 and Per3 5/5 mice and a similar amount in the hypothalamus. At least some of these genes seem to be involved in the regulation of, or response to, sleep, as well as in neuronal development and function. For instance, some isoforms of the Homer1 gene, a marker of sleep homeostasis, were up-regulated in the Per3 5/5 compared to the Per3 4/4 hypothalamus.

With this tool in hand, we may be in a position to start identifying the genetic control of sleep architecture in humans and maybe unveil if Napoleon’s sleep ability was a true genetic oddity, the result of his iron will or just a historical myth.

As of this release, the human PER3 entry has been updated in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to CCDS

Cross-references have been added to CCDS, the Consensus CDS project.

CCDS is available at http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi.

The format of the explicit links is:

Resource abbreviation CCDS
Resource identifier CCDS identifier

Cross-references to CCDS may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Example: O70554

Show all entries having a cross-reference to CCDS.

Text format

Examples:

O70554
DR   CCDS; CCDS38509.1; -.
P00750
DR   CCDS; CCDS6126.1; -. [P00750-1]
DR   CCDS; CCDS6127.1; -. [P00750-3]

XML format

Examples:

O70554
<dbReference type="CCDS" id="CCDS38509.1"/>
P00750
<dbReference type="CCDS" id="CCDS6126.1">
  <molecule id="P00750-1"/>
</dbReference>
<dbReference type="CCDS" id="CCDS6127.1">
  <molecule id="P00750-3"/>
</dbReference>

Cross-references to GeneReviews

Cross-references have been added to GeneReviews, a resource of expert-authored, peer-reviewed disease descriptions.

GeneReviews is available at http://www.ncbi.nlm.nih.gov/books/NBK1116/.

The format of the explicit links is:

Resource abbreviation GeneReviews
Resource identifier GeneReviews identifier

Example: O00555

Show all entries having a cross-reference to GeneReviews.

Text format

Example: O00555

DR   GeneReviews; CACNA1A; -.

XML format

Example: O00555

<dbReference type="GeneReviews" id="CACNA1A"/>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • L-isoglutamyl histamine

Modified term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • N6-crotonyl-L-lysine -> N6-crotonyllysine

Changes to keywords

New keywords:

Modified keywords:

UniParc news

UniParc cross-references with protein and gene names

The UniParc XML format uses dbReference elements to represent cross-references to external database records that contain the same sequence as the UniParc record. Additional information about an external database record is provided with different types of property child elements. We have introduced two new types, "protein_name" and "gene_name", to show the preferred protein and gene name of external database records that provide this information. In this release we have added names for cross-references to UniProtKB and RefSeq. For UniProtKB entries that have several protein or gene names, UniParc shows only the main one, which is the same name that is shown in the UniProtKB FASTA format. We will soon add names for cross-references to ENA, Ensembl, EnsemblGenomes and model organism databases (FlyBase, SGD, TAIR, WormBase).

Examples:

<dbReference type="UniProtKB/Swiss-Prot" id="P05067" version_i="3" active="Y" version="3" created="1991-11-01" last="2014-02-19">
  <property type="NCBI_GI" value="112927"/>
  <property type="NCBI_taxonomy_id" value="9606"/>
  <property type="protein_name" value="Amyloid beta A4 protein"/>
  <property type="gene_name" value="APP"/>
</dbReference>
...
<dbReference type="UniProtKB/Swiss-Prot protein isoforms" id="P05067-2" version_i="1" active="Y" created="2003-03-28" last="2014-02-19">
  <property type="NCBI_taxonomy_id" value="9606"/>
  <property type="protein_name" value="Isoform APP305 of Amyloid beta A4 protein"/>
  <property type="gene_name" value="APP"/>
</dbReference>

This change did not affect the UniParc XSD, but may nevertheless require code changes.

FTP site news

Every folder on our FTP server now contains a file called RELEASE.metalink that specifies the size and MD5 checksum of every file in that folder, e.g.
ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/RELEASE.metalink

Metalink is an extensible metadata file format that describes one or more computer files available for download. It facilitates file verification and recovery from data corruption and lists alternate download sources (mirror URIs).

Various command line download tools, e.g. cURL version 7.30 or higher and aria2, support metalink.

Example: The following command will download all files in the current_release/ folder and verify their MD5 checksums:

curl --metalink ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/RELEASE.metalink

They will be downloaded from one of the alternative locations mentioned in the metalink file. If one FTP server goes down during a download, programs can automatically switch to another mirror location. Some programs can also download segments from several FTP locations at the same time, which can make downloads much faster.

Please note that UniProt can be downloaded from the consortium member FTP sites at three different geographical locations:

USA: ftp://ftp.uniprot.org/pub/databases/uniprot
UK: ftp://ftp.ebi.ac.uk/pub/databases/uniprot
Switzerland: ftp://ftp.expasy.org/databases/uniprot

This information can be found in our FAQ.

UniProt release 2014_06

Published June 11, 2014

Headline

Everything you always wanted to know about… sperm-egg interaction

To reach the ultimate goal of sexual reproduction which is egg fertilization, sperm cells have to run an obstacle course. They have to jump, or rather to swim, through a lot of hoops and hurdles before fusing with the oocyte and forming a zygote. The very first step of this race starts after ejaculation and involves sperm capacitation, a complex process characterized by a series of structural and functional changes, leading to sperm hypermotility that allows it to swim through oviductal mucus. In the ampulla of the fallopian tube, in the immediate surroundings of the oocyte, the spermatozoon meets a hyaluronic acid-rich matrix secreted by cumulus cells that it penetrates with the help of hyaluronidase PH-20/SPAM1. The next impediment is the egg’s coat, the zona pellucida. The interaction between the spermatozoon and zona pellucida leads to the acrosomal reaction, in which molecules required for penetrating the zona pellucida are secreted and molecules needed for sperm binding to the egg are exposed. Once through the coat, the sperm access the perivitelline space and eventually the egg’s plasma membrane, called the oolemma. It binds to it and both egg and sperm membranes fuse.

Although the overall fertilization process has been known for a long time, a large part of the detailed molecular mechanism is still mysterious. In 2005, Inoue et al. identified Izumo1 as the sperm-specific protein involved in egg attachment. Without Izumo1, fertilization does not occur, at least in mice. It took 9 more years to pinpoint Folr4 as the Izumo1 egg partner. Folr4 is widely conserved across mammals, including marsupials. Contrary to what its name might suggest, Folr4 is not a folate receptor, but it efficiently binds Izumo1 and hence has been renamed Juno, after Jupiter’s wife (and sister). The Juno and Izumo1 interaction is an absolute requirement for fertilization. In the absence of Juno, mice display no particular phenotype in a daily life, but are totally sterile, although they mate normally.

After fertilization, the egg becomes refractory to further sperm fusion events to prevent polyspermy. This process involves biochemical changes of the oolemma occurring 30-45 minutes after the initial fusion event, as well as hardening of the zona pellucida in a second phase. Juno may play a role in establishing the membrane block to polyspermy. Indeed, it is rapidly shed from the oolemma and redistributed to vesicles within the perivitelline space where it may create an area of “decoy eggs” to neutralize incoming sperm.

This discovery is not yet “everything you always wanted to know about” fertilization, for instance it does not unveil the fusion mechanism itself, but is nevertheless a major step forward.

As of this release, human and mouse Juno proteins have been updated in UniProtKB/Swiss-Prot.

UniProtKB news

Extension of the UniProtKB accession number format

We have extended the UniProtKB accession number format to 10 alphanumerical characters by adding a third pattern for new UniProtKB accession numbers. Old UniProtKB accession numbers will not change. The valid patterns for UniProtKB accession numbers are:

accession 1 2 3 4 5 6 7 8 9 10
old [O,P,Q] [0-9] [A-Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9]
old [A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]
new [A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]

The three patterns can be combined into the following regular expression:

[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • N6-glutaryllysine

UniProt DAS news

We have retired the SAAS data source from our DAS server.

UniProt release 2014_05

Published May 14, 2014

Headline

A flounder… on the rocks!

Some organisms, such as certain vertebrates, plants, fungi and bacteria, have to resist low, subzero temperatures. Their survival relies upon the production of antifreeze molecules. Some insects, like the beetle Upis ceramboides, tolerate freezing to -60°C in midwinter thanks to the production of a compound, called xylomannan, made of a sugar and a fatty acid and located in cell membranes. However, most organisms use antifreeze proteins (AFPs). All AFPs act by binding to small ice crystals to inhibit growth that would otherwise be fatal, but each type of AFP seems to arrive at this end by a different route.

Pseudopleuronectes americanus, commonly called ‘winter flounder’, is a very common variety of flounder in North America. It lives in cold water and survives thanks to the expression of the AFP Maxi. The 3D structure of the Maxi protein has been recently elucidated, unveiling some very unusual features.

Maxi belongs to the type-I AFP family and consists of a homodimer. Each monomer folds exactly in half so that its N-and C-termini are side by side, hence the dimer looks like a 4-helix rod. It is composed of tandem 11-residue repeats that exhibit the [T/I]-x3-A-x3-A-x2 motif, where x is any residue. The conserved threonine/isoleucine and alanine residues in this motif have been shown to bind ice in monomeric type-I AFPs. In the 3D structure, the internal space generated by the packing of the 4 helices in the 11-residue repeat regions is just wide enough to accommodate a single layer of water. Amazingly, the water layer that occupies the gap consists of over 400 molecules forming an extensive, mainly polypentagonal network. As is the case for most globular proteins, Maxi internal residues are nonpolar, mainly alanines, which obviously is far from optimal for hydrophilic contacts. To overcome this problem, Maxi takes advantage of its backbone carboxyl groups to anchor water molecules and the whole structure is stabilized by water-mediated hydrogen bonding rather than by direct protein association. The positioned water molecules extend outwards between all 4 helices from the core to the surface and they form a network of ordered molecules at the periphery. As a result, this rather hydrophobic protein remains highly solvated and freely soluble in flounder blood under physiological conditions, i.e. at low temperatures. When the temperature rises above 16°C, Maxi irreversibly denatures.

Another surprise came from the observation that the predicted ice-binding residues, expected to face the protein exterior, actually occur on the inward-pointing surfaces of all 4 helices where they cooperate to form and anchor the interior ordered waters. How then does Maxi bind to ice? The current working hypothesis is that the positioned water molecules that extend outwards may form a network available to merge and freeze with the quasi-liquid layer on the surface of ice.

As of this release, the winter flounder antifreeze protein Maxi has been annotated and integrated into UniProtKB/Swiss-Prot. All antifreeze proteins available in UniProtKB/Swiss-Prot can be retrieved with the keyword ‘Antifreeze protein’.

UniProtKB news

Update of ECO mapping for evidence

In 2011, we have started to use the Evidence Codes Ontology (ECO) to describe the evidence for UniProtKB annotations. Since then, this ontology has been extended and the GO Consortium has published a mapping of their GO evidence codes to ECO. We have adapted our mapping to ECO accordingly to have equivalent evidence codes for UniProtKB and GO annotations. How this affects different UniProtKB distribution formats is described below.

XML and DAS format

In these two formats, ECO codes are used to describe the evidence for UniProtKB annotations. In the UniProtKB XML format, an evidence is represented by an evidence element with a type attribute whose value is an ECO code. In the DAS (features) representation of UniProtKB, an evidence is represented by a METHOD element with an optional cvId attribute whose value is an ECO code.

The table below shows the mapping of previous to new ECO codes.

Previous ECO code New ECO code
ECO:0000001 ECO:0000305
ECO:0000006 ECO:0000269
ECO:0000034 ECO:0000303
ECO:0000044 ECO:0000250
ECO:0000203 ECO:0000501 and ECO:0000256

The codes ECO:0000312 and ECO:0000313 remain unchanged.

In the future, we will also use ECO:0000255 for UniProtKB annotations.

RDF format

In the UniProtKB RDF format, ECO codes are used to describe the evidence
for UniProtKB and GO annotations. An evidence is represented by an evidence property whose value is an ECO code. The evidence property is part of an attribution object which is assigned to a UniProtKB or GO annotation via reification.

The table below shows the mapping of previous to new ECO codes.

GO evidence code Previous ECO code New ECO code
EXP ECO:0000006 ECO:0000269
IBA ECO:0000308 ECO:0000318
IBD ECO:0000214 ECO:0000319
IC ECO:0000001 ECO:0000305
IDA ECO:0000002 ECO:0000314
IEA ECO:0000203 ECO:0000501
IEP ECO:0000008 ECO:0000270
IGC ECO:0000177 ECO:0000317
IGI ECO:0000011 ECO:0000316
IKR ECO:0000216 ECO:0000320
IMP ECO:0000015 ECO:0000315
IPI ECO:0000021 ECO:0000353
IRD ECO:0000215 ECO:0000321
ISA ECO:0000200 ECO:0000247
ISM ECO:0000202 ECO:0000255
ISO ECO:0000201 ECO:0000266
ISS ECO:0000044 ECO:0000250
NAS ECO:0000034 ECO:0000303
ND ECO:0000035 ECO:0000307
RCA ECO:0000053 ECO:0000245
TAS ECO:0000033 ECO:0000304

Cross-references for isoform sequences: RefSeq

We have added isoform-specific cross-references to the RefSeq database. The format of these cross-references is as described in release 2014_03.

Cross-references to MaxQB

Cross-references have been added to MaxQB, a database of large proteomics projects.

MaxQB is available at http://maxqb.biochem.mpg.de/mxdb/.

The format of the explicit links is:

Resource abbreviation MaxQB
Resource identifier UniProtKB accession number.

Example: Q6ZSR9

Show all entries having a cross-reference to MaxQB.

Text format

Example: Q6ZSR9

DR   MaxQB; Q6ZSR9; -.

XML format

Example: Q6ZSR9

<dbReference type="MaxQB" id="Q6ZSR9"/>

Removal of the cross-references to ProtClustDB

Cross-references to ProtClustDB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Short rib-polydactyly syndrome 2B
  • Short rib-polydactyly syndrome 3

UniParc news

UniParc cross-references with multiple taxonomy identifiers

The UniParc XML format uses dbReference elements to represent cross-references to external database records that contain the same sequence as the UniParc record. Additional information about an external database record is provided with different types of property child elements, e.g. the species is represented with a property of the type "NCBI_taxonomy_id" that stores an NCBI taxonomy identifier in its value attribute. In the past, all external database records described a single species.

Example:

<dbReference type="REFSEQ" id="ZP_06545872" version_i="1" active="Y" version="1" created="2010-03-07" last="2013-07-18">
  <property type="NCBI_GI" value="289827083"/>
  <property type="NCBI_taxonomy_id" value="496064"/>
</dbReference>
<dbReference type="REFSEQ" id="ZP_18488583" version_i="1" active="Y" version="1" created="2012-11-25" last="2013-07-18">
  <property type="NCBI_GI" value="425085490"/>
  <property type="NCBI_taxonomy_id" value="1203546"/>
</dbReference>

With the introduction of WP-accessions in the NCBI Reference Sequence Project (RefSeq) database, UniParc needs to represent more than one species per dbReference element.

Example:

<dbReference type="REFSEQ" id="WP_001144069" version_i="1" active="Y" version="1" created="2013-07-19" last="2013-11-12">
  <property type="NCBI_GI" value="447066813"/>
  <property type="NCBI_taxonomy_id" value="496064"/>
  <property type="NCBI_taxonomy_id" value="1203546"/>
</dbReference>

This change did not affect the UniParc XSD, but may nevertheless require code changes.

UniProt release 2014_04

Published April 16, 2014

Headline

An old unwanted guest being shown the door

Poliomyelitis causes disabling paralysis, notably in children and adolescents. It is an old plague. An early case of poliomyelitis is shown on a 3,000-year-old Egyptian stele. The disease is caused by the poliovirus, an RNA virus that colonizes the gastro-intestinal tract without any symptoms. In rare cases, the virus enters the central nervous system, preferentially infecting and destroying motor neurons, leading to muscle weakness and acute flaccid paralysis.

In the late 1940s, John Enders showed that the virus could be grown in cells cultured in vitro. This observation provided the basis for the generation of poliovirus vaccines during the 1950s. Poliomyelitis is now virtually absent in economically developed countries, and the World Health Organization is currently using the vaccine in a far-reaching plan to eradicate the poliovirus worldwide.

Polioviruses are small-sized (30nm), non-enveloped icosahedral viruses composed of a capsid and an 8kb single-stranded RNA genome. Upon entry into a host cell, the poliovirus rearranges cytoplasmic membranes to create double membrane spherical vesicles in which the virus replicates, hidden from the antiviral detectors of the host cell. Once new viral particles are assembled, the host cell undergoes lysis, releasing poliovirus virions.

The poliovirus genome encodes a single polyprotein, which is processed by autocatalytic cleavage into 13 different products that ensure all viral functions from entry and replication to cell exit. The size constraint on the poliovirus genome is enormous, since it has to fit within a 30nm wide capsid. In this context, the polyprotein coding strategy is ideal as it allows the greatest economy of genome length versus protein end products.

In order to reduce redundancy in the knowledgebase, UniProtKB/Swiss-Prot describes all the protein products encoded by one gene in a given species in a single entry. Viral proteins are no exception to the rule. Hence, the poliovirus polyprotein is represented in a single UniProtKB/Swiss-Prot entry, which contains the description of 13 final and 4 intermediate chains.

As of this release, the Genome polyprotein entry of poliovirus type 1 (strain Mahoney) has been updated in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references for isoform sequences: Ensembl Genomes

We have added isoform-specific cross-references to the Ensembl Genomes sections EnsemblFungi, EnsemblMetazoa, EnsemblPlants and EnsemblProtists. The format of these cross-references is as described in release 2014_03.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2014_03

Published March 19, 2014

Headline

Minority report

We are a minority in our own body. Over 90% of our cells are actually not human, but microbial. The majority of these microbes reside in the gut. The gut microbiota is typically dominated by bacteria, more specifically by Bacteroidetes and Firmicutes. The exact composition of gut microbiota varies between individuals and depends upon lifestyle, diet, hygienic preferences, use of antibiotics, etc. Gut microbes have a profound influence on human physiology and nutrition. Among others, they contribute to harvesting energy from food.

All guidelines for a healthy diet emphasize the necessity of eating fruit, vegetables and whole grains. These products are rich in dietary fibers, i.e. non-starch polysaccharides, most of which cannot be digested by the hydrolases encoded by our genome. Our inherent ability to digest carbohydrates is restricted to starch and simple saccharides, not xyloglucans (XyGs), a family of highly branched plant cell wall polysaccharides, which are abundant in plants. In view of the prevalence of XyGs in our diet, the mechanism of degradation of these complex polysaccharides by bacteria was expected to be important to human energy acquisition, but until recently it was still unclear. Very interesting work by Larsbrink et al., published in February, sheds light on XyG metabolism. The authors identified a polysaccharide utilization locus (PUL) in the genome of a common human gut symbiont, Bacteroides ovatus. PUL is transcriptionally upregulated in response to growth on galactoxyloglucan. It is predicted to encode 10 genes, including 8 glycoside hydrolases. All of them were subjected to in-depth molecular characterization through reverse genetics, in vitro protein biochemistry and enzymology. Finally, the 3D structure of the endo-xyloglucanase BoGH5A, which generates short XyG oligosaccharides, was solved. This study unraveled all the details of the enzymatic pathways by which the most common dietary polysaccharides are digested in our gut.

Although XyG utilization loci (XyGULs) have been identified in only a few other gut-resident Bacteroidetes, including B. cellulosyliticus, B. uniformis, B. fluxus, Dysgonomonas mossii and D. gadei, most human beings harbor at least one of these Bacteroides XyGULs in their gut, suggesting their importance in human nutrition.

The importance of the gut microbiome goes far beyond an active role in food digestion. It also acts on intestinal function, promoting gut-associated lymphoid tissue maturation, tissue regeneration, gut motility, and morphogenesis of the vascular system surrounding the gut. It additionally affects many other physiopathological aspects, such as the nervous system and bone homeostasis. Not surprisingly, changes in the microbiota composition or a complete lack of a gut microbiota has been shown to affect metabolism, tissue homeostasis and behavior.

As of this release, manually reviewed B. ovatus XyGUL gene products are available in UniProtKB/Swiss-Prot. Let’s bet that they will be followed by many more proteins encoded by our other genome(s) in the near future.

UniProtKB news

Cross-references for isoform sequences

Some of the resources to which we link contain information that is specific to an isoform sequence and where this is known we now indicate the corresponding UniProtKB isoform sequence identifier in our cross-references as described below. The first resources for which we provide such isoform-specific cross-references are Ensembl and UCSC.

Text format

The UniProtKB isoform sequence identifier is shown in square brackets at the end of the DR line as an optional field:

DR   ResourceAbbreviation; ResourceIdentifier(; AdditionalField)+. [IsoId]

Examples:

DR   Ensembl; ENST00000281772; ENSP00000281772; ENSG00000144445. [A0AUZ9-1]
DR   Ensembl; ENST00000418791; ENSP00000405724; ENSG00000144445. [A0AUZ9-2]
DR   Ensembl; ENST00000452086; ENSP00000401408; ENSG00000144445. [A0AUZ9-3]
DR   Ensembl; ENST00000457374; ENSP00000393432; ENSG00000144445. [A0AUZ9-3]
DR   UCSC; uc002vds.3; human. [A0AUZ9-1]
DR   UCSC; uc002vdt.3; human. [A0AUZ9-2]
DR   UCSC; uc002vdx.1; human. [A0AUZ9-4]

XML format

To show the UniProtKB isoform sequence identifier in dbReference elements, we added an optional molecule element to the dbReferenceType. For consistency, we also changed the type of the molecule element that is found in the commentType. The XSD has been changed as highlited below:

    <xs:complexType name="commentType">
    ...
                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'subcellular location' annotations.</xs:documentation>
                    </xs:annotation>
                    <!-- <xs:element name="molecule" type="xs:string" minOccurs="0"/> -->
                    <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
                    <xs:element name="subcellularLocation" type="subcellularLocationType" maxOccurs="unbounded"/>
                </xs:sequence>
    ...
    <xs:complexType name="dbReferenceType">
    ...
        <xs:sequence>
            <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
            <xs:element name="property" type="propertyType" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
        ...
    </xs:complexType>
    ...
    <xs:complexType name="moleculeType">
        <xs:annotation>
            <xs:documentation>Describes a molecule by name or unique identifier.</xs:documentation>
        </xs:annotation>
        <xs:simpleContent>
            <xs:extension base="xs:string">
                <xs:attribute name="id" type="xs:string" use="optional"/>
            </xs:extension>
        </xs:simpleContent>
    </xs:complexType>

Examples:

<dbReference type="Ensembl" id="ENST00000281772">
  <molecule id="A0AUZ9-1"/>
  <property type="protein sequence ID" value="ENSP00000281772"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000418791">
  <molecule id="A0AUZ9-2"/>
  <property type="protein sequence ID" value="ENSP00000405724"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000452086">
  <molecule id="A0AUZ9-3"/>
  <property type="protein sequence ID" value="ENSP00000401408"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000457374">
  <molecule id="A0AUZ9-3"/>
  <property type="protein sequence ID" value="ENSP00000393432"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="UCSC" id="uc002vds.3">
  <molecule id="A0AUZ9-1"/>
  <property type="organism name" value="human"/>
</dbReference>
<dbReference type="UCSC" id="uc002vdt.3">
  <molecule id="A0AUZ9-2"/>
  <property type="organism name" value="human"/>
</dbReference>
<dbReference type="UCSC" id="uc002vdx.1">
  <molecule id="A0AUZ9-4"/>
  <property type="organism name" value="human"/>
</dbReference>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • Isoaspartyl lysine isopeptide (Lys-Asp)

UniProt release 2014_02

Published February 19, 2014

Headline

Epigenetics in the spotlight

In its active form, folate, commonly known as vitamin B9, is a methyl carrier, essential for the biosynthesis of methionine and nucleic acids, most notably thymine, but also purine bases. Methionine synthesis involves first the activation of methionine synthase (MTR) by methionine synthase reductase (MTRR) and then the MTR-catalyzed conversion of homocysteine into methionine concomitant with conversion of 5-methyltetrahydrofolate into tetrahydrofolate. Methionine can be further modified into S-adenosyl methionine which serves as a methyl donor in the biosynthesis of cysteine, carnitine, taurine, lecithin, and phospholipids, among others.

Folate deficiency can result in many health problems, the most notable one being neural tube defects in developing embryos, but the molecular mechanism linking folate metabolism to development remains poorly understood. This is what prompted Padmanabhan et al. to create an animal model to study the impact of abnormal folate metabolism. These authors produced a mouse that contained a gene trap vector inserted in Mtrr gene intron 9. Wild-type Mtrr mRNA was still produced in spite of the insertion, but at lower levels, and folate metabolism was impaired.

When mid-gestation embryos from heterozygous intercrosses were analyzed, it appeared that about half of them displayed developmental defects typical of folate deficiency, ranging from developmental delay to neural tube and heart defects. Surprisingly, wild-type embryos were affected to a similar extent as embryos bearing the mutated gene. Inheritance of the phenotype was not dependent upon the parental genotype, but instead upon that of the maternal grandparents. In other words, Mtrr mutations in either maternal grandparent disrupted the development of their grandchildren, even when the parents and the conceptus were wild-type. These congenital abnormalities persisted in wild-type progeny in generations 4 and 5 of Mtrr mutant maternal ancestors.

What could be the mechanism of this peculiar mode of inheritance? The answer is not yet definite. Because folate plays a key role in one-carbon metabolism, the authors investigated DNA methylation. As expected, global DNA hypomethylation was observed in livers, uteri and placentas. Imprinted loci (differentially methylated regions or DMRs) in wild-type placentas of mid-gestation embryos from heterozygous maternal grandparents were also analyzed. A large proportion of the DMRs assessed in placentas of severely affected embryos had CpG site methylation levels that were statistically different from unrelated wild-type C57BL/6 mice. Surprisingly however, the majority of these sites were hypermethylated and the associated genes down-regulated. There was a positive correlation between epigenetic instability and the severity of the phenotype. Hence, epigenetic instability leading to the misexpression of certain genes may be the cause of developmental phenotypes.

Epigenetic heredity has been reported for Kit and Sox9 genes. In this case, heredity was mediated by RNA, a mechanism rather unlikely for the Mtrr mutations described above. The RNA-mediated heredity observed for Kit and Sox9 required the presence of the tRNA-methyltransferase TRDMT1/DNMT2. Hence, for both phenomena, it seems that the common feature may be methylation, either at the DNA or RNA level.

While awaiting further exciting discoveries in the field of epigenetics, we have already updated MTRR entries with the current knowledge and made them available.

UniProtKB news

Change of the cross-references to PROSITE and HAMAP

The format of the cross-references to the PROSITE and HAMAP databases has been simplified in order to align it with the format of other InterPro member databases.

Text format

Changes for PROSITE:

The optional qualifiers "UNKNOWN", "FALSE_NEG" and "PARTIAL" have been removed. Only matches above the threshold were kept, i.e. cross-references with a "FALSE_NEG" or "PARTIAL" qualifier have been removed.

Examples:

A1RHR2:

Previous format:

DR   PROSITE; PS51257; PROKAR_LIPOPROTEIN; UNKNOWN_1.
DR   PROSITE; PS00922; TRANSGLYCOSYLASE; FALSE_NEG.

New format:

DR   PROSITE; PS51257; PROKAR_LIPOPROTEIN; 1.

O02781:

Previous format:

DR   PROSITE; PS00237; G_PROTEIN_RECEP_F1_1; PARTIAL.
DR   PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1.

New format:

DR   PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1.

Changes for HAMAP:

The optional field that described the nature of signature hits ("atypical", "fused" or "atypical/fused") has been removed. Only matches above the threshold were kept, i.e. "atypical" and "atypical/fused" cross-references have been removed if their match score was below the threshold.

Example:

Q9K3D6:

Previous format:

DR   HAMAP; MF_00006; Arg_succ_lyase; 1; fused.
DR   HAMAP; MF_01105; N-acetyl_glu_synth; 1; atypical/fused.

New format:

DR   HAMAP; MF_00006; Arg_succ_lyase; 1.

XML format

Changes for PROSITE:

The optional values "UNKNOWN", "FALSE_NEG" and "PARTIAL" that were stored in a property of type match status have been removed, so that the match status value has become an integer. Only matches above the threshold were kept, i.e. "FALSE_NEG" and "PARTIAL" cross-references have been removed.

Examples:

A1RHR2:

Previous format:

<dbReference type="PROSITE" id="PS51257">
  <property type="entry name" value="PROKAR_LIPOPROTEIN"/>
  <property type="match status" value="UNKNOWN_1"/>
</dbReference>
<dbReference type="PROSITE" id="PS00922">
  <property type="entry name" value="TRANSGLYCOSYLASE"/>
  <property type="match status" value="FALSE_NEG"/>
</dbReference>

New format:

<dbReference type="PROSITE" id="PS51257">
  <property type="entry name" value="PROKAR_LIPOPROTEIN"/>
  <property type="match status" value="1"/>
</dbReference>

O02781:

Previous format:

<dbReference type="PROSITE" id="PS00237">
  <property type="entry name" value="G_PROTEIN_RECEP_F1_1"/>
  <property type="match status" value="PARTIAL"/>
</dbReference>
<dbReference type="PROSITE" id="PS50262">
  <property type="entry name" value="G_PROTEIN_RECEP_F1_2"/>
  <property type="match status" value="1"/>
</dbReference>

New format:

<dbReference type="PROSITE" id="PS50262">
  <property type="entry name" value="G_PROTEIN_RECEP_F1_2"/>
  <property type="match status" value="1"/>
</dbReference>

Changes for HAMAP:

The optional property of type flag that described the nature of signature hits ("atypical", "fused" or "atypical/fused") has been removed. Only matches above the threshold were kept, i.e. "atypical" and "atypical/fused" cross-references have been removed if their match score was below the threshold.

Example:

Q9K3D6:

Previous format:

<dbReference type="HAMAP" id="MF_00006">
  <property type="entry name" value="Arg_succ_lyase"/>
  <property type="flag" value="fused"/>
  <property type="match status" value="1"/>
</dbReference>
<dbReference type="HAMAP" id="MF_01105">
  <property type="entry name" value="N-acetyl_glu_synth"/>
  <property type="flag" value="atypical/fused"/>
  <property type="match status" value="1"/>
</dbReference>

New format:

<dbReference type="HAMAP" id="MF_00006">
  <property type="entry name" value="Arg_succ_lyase"/>
  <property type="match status" value="1"/>
</dbReference>

These changes did not affect the XSD, but may nevertheless require code changes.

Cross-references to TreeFam

Cross-references have been added to TreeFam, a database composed of phylogenetic trees inferred from animal genomes.

TreeFam is available at http://www.treefam.org.

The format of the explicit links is:

Resource abbreviation TreeFam
Resource identifier TreeFam unique identifier.

Example: Q8CFE6

Show all entries having a cross-reference to TreeFam.

Text format

Example: Q8CFE6

DR   TreeFam; TF328787; -.

XML format

Example: Q8CFE6

<dbReference type="TreeFam" id="TF328787"/>

Cross-references to BioGrid

Cross-references have been added to BioGrid, a public database that archives and disseminates genetic and protein interaction data from model organisms and humans.

BioGrid is available at http://thebiogrid.org.

The format of the explicit links is:

Resource abbreviation BioGrid
Resource identifier BioGrid unique identifier.
Optional information 1 Number of interactions.

Example: O46201

Show all entries having a cross-reference to BioGrid.

Text format

Example: O46201

DR   BioGrid; 69392; 1.

XML format

Example: O46201

<dbReference type="BioGrid" id="69392">
  <property type="interactions" value="1"/>
</dbReference>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • N-methylglycine
  • N,N-dimethylglycine
  • N,N,N-trimethylglycine

Deleted term:

  • 3-hydroxyhistidine

UniRef news

Revision of the UniParc records used in the UniRef databases

We have stopped importing UniParc records that correspond to Ensembl proteomes sequences in the UniRef databases, as the relevant sequences are now part of UniProtKB. Previously, some sequences from Ensembl proteomes (e.g. from Human, Chicken, Cow) were missing from UniProtKB, but we have recently completed their import into UniProtKB (see FAQ) and thus no longer need to import them via UniParc. The UniRef databases will continue to include UniParc records from the RefSeq and PDB databases that are not in UniProtKB to ensure a complete sequence space coverage.

UniProt release 2014_01

Published January 22, 2014

Headline

Mouse attacks!

In the arid lands of Arizona lives a fierce predator whose howls pierce the desert night, terrifying its prey. This predator is… a mouse, Onychomys torridus, also called the grasshopper mouse. It may sound like a tale looming straight from the imagination of Tim Burton or Monthy Python, but this mouse really exists. It is carnivorous and it regularly howls just before a kill, although the emitted sound is more a sustained whistle than the actual howl of a wolf. Its prey is no less astonishing, including crickets, other rodents, tarantulas and bark scorpions (Centruroides sculpturatus).

Bark scorpions are not easy prey. They are venomous and inflict intensely painful, sometimes lethal stings. Surprisingly grasshopper mice do not seem to be seriously bothered by that, and it takes little time before the scorpion is captured, killed and eaten. How can O. torridus ignore the venom, while common house mice are sensitive to it? Overall, grasshopper mice do feel pain normally, but when they are injected with scorpion venom or with a physiological saline solution in their hind paws, they are much more irritated by the control saline solution than by the venom. In grasshopper mice, bark scorpion venom acts as an analgesic.

Venom from Buthidae scorpions initiates acute pain in sensitive mammals, such as house mice, rats and humans, by activating the voltage-gated sodium channel Nav1.7/SCN9A, but has no effect on the Nav1.8/SCN10A sodium channel. Recent experiments by Rowe et al. on freshly isolated O. torridus sensory neurons showed that, in this species, the venom strongly inhibits Nav1.8/SCN10A Na+ currents. These Na+ currents are necessary for action potential sustained firing and propagation. By inhibiting Nav1.8/SCN10A, the scorpion venom blocks pain transmission to the central nervous system, and hence induces analgesia. The diametrically opposed response of rodents towards scorpion venom seems to be due to only 2 residues within the Nav1.8/SCN10A sequence. In O. torridus, a glutamate residue is found at position 859 (E-859) and a glutamine residue at position 862 (Q-862), while in species known to be sensitive to the venom, these positions are reversed: Q-859 and E-862. Site-directed mutagenesis of these 2 residues in the O. torridus sequence (Q859E/E862Q) abolished venom sensitivity. Conversely, mutation of the glutamine position in Mus musculus (Q861E) conferred inhibition by C. sculpturatus venom.

Pain sensitivity is essential for survival, since it helps avoid damaging situations. Hence any change in pain perception has to be finely tuned in order not to be deleterious. O. torridus has evolved a brilliant strategy allowing it to exploit an abundant food resource in its environment, i.e. bark scorpions, while keeping intact its ability to feel the necessary pain.

Persistent pain can turn into a nightmare and improving our understanding of pain signaling may be a tremendous help in the discovery of new analgesic drugs. Nav1.7/SCN9A is already under close investigation as a potential target for pain prevention. The new and very exciting study by Rowe et al. shows that the Nav1.8/SCN10A channel also plays a crucial key role in the transmission of pain signals and may be an interesting target for analgesic development.

As of this release, the fully annotated O. torridus Nav1.8/SCN10A protein is available in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of the cross-references to IPI

Cross-references to IPI have been removed.

IPI has closed in 2011. The last release is archived at ftp://ftp.ebi.ac.uk/pub/databases/IPI.

The Ensembl and Ensembl Genomes projects offer access to genomic data from vertebrate and non-vertebrate species respectively.

Complete proteome data is available from UniProtKB.

An archive of the last mapping table between UniProtKB and IPI is archived at ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2014_01/.

Documents and RSS feeds for UniProt Forthcoming changes and News

We have replaced the documents sp_soon.htm (“UniProt Knowledgebase – Forthcoming changes”) and xml_soon.htm (“UniProt Knowledgebase – Forthcoming changes in XML”) by a searchable section Forthcoming changes on our website to announce planned changes for all UniProt data sets and file formats in one place and to provide a common RSS feed. The same information can also be downloaded from our FTP site.

Changes that have been implemented are described in our “News archive”, which can be searched in the News section of our website, followed via an RSS feed and downloaded from the FTP site. These news include the historical contents of sp_news.htm (“What’s new?”), but not that of xml_news.htm (“What’s new in XML?”). The latter file was renamed to xml_news_prior_2014_01.html to archive the XML changes that were implemented before 2014. This file will no longer be updated.

We have generated symbolic links on the FTP site for the files that have been replaced to give everyone time to update their FTP download procedures to the new files’ locations:

New version of DASty

Our DAS web client DASty has been redesigned. DASty provides a visual representation of the compilation of protein annotations from different third-party sources. This allows users to get a global overview of all protein annotation available for their protein of interest, from UniProt as well as other sources. The “Third-party data” link that is available on each UniProtKB entry now leads to this new version of DASty. Any bookmarks should be updated accordingly. For instance, the “Third-party data” link for UniProt accession P05067 now links to http://www.ebi.ac.uk/dasty/client/index.html?q=P05067

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

Modified terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • 1-(tryptophan-3-yl)-tryptophan (Trp-Trp) (interchain) -> 1-(tryptophan-3-yl)-tryptophan (Trp-Trp) (interchain with W-...)
  • 5’-tyrosyl-5’-aminotyrosine (Tyr-Tyr) (interchain) -> 5’-tyrosyl-5’-aminotyrosine (Tyr-Tyr) (interchain with Y-...)
  • Glycyl threonine ester (Gly-Thr) (interchain with G-...) -> Glycyl threonine ester (Gly-Thr) (interchain with T-...)

Changes to keywords

New keywords:

Modified keywords:

Deleted keyword:

  • Phage maturation

UniProt release 2013_12

Published December 11, 2013

Headline

The aflatoxin biosynthetic pathway annotated in UniProtKB/Swiss-Prot

Aflatoxins are very important members of the family of mycotoxins, that contaminate food and feed crops. More than 14 different aflatoxins have been identified so far. These secondary metabolites are mainly produced by the filamentous fungi Aspergillus flavus and Aspergillus parasiticus. These organisms grow in warm and humid locations, such as those where crops (e.g. rice, maize and ground nuts) are stored.

Intake of aflatoxins has both acute and long term effects. Acute aflatoxin poisoning leads to effects such as hemorrhagic necrosis of the liver, bile duct proliferation, edema and lethargy. In addition, aflatoxins have immunosuppressive effects and interfere with nutrient uptake leading to malnutrition (kwashiorkor). The most toxic of the aflatoxins, aflatoxin B1, is the most potent naturally occurring carcinogen known. The carcinogenic effect of aflatoxins is mediated by 2 cytochromes P-450 enzymes, CYP1A2 and CYP3A4. CYP1A2 and CYP3A4 turn the aflatoxins into much more reactive epoxides that react with DNA bases and induce mutations, leading, in the long term, to liver cancer. Overall it is estimated that aflatoxins negatively impact up to 5 billion people who live in warm and humid climates. The presence of dietary aflatoxin is strongly associated with incidences of liver and lung cancers, HIV/AIDS, malaria, growth stunting and childhood malnutrition, and increased risk of adverse birth outcomes in Asia, Africa, and Central America.

To increase the ability to eliminate or reduce aflatoxin contamination, the mycotoxin biosynthetic pathway has been comprehensively studied. The pathway is composed of over 25 enzymatic steps, each step catalyzed by a different enzyme. 13 of these enzymes have been biochemically characterized in sufficient depth to allow the recent attribution of enzyme classification (EC) numbers.

EC numbers are part of a classification system managed by the International Union for Biochemistry and Molecular Biology (IUBMB). They are composed of 4 digits, which represent both the name of the enzyme and the precise description of the chemical reaction it catalyzes. In UniProtKB, enzymes are annotated with EC numbers (in ‘Names and origin’, ‘Protein names’, ‘Recommended name’, see for instance pksL1 entry), when these are available.

As of this release, the enzymes involved in aflatoxin biosynthesis have been manually annotated and are publicly available in UniProtKB/Swiss-Prot. The newly characterized enzymes from this pathway belong to oxidoreductase, transferase, hydrolase, and lyase classes of the EC classification system.

UniProtKB news

New human 1000 Genomes Project variants file

UniProt would like to announce the release of a new extension to the humsavar.txt variant catalogue. This new variant file, homo_sapiens_variation.txt.gz, supplements the set of manually curated human variants in humsavar.txt with a catalogue of novel Single Nucleotide Variants (SNVs or SNPs) from the 1000 Genomes Project for both UniProtKB/Swiss-Prot and UniProtKB/TrEMBL sequences. These variants have been automatically mapped to UniProtKB sequences, including isoform sequences, through Ensembl. In addition to defining the position and amino acid change due to each variant, the new file maps each affected UniProtKB record to the corresponding Ensembl gene, transcript and protein identifiers, provides the chromosomal location with allele change and, where possible, a cross-reference to OMIM is provided for the variant. This file along with the humsavar.txt file can now be found in the new dedicated ‘variants’ directory in the UniProt FTP site. We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to add additional data sources for human variants that will include somatic variants, new data fields providing additional details concerning the variant and variants from additional species.

Cross-references to GuidetoPHARMACOLOGY

Cross-references have been added to GuidetoPHARMACOLOGY, which provides an expert-driven guide to pharmacological targets and the substances that act on them.

GuidetoPHARMACOLOGY is available at http://www.guidetopharmacology.org/

The format of the explicit links in the flat file is:

Resource abbreviation GuidetoPHARMACOLOGY
Resource identifier GuidetoPHARMACOLOGY identifier
Example Q08460:
DR   GuidetoPHARMACOLOGY; 380; -.

Show all the entries having a cross-reference to GuidetoPHARMACOLOGY.

New cross-reference category: Chemistry

A new database category has been added: Chemistry.

Change of the category of the cross-references BindingDB, ChEMBL and DrugBank

The BindingDB, ChEMBL and DrugBank databases have been moved from the category “Other” to the category “Chemistry”.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to keywords

New keywords:

Modified keyword:

Deleted keyword:

  • Inhibition of host TBK1-IKBKE-DDX3 complex by virus

UniProt release 2013_11

Published November 13, 2013

Headline

Forever young and cancer-free… in a black hole

In east African grasslands and savannas lives a most bizarre rodent: the naked mole-rat (Heterocephalus glaber). Naked mole-rats are small burrowing rodents, about the size of a mouse. They inhabit underground tunnels, where they form colonies ranging in size from 20 to 300 individuals. Naked mole-rats exhibit eusociality, a lifestyle reminiscent of that of ants or some bees. The colony is ruled by a queen; it has 1 to 3 males who breed only with the queen, while the other female members of the colony are sterile workers or soldiers. But this is not the only singularity of this amazing mammal. Among many other unexpected features, naked mole rats exhibit exceptional longevity, some reaching ages of 30 years, about 10 times longer than ordinary mice (in a protected environment). They show negligible senescence, no age-related increase in mortality, and high fecundity until death. In addition, they are highly resistant to cancer.

In 2009, it was reported that naked mole rats may resist cancer thanks to an extremely efficient mechanism of cell contact inhibition, called early contact inhibition (ECI). Contact inhibition is a process that arrests cell growth when cells come in contact with each other or the extracellular matrix. It is a powerful anticancer mechanism. The process of ECI causes naked mole-rat cells to arrest at a much lower density than mouse cells, and the loss of ECI makes naked mole-rat cells more susceptible to malignant transformation.

When culturing naked mole-rat fibroblasts, Tian et al. observed that the culture media became very viscous after a few days, much more than the media conditioned by human, guinea-pig or mouse cells. This increase in viscosity was due to the increased production of an anionic, nonsulfated glycosaminoglycan: high-molecular-mass hyaluronan (HMM-HA). HMM-HA overproduction was not restricted to tissue culture conditions. It was also observed in vivo, including in brain, heart, kidney and skin. Increased HMM-HA production was due to robust synthesis, via the up-regulation of hyaluronan synthase 2 (Has2), the enzyme catalyzing HMM-HA production, combined with slower degradation, due to the down-regulation of HA-degrading enzyme.

Secreted HMM-HA binds to fibroblasts through the Cd44 cell surface receptor and triggers intracellular signaling, leading to the expression of the cyclin-dependent kinase inhibitor Cdkn2a/p16-INK4a and to the induction of ECI. In naked mole-rat cells, this signaling is further optimized, since these cells exhibit a 2-fold higher affinity for HA as compared to mouse or human cells.

HA is widely distributed and one of the main components of the extracellular matrix. The authors hypothesized that the increased HMM-HA production in the naked mole-rat could have evolved as an adaptation to a subterranean lifestyle to provide flexible skin needed to squeeze through underground tunnels. This adaptation to harsh living conditions would turn out to have additional benefits, such as contributing to cancer resistance.

As of this release, naked mole-rat Has2 has been manually annotated and is publicly available in UniProtKB/Swiss-Prot entry G5AY81.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted diseases:
  • Epileptic encephalopathy, Lennox-Gastaut type
  • Knobloch syndrome 2

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N-acetylated lysine
Modified terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 5-glutamyl N2-arginine -> 5-glutamyl N2-ornithine
  • 5-glutamyl N2-glutamate -> 5-glutamyl glutamate

Changes to keywords

New keyword:

UniProt release 2013_10

Published October 16, 2013

Headline

When the cat’s away…

For all creatures, early detection of predators is a matter of survival. Olfaction often plays a crucial role in this regard. Odorant molecules activate specific receptors on sensory neurons. The axons from neurons expressing the same olfactory receptor come together at the same glomeruli, near the surface of the olfactory bulb of the brain. It is generally thought that odorants can be recognized by different receptors and that each glomerulus makes only a small contribution to the global representation of a given odor. However, recent discoveries suggest that the olfactory system may not be as redundant as previously thought.

Mice exhibit innate aversion to volatile amines, such as beta-phenylethylamine (PEA) and isopentylamine (IPA) that are excreted in cat urine. Trace amines robustly activate trace-amine associated receptors (TAARs). There are 15 TAAR genes in mouse. Targeted concomitant deletion of 14 of them (TAAR2 through 9) show no apparent phenotype. Homozygous mutant mice are healthy and breed normally. The only difference with wild-type and heterozygous littermates is that their aversion to PEA and to cat urine is abolished. This effect is specific, since their response to compounds produced by red fox remains unchanged. Among TAAR genes, TAAR4 is of particular interest, since it is exquisitely sensitive to PEA, with apparent affinities rivaling those seen with mammalian pheromone receptors. Amazingly, knockout of this single gene produces a loss of aversion to PEA and to puma or lynx urine, although homozygous mutant animals still avoid other odorants, such as IPA, exactly as their wild-type and heterozygous littermates do. To our knowledge, this is the first report of an individual main olfactory receptor contributing substantially to odor perception.

This type of exciting discovery reported in the literature triggers yet another innate reaction, that of Swiss-Prot curators to update UniProtKB. The revised mouse TAAR4 entry is now publicly available.

UniProtKB news

Cross-references to PRO

Cross-references have been added to PRO (Protein Ontology), which provides an ontological representation of protein-related entities by explicitly defining them and showing the relationships between them.

PRO is available at http://pir.georgetown.edu/pro/pro.shtml

The format of the explicit links in the flat file is:

Resource abbreviation PRO
Resource identifier PRO identifier
Example O42634:
DR   PRO; PR:O42634; -.

Show all the entries having a cross-reference to PRO.

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted diseases:
  • Microphthalmia, isolated, with cataract, 4

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • Methionine (R)-sulfoxide
  • Methionine (S)-sulfoxide

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2013_09

Published September 18, 2013

Headline

With a little help from my… Lassa virus

Dystroglycan provides a physical link between components of the extracellular matrix, including laminin, and the intracellular actin cytoskeleton. This link is crucial for a number of cellular processes, including laminin and basement membrane assembly, sarcolemmal stability, cell survival, peripheral nerve myelination, cell migration and epithelial polarization.

The dystroglycan protein is extensively glycosylated at multiple sites, and an unusual O-linked glycan is required for proper interaction with extracellular matrix ligands including laminin. Glycosyltransferases responsible for this modification were first identified using classical biochemical techniques, and mutations in the associated genes were identified in patients presenting with one of a number of dystroglycanopathies. These are a heterogeneous group of disorders characterized by muscular dystrophy that can be associated with brain anomalies, mental retardation, eye malformations, and other clinical symptoms. However until recently some 50% of newly diagnosed cases of dystroglycanopathy showed no significant association with variants in known glycosyltransferase genes.

To address this issue, Jae et al., 2013 developed a powerful approach to dystroglycanopathy candidate gene identification that exploits another, less beneficial property of dystroglycan. The hemorrhagic Lassa virus binds to glycosylated dystroglycan during infection, the efficiency of which depends on the glycosylation level. By using gene-trap insertion mutagenesis the authors were able to identify genes whose inactivation conferred resistance to Lassa virus infection, which by extension may include regulators of the level of dystroglycan glycosylation. These genes included all those previously known to be associated with a dystroglycanopathy, as well as several novel candidates. Exon sequencing of a panel of patients with severe dystroglycanopathy identified variants in two of them, POMK/SGK196 and TMEM5, while confirming the absence of variants in known dystroglycanopathy genes. The other candidates await further characterization.

We may be about to witness the elucidation of the underlying genetic causes of a range of dystroglycanopathies, disorders associated with defective dystroglycan modification, through the use of a deadly virus that normally targets the affected protein.

As of this release, all proteins involved in dystroglycanopathies can be retrieved from UniProtKB/Swiss-Prot with the keyword Dystroglycanopathy.

UniProtKB news

Removal of the cross-reference to Pathway_Interaction_DB

Cross-references to Pathway_Interaction_DB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Cataract, pulverulent, juvenile-onset, MAF-related
  • 2-aminoadipic 2-oxoadipic aciduria

Changes to keywords

New keywords:

Modified keywords:

UniProt release 2013_08

Published July 24, 2013

Headline

Girls just want to have … IFNE

Interferons (IFNs) are proteins made and released in answer to the presence of pathogens, such as viruses or bacteria, that trigger the protective defenses of the immune system. In other words, they “interfere” with infections, hence their name. Within the large IFN family, type I IFNs are clustered on a defined locus on chromosome 9p21 in humans and in a region of conserved synteny on chromosome 4 in mice. Their expression is induced by the activation of signaling pathways downstream of pattern-recognition receptors and they all bind to the IFN-alpha cell surface receptor complex consisting of IFNAR1 and IFNAR2 chains, leading to the expression of a whole set of genes.

There is, however, an alien on the type I IFN locus: IFN-epsilon (IFNE). IFNE shares less than 40% amino acid identity with bona fide type I IFNs, such as IFN-alpha or IFN-beta, but it does still bind to IFNAR, as expected for a type I IFN. However, unlike any of the other family members, it is not induced by the activation of any known pattern-recognition, including Toll-like receptor pathways. In addition, while other type I IFNs are mainly produced by haemopoietic cells, IFNE is constitutively expressed by epithelial cells of the female reproductive tract in humans and mice. At first glance, these observations seem to challenge a potential protective function for IFNE.

In a recent publication, Fung et al. reported that IFNE expression varied approximately 30-fold at different stages of the estrous cycle in the mouse uterus, with the highest levels at estrus (when estrogen levels are high) and was reduced during pregnancy (when progesterone levels are high). Similarly, in the human endometrium, IFNE levels were highest in the proliferative phase of the menstrual cycle and lowest in postmenopausal women (when estrogen levels are low). The suspected hormonal regulation could then be confirmed in mice and in humans: IFNE is induced by estrogens and reduced by progesterone. What about IFNE function? Fung et al. demonstrated that IFNE regulates IFN-regulated genes, including IRF7 and ISG15, as well as 2’5’oligoadenylate synthetase. What is more, Ifne-/- female mice, whose vaginas were infected with Chlamydia muridarum or herpes simplex virus 2, had more severe clinical disease than wild-type mice, as well as higher levels of virus or bacteria at defined time points after infection. Hence IFNE seems to play an important – though local – protective role against sexually transmitted infections.

These very interesting observations may have pinpointed the cause of susceptibility to infections of the reproductive tract in women on progesterone-containing contraception, i.e. a progesterone-induced decrease in IFNE expression.

In UniProtKB/Swiss-Prot, IFNE entries have been updated accordingly.

UniProtKB news

Cross-references to GeneWiki

Cross-references have been added to GeneWiki, an initiative that aims to create seed articles for every notable human gene.

GeneWiki is available at http://en.wikipedia.org/wiki/Gene_Wiki

The format of the explicit links in the flat file is:

Resource abbreviation GeneWiki
Resource identifier GeneWiki identifier
Example Q96N67:
DR   GeneWiki; Dock7; -.

Show all the entries having a cross-reference to GeneWiki.

Change of the cross-reference GlycoSuiteDB to UniCarbKB

GlycoSuiteDB, an annotated and curated relational database of glycan structures, has been integrated into UniCarbKB, with a new user interface and added functionalities.

We therefore changed the corresponding resource abbreviation from GlycoSuiteDB to UniCarbKB.

Example: P02763:

Previous flat file format:
DR   GlycoSuiteDB; P02763; -.
New flat file format:
DR   UniCarbKB; P02763; -.

UniProtKB/Swiss-Prot is currently linked to this resource from the cross-reference section (DR lines), but we also have some site-specific links from the sequence annotation section (FT CARBOHYD) of relevant UniProtKB/Swiss-Prot entries. An increase of the number of cross-linked entries is planned, including more literature based glycan data from UniCarbKB.

Removal of the cross-reference to GermOnline

Cross-references to GermOnline have been removed.

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted diseases:
  • Cataract, congenital, cerulean type, 3
  • Cataract, congenital, non-nuclear polymorphic, autosomal dominant
  • Cataract, cortical, age-related, 2
  • Cataract-microcornea syndrome
  • Cataract, sutural, with punctate and cerulean opacities
  • Cataract, zonular
  • Hereditary non-polyposis colorectal cancer 3
  • Leukotriene C4 synthase deficiency
  • Neuropathy, congenital amyelinating
  • Pallido-ponto-nigral degeneration
  • Platyspondylic lethal skeletal dysplasia Sand Diego type
  • Thromboxane synthetase deficiency
  • Weaver syndrome 2

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 5-glutamyl N2-arginine
  • 5-glutamyl N2-glutamate

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2013_07

Published June 26, 2013

Headline

How to go green, or red?

Chlorophyll is the major photosynthetic pigment. It performs the essential processes of harvesting light energy in the antenna complexes and transferring this energy to the reaction centers to produce chemical energy.

The chlorophyll molecule is present in all photosynthetic organisms. It is made up of 2 moieties of distinct origin, chlorophyllide and phytol. The early enzymatic steps of chlorophyllide biosynthesis from glutamyl-tRNA to protoporphyrin IX are shared with the heme biosynthesis pathway. Hence, protoporphyrin IX is the last common reactant for the synthesis of both heme and chlorophyll. To produce chlorophyll, a magnesium chelatase (EC=6.6.1.1) inserts Mg(2+) into the protoporphyrin IX ring, while an iron chelatase (EC=4.99.1.1) inserts Fe(2+) into the ring during heme biosynthesis.

In Arabidopsis thaliana, there are 15 enzymes and 27 genes required for chlorophyll biosynthesis from glutamyl-tRNA to chlorophyll b. Nine proteins are encoded by single-copy genes, and the others are encoded by gene families consisting of two to three members. The magnesium chelatase is a complex of three subunits, CHLI, CHLD and CHLH encoded by 4 different genes. As of this release, all 27 proteins are manually annotated in UniProtKB/Swiss-Prot. They all contain the subtopic PATHWAY: Porphyrin-containing compound metabolism; chlorophyll biosynthesis in ‘General annotation (Comments)’ and the keyword Chlorophyll biosynthesis. This keyword also allows the retrieval of additional proteins involved in the regulation of the process or in the biosynthesis of the long phytol side chain, for example.

Enzymes involved in the biosynthesis of the porphyrins, common to both heme and chlorophyll, are also annotated with the comment PATHWAY: Porphyrin-containing compound metabolism; protoporphyrin-IX biosynthesis.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases:

UniProt release 2013_06

Published May 29, 2013

Headline

Back to the wild

Nearly half of our genome consists of mobile elements and their recognizable remnants. These elements are thought to have shaped both our genes and our entire genome, driving genome evolution. However, mobile elements can undergo ‘molecular domestication’, whereby the transposon genes are incorporated into cellular gene expression programs, but are no longer mobile. They can also evolve cellular DNA recombination functions, such as the V(D)J antigen receptor-recombination system. The human genome contains some 50 genes that were derived from transposable elements or transposons, and many are now integral components of cellular gene expression programs.

Human THAP9 is one such transposon-derived gene. It is homologous to Drosophila P element DNA transposase. Both human and Drosophila proteins show a typical site-specific DNA-binding Zn finger domain. Human THAP9 is a single-copy gene and does not contain any terminal inverted repeats or target-site duplications, indicating that it constitutes a bona fide domesticated stationary sequence. It thus came as a surprise that this gene has nevertheless retained the catalytic activity to mobilize P transposable elements in Drosophila and human cells. The physiological relevance of this observation remains elusive, but what is clear is that domesticated transposons may have retained enough “wild” properties to keep our genome on the move.

The human THAP9 entry has been updated accordingly in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to SignaLink

Cross-references have been added to SignaLink, an integrated resource to analyze signaling pathway proteins, cross-talks, transcription factors, miRNAs and regulatory enzymes.

SignaLink is available at http://signalink.org/

The format of the explicit links in the flat file is:

Resource abbreviation SignaLink
Resource identifier UniProtKB accession number
Example Q24306:
DR   SignaLink; Q24306; -.

Show all the entries having a cross-reference to SignaLink.

Removal of the cross-reference to HSSP

Cross-references to HSSP have been removed.

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted disease:
  • Ichthyosis, lamellar, 1

UniProt release 2013_05

Published May 1, 2013

Headline

Human genetic diseases in UniProtKB/Swiss-Prot

During the past decade, next-generation sequencing (NGS) technologies have accelerated the detection of genetic variants resulting in the rapid discovery of new disease-associated genes. More than 100 causative genes in various Mendelian disorders have been identified by means of whole exome sequencing. However, the wealth of variation data made available by NGS is not sufficient, alone, to understand the mechanisms underlying disease pathogenesis and manifestation. Diseases are the consequences of series of events that include not only primary mutations in disease-causing genes, but also variations in disease-modifying genes, as well as the combined effects of gene-gene and gene-environment interactions. That is why new approaches to unravel disease mechanisms are based on biological network analysis.

In addition to providing a large amount of information on protein functions, interactions and biological pathways, UniProt pays particular attention to the annotation of human genetic diseases and disease-linked variants. Information on genetic diseases is shown in the ‘Involvement in disease’ subsection of the ‘General Annotation (Comments)’ section. In the current release, over 4,600 phenotypes are described in close to 3,000 human entries. The great majority of UniProtKB disease descriptions have links to the Online Mendelian Inheritance in Man knowledgebase (OMIM), allowing users to retrieve more detailed information.

In order to improve the clarity of medical annotation and to facilitate the retrieval of disease information from UniProtKB, we have modified the format of the subsection ‘Involvement in disease’. The newly modified subsection is organized in 2 parts. Firstly, the disease name, acronym and features are defined using a controlled vocabulary. Secondly, the role of the gene/protein in the disease is described in a ‘Note:’, that allows discrimination between disease-causing, disease-modifying and susceptibility genes. This note, partly written in free text, provides information on the biological context or other interesting information that may not be directly related to the phenotype description, such as the involvement of different proteins in the pathological mechanism. For example, multiple sulfatase deficiency (MSD) is due to the simultaneous decrease of activity of all sulfatases. However, the primary cause is a mutation in SUMF1, an enzyme required for post-translational modification and catalytic activation of these enzymes. This additional information is stored in the ‘Involvement in disease’ note.

Genetic diseases annotated in UniProtKB/Swiss-Prot are indexed in the humdisease.txt file, available for our users as of this release. Each record in this file consists of a disease identifier, acronym, and description, as well as known disease synonyms, links to OMIM, Medical Subject Headings (MeSH) and associated UniProtKB keywords.

UniProtKB news

Complete proteomes for Ensembl species

For UniProt release 2013_05, one new species from Ensembl vertebrates and 3 new Ensembl Genomes have been made available. These are:

Felis catus (Cat)
Brassica rapa subsp. pekinensis (Chinese cabbage)
Hyaloperonospora arabidopsidis (Downy mildew agent)
Magnaporthe poae (Kentucky bluegrass fungus)

In addition to the new imports, existing proteomes derived from Ensembl species have been updated with data from Ensembl release 70.
All predicted protein sequences from an Ensembl Genome are mapped to their UniProtKB counterparts under stringent conditions: 100% identity over 100% of the length of the two sequences is required. Any sequence found to be absent from UniProtKB is imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. All UniProtKB entries that map to an Ensembl Genome are used to build the proteome; they are tagged with the keyword Complete proteome and an Ensembl Genome cross-reference is added.
We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to make proteomes for the remaining Ensembl and Ensembl Genomes species currently absent from UniProtKB.

Removal of the cross-reference to GenomeReviews

Cross-references to GenomeReviews have been removed.

Changes to keywords

New keywords: Modified keyword:

UniProt release 2013_04

Published April 3, 2013

Headline

Major progress in adenovirus annotation

Adenoviruses were first isolated by Wallace Rowe in 1953 from adenoid tissue of sick children. These viruses infect a wide range of vertebrates, including humans. Infectious virions are spread primarily via respiratory droplets, however they can also be spread by fecal routes. Most infections with Human Adenovirus (HAdV) result in upper respiratory tract diseases; they account for about 10% of acute respiratory infections in children. They can also cause fever, diarrhea, pink eye (conjunctivitis), bladder infection (cystitis), rash illness, etc.

HAdV are medium-sized (90-100 nm), non-enveloped icosahedral viruses composed of a capsid and a double-stranded linear DNA genome. The viral genome is approximately 36kb long. It encodes 37 proteins which are produced by complex alternative splicing of 6 mRNA transcription units. The viral genome replicates in the host cell nucleus, but never integrates into the host genome. This is the reason why adenoviruses are widely used in gene therapy and anticancer virus vector trials.

The JCVI adenovirus project recently resulted in the sequencing of 150 new HAdV genomes. In order to support the annotation of these new genomes, the community needs a high quality set of data that can serve as a reference. In this context, a collaboration including UniProt, NCBI, JCVI and several field experts has been initiated to update reference adenovirus genomes and proteomes. Gene predictions have been corrected with the most recent proteomic and cDNA sequencing data. This major collaborative effort has resulted in a consistent and up-to-date annotation of the viral genome in NCBI RefSeq and of the HAdV reference proteome in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of MEDLINE identifiers

We have removed the MEDLINE identifiers from the bibliographic database cross-references of literature citations since they have been superceded by PubMed identifiers. The valid bibliographic database names and their associated identifiers are now:

Name Identifier
PubMed PubMed Unique Identifier (PMID)
DOI Digital Object Identifier (DOI)
AGRICOLA AGRICOLA Unique Identifier

UniProt release 2013_03

Published March 6, 2013

Headline

Latest from the prokaryotic world: bacterial Cas9, a new tool for genome engineering

The CRISPR system (Clustered Regularly Interspaced Short Palindromic Repeat) is a bacterial and archaeal, RNA-based adaptive immune system, which degrades invading genetic material. Very briefly, invading viruses or plasmids are recognized by their complementarity to CRISPR RNA (crRNA) and degraded by dedicated nucleases.

There are 3 major CRISPR systems, with a growing number of recognized subtypes depending on the Cas proteins (CRISPR-associated proteins) used to affect the various steps of crRNA generation and invading nucleic acid destruction. In type I and III CRISPR systems, different specialized Cas endonucleases generate crRNAs, which then assemble with other Cas proteins to create large crRNA-protein complexes that recognize and degrade invading nucleic acids complementary to the crRNA. Type II CRISPR systems are a little different. In these systems, correct processing of pre-crRNA requires a trans-encoded small RNA (tracrRNA), endogenous RNase III and the Cas9 protein. The tracrRNA serves as a guide for RNase III-aided processing of pre-crRNA. Subsequently the Cas9/crRNA/tracrRNA complex endonucleolytically cleaves linear or circular dsDNA target complementary to the crRNA. Degradation requires the Cas9 protein and both RNA species. Thus, in type II CRISPR systems, crRNA-guided degradation of DNA relies upon a single protein. This discovery has implications beyond the world of bacteria. Expressing Cas9 with specifically chosen crRNA should allow site-specific genome modifications, knocking-out genes on demand not only in bacteria where it is already relatively simple to do so, but also in higher organisms, such as vertebrates.

And indeed it works! In 2 back-to-back Science articles published online in January of this year, Streptococcus pyogenes strain SF370 Cas9 endonuclease was codon-optimized and targeted to the nucleus in human or mouse cells. In one article, RNase III was engineered in a similar fashion while the tracrRNA and pre-crRNA were expressed either separately or as a hybrid molecule, while in the other, only a hybrid crRNA-tracrRNA was expressed. In both papers, various gene targets were cloned into the crRNA locus, leading to site-specific target cleavage which was subsequently repaired by either nonhomologous end-joining or homologous recombination. While the efficiency of the process varies, introducing multiple targets within a single gene or targeting multiple genes at a time is feasible, allowing for comparatively easy manipulation of a genome of interest. Additionally, no toxicity has been observed upon expression in human cells.

A similar approach has been successfully used not only in other bacteria, but also in zebrafish, as well as in different human cell lines.

The work described above has been carried out using Cas9 from Streptococcus pyogenes strain SF370, and the corresponding UniProtKB/Swiss-Prot entry has been updated, as have been experimentally characterized orthologous proteins in other bacteria (Streptococcus thermophilus strain DGCC7710, Streptococcus thermophilus strain ATCC BAA-491 / LMD-9 and Listeria innocua serovar 6a strain CLIP 11262). Additionally, a new HAMAP rule has been made for the Cas9 family (MF_01480).

UniProtKB news

Cross-references to ChiTaRS

Cross-references have been added to ChiTaRS, a database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data.

ChiTaRS is available at http://chitars.bioinfo.cnio.es/

The format of the explicit links in the flat file is:

Resource abbreviation ChiTaRS
Resource identifier gene name
Optional information 1 organism name
Example P16320:
DR   ChiTaRS; ATP6AP1; drosophila.

Show all the entries having a cross-reference to ChiTaRS.

Cross-references to SABIO-RK

Cross-references have been added to SABIO-RK, a database of biochemical reaction kinetics.

SABIO-RK is available at http://sabiork.h-its.org/

The format of the explicit links in the flat file is:

Resource abbreviation SABIO-RK
Resource identifier UniProtKB accession number
Example P10172:
DR   SABIO-RK; P10172; -.

Show all the entries having a cross-reference to SABIO-RK.

Removal of the cross-reference to 8 2D gel databases

Cross-references to 2DBase-Ecoli, Aarhus/Ghent-2DPAGE, ANU-2DPAGE, Cornea-2DPAGE, PHCI-2DPAGE, PMMA-2DPAGE, Siena-2DPAGE, and Rat-heart-2DPAGE have been removed.