Removal of whitespace characters in the XML amino acid sequence representations

On September 18, 2019

The <sequence> elements of the UniProtKB, UniParc and UniRef XML representations format the amino acid sequence for historic reasons with spaces and newlines. These whitespace characters must be removed before parsing with native XML tools. To avoid this complication we are going to remove all whitespace characters in the <sequence> elements, so that they will contain only IUPAC amino acid codes.

Change of UniRef clustering method from CD-HIT to MMseqs2

On September 18, 2019

We will switch the clustering program for UniRef90 and UniRef50 from CD-HIT to MMseqs2 (Steinegger M. and Soeding J., Nat. Commun. 9 (2018)).

The clustering algorithm will remain “Greedy Incremental Clustering” with the same parameters (thanks to the MMseqs2 authors for making this available). UniRef100 will not be affected.