DECIPHER - Clusterize

Clusterize

This short example describes how to use Clusterize to cluster sequences, as described in:

ES Wright (2024) "Accurately clustering biological sequences in linear time by relatedness sorting." Nature Communications, doi:10.1038/s41467-024-47371-9.

For an in-depth tutorial on clustering, see the "Upsize Your Clustering with Clusterize" vignette, available from the Documentation page. Also, watch the video explaining Clusterize here.

How do I cluster sequences by similarity?

First it is necessary to install DECIPHER and load the library in R. Next, set the "fas" variable to the path to the FASTA file of unaligned sequences (e.g., "~/mySeqs.fas"). Then you can choose a distance cutoff for clustering the sequences. Clusterize will output a cluster number for each input sequence and print an estimate of the clustering effectiveness.

Hide output

# load the DECIPHER library in R
> library(DECIPHER)
> 
> # specify the path to the FASTA file (in quotes)
> fas <- "<<REPLACE WITH PATH TO FASTA FILE>>"
> 
> # load the sequences from the file
> # change "DNA" to "RNA" or "AA" as needed
> seqs <- readAAStringSet(fas)
> 
> # look at some of the sequences (optional)
> seqs
AAStringSet object of length 18976:
        width seq             names               
    [1]   567 MPYMGV...RRVPPK Seq1
    [2]   749 MRYIDD...MNQIES Seq2
    [3]   849 MLGILK...FGEKGT Seq3
    [4]   742 MLFSFS...IKEQNS Seq4
    [5]   499 MSSFTL...SAVSSL Seq5
    ...   ... ...
[18972]   927 MSRKVL...RGTDNE Seq18972
[18973]   465 MTFEER...GDDASF Seq18973
[18974]   502 MRTPKS...PHKTSV Seq18974
[18975]   527 MFFVPR...PGAAHS Seq18975
[18976]   475 MNRGRR...DLPARL Seq18976
> 
> # cluster the sequences
> clusters <- Clusterize(seqs,
+ cutoff=0.5, # < 50% distant
+ minCoverage=0.5, # > 50% coverage
+ processors=NULL) # use all CPUs
Partitioning sequences by 4-mer similarity:
  |========================================| 100%


Time difference of 6.05 secs


Sorting by relatedness within 15809 groups:
iteration 7 of up to 24 (100.0% stability) 


Time difference of 1.46 secs


Clustering sequences by 4-mer to 6-mer similarity:
  |========================================| 100%


Time difference of 52.73 secs


Clusters via relatedness sorting: 86.8% (0.3% exclusively)
Clusters via rare 4-mers: 99.7% (13.2% exclusively)
Estimated clustering effectiveness: 99.2%


> 
> # view the cluster numbers
> head(clusters)
     cluster
Seq1    6306
Seq2    1957
Seq3    3093
Seq4    4164
Seq5    7527
Seq6    1944
> 
> # compute cluster statistics
> max(clusters) # number of clusters
[1] 12559
> t <- table(clusters)
> mean(t) # average cluster size
[1] 1.510948
> tail(sort(t)) # biggest clusters
cluster
7451 8479 6757 2279 3414    6 
  47   49   51   52   64  110