Clusterize
This short example describes how to use Clusterize to cluster sequences, as described in:ES Wright (2024) "Accurately clustering biological sequences in linear time by relatedness sorting." Nature Communications, doi:10.1038/s41467-024-47371-9.
How do I cluster sequences by similarity?
First it is necessary to install DECIPHER and load the library in R. Next, set the "fas" variable to the path to the FASTA file of unaligned sequences (e.g., "~/mySeqs.fas"). Then you can choose a distance cutoff for clustering the sequences. Clusterize will output a cluster number for each input sequence and print an estimate of the clustering effectiveness.# load the DECIPHER library in R
> library(DECIPHER)
>
> # specify the path to the FASTA file (in quotes)
> fas <- "<<REPLACE WITH PATH TO FASTA FILE>>"
>
> # load the sequences from the file
> # change "DNA" to "RNA" or "AA" as needed
> seqs <- readAAStringSet(fas)
>
> # look at some of the sequences (optional)
> seqs
AAStringSet object of length 18976:
width seq names
[1] 567 MPYMGV...RRVPPK Seq1
[2] 749 MRYIDD...MNQIES Seq2
[3] 849 MLGILK...FGEKGT Seq3
[4] 742 MLFSFS...IKEQNS Seq4
[5] 499 MSSFTL...SAVSSL Seq5
... ... ...
[18972] 927 MSRKVL...RGTDNE Seq18972
[18973] 465 MTFEER...GDDASF Seq18973
[18974] 502 MRTPKS...PHKTSV Seq18974
[18975] 527 MFFVPR...PGAAHS Seq18975
[18976] 475 MNRGRR...DLPARL Seq18976
>
> # cluster the sequences
> clusters <- Clusterize(seqs,
+ cutoff=0.5, # < 50% distant
+ minCoverage=0.5, # > 50% coverage
+ processors=NULL) # use all CPUs
Partitioning sequences by 4-mer similarity:
|========================================| 100%
Time difference of 6.05 secs
Sorting by relatedness within 15809 groups:
iteration 7 of up to 24 (100.0% stability)
Time difference of 1.46 secs
Clustering sequences by 4-mer to 6-mer similarity:
|========================================| 100%
Time difference of 52.73 secs
Clusters via relatedness sorting: 86.8% (0.3% exclusively)
Clusters via rare 4-mers: 99.7% (13.2% exclusively)
Estimated clustering effectiveness: 99.2%
>
> # view the cluster numbers
> head(clusters)
cluster
Seq1 6306
Seq2 1957
Seq3 3093
Seq4 4164
Seq5 7527
Seq6 1944
>
> # compute cluster statistics
> max(clusters) # number of clusters
[1] 12559
> t <- table(clusters)
> mean(t) # average cluster size
[1] 1.510948
> tail(sort(t)) # biggest clusters
cluster
7451 8479 6757 2279 3414 6
47 49 51 52 64 110