Clusterize
This short example describes how to use Clusterize to cluster sequences, as described in:ES Wright (2024) "Accurately clustering biological sequences in linear time by relatedness sorting." Nature Communications, doi:10.1038/s41467-024-47371-9.
Instructions
First it is necessary to install DECIPHER and load the library in R. Next, set the "fas" variable to the path to the FASTA file of unaligned sequences (e.g., "~/mySeqs.fas"). Then you can choose a distance cutoff for clustering the sequences. Clusterize will output a cluster number for each input sequence and print an estimate of the clustering effectiveness.# load the DECIPHER library in R
> library(DECIPHER)
>
> # specify the path to the FASTA file (in quotes)
> fas <- "<<REPLACE WITH PATH TO FASTA FILE>>"
>
> # load the sequences from the file
> # change "DNA" to "RNA" or "AA" as needed
> seqs <- readAAStringSet(fas)
>
> # look at some of the sequences (optional)
> seqs
AAStringSet object of length 18976:
width seq names
[1] 567 MPYMGV...RRVPPK Seq1
[2] 749 MRYIDD...MNQIES Seq2
[3] 849 MLGILK...FGEKGT Seq3
[4] 742 MLFSFS...IKEQNS Seq4
[5] 499 MSSFTL...SAVSSL Seq5
... ... ...
[18972] 927 MSRKVL...RGTDNE Seq18972
[18973] 465 MTFEER...GDDASF Seq18973
[18974] 502 MRTPKS...PHKTSV Seq18974
[18975] 527 MFFVPR...PGAAHS Seq18975
[18976] 475 MNRGRR...DLPARL Seq18976
>
> # cluster the sequences
> clusters <- Clusterize(seqs,
+ cutoff=0.5, # < 50% distant
+ minCoverage=0.5, # > 50% coverage
+ processors=NULL) # use all CPUs
Partitioning sequences by 4-mer similarity:
|========================================| 100%
Time difference of 6.05 secs
Sorting by relatedness within 15809 groups:
iteration 7 of up to 24 (100.0% stability)
Time difference of 1.46 secs
Clustering sequences by 4-mer to 6-mer similarity:
|========================================| 100%
Time difference of 52.73 secs
Clusters via relatedness sorting: 86.8% (0.3% exclusively)
Clusters via rare 4-mers: 99.7% (13.2% exclusively)
Estimated clustering effectiveness: 99.2%
>
> # view the cluster numbers
> head(clusters)
cluster
Seq1 6306
Seq2 1957
Seq3 3093
Seq4 4164
Seq5 7527
Seq6 1944
>
> # compute cluster statistics
> max(clusters) # number of clusters
[1] 12559
> t <- table(clusters)
> mean(t) # average cluster size
[1] 1.510948
> tail(sort(t)) # biggest clusters
cluster
7451 8479 6757 2279 3414 6
47 49 51 52 64 110