DECIPHER logo

  • Alignment▸
  • Classification▸
  • Homology▾
  • Clusterize
  • Detect Repeats
  • Find Chimeras
  • Sequence Search
  • Oligo Design▸
  • Phylogenetics▸
  • Tutorials▸
  • Home
  • News
  • Downloads
  • Contact
  • Citations

Clusterize

This short example describes how to use Clusterize to cluster sequences, as described in:

ES Wright (2024) "Accurately clustering biological sequences in linear time by relatedness sorting." Nature Communications, doi:10.1038/s41467-024-47371-9.

For an in-depth tutorial on clustering, see the "Upsize Your Clustering with Clusterize" vignette, available from the Documentation page. Also, watch the video explaining Clusterize here.

How do I cluster sequences by similarity?

First it is necessary to install DECIPHER and load the library in R. Next, set the "fas" variable to the path to the FASTA file of unaligned sequences (e.g., "~/mySeqs.fas"). Then you can choose a distance cutoff for clustering the sequences. Clusterize will output a cluster number for each input sequence and print an estimate of the clustering effectiveness.

Hide output
# load the DECIPHER library in R
> library(DECIPHER)
> 
> # specify the path to the FASTA file (in quotes)
> fas <- "<<REPLACE WITH PATH TO FASTA FILE>>"
> 
> # load the sequences from the file
> # change "DNA" to "RNA" or "AA" as needed
> seqs <- readAAStringSet(fas)
> 
> # look at some of the sequences (optional)
> seqs
AAStringSet object of length 18976:
        width seq             names               
    [1]   567 MPYMGV...RRVPPK Seq1
    [2]   749 MRYIDD...MNQIES Seq2
    [3]   849 MLGILK...FGEKGT Seq3
    [4]   742 MLFSFS...IKEQNS Seq4
    [5]   499 MSSFTL...SAVSSL Seq5
    ...   ... ...
[18972]   927 MSRKVL...RGTDNE Seq18972
[18973]   465 MTFEER...GDDASF Seq18973
[18974]   502 MRTPKS...PHKTSV Seq18974
[18975]   527 MFFVPR...PGAAHS Seq18975
[18976]   475 MNRGRR...DLPARL Seq18976
> 
> # cluster the sequences
> clusters <- Clusterize(seqs,
+ cutoff=0.5, # < 50% distant
+ minCoverage=0.5, # > 50% coverage
+ processors=NULL) # use all CPUs
Partitioning sequences by 4-mer similarity:
  |========================================| 100%

Time difference of 6.05 secs
Sorting by relatedness within 15809 groups: iteration 7 of up to 24 (100.0% stability)
Time difference of 1.46 secs
Clustering sequences by 4-mer to 6-mer similarity: |========================================| 100%
Time difference of 52.73 secs
Clusters via relatedness sorting: 86.8% (0.3% exclusively) Clusters via rare 4-mers: 99.7% (13.2% exclusively) Estimated clustering effectiveness: 99.2%
> > # view the cluster numbers > head(clusters) cluster Seq1 6306 Seq2 1957 Seq3 3093 Seq4 4164 Seq5 7527 Seq6 1944 > > # compute cluster statistics > max(clusters) # number of clusters [1] 12559 > t <- table(clusters) > mean(t) # average cluster size [1] 1.510948 > tail(sort(t)) # biggest clusters cluster 7451 8479 6757 2279 3414 6 47 49 51 52 64 110