Sequence Search
This short example describes how to use DECIPHER to search biological sequences, as described in:ES Wright (2024) "Fast and Flexible Search for Homologous Biological Sequences with DECIPHER v3." The R Journal, doi:in Press.
Instructions
Searching is split into two main functions: IndexSeqs and SearchIndex. The IndexSeqs function builds an inverted index of the target sequences to accelerate querying. The SearchIndex function searches the inverted index for homologs to the input query sequences.# load the DECIPHER library in R
> library(DECIPHER)
>
> # load the target sequences from a file
> fas <- "<<REPLACE WITH PATH TO TARGET SEQUENCES>>"
> target <- readDNAStringSet(fas)
> target
DNAStringSet object of length 1:
width seq names
[1] 1042519 GCGGCCG...CGGGGTA NC_000117.1 Chlam...
>
> # construct an inverted index
> index <- IndexSeqs(target,
+ K=9, # set an appropriate k-mer size
+ processors=NULL) # use all CPUs
|========================================| 100%
Time difference of 0.36 secs
>
> # load the query sequences from a file
> fas <- "<<REPLACE WITH PATH TO QUERY SEQUENCES>>"
> query <- readDNAStringSet(fas)
> query
DNAStringSet object of length 317:
width seq names
[1] 819 ATGGCTT...AAGAAAA Rickettsia prowaz...
[2] 822 ATGGGAA...GAAAAAG Porphyromonas gin...
[3] 822 ATGGGAA...GAAAAAG Porphyromonas gin...
[4] 822 ATGGGAA...GAAAAAG Porphyromonas gin...
[5] 819 ATGGCTA...TGGTAAA Pasteurella multo...
... ... ...
[313] 819 ATGGCAA...TACTAAA Pectobacterium at...
[314] 822 ATGCCTA...CGTCAAG Acinetobacter sp....
[315] 864 ATGGGCA...TCAGTCT Thermosynechococc...
[316] 831 ATGGCAC...GAAGAAG Bradyrhizobium ja...
[317] 840 ATGGGCA...GCGAGGT Gloeobacter viola...
>
> # search for homologs of the query
> hits <- SearchIndex(query,
+ index, # target sequences
+ processors=NULL) # use all CPUs
|========================================| 100%
Time difference of 0.02 secs
>
> head(hits)
Pattern Subject Score Position
1 1 1 2.0656499 365, 381....
2 2 1 0.9781966 353, 367....
3 3 1 0.9781966 353, 367....
4 4 1 0.9781966 353, 367....
5 5 1 0.9506100 276, 285....
6 6 1 0.9506100 276, 285....
>
> # optionally, locally align the hits
> aligned <- AlignPairs(query, target, hits)
|========================================| 100%
Time difference of 0.18 secs
>
> head(aligned)
Pattern PatternStart PatternEnd Subject
1 1 266 389 1
2 2 341 397 1
3 3 341 397 1
4 4 341 397 1
5 5 217 601 1
6 6 217 601 1
SubjectStart SubjectEnd Matches Mismatches
1 978502 978623 60 62
2 165465 165521 34 23
3 165465 165521 34 23
4 165465 165521 34 23
5 82935 83316 149 233
6 82935 83316 149 233
AlignmentLength Score PatternGapPosition
1 124 40.8
2 57 45.0
3 57 45.0
4 57 45.0
5 385 31.8
6 385 31.8
PatternGapLength SubjectGapPosition
1 84
2
3
4
5 72, 171
6 72, 171
SubjectGapLength
1 2
2
3
4
5 1, 2
6 1, 2