Managing big biological sequence data - Workflow part #1
In part #1 of this workflow, the goal is to design optimal primers to differentiate all of the Pythium. Here the Cytochrome c oxidase subunit 1 (COI) gene is used as a phylogenetic marker to identify species. The COI gene is part of the mitochondrial genome of all eukaryotes. The first step is to download the COI gene sequences of known Pythium species from the Internet and import them into a sequence database, as shown below.# all paths are relative to the installed datasets
> data_dir <- "<<PATH TO DATA FOLDER>>"
>
> library(DECIPHER)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall,
clusterEvalQ, clusterExport, clusterMap,
parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply,
parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame,
cbind, colnames, do.call, duplicated, eval,
evalq, Filter, Find, get, grep, grepl,
intersect, is.unsorted, lapply, lengths,
Map, mapply, match, mget, order, paste,
pmax, pmax.int, pmin, pmin.int, Position,
rank, rbind, Reduce, rownames, sapply,
setdiff, sort, table, tapply, union,
unique, unsplit, which, which.max,
which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:base’:
colMeans, colSums, expand.grid, rowMeans,
rowSums
Loading required package: IRanges
Loading required package: XVector
Loading required package: RSQLite
Loading required package: DBI
>
> # Create a connection to an on-disk SQLite database
> dbConn <- dbConnect(SQLite(),
+ "./COIgenes.sqlite") # path to new database file
>
> # Import sequences from a GenBank formatted file
> Seqs2DB(paste(data_dir,
+ "/Pythium_spp_COI.gb",
+ sep=""),
+ type="GenBank",
+ dbFile=dbConn,
+ identifier="Pythium")
Reading GenBank file chunk 1
488 total sequences in table Seqs.
Time difference of 0.29 secs
>
> # View the database table that was constructed
> BrowseDB(dbConn)
>
> # Retrieve the imported sequences
> dna <- SearchDB(dbConn)
Search Expression:
select row_names, sequence from _Seqs where
row_names in (select row_names from Seqs)
DNAStringSet of length: 488
Time difference of 0.02 secs
> dna
A DNAStringSet instance of length 488
width seq names
[1] 1277 ATGAATTTT...GTTATTCTT 1
[2] 1277 ATGAATTTT...GTTATTTTT 2
[3] 1095 TATATAATG...TATTTTTTT 3
[4] 1299 ATGAATTTT...ATTACATTT 4
[5] 1109 CATCATTTA...TATAGGTGT 5
... ... ...
[484] 673 AAATCATAA...TTATTCCAA 484
[485] 680 AATCATAAA...ACATTTATT 485
[486] 680 AATCATAAA...ACATTTATT 486
[487] 680 AATCATAAA...ACATTTATT 487
[488] 680 AATCATAAA...ACATTTATT 488
>
> # Align the sequences based on their translations
> DNA <- AlignTranslation(dna)
Determining distance matrix based on shared 5-mers:
|============================================| 100%
Time difference of 0.65 secs
Clustering into groups by similarity:
|============================================| 100%
Time difference of 1.11 secs
Aligning Sequences:
|============================================| 100%
Time difference of 2.56 secs
Determining distance matrix based on alignment:
|============================================| 100%
Time difference of 0.27 secs
Reclustering into groups by similarity:
|============================================| 100%
Time difference of 0.64 secs
Realigning Sequences:
|============================================| 100%
Time difference of 2.43 secs
> DNA
A DNAStringSet instance of length 488
width seq names
[1] 1299 ATGAATTTT...--------- 1
[2] 1299 ATGAATTTT...--------- 2
[3] 1299 ---------...--------- 3
[4] 1299 ATGAATTTT...ATTACATTT 4
[5] 1299 ---------...--------- 5
... ... ...
[484] 1299 ---------...--------- 484
[485] 1299 ---------...--------- 485
[486] 1299 ---------...--------- 486
[487] 1299 ---------...--------- 487
[488] 1299 ---------...--------- 488
>
> # Display the sequences in a web browser
> BrowseSeqs(DNA)
> # show differences with the first sequence
> BrowseSeqs(DNA, highlight=1)
> # show differences with the consensus sequence
> BrowseSeqs(DNA, highlight=0)
> # change the degree of consensus
> BrowseSeqs(DNA, highlight=0, threshold=0.2)
>
> # note the pattern common to most sequences
> pattern <- DNAStringSet("TAGATTTAGCWATTTTTAGTTTACA")
> BrowseSeqs(DNA,
+ patterns=pattern)
>
> # The protein sequences are very similar
> AA <- AlignTranslation(dna, type="AAStringSet")
Determining distance matrix based on shared 5-mers:
|============================================| 100%
Time difference of 0.66 secs
Clustering into groups by similarity:
|============================================| 100%
Time difference of 0.93 secs
Aligning Sequences:
|============================================| 100%
Time difference of 2.51 secs
Determining distance matrix based on alignment:
|============================================| 100%
Time difference of 0.26 secs
Reclustering into groups by similarity:
|============================================| 100%
Time difference of 0.61 secs
Realigning Sequences:
|============================================| 100%
Time difference of 1.96 secs
> BrowseSeqs(AA, highlight=1)
>
> # Choose a reference for frameshift correction
> REF <- translate(dna[11]) # sequence #11
>
> # Correct the frameshift in sequence #12
> correct <- CorrectFrameshifts(myXStringSet=dna[12],
+ myAAStringSet=REF,
+ type="both")
Assessing frameshifts in nucleotide sequences:
|============================================| 100%
Time difference of 0.05 secs
> correct
$indels
$indels$`12`
$indels$`12`$insertions
integer(0)
$indels$`12`$deletions
[1] 813
$indels$`12`$distance
[1] 0
$indels$`12`$index
[1] 1
$sequences
A DNAStringSet instance of length 1
width seq names
[1] 1089 TATAATGTTG...TTGGTTATTC 12
> dna[12] <- correct$sequence
>
> # Sequence #11 is now identical to #12
> DNA <- AlignTranslation(dna)
Determining distance matrix based on shared 5-mers:
|============================================| 100%
Time difference of 0.63 secs
Clustering into groups by similarity:
|============================================| 100%
Time difference of 1.08 secs
Aligning Sequences:
|============================================| 100%
Time difference of 3.05 secs
Determining distance matrix based on alignment:
|============================================| 100%
Time difference of 0.32 secs
Reclustering into groups by similarity:
|============================================| 100%
Time difference of 0.67 secs
Realigning Sequences:
|============================================| 100%
Time difference of 1.88 secs
> BrowseSeqs(DNA, highlight=11)
>
> # Identify clusters for primer design
> d <- DistanceMatrix(DNA)
|============================================| 100%
Time difference of 0.37 secs
> dim(d) # a symmetric matrix
[1] 488 488
> c <- IdClusters(d,
+ method="UPGMA",
+ cutoff=0.05,
+ show=TRUE)
|============================================| 100%
Time difference of 1.66 secs
> head(c) # cluster numbers
cluster
1 21
2 21
3 20
4 32
5 2
6 20
>
> # Identify sequences by cluster name in the database
> Add2DB(data.frame(identifier=paste("cluster",
+ c$cluster,
+ sep="")),
+ dbConn)
Expression:
update Seqs set identifier = :identifier where
row_names = :row_names
Added to table Seqs: "identifier".
Time difference of 0.02 secs
> BrowseDB(dbConn)
>
> # Design primers for next-generation sequencing
> primers <- DesignSignatures(dbConn,
+ type="sequence",
+ resolution=5,
+ levels=5,
+ minProductSize=400,
+ maxProductSize=800,
+ annealingTemp=55,
+ maxPermutations=8)
Tallying 8-mers for 48 groups:
|============================================| 100%
Time difference of 2.15 secs
Designing primer sequences based on the group 'cluster2':
|============================================| 100%
Time difference of 130.93 secs
Selecting the most common primer sequences:
|============================================| 100%
Time difference of 48.39 secs
Determining PCR products from each group:
|============================================| 100%
Time difference of 110.41 secs
Scoring primer pair combinations:
|============================================| 100%
Time difference of 11.17 secs
Choosing optimal forward and reverse pairs:
|============================================| 100%
Time difference of 64.59 secs
> primers[1,] # the top scoring primer set
forward_primer reverse_primer score
1 KCRCAACCWGGTAATCAAAT AAWCCWGGWGCTCTCAT 0.047509....
coverage products
1 0.916666.... 424
similar_signatures
1 (cluster32, cluster6); (cluster36, cluster24); (cluster21, cluster20, cluster2, cluster12, cluster18, cluster44)
missing_signatures
1 cluster46, cluster45, cluster8, cluster41
>
> # Highlight the primers' target sites
> BrowseSeqs(DNA,
+ patterns=c(DNAStringSet(primers[1, 1]),
+ reverseComplement(DNAStringSet(primers[1, 2]))))