DECIPHER - Bioinformatics Video - Managing big biological sequence data

Managing big biological sequence data - Workflow part #1
In part #1 of this workflow, the goal is to design optimal primers to differentiate all of the Pythium. Here the Cytochrome c oxidase subunit 1 (COI) gene is used as a phylogenetic marker to identify species. The COI gene is part of the mitochondrial genome of all eukaryotes. The first step is to download the COI gene sequences of known Pythium species from the Internet and import them into a sequence database, as shown below.
Hide output
# all paths are relative to the installed datasets
> data_dir <- "<<PATH TO DATA FOLDER>>"
> 
> library(DECIPHER)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:


    clusterApply, clusterApplyLB, clusterCall,
    clusterEvalQ, clusterExport, clusterMap,
    parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply,
    parSapplyLB


The following objects are masked from ‘package:stats’:


    IQR, mad, xtabs


The following objects are masked from ‘package:base’:


    anyDuplicated, append, as.data.frame,
    cbind, colnames, do.call, duplicated, eval,
    evalq, Filter, Find, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths,
    Map, mapply, match, mget, order, paste,
    pmax, pmax.int, pmin, pmin.int, Position,
    rank, rbind, Reduce, rownames, sapply,
    setdiff, sort, table, tapply, union,
    unique, unsplit, which, which.max,
    which.min


Loading required package: S4Vectors
Loading required package: stats4


Attaching package: ‘S4Vectors’


The following objects are masked from ‘package:base’:


    colMeans, colSums, expand.grid, rowMeans,
    rowSums


Loading required package: IRanges
Loading required package: XVector
Loading required package: RSQLite
Loading required package: DBI
> 
> # Create a connection to an on-disk SQLite database
> dbConn <- dbConnect(SQLite(),
+    "./COIgenes.sqlite") # path to new database file
> 
> # Import sequences from a GenBank formatted file
> Seqs2DB(paste(data_dir,
+       "/Pythium_spp_COI.gb",
+       sep=""),
+    type="GenBank",
+    dbFile=dbConn,
+    identifier="Pythium")
Reading GenBank file chunk 1


488 total sequences in table Seqs.
Time difference of 0.29 secs


> 
> # View the database table that was constructed
> BrowseDB(dbConn)
> 
> # Retrieve the imported sequences
> dna <- SearchDB(dbConn)
Search Expression:
select row_names, sequence from _Seqs where
row_names in (select row_names from Seqs)


DNAStringSet of length: 488
Time difference of 0.02 secs


> dna
  A DNAStringSet instance of length 488
      width seq                   names               
  [1]  1277 ATGAATTTT...GTTATTCTT 1
  [2]  1277 ATGAATTTT...GTTATTTTT 2
  [3]  1095 TATATAATG...TATTTTTTT 3
  [4]  1299 ATGAATTTT...ATTACATTT 4
  [5]  1109 CATCATTTA...TATAGGTGT 5
  ...   ... ...
[484]   673 AAATCATAA...TTATTCCAA 484
[485]   680 AATCATAAA...ACATTTATT 485
[486]   680 AATCATAAA...ACATTTATT 486
[487]   680 AATCATAAA...ACATTTATT 487
[488]   680 AATCATAAA...ACATTTATT 488
> 
> # Align the sequences based on their translations
> DNA <- AlignTranslation(dna)
Determining distance matrix based on shared 5-mers:
  |============================================| 100%


Time difference of 0.65 secs


Clustering into groups by similarity:
  |============================================| 100%


Time difference of 1.11 secs


Aligning Sequences:
  |============================================| 100%


Time difference of 2.56 secs


Determining distance matrix based on alignment:
  |============================================| 100%


Time difference of 0.27 secs


Reclustering into groups by similarity:
  |============================================| 100%


Time difference of 0.64 secs


Realigning Sequences:
  |============================================| 100%


Time difference of 2.43 secs


> DNA
  A DNAStringSet instance of length 488
      width seq                   names               
  [1]  1299 ATGAATTTT...--------- 1
  [2]  1299 ATGAATTTT...--------- 2
  [3]  1299 ---------...--------- 3
  [4]  1299 ATGAATTTT...ATTACATTT 4
  [5]  1299 ---------...--------- 5
  ...   ... ...
[484]  1299 ---------...--------- 484
[485]  1299 ---------...--------- 485
[486]  1299 ---------...--------- 486
[487]  1299 ---------...--------- 487
[488]  1299 ---------...--------- 488
> 
> # Display the sequences in a web browser
> BrowseSeqs(DNA)
> # show differences with the first sequence
> BrowseSeqs(DNA, highlight=1)
> # show differences with the consensus sequence
> BrowseSeqs(DNA, highlight=0)
> # change the degree of consensus
> BrowseSeqs(DNA, highlight=0, threshold=0.2)
> 
> # note the pattern common to most sequences
> pattern <- DNAStringSet("TAGATTTAGCWATTTTTAGTTTACA")
> BrowseSeqs(DNA,
+    patterns=pattern)
> 
> # The protein sequences are very similar
> AA <- AlignTranslation(dna, type="AAStringSet")
Determining distance matrix based on shared 5-mers:
  |============================================| 100%


Time difference of 0.66 secs


Clustering into groups by similarity:
  |============================================| 100%


Time difference of 0.93 secs


Aligning Sequences:
  |============================================| 100%


Time difference of 2.51 secs


Determining distance matrix based on alignment:
  |============================================| 100%


Time difference of 0.26 secs


Reclustering into groups by similarity:
  |============================================| 100%


Time difference of 0.61 secs


Realigning Sequences:
  |============================================| 100%


Time difference of 1.96 secs


> BrowseSeqs(AA, highlight=1)
> 
> # Choose a reference for frameshift correction
> REF <- translate(dna[11]) # sequence #11
> 
> # Correct the frameshift in sequence #12
> correct <- CorrectFrameshifts(myXStringSet=dna[12],
+    myAAStringSet=REF,
+    type="both")


Assessing frameshifts in nucleotide sequences:
  |============================================| 100%


Time difference of 0.05 secs
> correct
$indels
$indels$`12`
$indels$`12`$insertions
integer(0)


$indels$`12`$deletions
[1] 813


$indels$`12`$distance
[1] 0


$indels$`12`$index
[1] 1






$sequences
  A DNAStringSet instance of length 1
    width seq                     names               
[1]  1089 TATAATGTTG...TTGGTTATTC 12


> dna[12] <- correct$sequence
> 
> # Sequence #11 is now identical to #12
> DNA <- AlignTranslation(dna)
Determining distance matrix based on shared 5-mers:
  |============================================| 100%


Time difference of 0.63 secs


Clustering into groups by similarity:
  |============================================| 100%


Time difference of 1.08 secs


Aligning Sequences:
  |============================================| 100%


Time difference of 3.05 secs


Determining distance matrix based on alignment:
  |============================================| 100%


Time difference of 0.32 secs


Reclustering into groups by similarity:
  |============================================| 100%


Time difference of 0.67 secs


Realigning Sequences:
  |============================================| 100%


Time difference of 1.88 secs


> BrowseSeqs(DNA, highlight=11)
> 
> # Identify clusters for primer design
> d <- DistanceMatrix(DNA)
  |============================================| 100%


Time difference of 0.37 secs


> dim(d) # a symmetric matrix
[1] 488 488
> c <- IdClusters(d,
+    method="UPGMA",
+    cutoff=0.05,
+    show=TRUE)
  |============================================| 100%


Time difference of 1.66 secs


> head(c) # cluster numbers
  cluster
1      21
2      21
3      20
4      32
5       2
6      20
> 
> # Identify sequences by cluster name in the database
> Add2DB(data.frame(identifier=paste("cluster",
+       c$cluster,
+       sep="")),
+    dbConn)
Expression:
update Seqs set identifier = :identifier where
row_names = :row_names


Added to table Seqs:  "identifier".
Time difference of 0.02 secs


> BrowseDB(dbConn)
> 
> # Design primers for next-generation sequencing
> primers <- DesignSignatures(dbConn,
+    type="sequence",
+    resolution=5,
+    levels=5,
+    minProductSize=400,
+    maxProductSize=800,
+    annealingTemp=55,
+    maxPermutations=8)
Tallying 8-mers for 48 groups:
  |============================================| 100%


Time difference of 2.15 secs


Designing primer sequences based on the group 'cluster2':
  |============================================| 100%


Time difference of 130.93 secs


Selecting the most common primer sequences:
  |============================================| 100%


Time difference of 48.39 secs


Determining PCR products from each group:
  |============================================| 100%


Time difference of 110.41 secs


Scoring primer pair combinations:
  |============================================| 100%


Time difference of 11.17 secs


Choosing optimal forward and reverse pairs:
  |============================================| 100%


Time difference of 64.59 secs
> primers[1,] # the top scoring primer set
        forward_primer    reverse_primer        score
1 KCRCAACCWGGTAATCAAAT AAWCCWGGWGCTCTCAT 0.047509....
      coverage products
1 0.916666....      424
                                                                                                similar_signatures
1 (cluster32, cluster6); (cluster36, cluster24); (cluster21, cluster20, cluster2, cluster12, cluster18, cluster44)
                         missing_signatures
1 cluster46, cluster45, cluster8, cluster41
> 
> # Highlight the primers' target sites
> BrowseSeqs(DNA,
+    patterns=c(DNAStringSet(primers[1, 1]),
+       reverseComplement(DNAStringSet(primers[1, 2]))))
< Previous Part Next Part >