盛夏不等式: Consensus and Profile (ID: CONS)

Problem

給定數條等長DNA序列（存於 FASTA 檔），求其 consensus string 及各位置的鹼基統計量

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

示範數據

# input
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

# output
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

Background

Consensus string 是由數條等長字串各位置出現頻率最高的字符所串接而成的字串 [1]，而 profile matrix 則是紀錄字符於各字串不同位置出現頻率的矩陣。核酸和蛋白質序列的 consensus string 又被稱為 consensus sequence，常用於需要描述序列變異的場合。

譜系分析

若某批序列可追溯至一條序列，則這條序列即是該批序列的 common ancestor，即所有其他序列皆突變自 common ancestor 及其衍生序列。假設突變為隨機出現且發生的機率不高，則一條與其他序列鹼基差異最小的序列，即是最有可能的 common ancestor。是以，辨識一批序列的 consensus sequence 也是推論 most likely common ancestor 的手段之一。

基因變異分析

Single nucleotide polymorphism (SNP) 是指基因體某位點的核酸種類在族群中存在變異。由於 consensus sequence 反映了各位置出現頻率最高的核酸種類，因此比對測序列與得自族群的 consensus sequence，便能找出變異（與族群不同）的位點。

校正定序錯誤

PacBio 的 Single Molecule, Real-Time (SMRT) 定序技術可以實踐單分子即時定序，從而避免 PCR 偏誤並大幅增加定序長度。相較於次世代定序，SMRT 技術的定序錯誤率普遍較高，不過此技術產生的定序錯誤乃隨機分布於序列各位置，而不像次世代定序的定序錯誤往往出現在序列末端。

定序錯誤隨機分布的特性讓 SMRT 技術可以透過加深定序深度來改善結果品質。簡言之，若定序品質不太離譜，則序列上各位置的正確鹼基數會多於錯誤鹼基數。當定序數量提升到足以顯現正確鹼基和錯誤鹼基的數量差時，即可透過辨識 consensus string 來獲取正確的序列。

Solution

此題可分解為以下步驟：讀取 .fasta 檔、統計各位置鹼基出現頻率、彙整 profile matrix、生成 Consensus string、整理數據為 Rosalind 要求的格式。依照 Python 和 R 的資料類別差異，我採取了不同的方式解題。

Python

我的策略是依序取出各位置的鹼基，再統計四種鹼基的出現頻率，存為 dictionary（key：鹼基種類，value：出現頻率）。接著，將這些 dictionary 存為 list，即完成 profile matrix。因為 list 中每個 dictionary 都記錄著各位置的鹼基統計量，所以只要依序從 dictionary 取出 value 最大的 key，即可獲得 consensus sequence。

首先讀取 .fasta 為 dictionary，方法可參考先前的解題筆記：Computing GC Content (ID: GC)

# Read a .fasta file into a dictionary
def read_fasta (path):
    f = open(path,  "r")
    seqs = {}
    for line in f:
        if line.startswith(">"):
            ID = line.lstrip(">").replace("\n", "")
            seqs[ID] = "" 
        else:
            seqs[ID] = seqs[ID] + line.replace("\n", "")       
    return seqs

接著統計各位置四種鹼基的出現頻率。為了達成目的，除了計算鹼基的 function，還需要能取出指定位置字符的 function。簡言之，以 for 迴圈遍歷字串list，並取出每一字串指定位置的字符存入 list [2] 。

# Count DNA nuceotides
def count (seq):
    return {"A": seq.count("A"), "T": seq.count("T"), "C": seq.count("C"), "G": seq.count("G")}
# Extract the Nnd character in a list of strings
def charAt (s, n):
    return [s[i][n] for i in range(len(s))]

將所有序列各位置的鹼基統計量存入 list 以獲得 profile matrix。此處沒有用到實際上的矩陣，但概念上仍紀錄了鹼基統計量，只是要在輸出時調整成題目要求的格式。

# Generate profile matrix
def profiler (seqs):
    profile = []
    for j in range(len(seqs.values()[0])):
        profile.append(count(charAt(seqs.values(), j)))
    return profile

而辨識 consensus sequence 的方法則是依序從 profile matrix 當中，取出 value（出現頻率）最大的 key（鹼基種類）合併為字串。

# Extract the consensus sequence
def cons (seqs):
    return "".join([max(k, key = k.get) for k in profiler(seqs)])

最後，把 profile matrix 當中每個鹼基的統計量取出，前端加上鹼基種類與冒號，中間穿插空格，最後接上換行符，即可輸出題目要求的格式。

# Format profile matrix into the Rosalind-required format
def formater (profile):
    out = ""
    for nt in "ATCG":
         out = out + nt + ": " + " ".join([str(k[nt]) for k in profile]) + "\n"
    return(out)

示範數據的輸出入如下：

> dnas = read_fasta("./exemplified/rosalind_exp_cons.fasta")
> print cons(dnas)
ATGCAACT
> print formater(profiler(dnas))
A: 5 1 0 0 5 5 0 0
T: 1 5 0 0 0 1 1 6
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0

R

我的策略是把序列存為矩陣，其中每個 row 都代表一條序列，column 則紀錄各位置的鹼基種類。接著，統計每個 column 四種鹼基的數量，即可獲得 profile matrix（row: A、T、C、G，column：出現頻率）。最後，取出 profile matrix 各 column 數值最大者的 row names 合併為單一字串，便能輸出 consensus sequence。

首先，讀取 .fasta 檔並存為 list。此處簡化了之前的寫法，在讀取所有數據以後，以 ">" 符號為依據，透過 grepl 分別抓出序列名稱和序列字串，並於移除不必要的字符後，將帶有名稱的向量轉存為 list。然而，這種寫法必須確認序列字串存在同一行。若分為多行紀錄，則無法和序列名稱一一對應，以至於不能正確讀取檔案。

# Read .fasta into list
read_fasta <- function (path) {
  l <- readLines(path)
  seqs <- l[!grepl(">", l)]
  names(seqs) <- gsub(">", "", l[grepl(">", l)])
  return(as.list(seqs))
}

計算鹼基的任務我也用了跟之前 )不同的方法。先前的方法基於 table()，但 table() 只是統計字串內各種字符的數量，假如 A、T、C、G 任一鹼基沒出現的話，輸出的結果連 0 都不會顯示（畢竟 table() 本身不知道我們預期有四種鹼基要統計）。

# Count DNA nucleotides
count <- function(seq) {
  stat <- rep(0, 4)
  names(stat) <- c("A", "T", "C", "G")
  for (i in seq) {stat[i] <-  stat[i] + 1}
  return(stat)
}

由於序列存為 list，為了方便統計各位置不同鹼基的出現頻率，所以我先把序列存為 matrix，一個 row 儲存一條序列， column 則儲存該序列各位置的鹼基。具體的做法則是先把所有序列拆為字符向量，再使用 rbind 與 do.call 合併向量為 matrix。[3]

# Transfer a list of sequences into a matrix
seqs2matrix <- function (seqs) {
  d <- do.call(rbind, lapply(1:length(seqs), function (x) {unlist(strsplit(seqs[[x]], ""))}))
  row.names(d) <- names(seqs)
  return(as.matrix(d))
}

有了鹼基 matrix 以後，只要逐 column 統計鹼基數，再透過 cbind 與 do.call 合併輸出結果，即可彙整出 profile matrix。

# Generate profile matrix
profiler <- function (dnas) {
  dnas <- seqs2matrix(dnas)
  profile <- do.call(cbind, lapply(1:ncol(dnas), function (x) {count(dnas[, x])}))
  return(profile)
}

接著，透過 which.max 取出該位置出現頻率最高的鹼基 row number，即可從 profile matrix 的 row names 獲得該鹼基字符。將各位置最多的鹼基字符串聯起來，即可得出 consensus sequence。

# Find the consensus sequence
find_cons <- function (profile) {
  return(paste0(rownames(profile)[apply(profile, 2, which.max)], collapse = ""))
}

最後，移除 profile matrix 的 column names，並將 row names 加上冒號，即可輸出題目要求的格式。

# Format profile matrix into the Rosalind-required format
formater <- function(profile) {
  prmatrix(profile, 
           rowlab = paste0(row.names(profile), ": "), 
           collab = rep("", ncol(profile)))
}

示範數據的輸出如下：

dnas <- read_fasta("./exemplified/rosalind_exp_cons.fasta")
> find_cons(profiler(dnas))
[1] "ATGCAACT"
> formater(profiler(dnas))

A:  5 1 0 0 5 5 0 0
T:  1 5 0 0 0 1 1 6
C:  0 0 1 4 2 0 6 1
G:  1 1 6 3 0 1 0 0

Discussion

1. Biology: Consensus string and hamming distance

在所有序列等長的條件下，Consensus string 也可被定義為「與所有字串的 hamming distance 之和最小的字串」

2. Python: list 取值

Python 的 list 以 "[]" 表示，可儲存多樣的資料型態，存儲的項目間以 "," 分隔，各項目皆有其序號，編號的起始值為 0，向右逐項 +1。欲取出 list 指定項目可採以下方式：

# Exemplified data
l = ["A", "B", "C", "D", "E"]

取出指定項目，

# 取出0號位置的項目 
> print l[0]
A
# 取出最後一個項目
> print l[-1]
E

取出字串，要留意當寫成 l[0:2] 時，只會取到序號2之前的項目，而不會取到序號2的項目。

# l[start:end:sep]
> print l[0:2]
["A", "B"]
> print l[0:5:2]
["A", "C", "E"]
> print l[2:]
["C", "D", "E"]

3. R: do.call() and Reduce()

由於 lapply 的輸出結果為多個 list，所以無法直接以 rbind (or cbind) 合併輸出結果為 matrix 或 dataframe。這種情況下，可以配合 Reduce 和 do.call 來達到目的。

Reduce 是 R 實踐迴圈的一種方式，專門用於 binary function（即加、減、乘、除或合併等涉及兩個項目的運算或處理）。Reduce 的特性在於會保留執行 binary function 的結果再用於下一輪迴圈，所以能夠達成累加或連續合併資料的效果。 Reduce 的使用模式如下，f 是欲執行的 binary function，x 為輸入的向量。第一個輸入的項目會和 init 一起計算，所以 init 要設定為與 x 相同型態的變項。

Reduce(function, x, init)

# ((((1+2)+3)+4)+5)
> Reduce(f = "+", c(1, 2, 3, 4, 5))
[1] 15
# rbind(rbind(c(1, 2), c(3, 4)), c(5, 6))
>  Reduce(f = rbind, list(c(1, 2), c(3, 4), c(5, 6)), init = c())
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

do.call 則允許 function 接收以 list 儲存的參數。其用法如下，what 是 function，而 args 則是以 list 儲存的參數。

do.call(what, args)

# rbind(c(1,2), c(3,4), c(5, 6))
> do.call(rbind, list(c(1,2), c(3,4), c(5, 6)))
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

綜上所述，lapply 輸出的結果為 list，所以無法直接套用 rbind 或 cbind，Reduce 藉由輪番取出 list 裡接續的兩個項目來迴避 rbind 或 cbind 不能處理 list 的問題；而 do.call 則能允許 rbind 和 cbind 接收 list 存儲的參數來解決問題。

盛夏不等式

Consensus and Profile (ID: CONS)