1: Introduction
The goal of protHMM is to help integrate profile hidden markov model (HMM) representations of proteins into the machine learning and bioinformatics workflow. protHMM ports a number of features from use in Position Specific Scoring Matrices (PSSMs) to HMMs, along with implementing features used with HMMs specifically, which to our knowledge has not been done before. The adoption of HMM representations of proteins derived from HHblits and HMMer also presents an opportunity for innovation; it has been shown that HMMs can benefit from better multiple sequence alignment than PSSMs and thus get better results than corresponding HMMs using similar feature extraction techniques (Lyons et al. 2015). protHMM implements 20 different feature extraction techniques to provide a comprehensive list of feature sets for use in bioinformatics tasks ranging from protein fold classification to protein-protein interaction.
2: Autocorrelation: hmm_MB, hmm_MA, hmm_GA
protHMM implements normalized Moreau-Broto, Moran and Geary autocorrelation descriptors, each of which measure the correlation between two amino acid residues separated by a lag value, (lg) along the sequence. Each of these feature vectors return a vector of \(20 \times lg\). Liang, Liu, and Zhang (2015) provide a mathematical representation for each of these features, having used them to predict protein structural class:
Moreau-Broto
\({N_{lg}^j = \frac{1} {L-lg} \sum_{i = 1} ^ {L-lg} H_{i, j} \times H_{i+lg, j}, \space (j = 1,2..20, 0 < d < L)}\)
In which \({N_{lg}^j}\) is the Moreau-Broto correlation factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence and \(H\) represents the HMM matrix.
Moran
\({N_{lg}^j = \frac{\frac{1} {L-lg} \sum_{i = 1} ^ {L-lg} (H_{i, j} - \bar{H_j}) \times (H_{i+lg, j} - \bar{H_j})}{\frac {1} {L} \sum_{i = 1} {L} (H_{i, j} - \bar{H_j})^2}, \space (j = 1,2..20, 0 < d < L)}\)
In which \({N_{lg}^j}\) is the Moran correlation factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence, \(\bar{H_j}\) represents the average of column \(j\) in matrix \(H\) and \(H\) represents the HMM matrix.
Geary
\({N_{lg}^j = \frac{\frac{1} {2(L-lg)} \sum_{i = 1} ^ {L-lg} (H_{i, j} - H_{i+d, j})^2}{\frac {1} {L-1} \sum_{i = 1} {L} (H_{i, j} - \bar{H_j})^2}, \space (j = 1,2..20, 0 < d < L)}\)
In which \({N_{lg}^j}\) is the Geary correlation factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence and \(H\) represents the HMM matrix.
Examples
library(protHMM)
## basic example code
h_MB<- hmm_MB(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_MA<- hmm_MA(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_GA<- hmm_GA(system.file("extdata", "1DLHA2-7", package="protHMM"))
head(h_MB, 20)
#> [1] 0.0027399767 0.0028318487 0.0030151714 0.0027021756 0.0029396909
#> [6] 0.0028727712 0.0030186241 0.0028945067 0.0029009803 0.0006026105
#> [11] 0.0003812977 0.0004823361 0.0003340628 0.0005073259 0.0003092559
#> [16] 0.0002997833 0.0010281865 0.0004959552 0.0023601628 0.0021576905
head(h_MA, 20)
#> [1] -0.10731521 -0.08432884 0.03573254 -0.15878664 0.03660993 -0.02897298
#> [7] 0.05583864 -0.02260616 -0.01659614 -0.00317749 -0.02603402 -0.01744969
#> [13] -0.03281501 -0.01579394 -0.03647018 -0.03876510 0.03093880 -0.02196930
#> [19] 0.31531271 0.17506839
head(h_GA, 20)
#> [1] 1.0866849 1.0541761 0.9300136 1.1177010 0.9217161 0.9864679 0.9015713
#> [8] 0.9811766 0.9780934 1.0025380 1.0347499 1.0360329 1.0615057 1.0546791
#> [15] 1.0855622 1.0984717 1.0403336 1.1041239 0.6773725 0.8157369
3: Covariance: hmm_AC, hmm_CC
protHMM implements two covariance-based features, autocovariance (AC) and cross covariance (CC). These features calculate the covariance between two residues separated by a lag value lg along the sequence. AC calculates this for positions in the same column, and CC for positions that are not in the same column of the HMM. It should also be noted that for these features, the HMM matrix is not converted to probabilities, unlike other features. Both features were used for protein fold recognition.
Autocovariance
Dong, Zhou, and Guan (2009) provide a mathematical representation of autocovariance:
\(AC(j, lg) = \sum_{i = 1}^{L-lg} \frac{(H_{i, j} - \bar{H_j}) \times (H_{i+lg, j} - \bar{H_j})}{L-lg}, \space j = 1,2..20\)
In which \(AC(j,\space lg)\) is the autocovariance factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence, \(\bar{H_j}\) represents the average of column \(j\) in matrix \(H\) and \(H\) represents the HMM matrix.
Cross covariance
Dong, Zhou, and Guan (2009) also provide a mathematical representation of cross covariance:
\(AC(j_1, j_2, lg) = \sum_{i = 1}^{L-lg} \frac{(H_{i, j_1} - \bar{H_{j_1}}) \times (H_{i+lg, j_2} - \bar{H_{j_2})}}{L-lg}, \space j_1,j_2 = 1, 2, 3...20, \space j_1 \neq j_2\)
In which \(AC(j_1,\space j_2,\space lg)\) is the cross correlation factor for columns \(j_1, \space j_2\), \(lg\) is the lag value, \(L\) is the length of the protein sequence, \(\bar{H_{j_i}}\) represents the average of column \(j_i\) in matrix \(H\) and \(H\) represents the HMM matrix.
Examples
library(protHMM)
## basic example code
h_AC<- hmm_ac(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_CC<- hmm_cc(system.file("extdata", "1DLHA2-7", package="protHMM"))
head(h_AC, 20)
#> [1] 2027588 2048277 2069394 2090950 8161072 8244348 8329341 8416105 3815863
#> [10] 3854800 3894540 3935108 2055989 2076968 2098380 2120238 4671700 4719371
#> [19] 4768024 4817691
head(h_CC, 20)
#> [1] 432613.90 607445.71 679287.41 527806.56 446405.28 989099.80 188119.73
#> [8] 479781.58 547044.82 29788.07 892089.22 946319.61 520382.72 820507.43
#> [15] 297221.83 194817.35 309185.70 286636.41 873835.20 -23142.80
4: Bigrams and Trigrams
The bigrams and trigrams features are outlined by Lyons et al. (2015), and they interpret the likelihood of 2 (bi) or 3 (tri) amino acids occurring consecutively in the sequence. Thus, they take shape as a 20 x 20 matrix for bigrams and a 20 x 20 x 20 array for trigrams, which are then flattened to create vectors of length 400 and 8000. These features were used by Lyons et al. (2015) for protein fold recognition.
Bigrams
Lyons et al. (2015) outlines bigrams mathematically as \(B[i, j]\), where:
\({B[i, j] = \sum_{a = 1}^{L-1} H_{a, i}H_{a+1, j}}\)
And \({H}\) corresponds to the original HMM matrix, and \({L}\) is the number of rows in \({H}\).
Trigrams
Lyons et al. (2015) outlines bigrams mathematically as \(B[i, j, k]\), where:
\({B[i, j, k] = \sum_{a = 1}^{L-2} H_{a, i}H_{a+1, j}H_{a+2, k}}\)
And \({H}\) corresponds to the original HMM matrix, and \({L}\) is the number of rows in \({H}\).
Examples
library(protHMM)
## basic example code
h_tri<- hmm_trigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_bi<- hmm_bigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
head(h_tri, 20)
#> [1] 0.005352191 0.011975355 0.019514992 0.008968399 0.018890547 0.008317675
#> [7] 0.013992176 0.016166307 0.022135295 0.004365804 0.013618584 0.020090544
#> [13] 0.008747969 0.012091065 0.024533920 0.019144562 0.017654663 0.006009293
#> [19] 0.006042909 0.011424255
head(h_bi, 20)
#> [1] 0.27125769 0.11864240 0.22349757 0.35406729 0.17926196 0.31284113
#> [7] 0.15635251 0.24893749 0.31070932 0.43189681 0.08388694 0.26433238
#> [13] 0.39767028 0.19981994 0.24877376 0.51622764 0.38255083 0.36514895
#> [19] 0.12440151 0.13735055
5: CHMM
The CHMM feature begins by creating a CHMM, which is created by constructing 4 matrices, \({A, B, C, D}\) from the original HMM \({H}\). \({A}\) contains the first 75 percent of the original matrix \({H}\) row-wise, \({B}\) the last 75 percent, \({C}\) the middle 75 percent and \({D}\) the entire original matrix (An et al. 2019). These are then merged to create the new CHMM \({Z}\). From there, the bigrams feature is calculated with a flattened 20 x 20 matrix \({B}\), in which
\({B[i, j] = \sum_{a = 1}^{L-1} Z_{a, i} \times Z_{a+1, j}}\)
\({H}\) corresponds to the original HMM matrix, and \({L}\) is the number of rows in \({Z}\) (An et al. 2019). Local Average Group, or LAG is calculated by first splitting the CHMM into \(j\) groups and then the following:
\(LAG(k) = \frac{20}{L}\sum_{p = 1}^{N/20} Mt[p+(i-1)\times(N/20),\space j]\)
Where \({L}\) is the number of rows in \({Z}\) and \(Mt[p+(i-1)\times(N/20),\space j]\) represents the row vector of the CHMM and the \(i\)th position in the \(j\)th group (An et al. 2019). These features are then fused, and a vector of 800 along the the original length 400 bigrams and LAG vectors are returned. The CHMM features were used to predict protein-protein interactions.
Examples
library(protHMM)
h_fused<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[1]]
h_lag<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[2]]
h_bigrams<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[3]]
head(h_fused, 20)
#> [1] 0.050431366 0.008073699 0.025377739 0.080125307 0.037055099 0.034279129
#> [7] 0.022591078 0.088067417 0.056547767 0.088639559 0.020870630 0.016466719
#> [13] 0.114605135 0.031219286 0.040658673 0.079269757 0.066999681 0.136328812
#> [19] 0.009850417 0.024292606
head(h_lag, 20)
#> [1] 0.050431366 0.008073699 0.025377739 0.080125307 0.037055099 0.034279129
#> [7] 0.022591078 0.088067417 0.056547767 0.088639559 0.020870630 0.016466719
#> [13] 0.114605135 0.031219286 0.040658673 0.079269757 0.066999681 0.136328812
#> [19] 0.009850417 0.024292606
head(h_bigrams, 20)
#> [1] 0.6644992 0.2932670 0.5686021 0.8736529 0.4484777 0.7703030 0.3734919
#> [8] 0.6135072 0.7469687 1.0671193 0.2097486 0.6524548 0.9846256 0.5010583
#> [15] 0.6275425 1.2841556 0.9755920 0.8980086 0.3038199 0.3414741
6: Distance
The distance feature calculates the cosine distance matrix between two HMMs \({A}, \space{B}\) before dynamic time warp is applied to the distance matrix calculate the cumulative distance between the HMMs, which acts as a measure of similarity (Lyons et al. 2016). The cosine distance matrix \({D}\) is calculated with:
\({D[a_i, b_j] = 1 - \frac{a_ib_j^{T}}{a_ia_i^Tb_jb_j^T}}\)
in which \({a_i}\) and \({a_i}\) refer to row vectors of \({A}\) and \({B}\) respectively (Lyons et al. 2016). This in turn means that \(D\) is of dimensions \({nrow(A), nrow(b)}\). Dynamic time warp then calculates the cumulative distance by calculating matrix:
\(C[i, j] = min(C[i-1, j], C[i, j-1], C[i-1, j-1]) + D[i, j]\)
where \({C_{i,j}}\) is 0 when \({i}\) or \({j}\) are less than 1 (Lyons et al. 2016). The lower rightmost point of the matrix \({C}\) is then returned as the cumulative distance between proteins. The distance feature was used by Lyons et al. (2016) for protein fold recognition.
Example
library(protHMM)
# basic example code
h<- hmm_distance(system.file("extdata", "1DLHA2-7", package="protHMM"), system.file("extdata", "1TEN-7",
package="protHMM"))
h
#> [1] 100
7: FP_HMM
FP_HMM consists of two vectors, \({d, s}\). Vector \({d}\) corresponds to the sums across the sequence for each of the 20 amino acid columns (Zahiri et al. 2013). Vector \({s}\) corresponds to a flattened matrix of \(S\) where:
\({S[i, j] = \sum_{k = 1}^{L} H[k, j] \times \delta[k, i]}\)
in which \({\delta[k, i] = 1}\) when \({A_i = H[k, j]}\) and 0 otherwise (Zahiri et al. 2013). \({A}\) refers to a list of all possible amino acids, \({i, j}\) span from \({1:20}\). This feature has been used for the prediction of protein-protein interactions.
Example
library(protHMM)
# basic example code
h<- fp_hmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[[1]]
#> [1] 5.327650 2.495163 4.226204 6.557727 3.430360 6.392750 2.638661 5.230106
#> [9] 5.887156 7.882747 1.510770 4.522352 6.408192 3.615053 4.486816 9.137523
#> [17] 7.110800 7.647033 1.971206 2.521788
head(h[[2]], 20)
#> [1] 0.00000000 0.06906144 0.25848195 0.61607168 0.21594299 0.21617140
#> [7] 0.24383966 0.22567252 0.18595546 0.51564480 0.00000000 0.15273093
#> [13] 0.63770076 0.00000000 0.24438102 0.21709957 0.46775998 0.89909902
#> [19] 0.10940189 0.05263446
8: HMM_GSD
This feature was created by Jin and Zhu (2021) and begins by creates a grouping matrix \(G\) by assigning each position a number \(1,2, 3\) based on the value at each position of HMM matrix \(H\); \(1\) represents the low probability group, \(2\) the medium and \(3\) the high probability group. The number of total points in each group for each column is then calculated, and the sequence is then split based upon the the positions of the 1st, 25th, 50th, 75th and 100th percentile (last) positioned points for each of the three groups, in each of the 20 columns of the grouping matrix. Thus for column \(j\):
\({S(k, \space j, \space z) = \sum_{i = 1}^{(z) \times.25 \times N} |G[i,\space j] = k|}\)
where \({k}\) is the group number, \({z = 1,2,3,4}\) and \({N}\) corresponds to number of rows in matrix \({G}\) (Jin and Zhu 2021).
Example
library(protHMM)
# basic example code
h<- hmm_GSD(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1:40]
#> [1] 1.11117557 0.83022094 -0.43407491 -1.50170251 -1.67027529 0.91450733
#> [7] 0.54926631 0.26831167 -0.01264296 -1.58598890 1.02688918 0.94260279
#> [13] 0.71783909 0.43688445 -1.61408437 1.11117557 0.32450260 -0.34978852
#> [19] -0.99598417 -1.67027529 0.83022094 0.71783909 0.52117084 0.24021621
#> [25] -1.38932066 0.63355270 0.63355270 0.63355270 0.63355270 -1.55789344
#> [31] 1.11117557 0.46497992 -0.34978852 -1.13646149 -1.67027529 1.05498465
#> [37] 0.85831640 0.52117084 0.18402528 -1.50170251
9: Pseudo HMM: IM_psehmm and pse_hmm
Both of the pseudo HMM features, pse_hmm and IM_psehmm, are based off of Chou and Shen (2007)’s psuedo PSSM concept. This was engineered in order to create non-variable length representations of proteins, while keeping sequence order information. IM_psehmm stands for improved pseudo hmm, which is based off of Ruan et al. (2020)’s improvements to the Chou and Shen (2007)’s initial offering; Ruan et al. (2020)’s offering is still based on the PSSM matrix. As such, this feature is a port for use with HMMs.
pseudo_hmm
Mathematically, pseudo_hmm is made of the fusion of 2 vectors. The first is of the means across the 20 amino acid emission frequency columns of the HMM. The second can be found in the following equation (Chou and Shen 2007):
\(V_{j}^{i} = \frac{1}{L-i}\sum_{k = 1}^{L-i}H[k, j] \times H[k+i, j], \space j = 1,2..20, \space i < L\)
\(H\) represents the HMM matrix, \(L\) represents the row number of \(H\), \(i\) represents the amount of contiguous amino acids that correlation is measured across and \(V_{j}^{i}\) represents the value of the vector for the \(j\)th column and \(i\)th most contiguous amino acids. This results in a vector of \(20 + g \times 20\).
IM_pse_hmm
IM_pse_hmm is also the fusion of two vectors, one being the same means vector described above. The second can be found as (Ruan et al. 2020):
\(V_{j}^{d} = \frac{1}{20-d}\sum_{k = 1}^{L}H[k, j] \times H[k, j+d], \space j = 1,2..20, \space d < 20\)
Where \(H\) the HMM matrix, \(L\) represents the number of rows in \(H\) and \(V_{j}^i\) represents the value in the feature vector for column \(j\) and columnal distance parameter \(d\). This results in a vector of length \({20+20\times d-d\times\frac{d+1}{2}}\).
10: Local Binary Pattern
This feature is based off of Li et al. (2019)’s work using local binary pattern to extract features from PSSMs to use in protein-protein interaction prediction. The local binary pattern extraction used here can be defined as \(B = b(s(H_{m, n+1}-H_{m, n}), s(H_{m+1, n+1}-H_{m, n}), s(H_{m+1, n}-H_{m, n})...s(H_{m-1, n+1}-H_{m, n})))\), where \(H[m,n]\) signifies the center of a local binary pattern neighborhood (\(2<m<L-1,2<n<20, L = nrow(H)\)), \(H\) refers to the HMM matrix, \(s(x) = 1\) if \(x \ge 0\), \(s(x) = 0\) otherwise and \(b\) refers to a function that converts a binary number to a decimal number. The local binary pattern features are then put into a histogram with 256 bins, one for each possible binary value. This is returned as a 256 length vector.
Example
library(protHMM)
## basic example code
h<- hmm_LBP(system.file("extdata", "1TEN-7", package="protHMM"))
h[1:20]
#> [1] 166 31 16 18 35 8 4 7 12 6 3 1 7 0 2 0 19 4 5
#> [20] 2
11: Linear Predictive Coding
In this feature, linear predictive coding (LPC) is used to extract a vector \(d\) of length \(g\) from each of the 20 amino acid emission frequency columns of the HMM matrix. linear predictive coding considers that position \(H[i, j]\) in the HMM matrix can be approximated by \(\sum_{k = 1}^{g} d_{k, j}a_{i,j}\). The feature vector extracted is thus the vector of d; this is done through the phonTools package (Barreda 2015). This assumes \(g\) of 14, and thus the feature vector generated if \(14\times20 = 280\). This feature is based off of Qin et al. (2015)’s work, in which they used an LPC feature set to predict protein structural class.
Example
library(protHMM)
## basic example code
h<- hmm_LPC(system.file("extdata", "1TEN-7", package="protHMM"))
h[1:20]
#> [1] 1.00000000 0.74635054 0.59126153 0.36331602 0.36010191 0.30277057
#> [7] 0.47978532 0.41120113 0.29511047 0.28022645 0.20143373 0.24002696
#> [13] 0.15302503 0.04457104 1.00000000 0.71845887 0.84727202 0.92128401
#> [19] 1.02748498 0.83287042
12: SCSH
This feature is documented by Mohammadi et al. (2022) and was used for protein-protein interaction prediction. The SCSH feature returns the k-mer composition between a protein’s consensus sequence given by the HMM and the protein’s actual sequence, for k = 2,3. First, all possible k-mers for all of the 20 possible amino acids are calculated (\({20^2}\) and \({20^3}\) permutations for 2 and 3-mers respectively). With those permutations, different vectors of length 400 and 8000 are created, vectors \(v_2, \space v_3\). Each position on the vectors corresponds to a specific k-mer, i.e \(v_2[1] = AA\) and \(v_2[1] = AAA\). Then, the protein sequence that corresponds to the HMM scores is extracted, and put into a bipartite graph with the actual protein sequence. Each path of length 1 or 2 is found on the graph, and the corresponding vertices on the graph are noted as possible 2 and 3-mers. For each 2 or 3-mer found from these paths, 1 is added to the position that corresponds to that 2/3-mer in the 2-mer and 3-mer vectors, which are the length 400 and 8000 vectors created previously. The vectors are then returned.
Examples
library(protHMM)
## basic example code
h<- hmm_SCSH(system.file("extdata", "1TEN-7", package="protHMM"))
## 2-mers
h[[1]][1:20]
#> [1] 0 0 0 1 0 2 0 0 1 1 1 0 2 0 1 0 2 1 0 0
## 3-mers: specific indexes as most of the vector is 0
h[[2]][313:333]
#> [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 2 0 0 0
13: Separated Dimers
Separated Dimers refers the the probability that there will be an amino acid dimer between amino acid residues separated by a distance of \(l\). Saini et al. (2015) conceived of this feature and applied it to protein fold recognition; mathematically, separated dimers calculates matrix \(F\):
\({F[m, n] = \sum_{i = 1}^{L-l} H_{i, m}H_{i+l, n}}\)
\({H}\) corresponds to the original HMM matrix, and \(L\) is the number of rows in \(H\). Matrix \(F\) is then flattened to a feature vector of length 400, and returned.
Examples
library(protHMM)
## basic example code
h<- hmm_SepDim(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1:40]
#> [1] 0.28073204 0.13127205 0.22465162 0.30993504 0.17705352 0.30286268
#> [7] 0.13611039 0.27110569 0.26843594 0.42275865 0.08623105 0.23523189
#> [13] 0.28262058 0.17449219 0.23466631 0.45744566 0.38400417 0.41489217
#> [19] 0.06656876 0.12644926 0.12468473 0.02787985 0.17961578 0.17639461
#> [25] 0.04536077 0.33523934 0.07878719 0.08306272 0.13034049 0.14246328
#> [31] 0.04005923 0.13693382 0.09324879 0.10231668 0.09851687 0.29441110
#> [37] 0.16511460 0.15335638 0.03221440 0.03884258
14: Single Average Group
Nanni, Lumini, and Brahnam (2014) pioneer the Single Average Group feature and use it for protein classification. This feature groups together rows that are related to the same amino acid, using a vector \({SA(k)}\), in which \({k}\) spans \({1:400}\) and:
\({SA(k) = avg_{i \space = \space 1, 2... L}\space H[i, j] \times \delta(P(i), A(z))}\)
in which \({H}\) is the HMM matrix, \({P}\) in the protein sequence, \({A}\) is an ordered set of amino acids, the variables \({j, \space z = 1, 2, 3...20}\), the variable \({k = j + 20 \times (z-1)}\) when creating the vector. \({\delta()}\) represents Kronecker’s delta.
Example
library(protHMM)
## basic example code
h<- hmm_Single_Average(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1:40]
#> [1] 5.327649515 2.495163167 4.226203618 6.557727068 3.430359915 6.392750239
#> [7] 2.638660734 5.230106250 5.887155542 7.882747181 1.510770387 4.522351690
#> [13] 6.408192096 3.615052651 4.486815933 9.137522926 7.110799532 7.647033087
#> [19] 1.971206424 2.521787664 0.069061438 1.448076555 0.000000000 0.000000000
#> [25] 0.040982786 0.037528025 0.034250848 0.033810905 0.000000000 0.033966298
#> [31] 0.018141963 0.012174447 0.061500560 0.009964409 0.031380236 0.049529900
#> [37] 0.045531222 0.055012065 0.008144264 0.011430244
15: Smoothed HMM
This feature extraction technique is found in Fang, Noguchi, and Yamana (2013), being used in conjunction with another technique for ligand binding site prediction. This feature smooths the HMM matrix \(H\) by using sliding window of length \(sw\) to incorporate information from up and downstream residues into each row of the HMM matrix. Foreach HMM row :
\(r_i = \sum_{x \space = \space i-\frac{sw}{2}}^{i+\frac{sw}{2}}{r_x}\)
for \({i = 1, 2, 3...L}\), where \({L}\) is the number of rows in \({H}\). For rows such as the beginning and ending rows, zero matrices of dimensions \(sw/2, 20\) are appended to \({H}\).
Example
library(protHMM)
## basic example code
h<- hmm_smooth(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1,]
#> [1] 0.11776641 0.01631112 0.05565213 0.11811762 0.02074633 0.18715081
#> [7] 0.11375485 0.18747835 0.25055389 0.12211269 0.05342971 0.12007436
#> [13] 0.07683993 0.16693724 0.01869764 0.24692156 0.30136984 0.67903030
#> [19] 0.04331468 0.10385260
16: Singular Value Decomposition
This feature extraction method is found in Song et al. (2018), and uses singular value decomposition to extract a feature vector from a HMM matrix. SVD factorizes the matrix \(H\) of dimensions \({i, j}\) to
\({U[i, r] \times \Sigma[r, r] \times V[r, j]}\)
The diagonal values of \({\Sigma}\) are known as the singular values of matrix \(H\), and are what are returned with this function. This feature was used to predict protein-protein interactions.
Example
library(protHMM)
# basic example code
h<- hmm_svd(system.file("extdata", "1DLHA2-7", package="protHMM"))
h
#> [1] 2.4328414 1.1318859 1.0331369 0.9046638 0.8595964 0.6696056 0.6425755
#> [8] 0.5865227 0.5189927 0.5155567 0.4701584 0.4118091 0.3627740 0.3249188
#> [15] 0.2858293 0.2615811 0.2576213 0.2251676 0.2074293 0.1242763
17: Read
This function reads in the 20 amino acid emission frequency columns used in the feature extraction methods discussed previously, and converts the columns into probabilities.
Example
library(protHMM)
## basic example code
h<- hmm_read(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1,]
#> [1] 0.0000000 0.0000000 0.0000000 0.3410370 0.0000000 0.0000000 0.0000000
#> [8] 0.3382121 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [15] 0.0000000 0.0000000 0.0000000 0.3208565 0.0000000 0.0000000