We have tested the accuracies of several prediction methods on benchmark sets. After testing, ten methods have been selected according to their prediction accuracies, availability, how they can be integrated into a consensus method, and we tried to select methods that were based on different algorithm type. The selected methods are: HMMTOP, Membrain, Memsat-SVM, Octopus, Philius, Phobius, Pro, Prodiv, Scampi, TMHMM.
Then we searched with BLAST for each sequence in TOPDB against TOPDB database itself by the parameters of e-value 10-10. Hits were accepted if the following clauses all were true: i) the hit's length was above 80% of the query sequence's length; ii) all TM helices were covered in the homologous TOPDB entry by the alignment; iii) sequence similarity was above 40% within HSPs. Topology data of the homologous protein and the entry itself were used in the constrained prediction by mirroring their sequential positions according to the position of the HSPs.
The search engine of TOPDOM homepage has been also used to locate those domains/motifs in the human sequences that were found earlier conservatively on the same side of TMPs, and we used the position and topology localization of the result(s) as constraint(s).
The newly developed consensus prediction algorithm is based on the probabilistic framework provided by the hidden Markov model, therefore the HMMTOP method can be utilized for this task. Briefly, the results of the ten prediction methods together with the available 3D or experimental topology data can be applied in HMMTOP as weighted constraints to obtain the constrained consensus prediction result. The weights depend on the per-protein topology or topography accuracies of the methods. The results of the ith method are:
\(Pred_{i} = l_{1}, l_{2} ... l_{n}\), \(1\leq i\leq m\)
\(l_{j} \in {"I","M","O","L","U"}\), \(1\leq j\leq n\)
where \(m\leq 10\) (the ten prediction methods and zero, one or more 3D/experimental topology constraints), n is the length of the query sequence and the "I", "M", "O", "L", "U" labels correspond to cytoplasmic loops, membrane spanning segments, non-cytoplasmic loops, membrane re-entrant loops and unknown regions, respectively.
We calculated the per-protein topography (\(Acc_{Tpg}\)) and topology (\(Acc_{Top}\)) accuracies of each method on a "structure benchmark set", and used these values as weights for the constraints. (\(Acc_{Tpg}\)) was applied for those positions, where the prediction method resulted in transmembrane or re-entrant loops (label "M" or "L", respectively), otherwise (\(Acc_{Top}\)) was used (for label "I" and "O"). In the case of 3D orexperimental topology data, the weights were set to 20. In the case of prediction methods, the results of the given prediction were used as constraints, but only if the prediction was valid, i.e. it contains at least one transmembrane region:
\[W_{i,j} = \left\{ \begin{array}{lr} Acc_{Top}(i), \ \ if\ \ \ \ Pred_{i,j}\in {"I","O"} and \ type(j) = prediction \ method \\ Acc_{Tpg}(i), \ \ if\ \ \ \ Pred_{i,j}\in {"M","L"} and \ type(j) = prediction \ method \\ 20, \ \ if\ \ \ \ type(j) = experimental \ result \\ \end{array} \right. \]\( \ \ 1\leq i\leq m, \ 1\leq j\leq n\)
These weights were normalized to one in each sequential position, and were used as constraints in the HMM:
\(C_{j,k} = \cfrac{\sum\limits_{i=1}^{m} W_{i,j}\cdot\Delta (k, Pred_{i,j})}{\sum\limits_{k=1}^{m} \sum\limits_{i=1}^{m} W_{i,j}\cdot\Delta (k, Pred_{i,j})} \ \ 1\leq j\leq n, \ 1\leq k\leq N\)
where N is the number of states in the hidden Markov model and
\[\Delta (a,b) = \left\{ \begin{array}{lr} 1, \ \ if\ \ \ \ Label (S_{a})=b\\ 0, \ \ if\ \ \ \ Label (S_{a})\neq b\\ \end{array} \\ \right. \]\( \ \ 1\leq a\leq N, \ 1\leq b\leq \hat{N}\)
where \(\hat{N} \)= 4, the number of the main states (inside, outside, membrane and loop) and S denote states of the hidden Markov model. If Memsat-SVM or Octopus methods resulted in re-entrant loop regions, or re-entrant loop regions were used as 3D or experimental topology constraints, a modified architecture for HMMTOP algorithm was used, allowing the extra "language rule" for the hidden Markov model.