gitsvnid: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/kebabs@98268 bc3139a867e503109ffcced21a209358
Ulrich Bodenhofer authored on 12/01/2015 14:33:341  1 
deleted file mode 100644 
...  ... 
@@ 1,1604 +0,0 @@ 
1 
\documentclass[article]{bioinf} 

2 
 

3 
\usepackage[noae]{Sweave} 

4 
\usepackage{amsmath,amssymb} 

5 
\usepackage{hyperref} 

6 
\usepackage{float} 

7 
\usepackage[authoryear]{natbib} 

8 
 

9 
\hypersetup{colorlinks=false, 

10 
 pdfborder=0 0 0, 

11 
 pdftitle={KeBABS\\ An R Package for Kernel Based Analysis of Biological Sequences}, 

12 
 pdfauthor={Johannes Palme}} 

13 
 

14 
\title{KeBABS \\ An R Package for Kernel Based Analysis of Biological Sequences} 

15 
\author{Johannes Palme and Ulrich Bodenhofer} 

16 
\affiliation{Institute of Bioinformatics, Johannes Kepler University 

17 
Linz\\Altenberger Str. 69, 4040 Linz, Austria\\ 

18 
\email{kebabs@bioinf.jku.at}} 

19 
 

20 
%\VignetteIndexEntry{KeBABS  An R Package for Kernel Based Analysis of Biological Sequences} 

21 
%\VignetteKeywords{Kernel, kernels, similarity, sequences, biological sequences, prediction profiles, SVM, classification, regression, multiclass, kernlab, e1071, LiblineaR} 

22 
 

23 
 

24 
 

25 
\newcommand{\KeBABS}{\text{KeBABS}} 

26 
\newcommand{\kebabs}{\texttt{kebabs}} 

27 
\newcommand{\kbsvm}{\texttt{kbsvm}} 

28 
\newcommand{\Biostrings}{\texttt{Biostrings}} 

29 
\newcommand{\method}[1]{{\fontfamily{phv}\fontshape{rm}\selectfont #1}} 

30 
\newcommand{\R}{R} 

31 
\newcommand{\Real}{\mathbb{R}} 

32 
 

33 
\renewcommand{\vec}[1]{\mathbf{#1}} 

34 
 

35 
\setkeys{Gin}{width=0.55\textwidth} 

36 
 

37 
\SweaveOpts{highlight=TRUE, tidy=TRUE, keep.space=TRUE, keep.blank.space=TRUE, keep.comment=TRUE} 

38 
 

39 
\begin{document} 

40 
<<Init, echo=FALSE>>= 

41 
options(width=70) 

42 
options(useFancyQuotes=FALSE) 

43 
set.seed(0) 

44 
library(kebabs) 

45 
kebabsVersion < packageDescription("kebabs")$Version 

46 
kebabsDateRaw < packageDescription("kebabs")$Date 

47 
kebabsDateYear < as.numeric(substr(kebabsDateRaw, 1, 4)) 

48 
kebabsDateMonth < as.numeric(substr(kebabsDateRaw, 6, 7)) 

49 
kebabsDateDay < as.numeric(substr(kebabsDateRaw, 9, 10)) 

50 
kebabsDate < paste(month.name[kebabsDateMonth], " ", 

51 
 kebabsDateDay, ", ", 

52 
 kebabsDateYear, sep="") 

53 
@ 

54 
\newcommand{\kebabsVersion}{\Sexpr{kebabsVersion}} 

55 
\newcommand{\kebabsDate}{\Sexpr{kebabsDate}} 

56 
\manualtitlepage[Version \kebabsVersion, \kebabsDate] 

57 
 

58 
\section*{Scope and Purpose of this Document} 

59 
 

60 
This document is the user manual for the \R\ package \kebabs. It is meant as an introduction to the \KeBABS\ functionality which should make the user acquainted with the basic functions of the package both on a conceptual level as well as from the usage perspective. This means that the document will neither go into full details for the described functions nor can it be considered a complete description of all package functions. As an additional source of information, the help pages including examples are available. 

61 
 

62 
Please also consider that this manual is not an introduction to \R\ nor an introduction to machine learning in general or to support sector machinebased learning. The document provides the information in a form that also allows less experienced users to use the package. However, if you do not have any knowledge of these subjects, please read introductory literature first. 

63 
 

64 
\section*{Authorship and Contributions} 

65 
 

66 
Except for the header files {\tt khash.h} and {\tt ksort.h} and three small stub files for Bioconductor packages this package was written entirely by Johannes Palme as part of his master thesis in the Bioinformatics master program at the Johannes Kepler University in Linz, Austria, under the supervision of Ulrich Bodenhofer. The author wants to express his sincere gratitude to Ulrich Bodenhofer for numerous lively discussions throughout the development of the package. The two header files mentioned above provide a fast hash table implementation and sorting algorithms implemented in pure C code and are part of the opensource Clibrary {\tt klib} \citep{klib2011}. The R code for prediction profile plots partly uses a similar functionality in the package {\tt procoil} as inspiration \citep{procoil2011}. 

67 
 

68 
The datasets delivered with the package come from two sources: 

69 
\begin{itemize} 

70 
\item TFBS  transcription factor binding site data: 

71 
This data set is based on supplementary data provided with the publication "Discriminative prediction of mammalian enhancers from DNA sequence by \cite{Lee2011}. 

72 
\item CCoil  protein sequences for coiled coil proteins : 

73 
This dataset is part of the data collection for package {\tt procoil}\footnote{\url{http://www.bioinf.jku.at/software/procoil/data.html}}. 

74 
\end{itemize} 

75 
 

76 
KeBABS relies heavily on functionality provided by other R packages, especially the packages 

77 
\begin{itemize} 

78 
\item {\tt Biostrings} for storing and manipulating biological sequences 

79 
\item {\tt e1071}, {\tt Liblinear} and {\tt kernlab} for SVM implementations used in binary and multiclass classification and regression scenarios 

80 
\end{itemize} 

81 
 

82 
Special thanks go to the persons who have contributed to these packages, the C library {\tt klib} or any of the data mentioned above. 

83 
 

84 
\newpage 

85 
 

86 
\vspace{1cm} 

87 
 

88 
\newlength{\auxparskip} 

89 
\setlength{\auxparskip}{\parskip} 

90 
\setlength{\parskip}{0pt} 

91 
\tableofcontents 

92 
\clearpage 

93 
\setlength{\parskip}{\auxparskip} 

94 
 

95 
\newlength{\Nboxwidth} 

96 
\setlength{\Nboxwidth}{\textwidth} 

97 
\addtolength{\Nboxwidth}{2\fboxrule} 

98 
\addtolength{\Nboxwidth}{2\fboxsep} 

99 
 

100 
\newcommand{\notebox}[1]{% 

101 
\begin{center} 

102 
\fbox{\begin{minipage}{\Nboxwidth} 

103 
\noindent{\sffamily\bfseries Note:} #1 

104 
\end{minipage}} 

105 
\end{center}} 

106 
 

107 
\section{Introduction} 

108 
\label{sec:Introduction} 

109 
 

110 
The \kebabs\ package provides functionality for kernel based analysis of biological sequences (DNA, RNA and amino acid sequences) via Support Vector Machine (SVM) based methods. Sequence kernels define similarity measures between sequences which can be used in various machine learning methods. The package provides four important kernels for sequence analysis in a very flexible and efficient implementation and extends their standard positionindependent functionality in a novel way to take the position of patterns in the sequences into account. Following kernels are available in \KeBABS: 

111 
 

112 
\begin{itemize} 

113 
\item Spectrum kernel \citep{Leslie2002} 

114 
\item Mismatch kernel \citep{Leslie2003} 

115 
\item Gappy pair kernel \citep{Kuksa2008} 

116 
\item Motif kernel \citep{BenHur2003} 

117 
\end{itemize} 

118 
 

119 
All of the kernels except the mismatch kernel support positionindependent and positiondependent similarity measures. Because of the flexibility of the kernel formulation other kernels like the weighted degree kernel \citep{Raetsch2004} or the shifted weighted degree kernel with constant weighting of positions \citep{Raetsch2005} are included as special cases. The kernels also exist in an annotationspecific version which uses annotation information placed along the sequence. The package allows for the generation of a kernel matrix or an explicit representation for all available kernels which can be used with methods implemented in other \R\ packages. 

120 
 

121 
\KeBABS\ focuses on SVMbased methods and provides a framework that simplifies the usage of existing SVM implementations in other \R\ packages. In the current version of the package the SVMs in the \R\ packages {\tt kernlab}, {\tt e1071} and {\tt LiblineaR} from the "The Comprehensive R Archive Network" (CRAN)\footnote{\url{http://cran.rproject.org}} are supported. Binary and multiclass classification as well as regression tasks can be used with the SVMs in these packages in a unified way without having to deal with the different functions, parameters and formats of the selected SVM. The package provides cross validation, grid search and model selection functions as support for choosing hyperparameters . Grouped cross validation, a form of cross validation which respects group membership is also available. 

122 
 

123 
Another emphasis of this package is the biological interpretability of the machine learning results. This aspect is reflected in the computation of feature weights for all SVMs and prediction profiles which show the contribution of individual sequence positions to the prediction result and give an indication about the relevance of sequence sections for the learning result. 

124 
 

125 
The flexible and efficient kernels and the supporting SVM related functionality in \KeBABS\ together with the supported SVMs, especially the very fast linear SVMs in package LiblineaR, provide a fast, versatile and easytouse sequence analysis framework combined with improved biological interpretation possibilities of learning results through feature weights and prediction profiles. 

126 
 

127 
\newpage 

128 
 

129 
\section{Installation} 

130 
\label{sec:Installation} 

131 
 

132 
\subsection{Installation via Bioconductor} 

133 
 

134 
The \R\ package \kebabs\ (current version: \kebabsVersion) is part of the Bioconductor Project\footnote{\url{http://www.bioconductor.org/}}. To install the package from this repository, please enter the following commands into your \R\ session: 

135 
 

136 
<<eval=F>>= 

137 
source("http://bioconductor.org/biocLite.R") 

138 
biocLite("kebabs") 

139 
@ 

140 
 

141 
\subsection{Manual installation from source} 

142 
 

143 
The above described install method is the standard way of installing the package. If you e.g. want to compile the C/C++ code included in the package with special compile flags, manual installation of the package from source might be necessary. This is done in the following way: 

144 
 

145 
\begin{itemize} 

146 
\item{open the Bioconductor page of the package\footnote{\url{http://www.bioconductor.org/packages/release/bioc/html/kebabs.html}} in your web browser, download the source package \texttt{kebabs\_\kebabsVersion.tar.gz} and save the source package on your hard disk.} 

147 
\item{open a shell/terminal/command prompt window and change to the directory where the source package \texttt{kebabs\_\kebabsVersion.tar.gz} is stored. Then install the package with:\\\\ 

148 
\texttt{R CMD install kebabs\_\kebabsVersion.tar.gz}} 

149 
\end{itemize} 

150 
 

151 
\subsection{Installation of SVM packages} 

152 
 

153 
According to the needed SVMs for SVMbased learning one, several or all of the following packages must be installed additionally: 

154 
 

155 
\begin{description} 

156 
\item [e1071] {\tt > install.packages("e1071")} 

157 
\item [kernlab] {\tt > install.packages("kernlab")} 

158 
\item [LiblineaR] {\tt > install.packages("LiblineaR")} 

159 
\end{description} 

160 
 

161 
\section{Getting Started} 

162 
\label{sec:GettingStarted} 

163 
 

164 
To load the package, enter the following in your \R\ session: 

165 
<<>>= 

166 
library(kebabs) 

167 
@ 

168 
 

169 
When the package is loaded several messages are displayed indicating that some objects are masked. This does not cause any problem for \kebabs\ .Once the \R\ session returns to the \R\ command prompt ">" after loading the package, \kebabs\ is ready for use. 

170 
 

171 
The package includes this user manual which can be viewed in the \R\ session with\\ 

172 
 

173 
<<eval=F>>= 

174 
vignette("kebabs") 

175 
@ 

176 
 

177 
Help pages can be accessed with 

178 
 

179 
<<eval=F>>= 

180 
help(kebabs) 

181 
@ 

182 
 

183 
In the following example, we perform SVMbased binary classification for prediction of EP300 bound enhancers from DNA sequence. From a set of sequences derived from ChIPseq EP300 binding sites and a set of negative sequences sampled from the same genome we train a model which allows us to predict enhancers for new sequences. In this example a small subset of a much larger dataset is used for demonstration purpose. 

184 
 

185 
The analysis in \kebabs\ usually starts with sequence data for DNA, RNA or amino acid sequences. In the following examples, we use DNA sequences which are included in the \kebabs\ package to demonstrate package functionality. In \kebabs\ biological sequences are represented in one of the classes {\tt DNAStringSet}, {\tt RNAStringSet} or {\tt AAStringSet} from package \Biostrings. Sequences in this format can be created from character vectors containing the sequence data or through reading sequences from a FASTA file with the functions {\tt readDNAStringSet, readRNAStringSet, readAAStringSet} from package \Biostrings. In the first example, we load sample sequences provided with the \kebabs\ package with 

186 
 

187 
<<>>= 

188 
data(TFBS) 

189 
@ 

190 
 

191 
The dataset contains 500 DNA sequence fragments with lengths between 277 and 2328 bases from the mouse genome mm9 used to study transcription factor binding for the EP300/ CREBBP binding proteins and a label vector which indicates whether the protein binds to a sequence fragment or not. The dataset is a sampled subset of the mouse enhancer forebrain data from \cite{Lee2011} based on a ChIPseq study of \cite{Visel2009}. Let us look at the sequences stored under the name {\tt enhancerFB}, the number of sequences in the sequence set and the minimal and maximal length of the contained sequences 

192 
 

193 
<<>>= 

194 
enhancerFB 

195 
length(enhancerFB) 

196 
@ 

197 
 

198 
The length distribution of the sequences in the dataset can be shown with 

199 
 

200 
<<eval=FALSE>>= 

201 
hist(width(enhancerFB), breaks=30, xlab="Sequence Length", 

202 
 main="Distribution of Sequence Lengths") 

203 
@ 

204 
 

205 
<<fig=FALSE,echo=FALSE,results=hide>>= 

206 
pdf("001.pdf") 

207 
hist(width(enhancerFB), breaks=30, main="Distribution of Sequence Lenghts", 

208 
 xlab="Sequence Length") 

209 
dev.off() 

210 
@ 

211 
 

212 
\begin{figure}[htp] 

213 
\begin{center} 

214 
\includegraphics[width=0.6\columnwidth]{001.pdf} 

215 
\end{center} 

216 
\end{figure} 

217 
 

218 
To look at individual sequences from this set or parts of them, e.g. for sample number 3, we can use 

219 
 

220 
<<>>= 

221 
showAnnotatedSeq(enhancerFB, sel=3) 

222 
@ 

223 
 

224 
The label vector is named {\tt yFB} and describes the interaction of the DNA sequence fragment and the binding factor considered in the underlying experiment. It contains a "1" for all positive samples to which the transcription factor binds and a "1" for all negative samples for which no binding occurs. We look at the first few elements of the label vector with 

225 
 

226 
<<>>= 

227 
head(yFB) 

228 
@ 

229 
 

230 
and see that transcription factor binding occurs for the first and fourth sequence. With 

231 
 

232 
<<>>= 

233 
table(yFB) 

234 
@ 

235 
 

236 
we check the balancedness of the dataset and find that the dataset is balanced, i.e. that it contains approximately the same amount of positive and negative samples. Unbalanced data can have a strong impact on performance for SVMbased learning, therefore the user should always be aware of unbalanced data. Now being familiar with the data in the dataset we can start with sequence analysis. 

237 
 

238 
To train a model we split the data set into a part used to train the model  the training set  and a part to predict the binding behavior from the model  the test set. For the training set we randomly select 70\% of the sequences. The remaining sequences become the test set. 

239 
 

240 
<<>>= 

241 
numSamples < length(enhancerFB) 

242 
trainingFraction < 0.7 

243 
train < sample(1:numSamples, trainingFraction * numSamples) 

244 
test < c(1:numSamples)[train] 

245 
@ 

246 
 

247 
The vectors {\tt train} and {\tt test} now contain the sample indices for the training set and test set samples. The elements in {\tt train} are unordered because sample just picks elements randomly from the given set. 

248 
 

249 
In the next step, we select a kernel for the analysis. The sequences are character strings which are converted to numerical data representing similarity values with the help of sequence kernels. These similarity values are used to train a model via SVMs. In this example, the spectrum kernel is used. This kernel is described in detail in Section \ref{sec:SpectrumKernel}. First we define a kernel object with 

250 
 

251 
<<>>= 

252 
specK2 < spectrumKernel(k=2) 

253 
@ 

254 
 

255 
which describes the type of kernel and the used setting of the kernel parameters. A spectrum kernel with $k=2$ considers substrings of length 2 in the sequences for the similarity consideration. 

256 
 

257 
Now everything is available to train a model from the sequences in the training set with the \KeBABS\ function \kbsvm. In its simplest form, the function expects the training set, the labels for the training samples, the kernel object and the SVM together with SVM parameters. The meaning of most other parameters will be introduced later in this manual. In this example, we perform classification using the CSVM from package {\tt e1071} with a value 15 for the cost hyperparameter. The \kebabs\ SVM name is {\tt Csvc} which stands for "C support vector classification". The hyperparameter name {\tt cost} is the unified \kebabs\ name for the cost parameter called {\tt C} in package {\tt kernlab} and {\tt cost} in the packages {\tt e1071} and {\tt LiblineaR}. The function \kbsvm\ returns the trained model. 

258 
 

259 
<<>>= 

260 
model < kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2, 

261 
 pkg="e1071", svm="Csvc", cost=15) 

262 
@ 

263 
 

264 
Because of the large number of possible parameters in \kbsvm\, the parameter names are included for each parameter in the function. 

265 
 

266 
Now we have trained a linear model that separates the sequences for which transcription factor binding occurs from sequences without binding. The model parameters and all information necessary for prediction is stored in the model object. A description of the KeBABS model object is given in section \ref{sec:Training}. Once the model is trained, it can be used to predict whether EP300 binding occurs for new sequences in the test set with 

267 
 

268 
<<>>= 

269 
pred < predict(model, enhancerFB[test]) 

270 
head(pred) 

271 
head(yFB[test]) 

272 
@ 

273 
 

274 
The vector {\tt pred} contains the predictions for the samples in the test set. The true values for the test set are shown in the last output above. With the knowledge of the true labels for the test set we can check the quality of our predictions, i.e. the performance of the model for the sequences in the test set, with 

275 
 

276 
<<>>= 

277 
evaluatePrediction(pred, yFB[test], allLabels=unique(yFB)) 

278 
@ 

279 
 

280 
The accuracy shows the percentage of the samples in the test set that were predicted correctly. For balanced datasets, the accuracy is the main performance criterion. For unbalanced datasets, the balanced accuracy or the Matthews Correlation Coefficient might be more appropriate. The confusion matrix in the top part of the output shows the predictions as rows and the correct labels as columns. The values on the diagonal are the correct predictions for the two classes for which prediction and true label matches. 

281 
 

282 
For comparison, we now run the same training with the gappy pair kernel which uses pairs of substrings of length k separated by an arbitrary substring with a length up to m. For details on this kernel see Section \ref{sec:GappyPairKernel}. 

283 
 

284 
<<>>= 

285 
gappyK1M3 < gappyPairKernel(k=1, m=3) 

286 
model < kbsvm(x=enhancerFB[train], y=yFB[train], 

287 
 kernel=gappyK1M3, pkg="e1071", svm="Csvc", 

288 
 cost=15) 

289 
pred < predict(model, enhancerFB[test]) 

290 
evaluatePrediction(pred, yFB[test], allLabels=unique(yFB)) 

291 
@ 

292 
 

293 
If we want to switch to the CSVM implementation in a different package, e.g. \texttt{LiblineaR}, only the package parameter is changed. \KeBABS\ internally converts the data structures and parameters to the package specific format before calling the SVM implementation in the selected package. 

294 
 

295 
<<>>= 

296 
gappyK1M3 < gappyPairKernel(k=1, m=3) 

297 
model < kbsvm(x=enhancerFB[train], y=yFB[train], 

298 
 kernel=gappyK1M3, pkg="LiblineaR", svm="Csvc", 

299 
 cost=15) 

300 
pred < predict(model, enhancerFB[test]) 

301 
evaluatePrediction(pred, yFB[test], allLabels=unique(yFB)) 

302 
@ 

303 
 

304 
The accuracy is nearly identical, but the confusion matrix has changed. This example shows that completely identical results cannot be expected from the same SVM in different packages. The results between {\tt kernlab} and {\tt e1071} are usually identical because the implementation is very similar. 

305 
 

306 
We use now a different SVM, the L1 regularized L2 loss SVM from \texttt{LiblineaR}. 

307 
 

308 
<<>>= 

309 
gappyK1M3 < gappyPairKernel(k=1, m=3) 

310 
model < kbsvm(x=enhancerFB[train], y=yFB[train], 

311 
 kernel=gappyK1M3, pkg="LiblineaR", svm="l1rl2lsvc", 

312 
 cost=5) 

313 
pred < predict(model, enhancerFB[test]) 

314 
evaluatePrediction(pred, yFB[test], allLabels=unique(yFB)) 

315 
@ 

316 
 

317 
We have changed the value of the hyperparameter {\tt cost} as we switched to a different SVM. Hyperparameter values are dependent on the actual SVM formulation and the best performing values can be quite different between SVMs. Cross validation and grid search functions described in Section \ref{sec:HyperparSel} help to select good hyperparameter values. A description of the available SVMs in each package and the hyperparameters of each SVM can be found on the help page for the function \kbsvm\ which is shown with 

318 
 

319 
<<eval=F>>= 

320 
?kbsvm 

321 
@ 

322 
 

323 
Please do not get irritated when looking at the large number of parameters in the help page for this function. As we have seen in this example only a small subset of the parameters is needed for simple training. The other parameters are meant to tune the training behavior for specific situations or for more advanced usage scenarios. Most of them will be introduced in later sections in this manual and you will become familiar with them quickly as you start using the package. 

324 
 

325 
The next two sections describe the available sequence kernels in \KeBABS\ and the basic data representations in detail. Starting with Section \ref{sec:Training} important aspects of SVM based learning will be presented. 

326 
 

327 
\section{Sequence Kernels} 

328 
\label{sec:SequenceKernels} 

329 
 

330 
The sequence kernels defined in this package are targeting biological sequences and represent different similarity measures between two sequences. All of the kernels in this package split the sequence into patterns. These patterns are also called features in the context of machine learning. The difference between the kernels lies primarily in the way in which this splitting occurs. 

331 
 

332 
Described in a more general form the sequence kernels extract the sequence characteristics related to the features defined by the kernel and compute a similarity value between two sequences from their feature characteristics. In more specific mathematical terminology the kernel maps the sequences into a high dimensional vector space where the individual feature values are the coordinates in this vector space. The kernel value is computed through the dot product of the feature vectors representing the two sequences in high dimensional space. 

333 
 

334 
The sequence kernels in \KeBABS\ come in several flavors: 

335 
 

336 
\begin{description} 

337 
 \item [positionindependent] which means that the location of a pattern in the sequence is 

338 
 irrelevant for determination of the similarity value, just its occurrence has an influence. 

339 
 This is the``traditional'' type of kernel used for sequences. 

340 
\item [positiondependent] which in addition to the occurrence also take the position into 

341 
 account. This version again is split into two different variants:\\ 

342 
 \begin{description} 

343 
 \item [positionspecific] kernels which only consider patterns when they are located at 

344 
 the same position in the two sequences 

345 
 \item [distanceweighted] kernels for which the contribution of a pattern is defined by 

346 
 the weighted distance (measured in positions) of the pattern in the two 

347 
 sequences 

348 
 \end{description} 

349 
 Through the introduction of a parameterized distance weighting function a flexible 

350 
 blending between the two extremes of exact position evaluation in the 

351 
 positionspecific kernel and complete position independency in the 

352 
 positionindependent kernel is possible. 

353 
\item [annotationspecific] kernels which use annotation information in addition to the 

354 
 patterns in the sequence. The annotation consists of a character string with the same 

355 
 length as the sequence which is aligned to the sequence and delivers for each 

356 
 pattern in the sequence a corresponding pattern in the annotation. For the annotation 

357 
 specific kernel, the feature consists of the concatenation of sequence pattern and 

358 
 annotation pattern. This means that annotation information is used together with the 

359 
 sequence information in the similarity consideration. One typical example is the 

360 
 splitting of gene sequences into intron and exon parts. The annotation specific 

361 
 kernel is available positionindependent only. 

362 
\end{description} 

363 
 

364 
All \KeBABS\ kernels except the mismatch kernel support all flavors. For the mismatch kernel only the positionindependent version without annotation is implemented. 

365 
 

366 
First the positionindependent kernel variants are described. The positiondependent variants are presented in Section \ref{sec:PosDepKernel}. The annotationspecific kernel variant is covered in Section \ref{sec:AnnSpecKernel}. The next sections describe the characteristics and the kernel parameters of the different kernels. The generation of a kernel matrix for a specific kernel is described in Section \ref{sec:KernelMatrix}. A description of the generation of the explicit representation is given in S,ection \ref{sec:ExRep}. 

367 
 

368 
\subsection{PositionIndependent Sequence Kernels} 

369 
 

370 
For the positionindependent kernel the contribution of a single feature is dependent on the number of occurrences of the feature in the two sequences but independent from their location. The unnormalized kernel for positionindependent sequence kernels can generally be described as 

371 
 

372 
\begin{align} 

373 
\label{eq:PosIndepKernel} 

374 
k(x, y) = \sum_{m \in \mathcal{M}} N(m, x) \cdot N(m, y) 

375 
\end{align} 

376 
 

377 
where $x$ and $y$ are two sequences, $m$ is one pattern of the feature space $\mathcal{M}$ which contains all possible patterns and $N(m, x)$ is the number of occurrences of pattern $m$ in sequence $x$. $ \mathcal{M} $ is the size of this feature space. 

378 
 

379 
A formal definition of $N(m, x)$ which opens up the way to positiondependent kernels is given by 

380 
 

381 
\begin{align} 

382 
\label{eq:PatternOccurrences} 

383 
N(m, x) = \sum_{p=1}^{length(x)} \mathbf{1}(m, x, p) 

384 
\end{align} 

385 
 

386 
where $\mathbf{1}(m, x, p)$ is a kernel specific indicator function with a value of 1 if the pattern m matches the sequence x at position p. 

387 
 

388 
Then the feature mapping $\varphi$ is defined as 

389 
 

390 
\begin{align} 

391 
\label{eq:FeatureMap} 

392 
\varphi(x) = ((N(m,x)))_{m \in \mathcal{M}} 

393 
\end{align} 

394 
 

395 
i.e., for positionindependent kernels, the feature vector contains the counts for the different patterns found in the sequence as vector elements. 

396 
 

397 
Through kernel normalization, feature vectors are scaled to the unit sphere and the similarity value represents the cosine similarity which is just the cosine of the angle between feature vectors in the highdimensional space. The kernel normalization is performed according to 

398 
 

399 
\begin{align} 

400 
\label{eq:NormKernel} 

401 
k^\prime(x, y) = \frac{k(x,y)}{\sqrt{k(x,x) \cdot k(y,y)}} 

402 
\end{align} 

403 
 

404 
 

405 
\subsubsection{Spectrum Kernel} 

406 
\label{sec:SpectrumKernel} 

407 
 

408 
For the spectrum kernel \citep{Leslie2002} with kernel parameter {\tt k}, the features are all substrings of fixed length $k$. The feature space is defined through all possible substrings of length $k$ for the given alphabet $\mathcal{A}$. If $ \mathcal{A} $ is the number of characters in the character set, the feature space dimension is $ \mathcal{M}  =  \mathcal{A}  ^ k$. 

409 
 

410 
For DNAsequences with the exact alphabet $\mathcal{A} = \{\rm{A,C,G,T}\}$ or RNAsequences with the exact alphabet $\mathcal{A} = \{\rm{A,C,G,U}\}$ the alphabet size $ \mathcal{A} $ is 4, for amino acid sequences with the exact alphabet it is 21. In addition to the 20 standard amino acids also the $21^{st}$ amino acid Selenocystein with the abbreviation U or Sec is included in the alphabet, as it occurs throughout all kingdoms and also in 25 human proteins. The feature space grows exponentially with $k$. 

411 
 

412 
Usually \kebabs\ only considers the $k$mers that actually occur in sequences to keep performance and memory needs as low as possible, but please be aware that both the runtime and memory usage increase for kernel processing as k becomes larger because of the drastically increasing number of features. This effect becomes more prominent for larger alphabet sizes. For amino acids with $k=5$ the feature space size is already 4,084,101. 

413 
 

414 
The spectrum kernel walks along the sequence position by position and counts the number of occurrences of each $k$mer in the sequence. The kernel first determines the $N(m, x)$ for all $k$mers m in the sequences x and y and then computes the kernel value of the unnormalized kernel according to the formula \label{eq:PosIndepKernel} which is the dot product of the two feature vectors for the sequences x and y. Through applying kernel normalization according to equation (\ref{eq:NormKernel}), the normalized kernel value is computed from the unnormalized kernel values . 

415 
 

416 
As already shown in Section \ref{sec:GettingStarted}, a kernel object for the spectrum kernel is created with 

417 
 

418 
<<>>= 

419 
specK2 < spectrumKernel(k=2) 

420 
@ 

421 
 

422 
This represents a normalized kernel for the spectrum kernel with kmer length 2 because the {\tt normalized} parameter is set by default to {\tt TRUE}. The kernel object for the unnormalized kernel is created with 

423 
 

424 
<<>>= 

425 
specK2 < spectrumKernel(k=2, normalized=FALSE) 

426 
@ 

427 
 

428 
When working with sequences of different lengths the normalized kernel usually leads to better results. Kernel objects are used for training as already shown in Section \ref{sec:GettingStarted} , invisibly for the user, also in prediction and for explicit creation of a kernel matrix or an explicit representation. 

429 
 

430 
With the kernel parameter {\tt exact} set to {\tt TRUE} the exact character set without ambiguous characters is used for the analysis. This is the default setting. Any character found in the sequence which is not in the character set is interpreted as invalid and leads to drop of open patterns and a restart of the pattern search with the next sequence position. When setting this parameter to {\tt FALSE} the IUPAC character set with exact matching on each character of the character set is used. The IUPAC character set defines characters with ambiguous meaning in addition to A, C, G and T. 

431 
 

432 
The kernel parameter {\tt ignoreLower} with default setting {\tt TRUE} defines that lower case characters in the sequence should be ignored and treated like invalid characters described above. When the parameter is set to {\tt FALSE} the kernel becomes case insensitive and treats lower case characters like uppercase characters. The classes {\tt DNAStringSet, RNAStringSet} and {\tt AAStringSet} derived from {\tt XStringSet} do not store lowercase characters but convert them to uppercase automatically on creation. Lowercase 

433 
characters can be stored in the {\tt BioVector} derived classes {\tt DNAVector, RNAVector} and {\tt AAVector}. These classes are structured similar to the {\tt XStringSet} derived classes and they are treated identically in \KeBABS\. The {\tt BioVector} derived classes should be used only when lowercase characters are needed, e.g. for representing repeat regions, or in context with string kernels from package {\tt kernlab}. In all other situations sequence data should be represented through {\tt XStringSet} derived classes. 

434 
 

435 
With the kernel parameter {\tt presence} set to {\tt TRUE}, only the presence of a pattern in the sequence is relevant for the similarity consideration not the actual number of occurrences. 

436 
 

437 
Reverse complements of patterns are considered identical to the forward version of patterns when the kernel parameter {\tt revComplement} is set to {\tt TRUE}. For example, in the case of a spectrum kernel with {\tt k=3}, the kmer ACG and its reverse complement CGT are treated as one feature. The parameter can be helpful when analyzing interaction with double stranded DNA. 

438 
 

439 
The parameter {\tt r} allows exponentiation of the kernel value. Only values larger than 0 are allowed. This parameter can only be different from 1 when computing kernel matrices explicitly but not as part of a regular SVM training or when generating an explicit representation with this kernel. Please be aware that only integer values ensure positive definiteness of the resulting kernel matrix. 

440 
 

441 
When similarity values of multiple kernels are combined to define a new similarity measure we speak of a mixture kernel. In the mixed spectrum kernel $k$mers with multiple $k$mer lengths are taken into account for the similarity value. Within \KeBABS\ similarity values are only combined through weighted sums. The kernel parameter {\tt mixCoef} is used to define the mixture coefficients of a mixture kernel of several spectrum kernels. The parameter must be a numeric vector with k elements. For instance, for $k=3$ the spectrum kernel values for $k=1$, $k=2$ and $k=3$ are computed and mixed with the three elements in {\tt mixCoef} as mixture coefficients. If individual kernels components in the mixture kernel should not be included in the mixture the corresponding element in {\tt mixCoef} is set to zero. 

442 
 

443 
 

444 
\subsubsection{Mismatch Kernel} 

445 
\label{sec:MismatchKernel} 

446 
 

447 
The mismatch kernel \citep{Leslie2003} is similar to the spectrum kernel. The kernel parameter {\tt k} has the same meaning in the mismatch kernel as in the spectrum kernel. The mismatch kernel also extracts substrings of fixed length k. The kernel parameter {\tt m} defines the maximum number of mismatches that are allowed for each kmer. For example with $k=2$ and $m=1$ the occurrence of substring TA found in the sequence matches to the features AA, CA, GA, TA, TC, TG, TT with not more than one mismatch and therefore increases the feature count for these seven features. The feature space and the feature space dimension with $ \mathcal{M}  =  \mathcal{A}  ^ k$ is identical to the spectum kernel. The parameter {\tt m} has no impact on feature space size. 

448 
 

449 
A kernel object for the normalized mismatch kernel with $k=3$ and $m=1$ is created in the following way: 

450 
 

451 
<<>>= 

452 
mismK3M1 < mismatchKernel(k=3, m=1) 

453 
@ 

454 
 

455 
Apart from the new kernel parameter {\tt m}, the kernel parameters {\tt k}, {\tt r}, {\tt normalized}, {\tt exact}, {\tt ignoreLower} and {\tt presence} of the mismatch match kernel have the same functionality as described in Section \ref{sec:SpectrumKernel} for the spectrum kernel. The positiondependent kernel variants, the annotationspecific variant and the mixture kernel are not supported for the mismatch kernel. 

456 
 

457 
\subsubsection{Gappy Pair Kernel} 

458 
\label{sec:GappyPairKernel} 

459 
 

460 
The gappy pair kernel \citep{Kuksa2008, Mahrenholz2011} looks for pairs of $k$mers separated by several wildcard positions in between for which matching occurs for every alphabet character. For this kernel the kernel parameter {\tt m} has a different meaning than in the mismatch kernel. It defines the maximum number of wildcard positions. A single feature of the gappy pair kernel is structured in the following way 

461 
 

462 
\begin{align*} 

463 
\underbrace{kmer1}_{k} \underbrace{\hspace{5pt}.\hspace{5pt}.\hspace{5pt}.\hspace{5pt}.\hspace{5pt}.\hspace{5pt}.\hspace{5pt}.\hspace{5pt}}_{0 \le i \le m} \underbrace{kmer2}_{k} 

464 
\end{align*} 

465 
 

466 
An irrelevant position i can contain any character from the character set. In the feature names the irrelevant positions are marked as dots. For example, for the sequence CAGAT the gappy pair kernel with $k=1$ and $m=2$ finds pairs of monomers with zero to two irrelevant positions in between. i.e. it finds the features CA, C.G, C..A, AG, A.A, A..T, GA, G.T and AT. The gappy pair kernel object for $k=1$ and $m=2$ with normalization is created with: 

467 
 

468 
<<>>= 

469 
gappyK1M2 < gappyPairKernel(k=1, m=2) 

470 
@ 

471 
 

472 
The feature space for the gappy pair kernel is $ \mathcal{M}  = (m + 1)  \mathcal{A}  ^ {2k}$. It is considerably larger than that of the spectrum or mismatch kernel with same values for {\tt k}. Through the factor 2 in the exponent the dimensionality grows much faster with increasing k. For amino acids with $k=2$ and $m=10$ the feature space size is already larger than 2 million features. The feature space size grows exponentially with {\tt k} and linearly with {\tt m}, which means that through the parameter {\tt m} the complexity of the kernel can be adjusted better than for other sequence kernels. 

473 
 

474 
The other kernel parameters are identical to the spectrum kernel as described in section \ref{sec:SpectrumKernel}. 

475 
 

476 
\subsubsection{Motif Kernel} 

477 
\label{sec:MotifKernel} 

478 
 

479 
The motif kernel \citep{BenHur2003} is created from a predefined motif list. With this kernel the sequence analysis can be restricted to a userdefined set of features. The kernel also gives greater flexibility concerning the definition of individual features. 

480 
 

481 
A single motif is described similar to a regular expression through a sequence of following elements: 

482 
 

483 
\begin{itemize} 

484 
 \item single character from the character set, e.g. C for DNA sequences, which matches only this character 

485 
 \item the wildcard character '.' that matches any character of the character set 

486 
 \item the definition of a substitution group, which is a list of single characters from the character set enclosed in square brackets  e.g. $[$CT$]$ which matches any of the listed characters, in this example C or T. With a leading caret character in the list the meaning is inverted, i.e. with $[\hspace{2pt}\hat{ }\hspace{2pt}$CT$]$ only the not listed characters A and G would be matched for the exact DNA character set. 

487 
 \end{itemize} 

488 
 

489 
The motif collection must be defined as character vector which is passed to the constructor of the motif kernel object as in the following example for a normalized motif kernel: 

490 
 

491 
<<>>= 

492 
motifCollection1 < c("A[CG]T","C.G","C..G.T","G[^A][AT]", 

493 
 "GT.A[CA].[CT]G") 

494 
motif1 < motifKernel(motifCollection1) 

495 
@ 

496 
 

497 
The feature space size of the motif kernel corresponds to the number of motifs in the motif collection. Here a motif kernel with a small collection of motifs is defined just to show some of the possibilities for pattern definition. In realistic scenarios the motif collection is usually much larger and could contain up to several hundred thousand motifs. Often motif collections are created from a subset of entries in a motif database. 

498 
 

499 
Apart from kernel parameters {\tt revComplement} and {\tt mixCoef} which are not available for the motif kernel, the other kernel parameters are identical to the spectrum kernel as described in Section \ref{sec:SpectrumKernel}. The motif kernel can also be used to run sequence analysis with a subset of the features of a spectrum or gappy pair kernel. 

500 
 

501 
Hint: For a normalized motif kernel with a feature subset of a normalized spectrum / gappy pair kernel the explicit representation will not be identical to the subset of an explicit representation for the normalized spectrum / gappy pair kernel because the motif kernel is not aware of the other kmers which are used in the spectrum / gappy pair kernel additionally for normalization. 

502 
 

503 
\subsection{PositionDependent Kernels} 

504 
\label{sec:PosDepKernel} 

505 
 

506 
Positiondependent kernels take the position into account where a pattern occurs in two sequences. To ensure that positions in both sequences are comparable some kind of alignment is required for all kinds of positiondependent kernels as prerequisite. In the simplest form, the kernel only considers patterns which are found at the same position for the similarity measure. This kernel is named as positionspecific kernel. Patterns occur in two sequences at exactly the same position only if the sequences are evolutionary or functionally related and when no positionchanging mutation occurred. Once a single insert or deletion happens, the pattern sequence with the same patterns at identical positions is disturbed. A more flexible approach is to define a neighborhood of a pattern in which to look for the same pattern on the other sequence. This neighborhood is defined in \KeBABS\ through introduction of a parameterized distance weighting function. The following two subsections describe both kernel variants. More detailed information on the modeling of position dependency can be found in \cite{Bodenhofer2009}. 

507 
 

508 
 

509 
\subsubsection{PositionSpecific Kernel} 

510 
\label{sec:PosSpecKernel} 

511 
 

512 
For the positionspecific kernel, only patterns which occur at the same position in both sequences are relevant. The feature mapping $\varphi$ for a sequence x of length L for this kernel is defined as 

513 
 

514 
\begin{align} 

515 
\label{eq:FeatureMapPosSpec} 

516 
\varphi(x) = ((\mathbf{1}(m,x,1), .... , \mathbf{1}(m,x,L)))_{m \in \mathcal{M}} 

517 
\end{align} 

518 
 

519 
The feature vector is a binary vector containing for the part associated with one position a one for all patterns that occurs at this position position and zero otherwise. For the spectrum kernel and the mismatch kernel a single pattern occurs at each position, for the gappy pair kernel and the motif kernel multiple patters could be relevant at a single position. The feature space of the positionspecific version of a kernel is $\mathcal{M}_{pos} = L_m \cdot \mathcal{M}_{nopos}$ where $L_m$ is the maximum sequence length in the sequence set and $\mathcal{M}_{nopos}$ is the feature space size of the corresponding positionindependent kernel. Because of the high dimensionality of the feature space of positionspecific kernels, an explicit representation is not supported for this kernel variant. 

520 
 

521 
The kernel can be expressed with the kernel specific indicator function $\mathbf{1}(m, x, p)$ as 

522 
 

523 
\begin{align} 

524 
\label{eq:PosSpecKernel} 

525 
k(x,y) = \sum_{m \in \mathcal{M}}\sum_{p=1}^{L} \mathbf{1}(m, x, p) \cdot \mathbf{1}(m, y, p) 

526 
\end{align} 

527 
 

528 
with $L$ being the minimum length of the sequences x and y. The unnormalized kernel version just counts the occurrences where the same pattern at a position is found in both sequences. 

529 
 

530 
The spectrum kernel, the gappy pair kernel, and the motif kernel can be used in the positionspecific variant. For creating positionspecific kernels, the parameter {\tt distWeight} is set to 1 as in the following example of a positionspecific, unnormalized gappy pair kernel: 

531 
 

532 
<<>>= 

533 
gappyK1M2ps < gappyPairKernel(k=1, m=2, distWeight=1, 

534 
 normalized=FALSE) 

535 
@ 

536 
 

537 
The reason for using the parameter {\tt distWeight} with positionspecific kernels will become obvious in a few moments when the distanceweighted kernel is explained. 

538 
 

539 
As mentioned above, the sequences analyzed with a positionspecific kernel must be aligned in some way. Flexible alignment without changing the sequence set is possible through assignment of an offset vector as metadata to the sequence set. This vector gives the offset from the sequence start to position 1 for each sequence. The offset metadata is element metadata because it contains metadata for each element of the sequence set. For example for a set of amino acid sequences given as AAStringSet the offset vector is assigned with 

540 
 

541 
<<>>= 

542 
seq1 < AAStringSet(c("GACGAGGACCGA","AGTAGCGAGGT","ACGAGGTCTTT", 

543 
 "GGACCGAGTCGAGG")) 

544 
positionMetadata(seq1) < c(3, 5, 2, 10) 

545 
@ 

546 
 

547 
The offset vector must contain positive or negative integer values and must have the same length as the sequence set. If position metadata is not assigned for a sequence set the positionspecific kernel assumes that each sequence starts as position 1. In this case, the sequences must be stored aligned in the sequence set. 

548 
 

549 
It should be mentioned that the weighted degree kernel is the mixture kernel for the positionspecific spectrum kernel. In the following example, we use a weighted degree kernel in unnormalized form and create the kernel matrix with this kernel for the small sequence set given above 

550 
 

551 
<<>>= 

552 
wdK3 < spectrumKernel(k=3, distWeight=1, mixCoef=c(0.5,0.33,0.17), 

553 
 normalized=FALSE) 

554 
km < getKernelMatrix(wdK3, seq1) 

555 
km 

556 
@ 

557 
 

558 
We delete the position metadata through assignment of NULL and recreate the kernel matrix without alignment of the sequences via an offset. 

559 
 

560 
<<>>= 

561 
positionMetadata(seq1) < NULL 

562 
km < getKernelMatrix(wdK3, seq1) 

563 
km 

564 
@ 

565 
 

566 
 

567 
\subsubsection{DistanceWeighted Kernel} 

568 
\label{sec:DistWeightedKernel} 

569 
 

570 
The distanceweighted kernel is the second variant of positiondependent kernels. This kernel also evaluates position information and alignment of the sequences is important here as well. However, instead of just considering patterns which are located at the same position in the two sequences, this kernel searches for pattern occurrences in the proximity. If the same pattern is found in the neighborhood on the second sequence the contribution of this pattern depends on the distance of the pattern on the two sequences. For flexible definition of the neighborhood characteristics a parameterized distance weighting function is introduced. The contribution of the occurrence of a pattern to the similarity measure is computed as weighted distance between the pattern positions. 

571 
 

572 
The distanceweighted kernel is defined as 

573 
 

574 
\begin{align} 

575 
\label{eq:DistWeightKernel} 

576 
k(x,y) = \sum_{m \in \mathcal{M}}\sum_{p=1}^{L_x}\sum_{q=1}^{L_y} \mathbf{1}(m, x, p) \cdot E(p,q) \cdot \mathbf{1}(m, y, q) 

577 
\end{align} 

578 
 

579 
with $L_x$ and $L_y$ being the length of the sequences $x$ and $y$ and $E(p,q)$ the distance weighting function for positions $p$ and $q$. The positionspecific kernel described in the previous subsection is a special case of this kernel with the following distance weighting function: 

580 
 

581 
\begin{align} 

582 
\label{eq:PosSpecWeight} 

583 
E_{\mathrm{ps}}(p,q) = \left\{ 

584 
 \begin{array}{ll} 

585 
 1 & \text{if} \hspace{3pt} p = q \\ 

586 
 0 & \text{otherwise} 

587 
 \end{array} 

588 
 \right. 

589 
\end{align} 

590 
 

591 
The positionindependent kernel also is a special case of this kernel with the trivial distance weighting function $E_{\mathrm{pi}}(p,q) =1$ for all $p \in [1, .... , L_x]$, $q \in [1, .... , L_y]$. 

592 
 

593 
With a parameterized distance function, the neighborhood can be adjusted as needed. The package contains three predefined distance weighting functions, here defined as functions of the distance $d = p  q$ 

594 
 

595 
\begin{itemize} 

596 
 \item {linear distance weigthing:\\ 

597 
 $E_{\mathrm{lin, \sigma}}(d) = \max(1  \frac{d}{\sigma}, 0)$} 

598 
 \item {exponential distance weigthing:\\ 

599 
 $E_{\mathrm{exp, \sigma}}(d) = \exp( \frac{d}{\sigma})$} 

600 
 \item {gaussian distance weigthing:\\ 

601 
 $E_{\mathrm{exp, \sigma}}(d) = \exp( \frac{d^2}{\sigma^2})$} 

602 
\end{itemize} 

603 
 

604 
In all three cases the parameter $sigma$ controls the proximity behavior with a strict cut off at sigma for linear distance weighting. For small sigma close to 0, the distanceweighted kernel becomes the positionspecific kernel, for very large values of sigma the kernel becomes the positionindependent kernel. The parameter sigma allows flexible blending between both extremes of strict position evaluation and complete position independency for all three distance weighting functions. Positive definiteness of the resulting kernel matrix is ensured for these functions for any parameter values. The following plot shows the three parameterized functions with parameter value 20 for sigma. 

605 
 

606 
<<eval=F>>= 

607 
curve(linWeight(x, sigma=5), from=25, to=25, xlab="p  q", ylab="weight", 

608 
 main="Predefined Distance Weighting Functions", col="green") 

609 
curve(expWeight(x, sigma=5), from=25, to=25, col="blue", add=TRUE) 

610 
curve(gaussWeight(x, sigma=5), from=25, to=25, col="red", add=TRUE) 

611 
curve(swdWeight(x), from=25, to=25, col="orange", add=TRUE) 

612 
legend('topright', inset=0.03, title="Weighting Functions", c("linWeight", 

613 
 "expWeight", "gaussWeight", "swdWeight"), fill=c("green", "blue", 

614 
 "red", "orange")) 

615 
text(17, 0.70, "sigma = 5") 

616 
@ 

617 
 

618 
<<fig=FALSE,echo=FALSE,results=hide>>= 

619 
pdf("002.pdf") 

620 
curve(linWeight(x, sigma=5), from=25, to=25, xlab="p  q", ylab="weight", 

621 
 main="Predefined Distance Weighting Functions", col="green") 

622 
curve(expWeight(x, sigma=5), from=25, to=25, col="blue", add=TRUE) 

623 
curve(gaussWeight(x, sigma=5), from=25, to=25, col="red", add=TRUE) 

624 
curve(swdWeight(x), from=25, to=25, col="orange", add=TRUE) 

625 
legend('topright', inset=0.03, title="Weighting Functions", c("linWeight", 

626 
 "expWeight", "gaussWeight", "swdWeight"), fill=c("green", "blue", 

627 
 "red", "orange")) 

628 
text(17, 0.70, "sigma = 5") 

629 
dev.off() 

630 
@ 

631 
 

632 
\begin{figure}[htp] 

633 
\begin{center} 

634 
\includegraphics[width=0.6\columnwidth]{002.pdf} 

635 
\end{center} 

636 
\end{figure} 

637 
 

638 
The shifted weighted degree kernel (SWD) \citep{Raetsch2005} with equal position weighting is a special case of the distanceweighted mixture spectrum kernel with a distance weighting function of 

639 
 

640 
\begin{align} 

641 
\label{eq:ShiftedWDKernel} 

642 
E_{SWD}(d) = \frac{1}{2*( d  + 1)} 

643 
\end{align} 

644 
 

645 
This weighting function is similar to exponential distance weighting for $\sigma = 2$ but has a lower peak and larger tails. For brevity input parameter checks are omitted. 

646 
<<>>= 

647 
swdWeight < function(d) 

648 
{ 

649 
 if (missing(d)) 

650 
 return(function(d) swdWeight(d)) 

651 
 

652 
 1 / (2 * (abs(d) + 1)) 

653 
} 

654 
swdK3 < spectrumKernel(k=3, distWeight=swdWeight(), 

655 
 mixCoef=c(0.5,0.33,0.17)) 

656 
@ 

657 
 

658 
We load sequence data provided with the package, assign shorter sample names for display purpose and compute the kernel matrix for this kernel with userdefined distance weighting for the first few sequences to save runtime for vignette generation. Positiondependent kernels are considerably slower than their positionindependent counterparts. 

659 
 

660 
<<>>= 

661 
data(TFBS) 

662 
names(enhancerFB) < paste("Sample", 1:length(enhancerFB), sep="_") 

663 
enhancerFB 

664 
kmSWD < getKernelMatrix(swdK3, x=enhancerFB, selx=1:5) 

665 
kmSWD[1:5,1:5] 

666 
@ 

667 
 

668 
The function {\tt swdWeight} was a simple example of a userdefined weighting function without parametrization. In fact it is also contained as predefined weighting function in the package, but was used here to show how a user defined function can be defined. Please note that the weighting function returns itself as function if the distance parameter {\tt d} is missing. If parameters are used in a user defined weighting function it must be defined as closure for which all arguments except the distance argument are fixed once the function is called to do the actual distance weighting. In this way, the function can be passed when creating a kernel just fixing the parameters without actually passing the distance data which is not available at this time. As an example we use the following weighting function 

669 
 

670 
\begin{align} 

671 
\label{eq:ShiftedWDKernel} 

672 
E_{SWD}(d, b) = b ^ {  d } 

673 
\end{align} 

674 
 

675 
with the base \texttt{b} as parameter. This example is meant to show the definition of parameterized userdefined weighting functions and of course does not lead to anything new as it is just a scaled version of exponential weighting. 

676 
 

677 
The new weighting function is defined as 

678 
 

679 
<<>>= 

680 
udWeight < function(d, base=2) 

681 
{ 

682 
 if (missing(d)) 

683 
 return(function(d) udWeight(d, base=base)) 

684 
 

685 
 return(base^(d)) 

686 
} 

687 
 

688 
specudK3 < spectrumKernel(k=3, distWeight=udWeight(base=4), 

689 
 mixCoef=c(0,0.3,0.7)) 

690 
kmud < getKernelMatrix(specudK3, x=enhancerFB, selx=1:5) 

691 
@ 

692 
 

693 
Userdefined distance functions must consider three aspects: 

694 
 

695 
\begin{description} 

696 
 \item [Definition as closure:] 

697 
 When all parameters are passed to the function or set by default, but the distance 

698 
 argument is missing, the function must return a new unnamed function with the 

699 
 distance argument as single argument and all parameters set to their fixed or 

700 
 predefined value. 

701 
 \item [Vectorized processing:] 

702 
 The function must be able to process numeric vectors containing multiple distances 

703 
 and return a numeric vector of weighted distances of identical length. 

704 
 \item [Positive definiteness:] 

705 
 This means that the kernel matrix generated with this distance weighting function must 

706 
 be positive definite. One necessary, but not sufficient, precondition of positive 

707 
 definiteness is that the distance weighting function must be symmetric, i.e. 

708 
 $E(p,q)=E(q,p)$ must be fulfilled. For the example given above, the positive 

709 
 definiteness is ensured because it is only a scaled version of the package provided 

710 
 exponential weighting. 

711 
\end{description} 

712 
 

713 
For all learning methods that require positive definite kernel matrices, especially the SVMs supported in this package the third aspect is very important. Please also note that explicit representations are not available for the positiondependent kernel variants. 

714 
 

715 
 

716 
\subsection{Annotation Specific Kernel} 

717 
\label{sec:AnnSpecKernel} 

718 
 

719 
The annotation specific kernel evaluates annotation information together with the patterns in the sequence. For example the similarity considerations for gene sequences could be split according to exonic and intronic parts of the sequences. This is achieved by aligning an annotation string which contains one annotation character for each sequence position to the sequence. Each pattern in the sequence concatenated with the corresponding pattern of the annotation characters is used as feature. In the abovementioned example of exon/intron annotation the feature space contains separate features for identical patterns in exon and intron regions. 

720 
 

721 
The following small toy example shows how to assign exon/intron annotation to a DNA sequence set. A sequence set with just two sequences for very similar genes HBA1 and HBA2 from the human genome hg19 is generated. For this example the packages {\tt BSgenome.Hsapiens.UCSC.hg19}, {\tt TxDb.Hsapiens.UCSC.hg19.knownGene} and {\tt BSgenome} must be installed. Each gene contains three exons. Again checks on input parameters are omitted. 

722 
 

723 
<<eval=FALSE>>= 

724 
getGenesWithExonIntronAnnotation < function(geneList, genomelib, 

725 
 txlib) 

726 
{ 

727 
 library(BSgenome) 

728 
 library(genomelib, character.only=TRUE) 

729 
 library(txlib, character.only=TRUE) 

730 
 genome < getBSgenome(genomelib) 

731 
 txdb < eval(parse(text=txlib)) 

732 
 exonsByGene < exonsBy(txdb, by ="gene") 

733 
 

734 
 ## generate exon/intron annotation 

735 
 annot < rep("", length(geneList)) 

736 
 geneRanges < GRanges() 

737 
 exonsSelGenes < exonsByGene[geneList] 

738 
 

739 
 if (length(exonsSelGenes) != length(geneList)) 

740 
 stop("some genes are not found") 

741 
 

742 
 for (i in 1:length(geneList)) 

743 
 { 

744 
 exons < unlist(exonsSelGenes[i]) 

745 
 exonRanges < ranges(exons) 

746 
 chr < as.character(seqnames(exons)[1]) 

747 
 strand < as.character(strand(exons)[1]) 

748 
 numExons < length(width(exonRanges)) 

749 
 

750 
 for (j in 1:numExons) 

751 
 { 

752 
 annot[i] < 

753 
 paste(annot[i], 

754 
 paste(rep("e", width(exonRanges)[j]), 

755 
 collapse=""), sep="") 

756 
 

757 
 if (j < numExons) 

758 
 { 

759 
 annot[i] < 

760 
 paste(annot[i], 

761 
 paste(rep("i", start(exonRanges)[j+1]  

762 
 end(exonRanges)[j]  1), 

763 
 collapse=""), sep="") 

764 
 } 

765 
 } 

766 
 

767 
 geneRanges < 

768 
 c(geneRanges, 

769 
 GRanges(seqnames=Rle(chr), 

770 
 strand=Rle(strand(strand)), 

771 
 ranges=IRanges(start=start(exonRanges)[1], 

772 
 end=end(exonRanges)[numExons]))) 

773 
 } 

774 
 

775 
 ## retrieve gene sequences 

776 
 seqs < getSeq(genome, geneRanges) 

777 
 names(seqs) < geneList 

778 
 ## assign annotation 

779 
 annotationMetadata(seqs, annCharset="ei") < annot 

780 
 seqs 

781 
} 

782 
 

783 
## get gene sequences for HBA1 and HBA2 with exon/intron annotation 

784 
## 3039 and 3040 are the geneID values for HBA1 and HBA2 

785 
hba < getGenesWithExonIntronAnnotation(c("3039", "3040"), 

786 
 "BSgenome.Hsapiens.UCSC.hg19", 

787 
 "TxDb.Hsapiens.UCSC.hg19.knownGene") 

788 
@ 

789 
 

790 
For exon/intron annotation, we defined the annotation character set with character "e" denoting an exon position and "i" for an intron position. Up to 32 characters can be defined by the user in the annotation character set. The character "" is used to mask sequence parts that should not be evaluated in the sequence kernel. It must not be used as regular annotation character defined in the annotation character set because with masking a different function is assigned already. The annotation strings are built up from the length of exons and introns for the two genes in the gene set. Annotation strings must have identical number of characters as the sequences. With the function {\tt annotationMetadata} the annotation character set and the annotation strings are assigned to the sequence set. Individual sequences or sequence parts can be shown together with annotation information in the following way 

791 
 

792 
<<eval=FALSE>>= 

793 
annotationCharset(hba) 

794 
showAnnotatedSeq(hba, sel=1, start=1, end=400) 

795 
@ 

796 
 

797 
Annotation metadata can be deleted for the sequence set through assignment of {\tt NULL}. Now we generate an explicit representation with and without annotation. First a kernel object for the standard and the annotationspecific version of the spectrum kernel is created. The annotation metadata is only evaluated in the kernel when the kernel parameter {\tt annSpec} is set to {\tt TRUE}, i.e. for the second kernel object which can be used for sequence analysis in the same way as the standard spectrum kernel object. 

798 
 

799 
<<eval=FALSE>>= 

800 
specK2 < spectrumKernel(k=2) 

801 
specK2a < spectrumKernel(k=2, annSpec=TRUE) 

802 
erK2 < getExRep(hba, specK2, sparse=FALSE) 

803 
erK2[,1:6] 

804 
erK2a < getExRep(hba, specK2a, sparse=FALSE) 

805 
erK2a[,1:6] 

806 
@ 

807 
 

808 
We create a kernel matrix from the explicit representations via the function {\tt linearKernel}. 

809 
 

810 
<<eval=FALSE>>= 

811 
km < linearKernel(erK2) 

812 
km 

813 
kma < linearKernel(erK2a) 

814 
kma 

815 
@ 

816 
 

817 
For the normalized sequence kernels, the similarity values have a range from 0 to 1. For 1 the sequences are considered identical for the given similarity measure defined through the kernel. The similarity between HBA1 and HBA2 is close to 1 and the sequences are very similar related to dimer content. For the annotation specific kernel, the similarity between the sequences HBA1 and HBA2 decreases because the dimer content of exonic and intronic regions is considered separately. 

818 
 

819 
The second example for annotation specific kernels predicts the oligomerization state of coiled coil proteins from sequence using a heptad annotation for the amino acid sequences. The sequence data available on the web page for the R package {\tt procoil} contains amino acid sequences for which the formation of dimers or trimers is known. We first predict the oligomerization state from sequence alone. In a second step a heptad pattern placed along the sequence is used in addition as annotation information. The heptad characters describe positionspecificas of the amino acids within the coiled coil structure. 

820 
 

821 
The analysis is done with the gappy pair kernel, first without annotation and then using the annotation specific version of this kernel. The coiled coil data are 477 amino acid sequences with lengths between 8 and 123 amino acids. Roughly 80\% of the sequences form dimers and only 20\% trimers, which means that the data is very unbalanced. We will see how we can better deal with this unbalancedness. First the data is loaded, the sequences {\tt ccseq} are stored in an AAStringSet, the annotation {\tt ccannot} in a character vector and the labels {\tt yCC} in a factor. The annotation is already assigned to the sequence set but to show the assignment it is deleted in the first step. 

822 
 

823 
<<>>= 

824 
data(CCoil) 

825 
ccseq 

826 
ccannot[1:3] 

827 
head(yCC) 

828 
yCC < as.numeric(yCC) 

829 
## delete annotation metadata 

830 
annotationMetadata(ccseq) < NULL 

831 
annotationMetadata(ccseq) 

832 
gappy < gappyPairKernel(k=1, m=10) 

833 
train < sample(1:length(ccseq), 0.8*length(ccseq)) 

834 
test < c(1:length(ccseq))[train] 

835 
model < kbsvm(ccseq[train], y=yCC[train], kernel=gappy, 

836 
 pkg="LiblineaR", svm="Csvc", cost=100) 

837 
pred < predict(model, ccseq[test]) 

838 
evaluatePrediction(pred, yCC[test], allLabels=unique(yCC)) 

839 
@ 

840 
 

841 
The high accuracy is misleading because the larger class is distorting this value. Because of the unbalancedness of the data the balanced accuracy is the relevant performance measure here. Now annotation metadata is assigned to the sequence set and the same test is performed with the annotation specific gappy pair kernel. The annotation sequences are added as metadata to the sequence set together with the annotation character set specified as string containing all characters that appear in the character set. The "" character must not be used in this character set because it is used for masking sequence parts with the annotation strings. 

842 
 

843 
<<>>= 

844 
## assign annotation metadata 

845 
annotationMetadata(ccseq, annCharset="abcdefg") < ccannot 

846 
annotationMetadata(ccseq)[1:5] 

847 
annotationCharset(ccseq) 

848 
showAnnotatedSeq(ccseq, sel=2) 

849 
gappya < gappyPairKernel(k=1, m=10, annSpec=TRUE) 

850 
model < kbsvm(ccseq[train], y=yCC[train], kernel=gappya, 

851 
 pkg="LiblineaR", svm="Csvc", cost=100) 

852 
pred < predict(model, ccseq[test]) 

853 
evaluatePrediction(pred, yCC[test], allLabels=unique(yCC)) 

854 
@ 

855 
 

856 
We see that the balanced accuracy decreases which could indicate that the annotation specific kernel with annotation performs worse than the kernel without annotation, but we have silently assumed that the same cost value should be used for the annotated gappy pair kernel. The comparison of different kernels with the same hyperparameter values is usually not adequate because the best performing hyperparameter value might be quite different for the two kernels. Hyperparameter selection must be performed individually for each kernel. Please also note that comparing kernels should be done based on multiple training runs on independent data or cross validation and not from a single training run. In the following example grid search (see section \ref{sec:GridSearch}) is performed with both kernels at the same time with a kernel list and a vector of values for the {\tt cost} hyperparameter using 5fold cross validation for each grid point. With the parameter {\tt perfObjective} the balanced accuracy is set as objective for the parameter search. Multiple performance parameters (accuracy, balanced accuracy and Matthews correlation coefficient) are recorded for each grid point when the parameter {\tt perfParameters} is set to {\tt "ALL"}. 

857 
 

858 
<<>>= 

859 
## grid search with two kernels and 6 hyperparameter values 

860 
## using the balanced accuracy as performance objective 

861 
model < kbsvm(ccseq[train], y=yCC[train], 

862 
 kernel=c(gappy, gappya), pkg="LiblineaR", svm="Csvc", 

863 
 cost=c(1,10,50,100,200,500), explicit="yes", cross=5, 

864 
 perfParameters="ALL", perfObjective="BACC", 

865 
 showProgress=TRUE) 

866 
result < modelSelResult(model) 

867 
result 

868 
@ 

869 
 

870 
In grid search with balanced accuracy as performance objective, the annotated gappy pair kernel shows better performance. The detailed performance information for each grid point can be retrieved in the following way: 

871 
 

872 
<<>>= 

873 
perfData < performance(result) 

874 
perfData 

875 
which(perfData$BACC[1,] == max(perfData$BACC[1,])) 

876 
which(perfData$BACC[2,] == max(perfData$BACC[2,])) 

877 
@ 

878 
 

879 
We see from the balanced accuracy of both kernels that the balanced accuracy of the annotated gappy pair kernel is often higher than that of the kernel without annotation. The best performing hyper parameter value is different for both kernels. This example also shows clearly that the accuracy as performance measure does not help at all. Also the Matthews correlation coefficient can be selected as performance objective for unbalanced datasets and would have delivered similar results. The best value for MCC is observed for the annotated gappy pair kernel for the same cost value of 30. Finally we can plot the grid performance data for one performance measure. This is helpful especially in situations where multiple kernels and a larger number of hyperparameter values should be compared. Then kernels and regions of good or worse performance can be identified more easily. 

880 
 

881 
<<eval=FALSE>>= 

882 
plot(result, sel="BACC") 

883 
@ 

884 
 

885 
<<fig=FALSE,echo=FALSE,results=hide>>= 

886 
pdf("005.pdf") 

887 
plot(result, sel="BACC") 

888 
dev.off() 

889 
@ 

890 
 

891 
\begin{figure}[htp] 

892 
\begin{center} 

893 
\includegraphics[width=0.6\columnwidth]{005.pdf} 

894 
\end{center} 

895 
\end{figure} 

896 
 

897 
The color range in this plot is similar to heatmaps. The annotated kernel, which is {\tt K\_2} in the plot, shows higher balanced accuracy in the grid search than the gappy pair kernel without annotation for several values of the hyperparameter. 

898 
 

899 
Through userdefinable annotation the sequence kernels provide great flexibility for including additional sequence related information in kernel processing. As shown in this example the prediction of coiledcoil interactions in proteins as proposed in \cite{Mahrenholz2011} and described in the vignette of the R package {\tt procoil}\footnote{\url{http://www.bioconductor.org/packages/release/bioc/html/procoil.html}} can be performed with \KeBABS\ in a faster and more flexible way. The heptad information is assigned as sequence annotation to the amino acid sequences. The annotation specific gappy pair kernel can be used instead of the {\tt PrOCoil} kernel with regular and irregular heptad patterns. 

900 
 

901 
\subsection{List of Spectrum Kernel Variants} 

902 
\label{sec:KernelVariants} 

903 
 

904 
In this section the different variants are recapitulated in an exemplary manner for the spectrum kernel. Here only the normalized version is listed. All kernels can be used unnormalized as well. 

905 
 

906 
<<>>= 

907 
## positionindependent spectrum kernel 

908 
specK2 < spectrumKernel(k=3) 

909 
# 

910 
## annotation specific spectrum normalized 

911 
specK2a < spectrumKernel(k=3, annSpec=TRUE) 

912 
# 

913 
## spectrum kernel with presence normalized 

914 
specK2p < spectrumKernel(k=3, presence=TRUE) 

915 
# 

916 
## mixed spectrum normalized 

917 
specK2m < spectrumKernel(k=3, mixCoef=c(0.5, 0.33, 0.17)) 

918 
# 

919 
## positionspecific spectrum normalized 

920 
specK2ps < spectrumKernel(k=3, distWeight=1) 

921 
# 

922 
## mixed positionspecific spectrum kernel normalized 

923 
## also called weighted degree kernel normalized 

924 
specK2wd < spectrumKernel(k=3, dist=1, 

925 
 mixCoef=c(0.5, 0.33, 0.17)) 

926 
# 

927 
## distanceweighted spectrum normalized 

928 
specK2lin < spectrumKernel(k=3, distWeight=linWeight(sigma=10)) 

929 
specK2exp < spectrumKernel(k=3, distWeight=expWeight(sigma=10)) 

930 
specK2gs < spectrumKernel(k=3, distWeight=gaussWeight(sigma=10)) 

931 
# 

932 
## shifted weighted degree with equal position weighting normalized 

933 
specK2swd < spectrumKernel(k=3, distWeight=swdWeight(), 

934 
 mixCoef=c(0.5,0.33,0.17)) 

935 
# 

936 
## distanceweighted spectrum kernel with user defined distance 

937 
## weighting 

938 
udWeight < function(d, base=2) 

939 
{ 

940 
 if (!(is.numeric(base) && length(base==1))) 

941 
 stop("parameter 'base' must be a single numeric value\n") 

942 
 

943 
 if (missing(d)) 

944 
 return(function(d) udWeight(d, base=base)) 

945 
 

946 
 if (!is.numeric(d)) 

947 
 stop("'d' must be a numeric vector\n") 

948 
 

949 
 return(base^(d)) 

950 
} 

951 
specK2ud < spectrumKernel(k=3, distWeight=udWeight(b=2)) 

952 
@ 

953 
 

954 
 

955 
\subsection{Kernel Lists} 

956 
\label{sec:KernelLists} 

957 
 

958 
For grid search (see section \ref{sec:GridSearch}) and model selection (see section \ref{sec:ModelSelection}) often multiple kernels are relevant. In such cases a list of kernel objects can be passed to \kbsvm\ instead of a single kernel object. Kernel lists are either created implicitly through passing a vector for a single kernel parameter to the kernel constructor or explicitly through the user by putting multiple kernel objects into a list. In the following example we show the first variant in a grid search with hyperparameter {\tt cost}. The output shows the cross validation error for 5fold cross validation for each combination of kernel and hyperparameter cost. 

959 
 

960 
<<>>= 

961 
specK25 < spectrumKernel(k=2:5) 

962 
specK25 

963 
train < 1:100 

964 
model < kbsvm(x=enhancerFB[train], y=yFB[train], 

965 
 kernel=specK25, pkg="LiblineaR", svm="Csvc", 

966 
 cost=c(1,5,10,20,50,100), cross=5, explicit="yes", 

967 
 showProgress=TRUE) 

968 
modelSelResult(model) 

969 
@ 

970 
 

971 
A value range is supported in the spectrum kernel for the kernel parameter {\tt k} and in the gappy pair kernel for the parameters {\tt k} and {\tt m}. The range of {\tt k} for the gappy pair kernel is limited because of the exponential decrease of the feature space with an additional factor 2. Kernel lists can also be created by putting multiple kernel objects into a list, for example 

972 
 

973 
<<>>= 

974 
kernelList1 < list(spectrumKernel(k=3), mismatchKernel(k=3,m=1), 

975 
 gappyPairKernel(k=2,m=4)) 

976 
@ 

977 
 

978 
or by concatenating existing kernel lists 

979 
 

980 
<<>>= 

981 
kernelList2 < c(spectrumKernel(k=2:4), gappyPairKernel(k=1, m=2:5)) 

982 
@ 

983 
 

984 
Apart from grid search and model selection kernel lists cannot be used instead of single kernel objects. 

985 
 

986 
 

987 
\section{Data Representations} 

988 
\label{sec:DataRepresentations} 

989 
 

990 
For SVM based learning, the training and prediction data can either be presented in the form of a numeric kernel matrix or numeric data matrix in which the rows represent the data samples with the features as columns. \KeBABS\ kernels support the generation of both data structures. For positiondependent sequence kernels, the explicit representation is not available. For the supported packages {\tt kernlab}, {\tt e1071} and {\tt LiblineaR} learning via explicit representation is possible. The learning via kernel matrix is currently only available with package {\tt kernlab}. 

991 
 

992 
The generation of these data structures is usually performed internally by \KeBABS\ as part of the \kbsvm\ function invisible for the user. For very large datasets, the data structures can be precomputed once and saved to file to avoid repeated computation. For precomputing or usage in other machine learning methods or R packages not supported by \KeBABS\ the following sections give a short overview about both data structures and describes how they can be created. 

993 
 

994 
 

995 
\subsection{Kernel Matrix} 

996 
\label{sec:KernelMatrix} 

997 
 

998 
The kernel matrix contains similarity values between pairs of sequences computed with the kernel function. Within \KeBABS\ kernel matrices are stored in the class {\tt KernelMatrix} which is essentially a lightweight wrapper around the standard dense matrix class {\tt matrix}. \KeBABS\ supports only dense kernel matrices. For usage in other R packages, the kernel matrix must be converted to the format needed in the respective package. A kernel matrix can be created in two ways. We have already seen the first of them in previous sections. 

999 
 

1000 
<<>>= 

1001 
specK2 < spectrumKernel(k=2) 

1002 
km < getKernelMatrix(specK2, x=enhancerFB) 

1003 
class(km) 

1004 
dim(km) 

1005 
km[1:3, 1:3] 

1006 
@ 

1007 
 

1008 
If the argument {\tt y} is not specified in the call of getKernelMatrix, a quadratic and symmetrical kernel matrix with all samples against all samples is created. For normalized sequence kernels, the range of kernel values in the matrix is between 0 and 1, for unnormalized kernels the values are larger than or equal to 0. A heatmap of the kernel matrix showing a few clusters is plotted with 

1009 
 

1010 
<<eval=FALSE>>= 

1011 
heatmap(km, symm=TRUE) 

1012 
@ 

1013 
 

1014 
%<<fig=FALSE,echo=FALSE,results=hide>>= 

1015 
%pdf("003.pdf") 

1016 
%heatmap(km, symm=TRUE) 

1017 
%dev.off() 

1018 
%@ 

1019 
 

1020 
% use precreated heatmap with lower resolution to save space 

1021 
\begin{figure}[htp] 

1022 
\begin{center} 

1023 
\includegraphics[width=0.7\columnwidth]{Heatmap.png} 

1024 
\end{center} 

1025 
\end{figure} 

1026 
 

1027 
An alternative way to generate the kernel matrix through direct invocation of the kernel matrix function via the kernel object is shown below. 

1028 
 

1029 
<<>>= 

1030 
specK2 < spectrumKernel(k=2) 

1031 
km < specK2(x=enhancerFB) 

1032 
km[1:3, 1:3] 

1033 
@ 

1034 
 

1035 
The kernel matrix can also be generated for a subset of the sequences without subsetting of the sequence set through using the parameter selx. In the following example we use only 5 sequences 

1036 
 

1037 
<<>>= 

1038 
km < getKernelMatrix(specK2, x=enhancerFB, selx=c(1,4,25,137,300)) 

1039 
km 

1040 
@ 

1041 
 

1042 
In the following example, we create a rectangular kernel matrix with similarity values between two sets of sequences. First we create two sets of sequences from the package provided sequence set and then the kernel matrix 

1043 
 

1044 
<<>>= 

1045 
seqs1 < enhancerFB[1:200] 

1046 
seqs2 < enhancerFB[201:500] 

1047 
km < getKernelMatrix(specK2, x=seqs1, y=seqs2) 

1048 
dim(km) 

1049 
km[1:4,1:5] 

1050 
@ 

1051 
 

1052 
or the same rectangular kernel matrix via parameters {\tt selx} and {\tt sely} 

1053 
 

1054 
<<>>= 

1055 
km < getKernelMatrix(specK2, x=enhancerFB, selx=1:200, 

1056 
 y=enhancerFB, sely=201:500) 

1057 
dim(km) 

1058 
@ 

1059 
 

1060 
As mentioned above for normal training and prediction in \KeBABS\, you do not need to compute the kernel matrices explicitly because this is done automatically. 

1061 
 

1062 
 

1063 
\subsection{Explicit Representation} 

1064 
\label{sec:ExRep} 

1065 
 

1066 
For all kernels except the mismatch kernel and the positiondependent kernel variants, an explicit representation can be computed from the sequence set for a given kernel. The explicit representation contains the feature counts for all features occurring in the samples. By default, any feature from the feature space which is not present in the sample set is not listed in the explicit representation. Explicit representations can be generated in sparse or dense matrix form. 

1067 
 

1068 
 

1069 
\subsubsection{Linear Explicit Representation} 

1070 
\label{sec:LinearExRep} 

1071 
 

1072 
The explicit representation is stored in an object of class {\tt ExplicitRepresentationSparse} or {\tt ExplicitRepresentationDense}. Apart from the feature counts, it also contains the kernel object used for generating the explicit representation and a flag that indicates whether it is a linear or quadratic explicit representation. The feature count matrix can be extracted with a standard coercion to type {\tt matrix} if needed. In the following example, we generate an explicit representation for the first five packageprovided sequences for the unnormalized spectrum kernel with $k=2$ in dense format. 

1073 
 

1074 
<<>>= 

1075 
specK2 < spectrumKernel(k=2, normalized=FALSE) 

1076 
erd < getExRep(enhancerFB, selx=1:5, kernel=specK2, sparse=FALSE) 

1077 
erd 

1078 
@ 

1079 
 

1080 
Because of the short $k$mer length all features are present in all sequences. If {\tt k} is increased to 6, the picture changes. 

1081 
 

1082 
<<>>= 

1083 
specK6 < spectrumKernel(k=6, normalized=FALSE) 

1084 
erd < getExRep(enhancerFB, selx=1:5, kernel=specK6, 

1085 
 sparse=FALSE) 

1086 
dim(erd) 

1087 
erd[,1:6] 

1088 
@ 

1089 
 

1090 
Here we see that most features occur in only a few sequences. To save storage space, features that are not present in any sample are removed from the dense explicit representation. In spite of this storage improvement a sparse storage format is much more efficient in this situation. A sparse explicit representation is created through changing parameter {\tt sparse} to TRUE. Now we create the explicit representation for all sequences both in sparse and dense format to see the space saving of the sparse format. 

1091 
 

1092 
<<>>= 

1093 
specK6 < spectrumKernel(k=6, normalized=FALSE) 

1094 
erd < getExRep(enhancerFB, kernel=specK6, sparse=FALSE) 

1095 
dim(erd) 

1096 
object.size(erd) 

1097 
ers < getExRep(enhancerFB, kernel=specK6, sparse=TRUE) 

1098 
dim(ers) 

1099 
object.size(ers) 

1100 
ers[1:5, 1:6] 

1101 
@ 

1102 
 

1103 
The dense explicit representation is with around 16MB four times as large as the sparse explicit representation with 4MB. The factor of memory efficiency becomes more prominent when the feature space dimension and the number of sequences increase. The sparse format can also lead to considerable speedups in the learning. The SVMs in package {\tt kernlab} do not support the sparse format for explicit representations, the SVMs in the other packages support both variants. \KeBABS\ internally selects the appropriate format for the used SVM during training and prediction. With the parameter {\tt explicitType} in \kbsvm\ dense or sparse processing can be enforced. 

1104 
 

1105 
When setting the kernel flag {\tt zeroFeatures} zero feature columns remain in the explicit representation and the number of columns increases to the full feature space dimension. For sparse explicit representations, just the names of the features which do not occur are added when the flag is set. This could be necessary when using the explicit representation together with other software. 

1106 
 

1107 
Please consider that for all kernels except the motif kernel the feature space dimension 

1108 
increases drastically with increasing kernel parameter {\tt k} with an especially large increase for the gappy pair kernel. For the feature space size, the parameter {\tt m} is not relevant in the mismatch kernel and the increase is not as drastical for {\tt m} as for {\tt k} for the gappyPairKernel allowing also larger values of m. 

1109 
 

1110 
An explicit representation can be generated for a subset of the features in the feature space. This subset is specified as character vector in the {\tt feature} argument of the function getExplicitRepresentation. For normalization all features of the feature space are used. Dependent on the setting of the flag {\tt zeroFeatures}, features from this subset that do not occur in the sequences are included in the explicit representation or not. It should be mentioned that an explicit representation generated from a feature subset in a normalized spectrum kernel or normalized gappy pair kernel is not identical to an explicit representation generated from the same features in a normalized motif kernel because because the feature space and therefore the normalization is different in the two cases. 

1111 
 

1112 
Finally a small example with clustering shows how a kernel matrix or an explicit representation generated with a \KeBABS\ kernel can be used in other ML methods. We compute a kernel matrix for the 500 samples with the gappy pair kernel and use {\tt APCluster} \citep{Bodenhofer2011a}. The package implements affinity propagation clustering to find cluster structures in the data from the similarity matrix. The package {\tt apcluster}\footnote{\url{http://cran.rproject.org/package=apcluster}} must be installed for this example. 

1113 
 

1114 
<<>>= 

1115 
library(apcluster) 

1116 
gappyK1M4 < gappyPairKernel(k=1, m=4) 

1117 
km < getKernelMatrix(gappyK1M4, enhancerFB) 

1118 
apres < apcluster(s=km, p=0.8) 

1119 
length(apres) 

1120 
@ 

1121 
 

1122 
With the given setting of the preference parameter {\tt p=0.8} thirteen clusters were found. The clustering result is stored in the result object {\tt apres}. We perform agglomerative clustering starting from the previous clustering result to generate a cluster dendrogram. 

1123 
 

1124 
<<eval=FALSE>>= 

1125 
aggres < aggExCluster(km, apres) 

1126 
plot(aggres) 

1127 
@ 

1128 
 

1129 
<<fig=FALSE,echo=FALSE,results=hide>>= 

1130 
aggres < aggExCluster(km, apres) 

1131 
pdf("004.pdf") 

1132 
plot(aggres) 

1133 
dev.off() 

1134 
@ 

1135 
 

1136 
\begin{figure}[htp] 

1137 
\begin{center} 

1138 
\includegraphics[width=0.6\columnwidth]{004.pdf} 

1139 
\end{center} 

1140 
\end{figure} 

1141 
 

1142 
The cluster dendrogram shows that either a two or a four cluster solution are quite stable with respect to the preference parameter. Also the explicit representation could have been used for clustering together with one of the similarity measures defined in {\tt apcluster}. The simplest case of a linear kernel as similarity measure is identical to the clustering with the kernel matrix and with a changed preference value we get four clusters. 

1143 
 

1144 
<<>>= 

1145 
exrep < getExRep(enhancerFB, gappyK1M4, sparse=FALSE) 

1146 
apres1 < apcluster(s=linearKernel, x=exrep, p=0.1) 

1147 
length(apres1) 

1148 
@ 

1149 
 

1150 
\subsubsection{Quadratic Explicit Representation} 

1151 
\label{sec:QuadraticExrep} 

1152 
 

1153 
Learning with a quadratic kernel, i.e. a polynomial kernel of degree 2 without offset, is based on looking at pairs of features instead of single features. An analysis with a quadratic kernel is performed when the parameter {\tt featureType} in {\tt kbsvm} is set to {\tt "quadratic"}. The SVMs in package {\tt LiblineaR} are linear SVMs. These SVMs do not support nonlinear kernels. To allow the usage of the SVMs in {\tt LiblineaR} for quadratic analysis a quadratic explicit representation can be computed from the linear explicit representation. The features in the quadratic explicit representation are computed from pairs of features of the linear explicit representation. In the feature names of the quadratic explicit representation, the individual features of the pair are separated by the underscore character. 

1154 
 

1155 
Please note that symmetric feature pairs are combined into a single value to reduce the dimensionality of the quadratic explicit representation. The quadratic explicit representation can be used in training and prediction on {\tt LiblineaR} in the same way as a linear explicit representation. For the SVMs in other packages the quadratic explicit representation is not needed because the linear explicit representation is combined with a polynomial kernel of degree 2 for quadratic processing. 

1156 
 

1157 
All of the necessary conversions from linear to quadratic explicit representation are handled within \KeBABS\ automatically. If the quadratic explicit representation is needed for usage in an external software package it can be generated from any linear explicit representation in the following way  in this example again for only five sequences 

1158 
 

1159 
<<>>= 

1160 
exrep < getExRep(x=enhancerFB, selx=1:5, gappyK1M4, sparse=FALSE) 

1161 
dim(exrep) 

1162 
erquad < getExRepQuadratic(exrep) 

1163 
dim(erquad) 

1164 
erquad[1:5,1:5] 

1165 
@ 

1166 
 

1167 
The number of features increases drastically in the quadratic explicit representation compared to the linear explicit representation. 

1168 
 

1169 
 

1170 
\section{Training and Prediction for Binary Classification} 

1171 
\label{sec:Training} 

1172 
 

1173 
As already seen in the Section \ref{sec:GettingStarted}, the \KeBABS\ training method \kbsvm\ follows the usual SVM training logic. In addition to sequences, the label vector and a sequence kernel the function allows for selection of one SVM from the supported packages. SVM parameters passed to the function are dependent on the selected SVM. They are specified in a unified format to avoid changes in parameter names or formats when switching SVM implementations. \KeBABS\ translates the parameter names and formats to the specific SVM which is used in the training run. Details on the supported packages, the available SVMs and the unified SVM parameters for each SVM can be found on the help page for \kbsvm. Please be aware that identically named parameters e.g. {\tt cost} do not have the same values in different SVM formulations. This means that the value of such SVM parameters need to be adapted when changing to an SVM with a different SVM formulation. 

1174 
 

1175 
Please note that in binary classification the value used for the positive label must be selected in a specific way. For numeric labels in binary classification the positive class must have the larger value, for factor or character based labels the positive label must be at the first position when sorting the labels in descendent order according to the C locale. 

1176 
 

1177 
Instead of sequences, also a precomputed kernel matrix or an explicit representation can be passed as training data to the method \kbsvm. Precomputed kernel matrices can be interesting for large data sets or kernels with higher runtime when multiple training runs should be performed without the overhead of recreating the kernel matrix on each run. A sequence kernel should be passed to \kbsvm\ also in this situation as it might be needed for the computation of feature weights. The following example shows how to train and predict with a precomputed explicit representation. 

1178 
 

1179 
<<>>= 

1180 
gappyK1M4 < gappyPairKernel(k=1, m=4) 

1181 
exrep < getExRep(enhancerFB, gappyK1M4, sparse=TRUE) 

1182 
numSamples < length(enhancerFB) 

1183 
trainingFraction < 0.7 

1184 
train < sample(1:numSamples, trainingFraction * numSamples) 

1185 
test < c(1:numSamples)[train] 

1186 
model < kbsvm(x=exrep[train, ], y=yFB[train], kernel=gappyK1M4, 

1187 
 pkg="LiblineaR", svm="Csvc", cost=15) 

1188 
pred < predict(model, exrep[test, ]) 

1189 
evaluatePrediction(pred, yFB[test], allLabels=unique(yFB)) 

1190 
@ 

1191 
 

1192 
The method {\tt kbsvm()} returns a \KeBABS\ model of class {\tt KBModel} which is needed for prediction, calculation of feature weights and generation of prediction profiles. It stores all data needed for these functions in an SVM independent format. It also contains the native model returned by the selected SVM. This model can be retrieved from the \KeBABS\ model with the accessor {\tt svmModel}. Apart from this data, the \KeBABS\ model also can contain results returned from cross validation, grid search and model selection (see Section \ref{sec:HyperparSel}). 

1193 
 

1194 
The method {\tt predict()} is used to predict the class for new sequences from the model. The methods returns the class labels by default. Controlled through the parameter {\tt predictionType}, the {\tt predict()} method can also return the decision values or class membership probability values instead of the class label. For class membership probability values, a probability model is necessary which is computed during training when the parameter {\tt probModel} is set to {\tt TRUE}. When feature weights (see Section \ref{sec:FeatureWeights}) are available in the model, prediction is performed entirely within \KeBABS. Native prediction in the selected SVM can be enforced by setting the flag {\tt native} to {\tt TRUE} in the {\tt predict()} method. 

1195 
 

1196 
Prediction profiles (see Section \ref{sec:PredictionProfiles}) are computed during prediction for the predicted sequences when the parameter {\tt predProfiles} is set to {\tt TRUE} in {\tt predict}. Prediction profile generation requires the presence of feature weights (see Section \ref{sec:FeatureWeights}) in the \KeBABS\ model. 

1197 
 

1198 
For classification the {\tt evaluatePrediction()} method can be used to compute the performance parameters accuracy (ACC), balanced accuracy (BACC), Matthews Correlation Coefficient (MCC), area under the ROC curve (AUC), sensitivity (SENS), specificity (SPEC) and precision (PREC) from the predicted values and the true labels. The AUC value is computed when the decision values are passed to the function additionally like in the following example. Currently it is available for binary classification only. 

1199 
 

1200 
<<>>= 

1201 
preddec < predict(model, exrep[test, ], predictionType="decision") 

1202 
evaluatePrediction(pred, yFB[test], allLabels=unique(yFB), decValues=preddec) 

1203 
@ 

1204 
 

1205 
The Receiver Operating Characteristics (ROC) curve is supported for binary classification in \KeBABS\ as shown below. 

1206 
 

1207 
<<eval=FALSE>>= 

1208 
rocdata < computeROCandAUC(preddec, yFB[test], unique(yFB)) 

1209 
plot(rocdata, main="Receiver Operating Characteristics") 

1210 
@ 

1211 
 

1212 
<<fig=FALSE,echo=FALSE,results=hide>>= 

1213 
rocdata < computeROCandAUC(preddec, yFB[test], unique(yFB)) 

1214 
pdf("008.pdf") 

1215 
plot(rocdata, main="Receiver Operating Characteristics") 

1216 
dev.off() 

1217 
@ 

1218 
 

1219 
\begin{figure}[htp] 

1220 
\begin{center} 

1221 
\includegraphics[width=0.6\columnwidth]{008.pdf} 

1222 
\end{center} 

1223 
\end{figure} 

1224 
 

1225 
 

1226 
\section{Selection of Kernel Parameters, Hyperparameters and Models} 

1227 
\label{sec:HyperparSel} 

1228 
 

1229 
For selection of kernel, kernel parameters, and SVM hyperparameters \KeBABS\ provides cross validation and grid search. Both functionalities are implemented entirely in the {\tt kebabs} package. They are available for all supported SVMs in a packageindependent way for binary and multiclass classification as well as regression. Also model selection is provided by \KeBABS\ for all supported SVMs. All of these functions require repeated training and prediction and, therefore, \KeBABS\ was designed with strong focus on good performance of individual training and prediction runs allowing reasonable number of repetitions. 

1230 
 

1231 
Under special circumstances, it might make sense to first narrow down the parameter range with the fast SVM implementations in package {\tt LiblineaR} and then refine the reduced range with the slower but likely slightly more accurate SVMs from package {\tt e1071} or {\tt kernlab}. 

1232 
 

1233 
 

1234 
\subsection{Cross Validation} 

1235 
\label{sec:CrossValidation} 

1236 
 

1237 
Cross validation (CV) is provided in \KeBABS\ in three different variants: 

1238 
 

1239 
\begin{description} 

1240 
 \item [kfold cross validation:] 

1241 
 For kfold cross validation the data set is split into $k$ roughly equally sized parts 

1242 
 called folds. One of these folds is used as test set and the other $k1$ folds as training 

1243 
 set. In $k$ training runs the test fold is changed cyclically giving k different 

1244 
 performance values on different subsets of the data. This variant is selected through 

1245 
 setting the parameter {\tt cross}, which defines the number of folds, to values larger 

1246 
 than 1 in \kbsvm. 

1247 
 \item [LeaveOneOut cross validation (LOOCV):] 

1248 
 The LOOCV is $k$fold CV taken to the extreme with a fold size of one sample. This 

1249 
 variant is dynamically quite demanding and leads to the same number of training runs 

1250 
 as there are samples in the data set. It is selected through setting the parameter 

1251 
 {\tt cross} to 1 in \kbsvm. 

1252 
 \item [Grouped cross validation (GCV):] 

1253 
 For this variant samples must be assigned to groups before running cross validation. 

1254 
 Grouping can be performed in a way that learning can focus on the relevant features 

1255 
 and is not distracted by simple features which do not really give adequate information 

1256 
 about the classes. GCV is a specialized version of $k$fold CV which respects group 

1257 
 membership when splitting the data into folds. This means that one group is never 

1258 
 distributed over more than one fold. GCV is performed when passing the group 

1259 
 assignment of samples as integer vector or factor with the parameter {\tt groupBy} to 

1260 
 \kbsvm\ together with a value of parameter {\tt cross} that is smaller or equal to the 

1261 
 number of groups. 

1262 
\end{description} 

1263 
 

1264 
For all types of CV, the result can be retrieved from the model with the accessor function {\tt cvResult}. For classification the cross validation error is the mean classification error over all CV runs. For regression, the average of the mean squared errors of the runs is used. 

1265 
 

1266 
For classification tasks, the collection of additional performance parameters during cross validation runs can be requested with the parameter {\tt perfParameters} in \kbsvm. This could be helpful especially when working with unbalanced datasets. For multiple repetitions of CV, the parameter {\tt noCross} is used. 

1267 
 

1268 
In the following example, grid search with grouped CV for the coiled coil sequences is shown as 3fold grouped cross validation with two repetitions. In this example the groups were determined through single linkage clustering of sequence similarities derived from ungapped heptadspecific pairwise alignment of the sequences. The variable {\tt ccgroup} contains the precalculated group assignments for the individual sequences. The grouping is a measure against overrepresentation of specific sequences in the PDB and ensures, that no pair of sequences from two different clusters match to a degree of 60\% or higher of match positions related to the length of the shorter sequence (see also data Web page of package {\tt procoil}\footnote{\url{http://www.bioinf.jku.at/software/procoil/data.html}}). 

1269 
 

1270 
<<>>= 

1271 
data(CCoil) 

1272 
ccseq 

1273 
head(yCC) 

1274 
head(ccgroups) 

1275 
gappyK1M6 < gappyPairKernel(k=1, m=6) 

1276 
model < kbsvm(x=ccseq, y=as.numeric(yCC), kernel=gappyK1M6, 

1277 
 pkg="LiblineaR", svm="Csvc", cost=30, cross=3, 

1278 
 noCross=2, groupBy=ccgroups, perfObjective="BACC", 

1279 
 perfParameters=c("ACC", "BACC")) 

1280 
cvResult(model) 

1281 
@ 

1282 
 

1283 
\subsection{Grid Search} 

1284 
\label{sec:GridSearch} 

1285 
 

1286 
The grid search functionality gives an overview of the performance behavior for different parameter settings and supports the user in the selection of the sequence kernel and optimal values for the kernel parameters and SVM hyperparameters. The grid in this functionality is defined as rectangular data structure, where the rows correspond to different sequence kernels including their kernel parameters. Each row corresponds to a single sequence kernel object with a specific setting of the kernel parameters. The columns correspond to the SVM parameters. \KeBABS\ automatically performs cross validations for each of these grid points and returns the CV error as performance measure by default. Also grouped CV can be used for each grid point as shown above. The accessor {\tt modelSelResult} to the grid search results in the model is identical to the accessor for model selection results. 

1287 
 

1288 
<<>>= 

1289 
specK24 < spectrumKernel(k=2:4) 

1290 
gappyK1M24 < gappyPairKernel(k=1, m=2:4) 

1291 
gridKernels < c(specK24, gappyK1M24) 

1292 
cost < c(1,10, 100, 1000, 10000) 

1293 
model < kbsvm(x=enhancerFB, y=yFB, kernel=gridKernels, 

1294 
 pkg="LiblineaR", svm="Csvc", cost=cost, cross=3, 

1295 
 explicit="yes", showProgress=TRUE) 

1296 
modelSelResult(model) 

1297 
@ 

1298 
 

1299 
In this example, only 3fold CV was used to save time for vignette generation. When using 10fold CV the differences between the kernels become more prominent because the effect of random splits of the data average out better. When the parameter {\tt showProgress} is set to {\tt TRUE} in \kbsvm\ , a progress indication is displayed. 

1300 
 

1301 
The default performance parameter in grid search is the CV error, as in cross validation. For nonsymmetrical datasets other performance parameters set with parameter {\tt perfParameters} might be more expressive. For grid search, the performance objective can be defined explicitly with the parameter {\tt perfObjective} as criterion to select the best setting of the grid parameters. The accuracy (ACC), balanced accuracy (BACC), Matthews Correlation Coefficient (MCC) and area under the ROC curve (AUC) can be chosen as performance objective. The AUC is currently only supported for binary classification. 

1302 
 

1303 
Grid search can also be done over multiple SVMs in the same or in different packages as in the next example. In this case vectors of identical length must be passed for the parameters {\tt pkg}, {\tt svm} and the respective SVM parameters. The corresponding entries in these vectors refer to one SVM with its hyperparameter(s) represent one grid column. With the parameter {\tt showCVTimes} set to {\tt TRUE } in \kbsvm\ the cross validation runtimes are recorded for each individual grid point and are displayed at the end of the grid search. 

1304 
 

1305 
<<>>= 

1306 
specK34 < spectrumKernel(k=3:4) 

1307 
gappyK1M34 < gappyPairKernel(k=1, m=3:4) 

1308 
gridKernels < c(specK34, gappyK1M34) 

1309 
pkgs < c("e1071", "LiblineaR", "LiblineaR") 

1310 
svms < c("Csvc","Csvc","l1rl2lsvc") 

1311 
cost < c(50, 50, 12) 

1312 
model < kbsvm(x=enhancerFB, y=yFB, kernel=gridKernels, 

1313 
 pkg=pkgs, svm=svms, cost=cost, cross=10, 

1314 
 explicit="yes", showProgress=TRUE, 

1315 
 showCVTimes=TRUE) 

1316 
modelSelResult(model) 

1317 
@ 

1318 
 

1319 
\subsection{Model Selection} 

1320 
\label{sec:ModelSelection} 

1321 
 

1322 
In \KeBABS\ model selection is performed very similar to grid search. It is based on nested cross validation with an inner cross validation loop that determines the best setting of the hyper parameters and an outer cross validation loop to test the performance of the model with the best parameter settings on independent test sets. The result of model selection is the performance of the complete model selection process, not that of a single best model. 

1323 
 

1324 
Model selection is triggered when the parameter {\tt nestedCross} in \kbsvm\ is set to a value larger than 1. This parameter defines the number of folds in the outer cross validation. The number of folds in the inner cross validation is defined through the parameter {\tt cross} as usual. Multiple repetitions of model selection can be requested with the parameter {\tt noNestedCross}. 

1325 
 

1326 
In the following example, model selection is performed with 4fold outer cross validation and 10fold inner cross validation. The parameters of the best models selected in the inner cross validation are shown in the model selection result object and the results of the outer cross validation in the CV result object extracted from the model. 

1327 
 
