# A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs and its application to the detection of alternative splicing in RNA-seq data

###### Abstract

We present a new algorithm for enumerating bubbles with length constraints in directed graphs. This problem arises in transcriptomics, where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA-seq. This is the first polynomial-delay algorithm for this problem and we show that in practice, it is faster than previous approaches. This enables us to deal with larger instances and therefore to discover novel alternative splicing events, especially long ones, that were previously overseen using existing methods.

## 1 Introduction

Transcriptomes of model or non model species can now be studied by sequencing, through the use of RNA-seq, a protocol which enables us to obtain, from a sample of RNA transcripts, a (large) collection of (short) sequencing reads, using Next Generation Sequencing (NGS) technologies [Burge, Mortazavi]. Nowadays, a typical experiment produces 100M reads of 100nt each. However, the original RNA molecules are longer (typically 500-3000nt) and the general computational problem in the area is then to be able to assemble the reads in order to reconstruct the original set of transcripts. This problem is not trivial for mainly two reasons. First, genomes contain repeats that may be longer than the read length. Hence, a read does not necessarily enable to identify unambiguously the locus from which the transcript was produced. Second, each genomic locus may generate several types of transcripts, either because of genomic variants (i.e. there may exist several alleles for a locus) or because of transcriptomic variants (i.e. alternative splicing or alternative transcription start/end may generate several transcripts from a single locus that differ by the inclusion or exclusion of subsequences). Hence, if a read matches a subsequence shared by several alternative transcripts, it is a priori not possible to decide which of these transcripts generated the read.

General purpose transcriptome assemblers [Trinity, TransAbyss, Oases] aim at the general goal of identifying all alternative transcripts, but because of the extensive use of heuristics, they usually fail to identify infrequent transcripts, tend to report several fragments for each gene, or fuse genes that share repeats. Local transcriptome assemblers [Sacomoto2012], on the other hand, aim at a simpler goal, as they do not reconstruct full length transcripts. Instead, they focus on reporting all variable regions (polymorphisms): whether genomic (SNPs, indels) or transcriptomic (alternative splicing events). They are much less affected by the issue of repeats, since they focus only on the variable regions. They can afford to be exact and therefore are able to have access to infrequent transcripts. The fundamental idea is that each polymorphism corresponds to a recognizable pattern, called a bubble in the de Bruijn graph built from the RNA-seq reads. In practice, only bubbles with specific length constraints are of interest. However, even with this restriction, the number of such bubbles can be exponential in the size of the graph. Therefore, as with other enumeration problems, the best possible algorithm is one spending time polynomial in the input size between the output of two bubbles, i.e. a polynomial delay algorithm

In this paper, we introduce the first polynomial delay algorithm to enumerate all bubbles with length constraints in a weighted directed graph. Its complexity in the best theoretical case for general graphs is (Section 3) where is the number of vertices in the graph, the number of arcs. In the particular case of de Bruijn graphs, the complexity is (Section 4.1) where is a constant related to the length of the skipped part in an alternative splicing event. In practice, an algorithmic solution in (Section 4.2) appears to work better on de Bruijn graphs built from such data. We implemented the latter, show that it is more efficient than previous approaches and outline that it enables us to discover novel long alternative splicing events.

## 2 De Bruijn graphs and alternative splicing

A *de Bruijn graph* (DBG) is a directed graph whose
vertices are labeled by words of length over an alphabet
. An arc in links a vertex to a vertex if the
suffix of length of is equal to the prefix of . The out
and the in-degree of any vertex are therefore bounded by the size of
the alphabet . In the case of NGS data, the -mers
correspond to all words of length present in the reads of the
input dataset, and only those. In relation to the classical de Bruijn
graph for all possible words of size , the DBG for NGS data may
then not be complete. Given two vertices and in , an
-path is a path from to . As defined in [ourSPIRE],
by an -bubble, we mean two vertex-disjoint -paths. This
definition is, of course, not restricted to de Bruijn graphs.

As was shown in [Sacomoto2012], polymorphisms (i.e. variable
parts) in a transcriptome (including alternative splicing (AS) events)
correspond to recognizable patterns in the DBG that are precisely the
-bubbles. Intuitively, the variable parts correspond to
alternative paths and the common parts correspond to the beginning and
end points of those paths. More formally, any process generating
patterns and in the sequences, with , and and not sharing any
-mer, creates a -bubble in the DBG. In the special case of
AS events excluding mutually exclusive exons, since is empty, one
of the paths corresponds to the *junction* of , i.e. to
-mers that contain at least one letter of each sequence. Thus the
number of vertices of this path in the DBG is predictable: it is at
most^{1}^{1}1The size is *exactly* if has no common
prefix with and no common suffix with . . An example
is given in Fig. 1. In practice [Sacomoto2012],
an upper bound to the other path and a lower bound on
both paths is also imposed. In other words, an AS event corresponds to
a -bubble with paths and such that has at
most vertices, at most and both have at least
vertices.