DiProGB: The Dinucleotide Properties Genome Browser

The dinucleotide properties genome browser is part of my PhD project.

During the last 10 years, a large number of complete genomes has been sequenced. Having these data at hand, the basic aim is now to convert this information into biological knowledge. This requires the identification of biologically meaningful motifs in genomic data. Computational motif discovery has been used with some success in simple organisms such as yeast, for example. For higher organisms with more complex genomes more sensitive methods are required. There is also a growing awareness that not single motifs but motif combinations usually called modules may be relevant to biological function.
In this project we developed a new type of genome browser that offers user-friendly genome analysis tools for the statistical analysis of single and multiple sequences as well as for the visual exploration of single sequences. A peculiarity is that not only the standard sequence representation in terms of the bases A, T, G and C can be adopted, but also a reduced sequence representation by purine/pyrimidine and AT/GC characteristics and finally a representation in terms of a large number of dinucleotide parameters that can encode geometrical information on DNA structure, for example. All of these coding schemes can be converted into a signal representation that allows for a very effective visual motif discovery. Analyses can be performed for the + and – as well as for the double strand. Combining these sequence- and signal-based representations offers a new approach for the detection of new regulatory elements. The functionalities described make DiProGB a unique tool for the identification and analysis of functional motifs in genomes. From the algorithmic point of view standard sequence-based algorithms are combined with signal-based pattern recognition algorithms. DiProGB is a standalone computer program written in VC++. It has been optimized to cope with large genomes. The program has been developed under the Microsoft Windows operating system. It can, however, also be used under Linux, Mac, BSD, and Solaris after installing the program WineHQ (http://winehq.org), for example.

A more detailed description and the freely available program can be found at http://diprogb.fli-leibniz.de.
DiProGB is published in
Bioinformatics 2009; doi: 10.1093/bioinformatics/btp436

DiProDB : a database for dinucleotide properties

The dinucleotide property database is part of my PhD project.

We believe that encoding nucleic acid sequences by physicochemical properties of nucleotides may lead to new insights into structure and functioning of genomes beyond the information obtained from a character-based sequence representation alone. We are thus currently developing a genome browser that encodes genomic sequences by conformational and thermodynamic dinucleotide properties. Even though higher-order nucleotide combinations may play a role many properties can be understood adopting the so-called nearest-neighbour approach corresponding to dinucleotide properties. The usage of these data for different potential applications and also their comparison could be greatly facilitated if they are compiled in a database. We have thus set up the new database DiProDB that offers information on more than 100 dinucleotide property sets. By an option for submitting new data sets it is easily extendible. The information is offered in tabular form and can be customized according to the user’s needs. The database provides export functions and a user-friendly interface to search for specific entries. One especially interesting feature is the possibility of different real-time dinucleotide property correlation analyses. Moreover, we present thorough clustering analyses of the dinucleotide property sets. With DiProDB we want to provide reliable and easily accessible and comprehensive information on dinucleotide properties that may stimulate the application of these data to a diversity of biological problems.

The database can be found under http://diprodb.fli-leibniz.de and is published in
Nucleic Acids Res. 2009 Jan; 37(Database issue):D37-40. Epub 2008 Sep 19.

DASS: efficient discovery and p-value calculation of substructures in unordered data

The DASS algorithm is a side project in cooperation with Dr. Jens Hollunder and Dr. Thomas Wilhelm.

Pattern identification in biological sequence data is one of the main objectives of bioinformatics research. However, few methods are available for detecting patterns (substructures) in unordered datasets. Data mining algorithms mainly developed outside the realm of bioinformatics have been adapted for that purpose, but typically do not determine the statistical significance of the identified patterns. Moreover, these algorithms do not exploit the often modular structure of biological data.
We developed the algorithm DASS (Discovery of All Significant Substructures) that first identifies all substructures in unordered data (DASSSub) in a manner that is especially efficient for modular data. In addition, DASS calculates the statistical significance of the identified substructures, for sets with at most one element of each type (DASSPset), or for sets with multiple occurrence of elements (DASSPmset). The power and versatility of DASS is demonstrated by four examples: combinations of protein domains in multi-domain proteins, combinations of proteins in protein complexes (protein subcomplexes), combinations of transcription factor target sites in promoter regions and evolutionarily conserved protein interaction subnetworks.

The program code and additional data are available at http://www.fli-leibniz.de/tsb/DASS.
The algorithm is published in
Bioinformatics. 2007 Jan 1;23(1):77-83. Epub 2006 Oct 10.