How is bioinformatics helpful in sequencing DNA

Investigation of binding sites for the activation of genes

Transcription factors play a central role in the regulation of genes. The Bioinformatics Department at the MPI for Molecular Genetics uses various mathematical methods to investigate the function and interaction of transcription factors and thus to obtain further knowledge about the regulation of genes.

After the human genome and numerous other genomes have been sequenced, major research efforts are now focused on the analysis of the data obtained. However, this is not limited to the pure annotation of the genes, i.e. their identification and assignment to certain functions. Rather, knowing the entire DNA sequence of an organism allows a number of new questions, especially about the mechanisms of regulation of genes. Today we are not only interested in the function of a gene or the protein it encodes, but also which mechanisms lead to the gene being activated at all, i.e. being read and translated into a protein.

Transcription factors as regulators of gene expression

The human organism has around 25,000 genes that are present in each of its cells throughout its life. In every phase of development and in every cell type, however, different subsets of this gene pool are activated. Today we at least partially know the biological mechanisms that are responsible for this activation. A group of DNA-binding proteins, the so-called transcription factors, is of particular importance. They form a complex with the enzyme RNA polymerase, which is responsible for reading the DNA, and thereby activate it. The transcription factors recognize certain sequence patterns that are arranged at the starting point of a gene on the DNA (Fig. 1). The interest of the Bioinformatics department at the Max Planck Institute for Molecular Genetics focuses on the identification of such sequence patterns and the question of the combination in which transcription factors activate certain genes by binding to these patterns.

Molecular biologists and biochemists have been working for years with sequence patterns to which transcription factors can bind. We now know that many transcription factors each bind to several different patterns in the genome. The transcription factor SRF (serum response factor), for example, binds to the base sequence CCTAATATGG in front of the gene junB and thereby contributes to its activation (Fig.1). However, SRF also binds to other locations in the genome that differ from the above-mentioned sequence in different positions. Its binding sites are therefore generally described by the sequence of the possible bases or base alternatives. The description of a binding point using so-called “weight matrices” is more abstract. These indicate the possible distribution of the bases for each position of a binding site to be described (Fig. 2).

Phylogenetic footprinting - identification of important biological signals through evolutionary conservation

A fundamental problem in the analysis of binding sites is the fact that a defined, short sequence of bases occurs very often within the entire genome. Describing the binding site as a sequence of bases is therefore insufficient to predict at which point in the genome a transcription factor actually binds. However, important regulatory sequences are often evolutionarily conserved. Bioinformaticians therefore try to obtain information on the importance of a certain sequence (= potential binding site) by comparing the genome sequences of different organisms. Conversely, conserved sequences within regulatory regions represent primary candidates for the search for new binding sites for transcription factors; accordingly, they are intensively examined with regard to correspondences with known bonding patterns. The sequence of the already mentioned binding site for SRF would not be very informative if it were considered in the human genome alone. Interestingly, the same sequence occurs in a comparable position in front of a gene in the mouse genome. The comparison of the two genomes provides a strong indication that this pattern could be an important biological signal. This approach is known as phylogenetic footprinting. The scientists in the Bioinformatics Department at the MPI for Molecular Genetics have developed a series of computer programs to identify such conserved areas in front of orthologous genes from humans and mice and then to annotate them with known binding site patterns. The information is stored in the CORG (Comparative Regulatory Genomics) database and is publicly accessible at http://corg.molgen.mpg.de/.

Identification of the target genes of transcription factors

Predicting evolutionarily conserved binding sites can also help identify target genes of transcription factors. In a collaboration with A. Nordheim from the University of Tübingen, the putative target genes of SRF were first determined using DNA chip experiments (see below). These included genes that were directly regulated by SRF as well as genes that were only activated as a result of activation by SRF. From this total amount, those genes that are directly influenced by SRF could be filtered out using the methods described for pattern search and the analysis of evolutionary conservation. This information was helpful to elucidate a previously unknown mechanism of differentiation of muscle cells [1].

Analysis of activation patterns

For complex biological processes, however, it is not just the activation of individual genes that is important, but in particular the coordinated activation of entire groups of genes. Such “activation patterns” can be determined with the help of DNA chips (see von Heydebreck et al., Gene expression analysis of complex clinical phenotypes using DNA arrays, MPG yearbook 2003). These experiments show which genes behave similarly in defined cell types under defined conditions; they are referred to as co-expressed clusters of genes. Among other things, the working group is concerned with the question of whether such coexpression can be traced back to common regulatory mechanisms, for example the same transcription factors. Using the CORG database, the scientists examine the regulatory areas of the co-expressed genes. As a first approximation, they catalog the evolutionarily conserved binding sites within these areas, because these indicate a regulatory function of the respective binding factor. The evolutionarily conserved binding sites within the regulatory areas are still too numerous to be able to provide information about a specific function. The scientists therefore compare the frequency of occurrence of a certain binding site within a co-expressed cluster with the probability of its occurrence by chance. Such an analysis was carried out, for example, using publicly available DNA chip data on the cell cycle in a human cell line. The coexpression clusters consisted of the genes that are each strongly expressed in a certain phase of the cell cycle. The analysis of their regulatory patterns revealed a number of transcription factors which, according to the statistical conclusion described, play an important role in the cell cycle (Fig. 3). A comparison with experimental results confirmed this [2].

Interaction of transcription factors with one another

The scientists in the Bioinformatics department are currently working on the development of methods to study the interaction of the transcription factors. In the case of gene regulation in yeast, which is simpler than in mammalian organisms, they were able to show that transcription factors, whose binding sites are often found in close proximity to one another on a sequence, often also physically interact with one another [3]. In search of similar principles in mammals, they are now looking for statistical tendencies in the combination of binding sites on the DNA. Here, the researchers again come to the aid of evolutionary conservation and enable a reduction in false positive predictions. In a list of transcription factor pairs reduced in this way, the binding sites of which often occur close together, many known, interacting factors are indeed found [4].

The work described is based on a detailed processing and mathematical penetration of the existing, experimentally determined data. Known binding sites of transcription factors must be compared and grouped to avoid double counting. Their information content is calculated and new mathematical procedures were developed within the working group in order to predict the expected number of false positives - hence the statistical significance [5]. A common problem when analyzing data from functional genome research is multiple testing: Due to the large amount of data, a hypothesis can be tested with the help of many cases. However, this only apparently leads to impressive significance values. In order to obtain a realistic statement, the resulting significance values ​​must be corrected according to the number of tests carried out in order to ensure an adequate evaluation of biological data.

The regulation of the expression of a gene is not only determined by transcription factors. Nevertheless, this level of regulation must be studied in order to be able to show an overall picture of the various regulatory mechanisms of a cell. This brings us a significant step closer to a comprehensive understanding of the function of a living cell.

Original publications

U. Philippar, G. Schratt, C. Dieterich, J.M. Muller, P. Galgoczy, F.B. Engel, M.T. Keating, F. Gertler, R. Schule, M. Vingron, A. Nordheim:
The SRF target gene Fhl2 antagonizes RhoA / MAL-dependent activation of SRF.
Mol Cell 2004; 16: 867-880
C. Dieterich, S. Rahmann, M. Vingron:
Functional inference from non-random distributions of conserved predicted transcription factor binding sites.
Bioinformatics 2004; 20 (Suppl 1): I109-I115.
T. Manke, R. Bringas, M. Vingron:
Correlating Protein-DNA and Protein-Protein Interaction Networks.
J Mol Biol 2003; 333: 75-85
K. Rateitschak, T. Müller, M. Vingron:
Annotating significant pairs of transcription factor binding sites in regulatory DNA.
In Silico Biology 2004; 4: 479-487
S. Rahmann, T. Müller, M. Vingron:
On the power of profiles for transcription factor binding site detection.
Statistical Applications in Genetics and Molecular Biology 2003; 2: Article 7