Stockholm Bioinformatics Center (SBC) Per Kraulis: Home page

9 Sep 2000

General bioinformatics

Per Kraulis

Stockholm Bioinformatics Center, Stockholm University

What is bioinformatics?

Bioinformatics can be described as the science of collecting, modelling, storing, searching, annotating and analysing biological information. It involves a range of activities from data handling, publication, to data mining and analysis. An essential part of bioinformatics is to create new algorithms for the analysis of complex and/or large data sets.

Bioinformatics deals with the issues created by the massive amounts of new types of data obtained through novel biological experiments. The most well-known example is, of course, the determination of the complete nucleotide sequence of the human genome, which has to a large extent been accomplished, although not yet finished.

The basic data has so far usually been sequence data (nucleotide or protein), but other types of data (microarray, functional analysis, interactions) are now rapidly coming into focus.

Three levels of bioinformatics

  1. Analysis of a single gene (protein) sequence. For example:
  2. Analysis of complete genomes. For example:
  3. Analysis of genes and genomes with respect to functional data. For example:

Technical issues

Collecting, storing, searching and using biological information entails a number of technical problems, ranging from the trivial but important (file formats, database interactions) to the sophisticated (algorithm design, data modelling, ontologies). For example:

Legal issues

Bioinformatics databases, algorithms and software have wildly different levels of copyright, occasionally patents, and license conditions attached to them. This sometimes causes considerable problems for academic research. For example, a novel data modelling scheme may use data from several sources and combine them into an entire novel framework. This new database becomes difficult to distribute, since it builds on other databases. Of course, such legal protection may also be very important for the academic researcher.

A current trend is to distribute software (and also some databases) under some variant of the GNU General Public License (http://www.gnu.org/). Software patents in bioinformatics are not very common, so far. The large sequence databases and a few others are also publicly accessible and redistributable. But most other databases have conditions that make development and subsequent redistribution problematic.

Commercial uses of bioinformatics

Bioinformatics is of direct relevance for the pharmaceutical and agricultural industries. It is necessary for analysis of the experimental data. It can suggest hypotheses to be tested by experiment. There is a very close connection between basic bioinformatics science and its technological applications.

Most large pharmaceutical companies do not have any large groups doing original science in bioinformatics. They usually have enough expertise to use and apply academic software and commercial products, but not much more. Most bioinformatics efforts in industry are directed at finding interesting genes, or helping experimental scientists with handling and analysis of their data.

The relationship between academia and companies are basically good. There is much collaboration and communication. However, the very fast progress and movement of the field makes some kinds of long-term collaborations, such as PhD projects, difficult. A company (especially a small company) is reluctant to commit to a four-year project; much can happen during that time.

It must be noted that different parts of industry have different interests: Some companies sell information (databases), algorithms or analysis tools, while other companies use these tools in pharmaceutical and agricultural research. For example, the large pharmaceutical companies have no strong interest in software patents, while a small bioinfo company may build its entire business case on proprietary software.

Challenges

An important issue that will probably be with us for a long time is the integration of the large (and increasing) number of biological databases. Although some technology developments will help (XML), the fragmentation is more fundamental than that. It is due to different ways of viewing the data, and this is a genuine scientific problem.

The complete genomes for several important organisms are basically in hand, and we know (in principle) how to deal with data of this type. Of course, many improvements must be made.

Much current work focuses on how to interpret microarray expression data. Which genes are active or inactive during different states of the cell? How is the regulation carried out. There are fundamental research issues here, as well as obvious direct applications in diagnosis.

However, expression experiments on their own do not unveil the mechanisms underlying the various processes in the cell. New approaches, both experimental and theoretical/algorithmic, must be developed.

There are many signs that bioinformatics will focus on the issues of how to describe and analyse the fundamental life processes in cells and organisms. Understanding the function of genes and proteins cannot be achieved without understanding the systems in which they work. This is needed for the fundamental understanding of living systems, as well as for successful technological application of the new biology in the agricultural and pharmaceutical industry.

This is one of the fundamental challenges of biology and bioinformatics: How do we describe, analyse, simulate and predict the dynamics of life processes?


Copyright © 2000 Per J. Kraulis $Date: 2000/12/12 16:31:35 $