Currently over 8 million ESTs have been generated from over 250 different
eukaryotes. In general, this data is not well organised, difficult to interpret in
a genomic context and prevents simple comparative analyses. Common problems include
significant redundancy in the datasets (some genes may have been sequenced multiple
times) and a lack of consistent annotation between projects. An effective way to
overcome these problems is to group ESTs into clusters that represent genes, and to
provide annotations for the clusters. Since ESTs provide only a fraction of the
available genes for a particular organism, we refer to these analysed datasets as
To automate this process we have previously developed a novel bioinformatics
collates and identifies non-redundant sets of genes on a species specific basis.
The PartiGene system provides an industry standard for these types of databases
and has since been adopted by UK's Natural Environmental Research Council. This
system has now been expanded into a fully automatable pipeline such that for the
first time, we have created a comprehensive database of ~250 sets of eukaryotic
genes. This database provides a powerful platform to perform comparative studies
that go much beyond those which can currently be achieved using only the fully
Above is a schematic of the process used to generate PartiGeneDB.
(1) The process begins with the identification of organisms in Genbank for which >1000
ESTs exist. For each such organism, the current number of ESTs is compared with
the number of sequences in PartiGeneDB. If a significant number of new sequences
are available or the organism is new to the database, they are downloaded and
screened for the presence of contaminating vector sequence and poly(dA) tails.
(2) Screened sequences are then clustered on the basis of sequence
similarity using our in house clustering software CLOBB. This
clustering step is incremental such that during subsequent rounds of
clustering, the original cluster identifiers associated with each organism
remains intact. This enables the PartiGeneDB to be easily updateable and
ensures that analyses are consistent between such updates.
(3) Once the clusters have been generated, constituent sequences are assembled
to form consensus sequences (representing putative genes) using the publicly
available software tool PHRAP. These
consenus sequences form our putative 'genes'.
(4) Each sets of non-redundant consensus sequences form what we term a partial
genome of the organism concerned. At this stage the sequence and cluster data
is uploaded into the PartiGeneDB and the process moves onto analysis of the next
(5) Having created the initial sequence database, we annotate the non-redundant
set of putative gene sequences. Again automated pipelines have been set up
which perform a series of BLAST searches for each partial genome against a set
of user defined databases including the non-redundant DNA database and the
non-redundant protein database. Currently only annotation involving the
non-redundnat protein database is available. We also perform a series of
self-BLASTs of our datasets such that each partial genome is analysed in the
context of every other partial genome. Results from these analyses are helping
to explore levels of similarities between the various datasets and are also
being used to create phylogenetic profiles. In this latter instance it may be
possible to identify groups of genes of related function on the basis of these
profiles and potentially allow the identification of enzymes which appear to
have significantly diverged within one or more organisms. BLAST results are
extracted and also stored in the PartiGeneDB allowing the rapid identification
and retrieval of sequences with particular BLAST profiles.
In addition to the raw nucleotide sequence, we are applying a
previously developed pipeline, prot4EST, to generate a peptide prediction for
each putative gene sequence. These sequences are then being analysed by the
distinct protein domains which enable gene ontology terms to be associated
with each sequence. Collation of these terms will allow us to build profiles
of the molecular functions and processes associated with the partial genomes
obtained for each organism. Comparative analyses of these profiles may then
enable us to understand more about the biology associated with each organism
or collection of organisms.
A cluster is a set of sequences (in this case expressed sequence tags (ESTs))
which share enough sequence homology to suggest that they may all derive from
the same gene. Hence we treat individual clusters as sets of sequences derived
from the same gene, and by obtaining their consensus, we aim to derive a single
'gene' sequence. Note we use quotes around the word 'gene'. In some cases
incomplete sequencing, the 'gene' sequence may not be full length. Furthermore,
due to the nature of the clustering proceedure, the gene may actually derive
from a set of alternative transcripts or allelic variants. By clicking on the
consensus sequence in the cluster view (see below) you can retreive the file
detailing how a consensus was created and hence gain your own insights into the
nature of the sequences used to create the cluster. Rarely, sequences from
members of the same gene family may get clustered together - however more
common is the reverse where alternative splicing can lead to the creation of
more than one cluster from the same gene. The level of expansion of clusters
due to alternative splicing is currently being investigated.
Clusters are designated with a three letter species identifier followed by
an index number starting from 00001. The first two letters of the species
identifer indicate the genus and species, the third letter allows
differentiation between two or more species with similar IDs.
Here we provide a web interface to facilitate seraching of our available datasets
to allow researchers to identify genes of interest associated with an organism of
choice. At present we provide three portals of entry into the database
1) View an entire organisms dataset
2) Search by Annotation
3) Search by Sequence Similarity
The first option allows users to view the entire partial genome of a selected
organism on the basis of their relative abundance. This enables users to
identify which genes are the most highly represented within the dataset.
The "Search by Annotation" page allows users to input text into a search box
and search against an organism of choice for genes which share significant sequence
similarity to a protein from the non-redundant protein database which has been
annotated with the user selected text. The "Search by Sequence Similarity" page
is a local BLAST server, which allows users to search our cluster datasets for
sequences which share significant similarity to a user specified sequence.
After selecting a cluster to view using any of the search options you will be
preseneted after a few moments with a view providing details of the cluster. At
the top of the cluster page you will a see a section like the one shown below :
This section provides general information on the cluster of sequences used to
create the putative gene object including : cluster identifier (as
mentioned above this is related to the genus and species of the
organism from which it was derived); species name; number of sequences
in the cluster; types of sequences in the cluster (generally ESTs, although a
few organisms may have other types of sequence data incorporated); Number of
contigs (consensus sequences) built from the sequences and a panel detailing
the library composition of the sequences in the cluster. Also shown is a button
allowing the uer to download all the sequences associated with the cluster. In
the Library panel, the user may click on the library numbers to obtain more
detailed information on the cDNA libraries associated with the sequences.
Below this panel, the user will see a window detailing any BLAST annotation
associated with the cluster. Under this, the user will see a graphic detailing
the makeup of the consensus sequence from its constituent sequences (see
The graphic above details the sequences and the libraries from which they were
derived. Also shown is a view of how the seqences relate to the consensus
sequence. By clicking on the consensus sequence (red bar) the user may access
the multiple alignment of the sequences and how they relate to the consensus.
Clicking on the sequence names or sequence bars in the graphic, links the user
to the genbank entry for the sequence. The colour of the bars represent the
quality of the sequence - yellow = high quality regions of sequence, grey = low
quality regions. Below this graphic is a text box indicating the consensus
sequence of this cluster. Note if there is more than one contig associated with
a cluster additional graphics and sequence boxes will be provided below.
Our initial interest in PartiGeneDB is in its application to parasite biology.
Parasites represent a major scourge on human health and economics, especially
in the third world. Due to the relatively poor economies of the countries which
bear the greatest burden, drug and vaccine programs have not attracted the
attention they merit. Despite this lack of investment, a large body of sequence
data in the form of ESTs currently exists and continues to be generated for
many of these organisms. Of the 250+ organism datasets in PartiGeneDB,
approximately 70 of them represent parasitic organisms. The multiple occurrence
of parasitism suggests that there are specific adaptations which enable a
parasitic lifestyle. By focusing on groups of parasite and non-parasite
comparators we aim to explore the evolutionary origins of parasites, gain
insights into their biology and identify parasite specific traits.
SimiTri is a a Java/Perl-based application, which allows simultaneous display
and analysis of relative similarity relationships of the dataset of interest to
three different databases.
To launch the SimiTri application, simply select the SimiTri search option
and select the organism and three databases to be used for making the
comparisons. The selected datasets will then be used to generate a graphic and
a list of clusters (a historical example is shown below).
From the above figure it can be seen that each node of the triangle represents
one set of sequences - C.elegans proteins, non-elegans nematode proteins and
non-nematode proteins. The position of a cluster (represented as a single
square coloured by the highest BLAST e-value) thus represents its relatedness
to each of the three categories (calculated from the mean of the BLAST scores).
The diagram below shows how the triangle may be interpreted.
Clusters found on the three vertices indicate that they do not have a
significant blast hit against the database located on the opposite node.
Clusters which were found to be unique (no significant blast score against the
three dbs) and those which only had a significant score for one database, are
listed in a table below the graphic.