Home Info Search DB SickKids
 
     FAQ
     Related Links
     Contact
 

What is PartiGeneDB?

Currently over 8 million ESTs have been generated from over 250 different eukaryotes. In general, this data is not well organised, difficult to interpret in a genomic context and prevents simple comparative analyses. Common problems include significant redundancy in the datasets (some genes may have been sequenced multiple times) and a lack of consistent annotation between projects. An effective way to overcome these problems is to group ESTs into clusters that represent genes, and to provide annotations for the clusters. Since ESTs provide only a fraction of the available genes for a particular organism, we refer to these analysed datasets as partial genomes. To automate this process we have previously developed a novel bioinformatics pipeline, termed PartiGene, which collates and identifies non-redundant sets of genes on a species specific basis. The PartiGene system provides an industry standard for these types of databases and has since been adopted by UK's Natural Environmental Research Council. This system has now been expanded into a fully automatable pipeline such that for the first time, we have created a comprehensive database of ~250 sets of eukaryotic genes. This database provides a powerful platform to perform comparative studies that go much beyond those which can currently be achieved using only the fully sequenced organisms.

Top

How is it Created?



Above is a schematic of the process used to generate PartiGeneDB.
(1) The process begins with the identification of organisms in Genbank for which >1000 ESTs exist. For each such organism, the current number of ESTs is compared with the number of sequences in PartiGeneDB. If a significant number of new sequences are available or the organism is new to the database, they are downloaded and screened for the presence of contaminating vector sequence and poly(dA) tails.
(2) Screened sequences are then clustered on the basis of sequence similarity using our in house clustering software CLOBB. This clustering step is incremental such that during subsequent rounds of clustering, the original cluster identifiers associated with each organism remains intact. This enables the PartiGeneDB to be easily updateable and ensures that analyses are consistent between such updates.
(3) Once the clusters have been generated, constituent sequences are assembled to form consensus sequences (representing putative genes) using the publicly available software tool PHRAP. These consenus sequences form our putative 'genes'.
(4) Each sets of non-redundant consensus sequences form what we term a partial genome of the organism concerned. At this stage the sequence and cluster data is uploaded into the PartiGeneDB and the process moves onto analysis of the next identified organism.
(5) Having created the initial sequence database, we annotate the non-redundant set of putative gene sequences. Again automated pipelines have been set up which perform a series of BLAST searches for each partial genome against a set of user defined databases including the non-redundant DNA database and the non-redundant protein database. Currently only annotation involving the non-redundnat protein database is available. We also perform a series of self-BLASTs of our datasets such that each partial genome is analysed in the context of every other partial genome. Results from these analyses are helping to explore levels of similarities between the various datasets and are also being used to create phylogenetic profiles. In this latter instance it may be possible to identify groups of genes of related function on the basis of these profiles and potentially allow the identification of enzymes which appear to have significantly diverged within one or more organisms. BLAST results are extracted and also stored in the PartiGeneDB allowing the rapid identification and retrieval of sequences with particular BLAST profiles.
In addition to the raw nucleotide sequence, we are applying a previously developed pipeline, prot4EST, to generate a peptide prediction for each putative gene sequence. These sequences are then being analysed by the Interpro package to identify distinct protein domains which enable gene ontology terms to be associated with each sequence. Collation of these terms will allow us to build profiles of the molecular functions and processes associated with the partial genomes obtained for each organism. Comparative analyses of these profiles may then enable us to understand more about the biology associated with each organism or collection of organisms.

Top

What is a Cluster?

A cluster is a set of sequences (in this case expressed sequence tags (ESTs)) which share enough sequence homology to suggest that they may all derive from the same gene. Hence we treat individual clusters as sets of sequences derived from the same gene, and by obtaining their consensus, we aim to derive a single 'gene' sequence. Note we use quotes around the word 'gene'. In some cases due to incomplete sequencing, the 'gene' sequence may not be full length. Furthermore, due to the nature of the clustering proceedure, the gene may actually derive from a set of alternative transcripts or allelic variants. By clicking on the consensus sequence in the cluster view (see below) you can retreive the file detailing how a consensus was created and hence gain your own insights into the nature of the sequences used to create the cluster. Rarely, sequences from members of the same gene family may get clustered together - however more common is the reverse where alternative splicing can lead to the creation of more than one cluster from the same gene. The level of expansion of clusters due to alternative splicing is currently being investigated.

Clusters are designated with a three letter species identifier followed by an index number starting from 00001. The first two letters of the species identifer indicate the genus and species, the third letter allows differentiation between two or more species with similar IDs.

Top

Searching the Database

Here we provide a web interface to facilitate seraching of our available datasets to allow researchers to identify genes of interest associated with an organism of choice. At present we provide three portals of entry into the database :

1) View an entire organisms dataset
2) Search by Annotation
3) Search by Sequence Similarity

The first option allows users to view the entire partial genome of a selected organism on the basis of their relative abundance. This enables users to identify which genes are the most highly represented within the dataset. The "Search by Annotation" page allows users to input text into a search box and search against an organism of choice for genes which share significant sequence similarity to a protein from the non-redundant protein database which has been annotated with the user selected text. The "Search by Sequence Similarity" page is a local BLAST server, which allows users to search our cluster datasets for sequences which share significant similarity to a user specified sequence.

Top

Cluster Views

After selecting a cluster to view using any of the search options you will be preseneted after a few moments with a view providing details of the cluster. At the top of the cluster page you will a see a section like the one shown below :



This section provides general information on the cluster of sequences used to create the putative gene object including : cluster identifier (as mentioned above this is related to the genus and species of the organism from which it was derived); species name; number of sequences in the cluster; types of sequences in the cluster (generally ESTs, although a few organisms may have other types of sequence data incorporated); Number of contigs (consensus sequences) built from the sequences and a panel detailing the library composition of the sequences in the cluster. Also shown is a button allowing the uer to download all the sequences associated with the cluster. In the Library panel, the user may click on the library numbers to obtain more detailed information on the cDNA libraries associated with the sequences.

Below this panel, the user will see a window detailing any BLAST annotation associated with the cluster. Under this, the user will see a graphic detailing the makeup of the consensus sequence from its constituent sequences (see below).



The graphic above details the sequences and the libraries from which they were derived. Also shown is a view of how the seqences relate to the consensus sequence. By clicking on the consensus sequence (red bar) the user may access the multiple alignment of the sequences and how they relate to the consensus. Clicking on the sequence names or sequence bars in the graphic, links the user to the genbank entry for the sequence. The colour of the bars represent the quality of the sequence - yellow = high quality regions of sequence, grey = low quality regions. Below this graphic is a text box indicating the consensus sequence of this cluster. Note if there is more than one contig associated with a cluster additional graphics and sequence boxes will be provided below.

Top

PartiGeneDB and Parasite Research

Our initial interest in PartiGeneDB is in its application to parasite biology. Parasites represent a major scourge on human health and economics, especially in the third world. Due to the relatively poor economies of the countries which bear the greatest burden, drug and vaccine programs have not attracted the attention they merit. Despite this lack of investment, a large body of sequence data in the form of ESTs currently exists and continues to be generated for many of these organisms. Of the 250+ organism datasets in PartiGeneDB, approximately 70 of them represent parasitic organisms. The multiple occurrence of parasitism suggests that there are specific adaptations which enable a parasitic lifestyle. By focusing on groups of parasite and non-parasite comparators we aim to explore the evolutionary origins of parasites, gain insights into their biology and identify parasite specific traits.

Top

SimiTri

SimiTri is a a Java/Perl-based application, which allows simultaneous display and analysis of relative similarity relationships of the dataset of interest to three different databases.

To launch the SimiTri application, simply select the SimiTri search option and select the organism and three databases to be used for making the comparisons. The selected datasets will then be used to generate a graphic and a list of clusters (a historical example is shown below).



From the above figure it can be seen that each node of the triangle represents one set of sequences - C.elegans proteins, non-elegans nematode proteins and non-nematode proteins. The position of a cluster (represented as a single square coloured by the highest BLAST e-value) thus represents its relatedness to each of the three categories (calculated from the mean of the BLAST scores). The diagram below shows how the triangle may be interpreted.



Clusters found on the three vertices indicate that they do not have a significant blast hit against the database located on the opposite node. Clusters which were found to be unique (no significant blast score against the three dbs) and those which only had a significant score for one database, are listed in a table below the graphic.

Top