PROTEIN DATABASES

Protein databases are more specialized than primary sequence databases.

They contain information derived from the primary sequence databases.

Some contain protein translations of the nucleic acid sequences.

Some contain sets of patterns and motifs derived from sequence homologs.

GenBank - the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.

PIR Protein Information Resource -a comprehensive, non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database.

SWISS-PROT & TrEMBL - SWISS-PROT is a curated protein sequence database. is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT.

TIGR - a collection of curated databases containing DNA and protein sequence, gene expression, cellular role, protein family, and taxonomic data for microbes, plants and humans.

MOTIF, PATTERN & PROFILE DATABASES

ALIGN - a compendium of sequence alignments: it is a companion resource to PRINTS.

BLOCKS - multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.

DOMO - a database of homologous protein domain families.

HOMSTRAD - a curated database of structure-based alignments for homologous protein families.

InterPro- Integrated Resource of Protein Domains and Functional Sites - InterPro is an integrated documentation resource for protein families, domains and sites, developed initially as a means of rationalising the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. Each combined InterPro entry includes functional descriptions and literature references, and links are made back to the relevant member database(s), allowing users to see at a glance whether a particular family or domain has associated patterns, profiles, fingerprints, etc. Merged and individual entries (i.e., those that have no counterpart in the companion resources) are assigned unique accession numbers. Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (more than 1,000,000 hits in total).InterPro aims to reduce duplication of effort in the labour-intensive, rate-limiting process of annotation, and will facilitate communication between the disparate resources. By uniting these databases, we capitalise on their individual strengths, producing a single entity that is far greater than the sum of its parts.

PFam - a database of multiple alignments of protein domains or conserved protein regions. The alignments represent some evolutionary conserved structure which has implications for the protein's function. Profile hidden Markov models (profile HMMs) built from the Pfam alignments can be very useful for automatically recognizing that a new protein belongs to an existing protein family, even if the homology is weak.

PRINTS ñ Protein Fingerprint Database - a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family.

PRINTS-S ñ relational cousin of the PRINTS Database

ProDom - an automatic compilation of homologous domains.ProDom families were generated automatically using PSI-BLAST with a profile built from the seed alignments of Pfam-A 4.3 families.

ProSite - is a database of protein families and domains

consisting of biologically significant sites, patterns and profiles.

Protein Profiles - online cross-references to the Oxford University Press Protein Profiles project.

ProtoMap - site offers an exhaustive classification of all the proteins in the SWISSPROT and TrEMBL databases, into groups of related proteins.The resulting classification splits the protein space into well defined groups of proteins, most of them are closely correlated with natural biological families and superfamiliesfor comprehensive evaluation results). The hierarchical organization may help to detect finer subfamilies that make up known families of proteins as well as interesting relations between protein families.

SBASE - protein domain library sequences that contains 237.937 annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.

SYSTERS - SYSTERS cluster set contains sequences from SWISS-PROT , TrEMBL, PIR, Wormpep, and MIPS Yeast protein translations which are sorted into disjoint clusters. fragmental sequences build single sequence clusters, while the remaining sequences are contained in clusters of non-redundant sequences per cluster.

PROTEIN STRUCTURE DATABASES

CATH Protein Structure Classification ñ a hierarchical domain classification of protein structures in the Brookhaven protein databank.

FSSP Fold Classification based on Structure-Structure Alignment of Proteins - based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB).

Library of Protein Family Cores - structural alignments of protein families and computed average core structures for each family.Useful for building models, threading, and exploratory analysis.

ModBase a database of three-dimensional protein models calculated by comparative modeling.

PRESAGE - a database of proteins, each of which has a collection of annotations reflecting current experimental status, structural assignments models, and suggestions.

RCSB Protein Data Bank - single international repository for the processing and distribution of 3-D macromolecular structure data primarily determined experimentally.

Protein Loop Classification - Conformational clusters and consensus sequences for protein loops derived by computational analysis of their structures.

SCOP ñ Structural Classification of Proteins - a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known.

Sloop Database ñ Sloop Database of Super Secondary Fragments - a classification of protein loops.

3 Dee ñ Database of Protein Domain Definitions - contains structural domain definitions for all protein chains in the Protein Databank (PDB)that have 20 or more residues and are not theoretical models.

GENOMES

DEAMBULUM ñ contains the GENOMES: Viruses, Archaea,Bacteria, Fungi, Plants, Animals, and Man.

FlyBase - a comprehensive database for information on the genetics and molecular biology of Drosophila. It includes data from the Drosophila Genome Projects and data curated from the literature.

GeneCards - database of human genes, their products and their involvement in diseases.

GeneCensus Genome Comparisons

GenDis ñ Human Genetic Disease Database

Genome Database - Regions of the human genome, including genes, clones, amplimers (PCR markers), breakpoints, cytogenetic markers, fragile sites, ESTs, syndromic regions, contigs and repeats. Maps of the human genome, including cytogenetic maps, linkage maps, radiation hybrid maps, content contig maps, and integrated maps. These maps can be displayed graphically via the Web.Variations within the human genome including mutations and polymorphisms, plus allele frequency data.

KEGG: Kyoto Encyclopedia of Genes and Genomes - information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects.

PROTEOME ñ The BioKnowledge Library of Public Human PSD, Caenorhabditis elegans (WormPD), Saccharomyces cerevisiae (YPD) and S. pombe (PombePD).

Saccharomyces Genome Database - a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae.

WhiteHead Institute for Genomic Research ñ information on the Neurospora crassa Genome Database, Human SNP Database, Human Physical Mapping Project, Mouse Genetic and Physical Mapping Project,Rat Genetic Mapping Project, Mouse RH Mapping Project, Genome Center ftp Archive (Data)

WORMBASE - a repository of mapping, sequencing and phenotypic information about the C. elegans nematode

TRANSCRIPTIONAL REGULATION DATABASES & ALGORITHMS

COMPEL - Database on composite regulatory elements affecting gene transcription in eukaryotes.

EDP ñ Eukaryotic Promoter Database - an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally.

RegulonDB ñ A database on transcriptional regulation in Escherichia coli.

TRANSFAC ñ The Transcription Factor Database

TRDD ñ Transcription Regulatory Region Database

FastM and ModelInspector A program for the generation of models for regulatory regions in DNA sequences.

FunSiteP - Recognition and classification of eukaryotic promoters.

PatSearch Search for potential transcription factor binding sites.

Promoter Inspector - Prediction of promoter regions in mammalian genomic sequences.

Mat Inspector - Search for potential transcription factor binding sites.

RSATools Regulatory Sequence Analysis Tools

S_Compsearch for NFATp/AP-1 Comp. Elements

TRADAT- TRAnscription Databases and Analysis Tools

2Zip - Computational Approaches to Identify Leucine Zippers

OTHER

BIND - full descriptions of interactions, molecular complexes and pathways.

BioMagRes Bank ñ NMR-derived protein structures.

Cytomer ñ A relational database of physiological systems, organs and cell types.

ENZYME ñ Enzyme Nomenclature Database

Enzyme Structures Database - contains the known enzyme structures that have been deposited in the Brookhaven Protein Data Bank (the PDB).

Gene Ontology Consortium ñ attempts to produce a dynamic controlled vocabulary that can be applied to all eukaryotes.

Human Transcript Database a curated source for information related to RNA molecules that have been sequenced.

LIGAND- Database for enzymes, compounds, and reactions.

Metabolic Pathways of Biochemistry - graphically represents all major metabolic pathways, primarily those important to human biochemistry.

NDB ñ Nucleic Acid Database Project - assembles and distributes structural information about nucleic acids.

PMD ñ Protein Mutant Database - covers natural as well as artificial mutants, including random and site-directed ones, for all proteins except members of the globin and immunoglobulin families.

REBase ñ Restriction Enzyme Database ñ contains detailed information about restriction enzymes, methylases, the microorganisms from which they have been isolated, recognition sequences, cleavage sites, methylation specificity, the commercial availability of the enzymes, and references.

Radar ñ Rapid Automatic Detection and Alignment of Repeats
in protein sequences.

rRNA Database ñ all about ribosomal RNA.

S/MARtDB - information about scaffold/matrix attached regions.

TargetDB -database of peptides targeting proteins to cellular locations.

Transpath ñ Signal Transduction Browser - an information system on gene-regulatory pathways. Focuses on pathways involved in the regulation of transcription factors in different species, mainly human, mouse and rat.Elements of the relevant signal transduction pathways like hormones, receptors, enzymes and transcription factors are stored together with information about their interaction and references in an object-oriented database.

TOOLS

CLUSTALW ñ Multiple sequence alignment tool

ProteinProspector - Proteomics tools for mining sequence databases in conjunction with Mass Spectrometry experiments.

ReBASE Information Tool -ReBASE query tool.

SeqHound ñ database sequence fetch program.

SignalP - predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes.

SIMILARITY, HOMOLOGY SEARCH

These algorithms are designed for the comparison of a protein sequence against sequence databases to detect similar or homologous proteins.Conserved regions usually have similar amino acid sequence and/or structural similarities.Perform at least three separate searches using different algorithms.If default settings do not detect any similar proteins, try varying the PAM matrix values.Lower matrix values are best for identifying short regions of sequence with very high similarity. Higher PAM matrices are able to detect longer, weaker matches.Simultaneously, adjust the gap penalty value around the default value.

BLAST- The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships.The BLAST search algorithm is designed to find close matches rapidly. It is faster than the S-W algorithm.

BLITZperforms a sensitive and extremely fast comparison of a protein sequence against the SWISS-PROT protein sequence database using the Smith-Waterman algorithm.The Smith-Waterman algorithm is able to detect short matching regions such as binding sites in the middle of long sequences.

Bic-sw - Smith & Waterman algorithm implementation for protein database searches

BMC Search Launcher

FASTA ñ detects patches of regional similarity rather than the best alignment between the query sequence and the database sequences. Very fast, but complete sensitivity is sacrificed.

GeneMatcher - The Smith-Waterman (S-W) search algorithm used by the FDF server is about 5% more sensitive towards divergent matches than the BLAST algorithm. This significantly increases the chances of finding distant homologs of your query sequence in the databases.FDF software incorporates a frameshift-tolerant search algorithm. This feature is particularly useful when searching for potential coding sequences in low-quality DNA sequences, such as those found in EST databases.

MPsearch - MPSRCH is a biological sequence comparison tool that implements the true Smith and Waterman algorithm. This algorithm exhaustively compares every letter in a query sequence with every letter in the database.

Paralign and SWMMX - searches a number of sequence databases for sequences similar to your amino acid query sequence using two very sensitive algorithms. You can choose between the well-known Smith-Waterman optimal local alignment algorithm or a new algorithm called ParAlign, which is much faster but still almost as sensitive.

Pfam ñ HMM Search - Unlike standard pairwise alignment methods (e.g. BLAST, FASTA), Pfam HMMs deal sensibly with multidomain proteins.

SAS ñ Sequences Annotated by Structure - will perform a FASTA search of the given sequence against the proteins of known structure in the PDB and return a multiple alignment of all hits, each annotated by structural features.

Scanps 2.3 - Fast implementation of the true Smith & Waterman algorithm for protein database searches.

MOTIF, PATTERN & PROFILE SEARCH

There are a limited number of families into which most proteins are grouped.Proteins within a given family generally have a shared function.Conserved regions are usually important for function or for maintaining a specific 3D structure. Conserved regions usually have similar amino acid sequence and/or structural similarities.Domains are distinct functional regions of a protein, often linked together by a flexible region.Motifs are recurring substructures found in many proteins.Proteins of 500 or more amino acids most likely contain discrete functional domains.Regions of low complexity often separate domains.Long stretches of repeated residues, particularly proline, glutamine, serine, or threonine, often indicate linker sequences.Approximately 2000-3000, out of a predicted 10,000-20,000, different protein families have been characterized.Roughly, half of the proteins encoded in a new genome can be placed in a known family based on their amino acid sequence.

CDD A Conserved Domain Database and Search Service

eMatrix ñ fast and accurate sequence analysis using minimal-risk scoring matrices.

eMotif Scan ñ sequence database search using eMatrix regular expressions.

eMotif Search ñ protein classification search.

InterProScan ñ queries a protein sequence against InterPro.

Kangaroo - Kangaroo is a pattern search program. Given a sequence pattern the program will find all the records that contain that pattern.

MEME ñ Multiple EM for Motif Elicitation - MEME is a tool for discovering motifs in a group of related DNA or protein sequences.Takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif.

MOTIF - findssequence motifs in a query sequence, also provides functional and genomic information of the found motifs using DBGET and LinkDB as the hyperlinked annotations. Results presented graphically, and, where available, 3D structures of the found motifs can be examined by RasMol program when the hits are found in PROSITE database.Also, given a profile generated from the multiple sequence alignment, or, retrieved from a motif library such as PROSITE or Pfam, you can align a protein sequence with the profile.

Network Protein Sequence Analysis -this multi-algorithm server offers two pattern and signature searches: PATTINPROT: scan a protein sequence or a protein database for one or several pattern(s) andPROSCAN: scan a sequence for sites/signatures against PROSITE database.

PFam HMM Search - Analyzes a protein query sequence to find Pfam domain matches.

PPSearch - Protein motifs searches

PredictProtein - this multi-algorithm server searches the PROSITE Database to detect functional motifs and PRODOM to detect protein domains.

ProDom BLAST ñ BLAST homology search against all domain sequences in ProDom.

ProfileScan Server- compares a protein or nucleic acid sequence against a profilelibrary (PROSITE or Pfam).

ProtoMap ñ classifies a new protein sequence.

Pscan - uses information derived from the PRINTS database to detect functional fingerprints in protein.

P-val FingerPRINTScan - find the closest matching PRINTS fingerprint/s to a query sequence.

ScanProsite - Scans a protein sequence for the occurrence of patterns stored in the PROSITE database.

SMART ñ Simple Modular Architecture Research Tool

SPRINT ñ Search the PRINTS-S Database.

3motif ñ searches by eMOTIF, PDB Structure or BLOCKS accession number.

SECONDARY SEARCH

Folding and coiling due to H-bond formation determines secondary structure.H-bonds form between carboxyl and amino groups of nonadjacent amino acids.A single polypeptide can have both helical and sheet regions.Non-helix and sheet regions can form bends, loops or turns.

BTPRED ñ The Beta-Turn Prediction Server ñ temporarily down

CPHModels - predicts protein structure using comparative (homology) modelling.

COILS - compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation.

Garnier Peptide Structure Tool - is an implementation of the original Garnier Osguthorpe Robson algorithm (GOR I) for predicting protein secondary structure. Secondary structure prediction is notoriously difficult to do accurately. The GOR I alogorithm is one of the first semi-successful methods.

HTH - gives a practical estimation of the probability that the sequence is a helix-turn-helix motif.

Jpred2 - takes either a protein sequence or a mulitple alignment of protein sequences, and predicts secondary structure.  It works by combining a number of modern, high quality prediction methods to form a consensus. 

META PredictProtein ñ this multi-algorithm server utilizes eight different algorithms for predicting secondary structure.

MultiCoil - program predicts the location of coiled-coil regions in amino acid sequences and classifies the predictions as dimeric or trimeric. The method is based on the PairCoil algorithm.

PairCoil - predicts the location of coiled-coil regions in amino acid sequences by use of Pairwise Residue Correlations.

PredictProtein ñ this multi-algorithm server utilizes two algorithms to predict secondary structure.

PREDATOR - an accurate algorithm for secondary structure prediction based on recognition of potentially hydrogen-bonded residues in the amino acid sequence.

PSA Protein Structure Prediction Server - determines the probable placement of secondary structural elements along a query sequence.

PSIPRED

Structure Prediction Server ñ this multi-algorithm server uses the PHD algorithm to predict secondary structure.

SOSUI

SSpro - Protein secondary structure prediction based on Bidirectional Recurrent Neural Networks (BRNNs).

Tandem Repeats Finder - a program to locate and display tandem repeats (two or more adjacent, approximate copies of a pattern of nucleotides) in DNA sequences.

Tmpred ñ Prediction of Transmembrane Regions and Orientation - makes a prediction of membrane-spanning regions and their orientation. The algorithm is based on the statistical analysis of TMbase, a database of naturally occuring transmembrane proteins. The prediction is made using a combination of several weight-matrices for scoring.

TMHMM - predicts transmembrane helices and the predicted location of the intervening loop regions.

TERTIARY STRUCTURE

Tertiary structure results from folding of thesecondary structural elements.Tertiary structure is stabilized by bonds formed between amino acid R groups (H-bonds, ionic interactions, covalent bonds, hydrophobic interactions).

Dali - compares the coordinates of a query protein structure andcompares them against those in the Protein Data Bank. The output consists of a multiple alignment of structural neighbours.

SWEET - a program for constructing 3D models of saccharides
from their sequences using standard nomenclature.

3D-pssm - A Fast, Web-based Method for Protein Fold Recognition using 1D and 3D Sequence Profiles coupled with Secondary Structure and Solvation Potential Information.

TraDES - a New Way to Customize and Explore Protein Conformational Space.

PROTEIN CHEMISTRY

Compute pI/MW

CUTTER: A tool to generate and analyze proteolytic fragments.

FindMod Tool - predicts potential protein post-translational modifications (PTM) and find potential single amino acid substitutions in peptides.

GlycoMod Tool - predicts the possible oligosaccharide structures that occur on proteins from their experimentally determined masses.

PEPSTATS: Protein Statistics - outputs a report of simple protein sequence information including: molecular weight, number of residues, average residue weight, charge, isoelectric point, for each type of amino acid: number, molar percent, DayhoffStat, for each physico-chemical class of amino acid: number, molar percent.

Phospepsort4

PredAcc - Protein side chains relative solvent accessibility prediction.

ProtParam Tool - allows the computation of various physical and chemical parameters for a given protein stored in SWISS-PROT or TrEMBL or for a user entered sequence.

YinOYang 1.2 Prediction Server - produces neural network predictions for O-þ-GlcNAc attachment sites in eukaryotic protein sequences.

PROTEIN SEQUENCE FOR ANALYSIS

Analyze the following sequence.

  • For each type of search use three different search
  • Similarity/homology searches
  • Motif,
  • Pattern & Profile searches
  • Secondary Structure Prediction
  • Tertiary Structure Prediction

SRYPGQVSFGGIGGLNDQIRELREVIELPLKNPELFLRVGIKPPKGVLLYGPPGTGKTLLARAVASSLETNFLKVVSSAIVDKYIGESARLIREMFGYAKGTRALHHLHGRDRCHRWQAFQRGYICRQRNPAYTYGAPQPARRFRLSRQDQDHHGDEPPRYPRPCFAACRPSRSQD