Bioinformatics Questions and Answers Part-3

1. Which of the following statements about COG is incorrect regarding its features?
a) Currently, there are 4,873 clusters in the COG databases derived from unicellular organisms
b) It is constructed by comparing protein sequences encoded in forty-three completely sequenced genomes, which are mainly from prokaryotes, representing thirty major phylogenetic lineages
c) The interface for sequence searching in the COG database is the COGnitor program, which is based on gapped BLAST
d) It is a protein family database based on structural classification

Answer: d
Explanation: COG which stands for Cluster of Orthologous Groups, is a protein family database based on phylogenetic classification. Because orthologous proteins shared by three or more lineages are considered to have descended through a vertical evolutionary scenario, if the function of one of the members is known, functionality of other members can be assigned.

2. Which of the following statements about InterPro is incorrect regarding its features?
a) Protein relatedness is defined by the P-values from the BLAST alignments
b) The most closely related sequences are grouped into the lowest level clusters
c) More distant protein groups are merged into higher levels of clusters
d) The outcome of this cluster merging is a tree-like structure of functional categories

Answer: a
Explanation: InterPro is a database of clusters of homologous proteins similar to COG. Protein relatedness is defined by the E-values from the BLAST alignments. The database further provides gene ontology information for protein cluster at each level as well as keywords from InterPro domains for functional prediction.

3. Pfam is available at four locations around the world. Which of the following is not one of them?
a) UK
b) Sweden
c) US
d) Japan

Answer: d
Explanation: \Pfam is available at four locations around the world each providing a core set of functionality for accessing each family. They are US, UK, Sweden and France. Documentation on the content and use of Pfam is available via the web.

4. Which of the following is not a member database of InterPro?
a) SCOP
b) HAMAP
c) PANTHER
d) Pfam

Answer: a
Explanation: The signatures from InterPro come from 11 member databases viz. CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, TIGRFAMs.

5. Which of the following statements about SCOP is incorrect regarding its features?
a) Proteins with the same shapes but having little sequence or functional similarity are placed in different super families, and are assumed to have only a very distant common ancestor
b) Proteins having the same shape and some similarity of sequence and/or function are placed in ‘families’, and are assumed to have a closer common ancestor
c) SCOP was created in 1994 in the Centre of Protein Engineering and the University College London
d) It aims to determine the evolutionary relationship between proteins

Answer: c
Explanation: SCOP, Structural Classification of Proteins, was created in 1994 in the Centre of Protein Engineering and the Laboratory of Molecular Biology. It was maintained by Alexey G. Murzin and his colleagues in the Centre for Protein Engineering until its closure in 2010 and subsequently at the Laboratory of Molecular Biology in Cambridge, England.

6. What is the source of protein structures in SCOP and CATH?
a) Uniprot
b) Protein Data Bank
c) Ensemble
d) InterPro

Answer: b
Explanation: The source of protein structures in SCOP is PDB (Protein Data Bank). PDB is a secondary database which means it has protein structures derived from primary databases that have the protein sequences. UNIPROT is a primary database.

7. Which of the following statements about SUPERFAMILY database is incorrect regarding its features?
a) Sequences can be submitted raw or FASTA format
b) Sequences must be submitted in FASTA format only
c) It searches the database using a superfamily, family, or species name plus a sequence, SCOP, PDB or HMM ID’s
d) It has generated GO annotations for evolutionarily closed domains and distant domains

Answer: b
Explanation: SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP super families. Sequences can be amino acids, a fixed frame nucleotide sequence, or all frames of a submitted nucleotide sequence. Up to 1000 sequences can be run at a time.

8. Which of the following statements about PRINTS and ProDom databases is incorrect regarding its features?
a) PRINTS is a compendium of protein fingerprints
b) Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space
c) Current versions of ProDom are built using a novel procedure based on recursive BLAST searches
d) ProDom domain database consists of an automatic compilation of homologous domains

Answer: c
Explanation: Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches and not just BLAST searches. And PRINTS is indeed a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of UniProt.

9. Which of the following statements about CATH-Gene3D and HAMAP databases is incorrect regarding its features?
a) CATH-Gene3D describes protein families and domain architectures in complete genomes
b) In CATH-Gene3D the functional annotation is provided to proteins from single resource
c) HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies.
d) HAMAP stands for High-quality Automated and Manual Annotation of microbial Proteomes

Answer: b
Explanation: In CATH-Gene3D Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. Functional annotation is provided to proteins from multiple resources. Functional prediction and analysis of domain architectures is available at the website.

10. Which of the following statements about PANTHER and TIGRFAMs databases is incorrect regarding its features?
a) TIGRFAMs provides a tool for identifying functionally related proteins based on sequence homology
b) TIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation
c) Hidden Markov models (HMMs) are not used in PANTHER
d) PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise

Answer: c
Explanation: In PANTHER the subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function (human-curated molecular function and biological process classifications and pathway diagrams), as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences.