• Home
  • >
  • 1. Search or Browse
  • >
  • 2. Select a reference protein
  • >
  • 3. Explore conformations

Clustering Procedure in MultiCoMP

Overview

The data stored in the database have been prepared using a two-stage clustering procedure: protein clustering and conformational clustering. The first clustering procedure is based on sequence similarity, and the second one is based on structural similarity.

Protein Clustering

First, all the protein chains were clustered based on the sequence identity (%) using BLASTClust. Different cutoffs between 20 and 90 with an increment of 10 were used. The minimum length coverage of BLASTClust was set to 0.5. See the NCBI website for details of BLASTClust. Only chains containing one or more transmembrane (TM) regions were used in the BLASTClust clustering.


We then clustered the PDB IDs such that each cluster (referred to as a "Protein Group") consists of a set of PDB entries that share at least one chain belonging to a common BLSTClust cluster. Note that this procedure resulted in overlapping Protein Groups. For example, suppose PDB entries 1abc, 1def and 1ghi each consists of chains A and B. If 1abcA and 1defA belong to one BLASTClust cluster and 1abcB and 1ghiB belong to another BLASTClust cluster, 1abc and 1def form one protein group (PG1) and 1abc and 1ghi form another protein group (PG2).

Conformation Clustering

Given a Protein Group, we first reconstructed the biological unit of each member protein using the information from PDBTM. The biological unit of a protein consists of a set of chains Thus, to compare two protein structures as biological units, we need to establish a one-to-one correspondence between chains.

Sequence filter

Before examining chain correspondences for a pair of proteins (proteins 1 and 2), the following sequence filter was applied:

1) We calculated the sequence similarity between all possible pairs of chains (one from protein 1 and the other from protein 2) and classified the pairs into either "potential" (if the percentage sequence identity ≥20%) or "reject" (otherwise).

2) The largest subset S1 of chains from protein 1 was defined, such that each chain had at least one “potential” matching partner from protein 2. Similarly, the largest subset S2 of chains from protein 2 was defined.

3) The number of chains in S1 (denoted as |S1|) and that in S2 (denoted as |S2|) may be different. This situation typically happens if protein 1 is a homo oligomer (e.g., a tetramer) and protein 2 is a smaller homo oligomer (e.g., a dimer). In such a situation, the best subset of protein 1 to match protein 2 cannot be determined uniquely. We currently ignore all these cases and label the protein pairs with |S1| ≠ |S2| "incomparable".

Rigid-body superimposition

For any other protein pairs, all “potential” chain combinations of S1 and S2 were generated and the structural similarity was calculated for the entire complexes. We currently use ProFit (Martin ACR and Porter CT) with the 'ALIGN' option for performing rigid-body superimposition, with an option of ALIGN. While ProFit is a well-tested, robust program, it occasionally fails to generate an output RMSD value for unknown reasons. If no RMSD value was obtained for any possible combination of chains, such a pair of proteins was again labeled "incomparable". In the future, we plan to use other programs for structure comparison.


The chain combination with the lowest Cα RMSD (Å) after optimal superimposition was chosen as the chain correspondence for the pair of proteins, and the RMSD value was used as the measure of structural similarity.

Hierarchical clustering

For a Protein Group PGi, we defined the largest subset PG_Si, where no "incomparable" label was associated with any pair of elements in the subset. The remaining proteins in PGi (i.e., the complement of PG_Si in PGi,) were excluded from the conformational clustering.


In summary, certain proteins can be excluded from the conformational clustering for two reasons:

- The numbers of chains that can be matched were different.

- ProFit failed to produce RMSD values.


The entries excluded from conformational clustering (if any) are shown at the bottom of the "3. Explore conformations" page.


Finally, average linkage clustering was performed using the hclust function in the R package and "Conformational Groups" were defined with a range of cutoff values between 0.5 and 3.5.

Citation

Coming soon.

If there is something you would like to know more about, please do not hesitate to . It will be helpful to further improve this Web-application.

Settings

Protein Cutoff X (%)

Proteins having a percentage sequence identity (%) ≥ X with the reference protein are considered related and appear in the list on this page.

Conformation Cutoff Y (Å)

Structures with a Cα RMSD value (Å) > Y are considered distinct conformations and clustered separately as Conformational Group 01, 02, and so on.