Conformational ensembles generation, data representation and visualization of predicted flexibility properties using BioExcel Building Blocks (biobb) and FlexDyn tools

Workflow included in the ELIXIR 3D-Bioinfo Implementation Study:

Building on PDBe-KB to chart and characterize the conformation landscape of native proteins

This tutorial aims to illustrate the process of generating protein conformational ensembles from 3D structures and analysing its molecular flexibility, step by step, using the BioExcel Building Blocks library (biobb).

Conformational landscape of native proteins

Proteins are dynamic systems that adopt multiple conformational states, a property essential for many biological processes (e.g. binding other proteins, nucleic acids, small molecule ligands, or switching between functionaly active and inactive states). Characterizing the different conformational states of proteins and the transitions between them is therefore critical for gaining insight into their biological function and can help explain the effects of genetic variants in health and disease and the action of drugs.

Structural biology has become increasingly efficient in sampling the different conformational states of proteins. The PDB has currently archived more than 170,000 individual structures, but over two thirds of these structures represent multiple conformations of the same or related protein, observed in different crystal forms, when interacting with other proteins or other macromolecules, or upon binding small molecule ligands. Charting this conformational diversity across the PDB can therefore be employed to build a useful approximation of the conformational landscape of native proteins.

A number of resources and tools describing and characterizing various often complementary aspects of protein conformational diversity in known structures have been developed, notably by groups in Europe. These tools include algorithms with varying degree of sophistication, for aligning the 3D structures of individual protein chains or domains, of protein assemblies, and evaluating their degree of structural similarity. Using such tools one can align structures pairwise, compute the corresponding similarity matrix, and identify ensembles of structures/conformations with a defined similarity level that tend to recur in different PDB entries, an operation typically performed using clustering methods. Such workflows are at the basis of resources such as CATH, Contemplate, or PDBflex that offer access to conformational ensembles comprised of similar conformations clustered according to various criteria. Other types of tools focus on differences between protein conformations, identifying regions of proteins that undergo large collective displacements in different PDB entries, those that act as hinges or linkers, or regions that are inherently flexible.

To build a meaningful approximation of the conformational landscape of native proteins, the conformational ensembles (and the differences between them), identified on the basis of structural similarity/dissimilarity measures alone, need to be biophysically characterized. This may be approached at two different levels.

At the biological level, it is important to link observed conformational ensembles, to their functional roles by evaluating the correspondence with protein family classifications based on sequence information and functional annotations in public databases e.g. Uniprot, PDKe-Knowledge Base (KB). These links should provide valuable mechanistic insights into how the conformational and dynamic properties of proteins are exploited by evolution to regulate their biological function.
At the physical level one needs to introduce energetic consideration to evaluate the likelihood that the identified conformational ensembles represent conformational states that the protein (or domain under study) samples in isolation. Such evaluation is notoriously challenging and can only be roughly approximated by using computational methods to evaluate the extent to which the observed conformational ensembles can be reproduced by algorithms that simulate the dynamic behavior of protein systems. These algorithms include the computationally expensive classical molecular dynamics (MD) simulations to sample local thermal fluctuations but also faster more approximate methods such as Elastic Network Models and Normal Node Analysis (NMA) to model low energy collective motions. Alternatively, enhanced sampling molecular dynamics can be used to model complex types of conformational changes but at a very high computational cost.

The ELIXIR 3D-Bioinfo Implementation Study Building on PDBe-KB to chart and characterize the conformation landscape of native proteins focuses on:

Mapping the conformational diversity of proteins and their homologs across the PDB.
Characterize the different flexibility properties of protein regions, and link this information to sequence and functional annotation.
Benchmark computational methods that can predict a biophysical description of protein motions.

This notebook is part of the third objective, where a list of computational resources that are able to predict protein flexibility and conformational ensembles have been collected, evaluated, and integrated in reproducible and interoperable workflows using the BioExcel Building Blocks library. Note that the list is not meant to be exhaustive, it is built following the expertise of the implementation study partners.

The list of the selected tools is given in the following table, classified based on the underlying theoretical methods types, and presented together with the corresponding publication DOIs and URLs:

Tool	URL	Reference	Conda	Type
Concoord	URL	Reference	Conda	Atomistic intra-molecular interactions
ProDy	URL	Reference	Conda	Vibrational Analysis
FlexServ	URL	Reference	Conda	Vibrational Analysis, Coarse-Grained MD

NOLB	URL	Reference	Conda	Vibrational Analysis, Atomistic intra-molecular interactions
iMod	URL	Reference	Conda	Vibrational Analysis, Atomistic intra-molecular interactions

where the theoretical methods types are:

Vibrational analysis: Tools computing the vibrational normal modes of protein three-dimensional structures, taking as input PDB or mmCIF files. The movements associated with the vibrational modes (i.e. the eigenvectors) are known to be good descriptors of proteins dynamics, and of their flexibility. The type of output delivered by the different modes varies from protein motions to conformational ensembles.
Coarse-grained molecular simulations: Tools in this category make use of coarse-grained representations of the proteins and of molecular simulation techniques to generate conformational ensembles.
Atomistic intra-molecular interactions: Such tools make use of a potential to represent intra-molecular interactions and predict likely conformational changes in a protein structure.

The notebook is divided in two main blocks:

Conformational ensemble generation: where the different selected methods are used to build conformational ensembles that are then visualized using the NGL viewer.
Macromolecular flexibility analysis: where the generated ensembles are analysed to extract flexibility properties that will be used in subsequent comparisons with the conformational diversity and flexiblity observed in the PDB database.

The particular structure used is the complex between ADENYLATE KINASE from Escherichia Coli and the inhibitor AP5A (PDB code 1AKE).

Settings

Biobb modules used

biobb_flexserv: Tools to compute biomolecular flexibility on protein 3D structures.
biobb_flexdyn: Tools to study the conformational landscape of native proteins.
biobb_io: Tools to fetch biomolecular data from public databases.
biobb_structure_utils: Tools to modify or extract information from a PDB structure.
biobb_analysis: Tools to analyse Molecular Dynamics trajectories.
biobb_gromacs: Tools to setup and run Molecular Dynamics simulations with GROMACS MD package.

Auxiliary libraries used

jupyter: Free software, open standards, and web services for interactive computing across all programming languages.
plotly: Python interactive graphing library integrated in Jupyter notebooks.
nglview: Jupyter/IPython widget to interactively view molecular structures and trajectories in notebooks.
simpletraj: Lightweight coordinate-only trajectory reader based on code from GROMACS, MDAnalysis and VMD.
pandas: Open source data analysis and manipulation tool, built on top of the Python programming language.

Conda Installation and Launch

Take into account that, for this specific workflow, there are two environment files, one for linux OS and the other for mac OS:

linux

git clone https://github.com/bioexcel/biobb_wf_flexdyn.git
cd biobb_wf_flexdyn
conda env create -f conda_env/environment.linux.yml
conda activate biobb_wf_flexdyn
jupyter-notebook biobb_wf_flexdyn/notebooks/biobb_wf_flexdyn.ipynb

macos

git clone https://github.com/bioexcel/biobb_wf_flexdyn.git
cd biobb_wf_flexdyn
conda env create -f conda_env/environment.macos.yml
conda activate biobb_wf_flexdyn
jupyter-notebook biobb_wf_flexdyn/notebooks/biobb_wf_flexdyn.ipynb

Tutorial

Click here to view tutorial in Read the Docs

Click here to execute tutorial in Binder

Version

2023.3 Release

Copyright & Licensing

This software has been developed in the MMB group at the BSC & IRB for the European BioExcel, funded by the European Commission (EU H2020 823830, EU H2020 675728, EU HORIZON-EUROHPC-JU 101093290).

(c) 2015-2023 Barcelona Supercomputing Center
(c) 2015-2023 Institute for Research in Biomedicine

Licensed under the Apache License 2.0, see the file LICENSE for details.