alphafold2

ALPHAFOLD2

Preparing to run Alphafold

The ALPHAFOLD2 source an implementation of the inference pipeline of AlphaFold v2.0. using a completely new model that was entered in CASP14. This is not a production application per se, but a reference that is capable of producing structures from a single amino acid sequence.

From the developers' original publication: "The provided inference script is optimized for predicting the structure of a single protein, and it will compile the neural network to be specialized to exactly the size of the sequence, MSA, and templates. For large proteins, the compile time is a negligible fraction of the runtime, but it may become more significant for small proteins or if the multi-sequence alignments are already precomputed. In the bulk inference case, it may make sense to use our make_fixed_size function to pad the inputs to a uniform size, thereby reducing the number of compilations required."

The SBGrid installation of Alphafold2 does not require Docker to run, but does require a relatively recent NVidia GPU and updated driver.

CUDA SDK is required

The alphafold2 pipeline requires that the CUDA Toolkit SDK be installed. We use version 11.1 and have had success with latest versions. The executables provided in this package are not redistributable so we do not include them with our CUDA libraries.

Required Databases and Parameters AlphaFold requires a set of (large) genetic databases that must be downloaded separately. See https://github.com/deepmind/alphafold#genetic-databases for more information.

These databases can be downloaded with the included download script and the aria2c program, both of which are available in the SBGrid collection. Note that these databases are large in size (> 2Tb) and may require a significant amount of time to download.

/programs/x86_64-linux/alphafold/2.2.0/alphafold/scripts/download_all_data.sh <destination path>

The database directory shouuld look like this :

├── bfd     
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata                                                                                                 
│   └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── mgnify
│   ├── mgy_clusters_2018_12.fa
├── params
│   ├── LICENSE
│   ├── params_model_1_multimer.npz
│   ├── params_model_1.npz
│   ├── params_model_1_ptm.npz
│   ├── params_model_2_multimer.npz
│   ├── params_model_2.npz
│   ├── params_model_2_ptm.npz
│   ├── params_model_3_multimer.npz
│   ├── params_model_3.npz
│   ├── params_model_3_ptm.npz
│   ├── params_model_4_multimer.npz
│   ├── params_model_4.npz
│   ├── params_model_4_ptm.npz
│   ├── params_model_5_multimer.npz
│   ├── params_model_5.npz
│   └── params_model_5_ptm.npz
├── pdb70
│   ├── md5sum
│   ├── pdb70_a3m.ffdata
│   ├── pdb70_a3m.ffindex
│   ├── pdb70_clu.tsv
│   ├── pdb70_cs219.ffdata
│   ├── pdb70_cs219.ffindex
│   ├── pdb70_hhm.ffdata
│   ├── pdb70_hhm.ffindex
│   └── pdb_filter.dat
├── pdb_mmcif
│   ├── mmcif_files
│   └── obsolete.dat
├── pdb_seqres
│   └── pdb_seqres.txt
├── small_bfd
│   └── bfd-first_non_consensus_sequences.fasta
├── uniclust30
│   └── uniclust30_2018_08
├── uniprot
│   └── uniprot.fasta

Running the python script run_alphafold.py

The run_alphafold.py script requires all parameters to be set explicitly. Pass --helpshort or --helpfull to see help on flags. See examples below.

GPU Memory

Memory is going to be an issue with larger protein sizes. The original publication suggests some things to try:

"Inferencing large proteins can easily exceed the memory of a single GPU. For a V100 with 16 GB of memory, we can predict the structure of proteins up to ~1,300 residues without ensembling and the 256- and 384-residue inference times are using a single GPU’s memory. "

"The memory usage is approximately quadratic in the number of residues, so a 2,500 residue protein involves using unified memory so that we can greatly exceed the memory of a single V100. In our cloud setup, a single V100 is used for computation on a 2,500 residue protein but we requested four GPUs to have sufficient memory."

The following environment variable settings may help with larger polypeptide calculations (> 1,200 aa).

TF_FORCE_UNIFIED_MEMORY=1
XLA_PYTHON_CLIENT_MEM_FRACTION=0.5
XLA_PYTHON_CLIENT_ALLOCATOR=platform
Thanks Ci Ji Lim at Wisconsin for suggesting and testing these.

Web portal

It is possible to run alphafold through a web portal. See https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb .

Changes in latest versions

2.2.0

  • Added new AlphaFold-Multimer models with greatly reduced numbers of clashes on average and slightly increased accuracy.
  • Removed unused bias argument in GlobalAttention
  • Removed prokaryotic MSA pairing algorithm as it didn’t improve accuracy on average.
  • Added the ability to run with multiple seeds per model to match the AlphaFold-Multimer paper.
  • Fixed degraded performance when using num_recycle=0 with models trained with recycling due to incorrect skipping of layers
  • Added split_rng=False (current default) to sharded_map to support new Haiku release.
  • Removed unused code in amber_minimize.py.

New model parameters are required. Those can be found here : https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar

2.1.2
Version 2.1.2 adds a few new features that must be explictly defined on the command line. See the examples below for and example and run_alphafold.py --help for more info.

2.1.1
On 4 Nov 2021 we added vesion 2.1.0 to the installation. This version allows prediction of multimers from fasta files containing multiple sequences. This version is not currently the default, but will be after further testing.

To use it, set the ALPHAFOLD_X variable to 2.1.0 in the shell or in the ~/.sbgrid.conf file. This is the standard SBGrid version override method.

NOTE

  • Databases must be redownloaded for 2.1.1, specifically UniProt and PDB seqres databases.
    • Run scripts/download_uniprot.sh <DOWNLOAD_DIR>.
    • Remove/rename <DOWNLOAD_DIR>/pdb_mmcif. It is needed to have PDB SeqRes and PDB from exactly the same date. Failure to do this step will result in potential errors when searching for templates when running AlphaFold-Multimer.
    • Run scripts/download_pdb_mmcif.sh <DOWNLOAD_DIR>.
    • Run scripts/download_pdb_seqres.sh <DOWNLOAD_DIR>.
  • Update the model parameters.
    • Remove/rename the old model parameters in <DOWNLOAD_DIR>/params.
    • Download new model parameters using scripts/download_alphafold_params.sh <DOWNLOAD_DIR>.

Some command line flags have changed since version 2.0.0. We recommend running the run_alphafold.py command directly and are not providing a wrapper script ( as we did for 2.0.0) at the present time.

Examples

Alphafold2 2.3.1

/programs/x86_64-linux/alphafold/${ALPHAFOLD_X}/bin.capsules/run_alphafold.py \
    --data_dir=/programs/local/alphafold/ \
    --output_dir=$(pwd) \
    --fasta_paths=${input} \
    --max_template_date=2020-05-14 \
    --db_preset=full_dbs \
    --bfd_database_path=/programs/local/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniref30_database_path=/programs/local/alphafold/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --uniref90_database_path=/programs/local/alphafold/uniref90/uniref90.fasta \
    --mgnify_database_path=/programs/local/alphafold/mgnify/mgy_clusters_2018_12.fa \
    --template_mmcif_dir=/programs/local/alphafold/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=/programs/local/alphafold/pdb_mmcif/obsolete.dat \
    --use_gpu_relax=True \
    --model_preset=monomer \
    --pdb70_database_path=/programs/local/alphafold/pdb70/pdb70

where the input fasta is

>T1083
GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRYKEAAEENRALAKLHHELAIVED

This is very similar to earlier 2.2.x releases, but the --unicluster30_database_path has been renamed to --uniref30_database_path.

NOTE : Alphafold2 introduced new flags for GPU-based relaxation that must be specifed. You can also resume from a previously run MSA. See the alphafold2 github repo for more info.

Alphafold2 2.1.2
Standard prediction example : Assuming Alphafold2 databases and parameters are in /programs/local/alphafold, use:

/programs/x86_64-linux/alphafold/${ALPHAFOLD_X}/bin.capsules/run_alphafold.py \
    --data_dir=/programs/local/alphafold/ \
    --output_dir=$(pwd) \
    --fasta_paths=${input} \
    --max_template_date=2020-05-14 \
    --db_preset=full_dbs \
    --bfd_database_path=/programs/local/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniclust30_database_path=/programs/local/alphafold/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --uniref90_database_path=/programs/local/alphafold/uniref90/uniref90.fasta \
    --mgnify_database_path=/programs/local/alphafold/mgnify/mgy_clusters_2018_12.fa \
    --template_mmcif_dir=/programs/local/alphafold/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=/programs/local/alphafold/pdb_mmcif/obsolete.dat \
    --use_gpu_relax=True \
    --model_preset=monomer \
    --pdb70_database_path=/programs/local/alphafold/pdb70/pdb70

where the input fasta is

>T1083
GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRYKEAAEENRALAKLHHELAIVED
Alphafold2 2.1.2
Multimer example :
/programs/x86_64-linux/alphafold/${ALPHAFOLD_X}/bin.capsules/run_alphafold.py \
    --data_dir=/programs/local/alphafold \
    --output_dir=$(pwd) \
    --fasta_paths=${input} \
    --max_template_date=2020-05-14 \
    --db_preset=full_dbs \
    --bfd_database_path=/programs/local/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniclust30_database_path=/programs/local/alphafold/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
    --uniref90_database_path=/programs/local/alphafold/uniref90/uniref90.fasta \
    --mgnify_database_path=/programs/local/alphafold/mgnify/mgy_clusters_2018_12.fa \
    --template_mmcif_dir=/programs/local/alphafold/pdb_mmcif/mmcif_files \
    --uniprot_database_path=/programs/local/alphafold/uniprot/uniprot.fasta \
    --pdb_seqres_database_path=/programs/local/alphafold/pdb_seqres/pdb_seqres.txt \
    --obsolete_pdbs_path=/programs/local/alphafold/pdb_mmcif/obsolete.dat \
    --use_gpu_relax=True \
    --model_preset=multimer

where the input multimer fasta is

>T1083
GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRYKEAAEENRALAKLHHELAIVED
>T1084
MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH

You can download example scripts of the above here:
alphafold_monomer_2.1.2.sh
alphafold_multimer_2.1.2.sh

Alphafold2 2.1.1
Standard prediction example :

/programs/x86_64-linux/alphafold/2.1.1/bin.capsules/run_alphafold.py \
	--data_dir=/programs/local/alphafold/ \
	--output_dir=/scratch/data/sbgrid/alphafold/test_monomer \
	--fasta_paths=test_monomer.fasta \
	--max_template_date=2020-05-14 \
	--db_preset=full_dbs \
	--bfd_database_path=/programs/local/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
	--uniclust30_database_path=/programs/local/alphafold/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
	--uniref90_database_path=/programs/local/alphafold/uniref90/uniref90.fasta \
	--mgnify_database_path=/programs/local/alphafold/mgnify/mgy_clusters_2018_12.fa \
	--template_mmcif_dir=/programs/local/alphafold/pdb_mmcif/mmcif_files \
	--pdb70_database_path=/programs/local/alphafold/pdb70/pdb70 \
	--obsolete_pdbs_path=/programs/local/alphafold/pdb_mmcif/obsolete.dat

where test_monomer.fasta is

>T1083
GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRYKEAAEENRALAKLHHELAIVED

Alphafold2 2.1.1
Multimer example :

/programs/x86_64-linux/alphafold/2.1.1/bin.capsules/run_alphafold.py \
	--data_dir=/programs/local/alphafold \
	--output_dir=/scratch/data/sbgrid/alphafold/test_multimer \
	--fasta_paths=test_multimer.fasta \
	--max_template_date=2020-05-14 \
	--db_preset=full_dbs \
	--bfd_database_path=/programs/local/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
	--uniclust30_database_path=/programs/local/alphafold/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
	--uniref90_database_path=/programs/local/alphafold/uniref90/uniref90.fasta \
	--mgnify_database_path=/programs/local/alphafold/mgnify/mgy_clusters_2018_12.fa \
	--template_mmcif_dir=/programs/local/alphafold/pdb_mmcif/mmcif_files \
	--model_preset=multimer \
	--uniprot_database_path=/programs/local/alphafold/uniprot/uniprot.fasta \
	--pdb_seqres_database_path=/programs/local/alphafold/pdb_seqres/pdb_seqres.txt \
	--obsolete_pdbs_path=/programs/local/alphafold/pdb_mmcif/obsolete.dat

where test_multimer.fasta is

>T1083
GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRYKEAAEENRALAKLHHELAIVED
>T1084
MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH

Known issues

  • Unified memory across GPUs does not appear to work in the current version.
  • The ptxas executable is required to be in PATH in some cases, but not all. We can not redistribute this binary since it is part of the NVIDIA CUDA SDK. Unfortunately, must be installed separetely and added to the environment PATH variable, typically in /usr/local/cuda/bin. Version 11.0.3 works well in our hands, but other CUDA versions should also work. You can download the SDK here : https://developer.nvidia.com/cuda-toolkit-archive
  • Clashes bewtween monomers have been reported in some cases in multimer mode