Using the SBGrid Environment
Support for Site Administrators
Hardware Support Notes
Support for Developers
The ALPHAFOLD2 source an implementation of the inference pipeline of AlphaFold v2.0. using a completely new model that was entered in CASP14. This is not a production application per se, but a reference that is capable of producing structures from a single amino acid sequence.
From the developers' original publication: "The provided inference script is optimized for predicting the structure of a single protein, and it will compile the neural network to be specialized to exactly the size of the sequence, MSA, and templates. For large proteins, the compile time is a negligible fraction of the runtime, but it may become more significant for small proteins or if the multi-sequence alignments are already precomputed. In the bulk inference case, it may make sense to use our make_fixed_size function to pad the inputs to a uniform size, thereby reducing the number of compilations required."
The SBGrid installation of Alphafold2 does not require Docker to run, but does require a relatively recent NVidia GPU and updated driver.
AlphaFold requires a set of (large) genetic databases that must be downloaded separately. See https://github.com/deepmind/alphafold#genetic-databases for more information.
These databases can be downloaded with the included download script and the aria2c program, both of which are available in the SBGrid collection. Note that these databases are large in size (> 2Tb) and may require a significant amount of time to download.
/programs/x86_64-linux/alphafold/2.0.0/alphafold/scripts/download_all_data.sh <destination path>
The database directory shouuld look like this :
├── bfd │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata │ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex ├── mgnify │ └── mgy_clusters.fa ├── params │ ├── LICENSE │ ├── params_model_1.npz │ ├── params_model_1_ptm.npz │ ├── params_model_2.npz │ ├── params_model_2_ptm.npz │ ├── params_model_3.npz │ ├── params_model_3_ptm.npz │ ├── params_model_4.npz │ ├── params_model_4_ptm.npz │ ├── params_model_5.npz │ └── params_model_5_ptm.npz ├── pdb70 │ ├── md5sum │ ├── pdb70_a3m.ffdata │ ├── pdb70_a3m.ffindex │ ├── pdb70_clu.tsv │ ├── pdb70_cs219.ffdata │ ├── pdb70_cs219.ffindex │ ├── pdb70_hhm.ffdata │ ├── pdb70_hhm.ffindex │ └── pdb_filter.dat ├── pdb_mmcif │ ├── mmcif_files │ ├── obsolete.dat │ └── raw ├── small_bfd │ └── bfd-first_non_consensus_sequences.fasta ├── uniclust30 │ └── uniclust30_2018_08 └── uniref90 └── uniref90.fasta
Once the databases are in place, AlphaFold can be run with the wrapper script run_alphafold.sh. The default location for the databases should be
/programs/local/alphafold, but can be changed using the ALPHAFOLD_DB variable. For example:
/tmp/databases as the database location in the run script in bash.
tcsh users would use :
setenv ALPHAFOLD_DB "/tmp/databases"
To use the run script, specify the path to the fasta file and an output directy like so:
run_alphafold.sh <path to fasta file> <path to an output directory>
Other Useful variables used by this script :
|ALPHAFOLD_DB||Set alternative path to database files|
|ALPHAFOLD_PTM||Use pTM models when set|
|ALPHAFOLD_PRESET||use reduced_dbs or CASP14 databases|
|ALPHAFOLD_TEMPLATE||date string for limiting template search|
You can use our run_alphafold script template here to create your own run script using the SBGrid installation of Alphafold2.
run_alphafold.sh is a convenience wrapper script that shortens the required command arguments to run_alphafold.py. The run_alphafold.py script is also available which requires all parameters to be set explicitly, but provides greater flexibility. Pass --helpshort or --helpfull to see help on flags.
Memory is going to be an issue with larger protein sizes. The original publication suggests some things to try:
"Inferencing large proteins can easily exceed the memory of a single GPU. For a V100 with 16 GB of memory, we can predict the structure of proteins up to ~1,300 residues without ensembling and the 256- and 384-residue inference times are using a single GPU’s memory. "
"The memory usage is approximately quadratic in the number of residues, so a 2,500 residue protein involves using unified memory so that we can greatly exceed the memory of a single V100. In our cloud setup, a single V100 is used for computation on a 2,500 residue protein but we requested four GPUs to have sufficient memory."
The following environment variable settings may help with larger polypeptide calculations (> 1,200 aa).
Thanks Ci Ji Lim at Wisconsin for suggesting and testing these.
TF_FORCE_UNIFIED_MEMORY=1 XLA_PYTHON_CLIENT_MEM_FRACTION=0.5 XLA_PYTHON_CLIENT_ALLOCATOR=platform
The pTM scores are not calculated using the default model. To get pTM scored models you need to change the model names in the input. We have provided a template wrapper script (https://sbgrid.org//wiki/examples/alphafold2) which you can change to your requirements. To get pTM scores you will need to change the model_name line to "model_1_ptm,model_2_ptm" etc.
We include reference sequences from CASP14 in the installation. This command should run successfully:
It is possible to run alphafold through a web portal. See https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb .
ptxasexecutable is required to be in PATH in some cases, but not all. We can not redistribute this binary since it is part of the NVIDIA SDK. It must be installed separetely and added to the environment PATH variable, typically in
/usr/local/cuda/bin. Version 11.0.3 works well in our hands, but other versions should work. You can download the SDK here : https://developer.nvidia.com/cuda-toolkit-archive