baktfold Method
ProstT5 3Di Inference
baktfoldbegins by predicting the Foldseek 3Di tokens for every hypothetical protein input using the ProstT5 protein language model- Alternatively, this step is skipped if you specify pre-computed protein structures in the
.pdbor.cifformats by specifying the directory containing the structures with--structure-dir - You can also annotate all proteins using
-a
Foldseek Structural Comparison
baktfoldthen creates a Foldseek database combining the AA and 3Di representations of each protein, and compares this to each of the fourbaktfolddatabases with Foldseek (detailed below)- Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5 with the parameter
--structure-dir - You can also specify a custom Foldseek database with
--custom-db
baktfold databases
- Swiss-Prot (n=590,183)
- AFDB Clusters (v6) (n=3,085,778)
- PDB (n=294,848)
- CATH (n=195,223)
Downstream Annotation Processing
- The top hit for each database lower than the E-value threshold (default E=1e-03) is then kept for each of the four searches
- For each protein the following logic is conducted to select the annotation from the top hit(s) from the 4 database searches:
- The top hit for each database lower than the E-value threshold (default E=1e-03) is then kept for each of the four searches.
-
If Baktfold finds hits for multiple databases, the overall functional label will be taken:
-
From the user’s custom database (if specified)
- Swiss-Prot
- AFDB if no Swiss-Prot hits were found,
- PDB if neither Swiss-Prot nor AFDB hits were found and
- finally CATH if no other databases were hit.
In practice, 4. and 5. above are extremely rare.
All top hit accessions are available in baktfold.inference.tsv while the detailed foldseek statistics are available for each database in the respective _tophit.tsv files in Baktfold’s output.