baktfold Output Files
Main Outputs
.faawhich holds all amino acid sequences of CDSs (not only hypotheticals) (like Bakta).ffnwhich holds all nucletotide sequences of CDSs (like Bakta).fnawhich holds the input genome FASTA (parsed from the JSON) (like Bakta)_3di.fastawhich holds all Foldseek 3Di sequences of predicted CDSs as predicted by ProstT5.gbff.gff3and.emblare GenBank, GFF3 and EMBL format output files (like Bakta) with full genome annotations including the input annotations combined with Baktfold's additional annotations{prefix}.tsvis the same as Bakta's.tsv output, containing all full genome annotations including the input annotations combined with Baktfold's additional annotations.jsonwhich is a Bakta JSON format output file with all genome annotations including the input annotations combined with Baktfold's additional annotations*_tophit.tsvwhich have detailed alignment statistics for the Foldseek output tophits for each of the four consituent Baktfold databases (Swiss-Prot, AFDB clusters, CATH and PDB)- For example:
query target bitscore fident evalue qStart qEnd qLen qCov tStart tEnd tLen tCov
MEGJMN_070 AF-A0A1I3V7E0-F1-model_v6 292 0.41 2.619e-06 1 91 93 0.97 1 95 99 0.95
-
There will also be a
custom_database_tophit.tsvif you have used--custom-db -
baktfold.inference.tsvwhich contains the new annotated funciton (Functioncolumn) and all annotated tophits (if they exist) for all proteins that were attempted to be annotated by Baktfold - Note: This will contain an extra column
CustomDBif you have used--custom-dbwith a custom database- For example:
ID Length Product Swissprot AFDBClusters PDB CATH
MEGJMNBEGN_27 162 HTH-type quorum-sensing regulator RhlR swissprot_P54292 afdbclusters_A0A9E1VSB0 pdb_5l09 cath_3sztB01
MEGJMNBEGN_30 68 hypothetical protein
MEGJMNBEGN_70 94 hypothetical protein afdbclusters_A0A1I3V7E0
baktfold.summary.txtwhich contains basic summary statistics showing how many CDS baktfold was able to annotateCDS beginning hypotheticalsis the amount of hypotheticals parsed from the input JSONCDS annotated with Baktfold database hitis the number of CDS that have at least 1 hit to a constituent Baktfold DBCDS annotated with Baktfold functionis the number of CDS whose Baktfold tophit has a non-hypothetical function transferredCDS remaining hypotheticalsis the number of CDS remaining hypothetical mostbaktfold- For example
Annotation:
CDS count: 2635
CDS beginning hypotheticals: 55
CDS annotated with Baktfold database hit: 12
CDS annotated with Baktfold function: 7
CDS remaining hypotheticals: 48
Baktfold:
Software: v0.2.0
Database: v0.1.0
DOI: https://doi.org/10.64898/2026.03.31.715528
URL: github.com/gbouras13/baktfold
Supplementary Outputs
_prostT5_3di_mean_probabilities.csv- contains the mean ProstT5 probability score for each CDS. These are equivalent to the probability of how similar the overall ProstT5 3Di sequence is predicted to be compared to its Alphafold2 baseline_prostT5_3di_all_probabilities.json- contains the ProstT5 probabilities for each residue for each CDS, in the json format
Optional Outputs
_embeddings_per_protein.h5- contains the ProstT5 embeddings for each protein in the h5 format_embeddings_per_residue.h5- contains the ProstT5 embeddings for each residue in the h5 format