Features
Some code adapted from @mheinzinger
https://github.com/mheinzinger/ProstT5/blob/main/scripts/generate_foldseek_db.py
create_foldseek_prostt5_gpu_db(fasta_aa, foldseek_db_path, db_dir, logdir)
Convert a Foldseek DB with ProstT5 3Di predictions using Foldseek-GPU
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta_aa |
Path
|
Path to the amino-acid FASTA file. |
required |
foldseek_db_path |
Path
|
Path to the directory where Foldseek database will be stored. |
required |
db_dir |
Path
|
Path to the baktfold DB |
required |
logdir |
Path
|
Path to the directory where logs will be stored. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/baktfold/features/create_foldseek_db.py
foldseek_tsv2db(in_tsv, out_db_name, db_type, logdir)
Convert a Foldseek TSV file to a Foldseek database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_tsv |
Path
|
Path to the input TSV file. |
required |
out_db_name |
Path
|
Path for the output Foldseek database. |
required |
db_type |
int
|
Type of the output database. |
required |
logdir |
Path
|
Path to the directory where logs will be stored. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/baktfold/features/create_foldseek_db.py
generate_foldseek_db_from_aa_3di(fasta_aa, fasta_3di, foldseek_db_path, logdir, prefix)
Generate Foldseek database from amino-acid and 3Di sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta_aa |
Path
|
Path to the amino-acid FASTA file. |
required |
fasta_3di |
Path
|
Path to the 3Di FASTA file. |
required |
foldseek_db_path |
Path
|
Path to the directory where Foldseek database will be stored. |
required |
logdir |
Path
|
Path to the directory where logs will be stored. |
required |
prefix |
str
|
Prefix for the Foldseek database. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/baktfold/features/create_foldseek_db.py
generate_foldseek_db_from_structures(fasta_aa, foldseek_db_path, structure_dir, logdir, prefix, proteins_flag)
Generate Foldseek database from PDB files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta_aa |
Path
|
Path to the amino-acid FASTA file. |
required |
foldseek_db_path |
Path
|
Path to the directory where Foldseek database will be stored. |
required |
structure_dir |
Path
|
Path to the directory containing .pdb or .cif structure files. |
required |
logdir |
Path
|
Path to the directory where logs will be stored. |
required |
prefix |
str
|
Prefix for the Foldseek database. |
required |
proteins_flag |
bool
|
Flag - True if proteins-compare is run |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/baktfold/features/create_foldseek_db.py
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | |
create_result_tsv(query_db, target_db, result_db, result_tsv, logdir, foldseek_gpu, structures, threads)
Create a TSV file containing the results of a Foldseek search.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_db |
Path
|
Path to the query database. |
required |
target_db |
Path
|
Path to the target database. |
required |
result_db |
Path
|
Path to the result database generated by the search. |
required |
result_tsv |
Path
|
Path to save the resulting TSV file. |
required |
logdir |
Path
|
Path to the directory where logs will be stored. |
required |
foldseek_gpu |
bool
|
Run Foldseek-GPU with accelerate ungapped prefilter |
required |
structures |
bool
|
Whether structures were input (not ProstT5) |
required |
threads |
int
|
Number of threads to use. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/baktfold/features/run_foldseek.py
run_foldseek_search(query_db, target_db, result_db, temp_db, threads, logdir, evalue, sensitivity, max_seqs, ultra_sensitive, extra_foldseek_params, foldseek_gpu, structures, gpus=None)
Run a Foldseek search using given parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_db |
Path
|
Path to the query database. |
required |
target_db |
Path
|
Path to the target database. |
required |
result_db |
Path
|
Path to store the result database. |
required |
temp_db |
Path
|
Path to store temporary files. |
required |
threads |
int
|
Number of threads to use for the search. |
required |
logdir |
Path
|
Path to the directory where logs will be stored. |
required |
evalue |
float
|
E-value threshold for the search. |
required |
sensitivity |
float
|
Sensitivity threshold for the search. |
required |
max_seqs |
int
|
Maximum results per query sequence allowed to pass the prefilter for foldseek. |
required |
ultra_sensitive |
bool
|
Whether to skip foldseek prefilter for maximum sensitivity |
required |
extra_foldseek_params |
str
|
Extra foldseek search params |
required |
foldseek_gpu |
bool
|
Run Foldseek-GPU with accelerate ungapped prefilter |
required |
structures |
bool
|
Run Foldseek with structures, not ProstT5 3Dis |
required |
gpus |
Optional[str]
|
Comma-separated CUDA indices (e.g. "0,2") to
restrict foldseek's GPU prefilter to a subset of devices. When
|
None
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/baktfold/features/run_foldseek.py
summarise_hits(result_db, result_db_greedy_best_hits, logdir, threads)
Get all non-overlapping tophits covering a query (designed for CATH)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result_db |
Path
|
Path to the result database generated by the search. |
required |
result_db_greedy_best_hits |
Path
|
Path to save the greedy best hits results db. |
required |
logdir |
Path
|
Path to the directory where logs will be stored. |
required |
threads |
int
|
Number of threads to use. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/baktfold/features/run_foldseek.py
3Di prediction for baktfold — wraps pholdlib's shared inference engine.
Baktfold-specific: flat cds_dict (no contig nesting), Bakta hypotheticals format with in-place annotation updates, has_duplicate_locus support.
Code adapted from @mheinzinger https://github.com/mheinzinger/ProstT5/blob/main/scripts/predict_3Di_encoderOnly.py
get_embeddings(hypotheticals, cds_dict, out_path, prefix, model_dir, model_name, checkpoint_path, output_3di, output_h5_per_residue, output_h5_per_protein, half_precision, max_residues=100000, max_seq_len=30000, max_batch=10000, cpu=False, output_probs=True, save_per_residue_embeddings=False, save_per_protein_embeddings=False, threads=1, mask_threshold=0, has_duplicate_locus=False, gpus=None)
Run ProstT5 + CNN 3Di prediction for all sequences in cds_dict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hypotheticals |
List[Dict]
|
List of Bakta feature dicts (mutated in-place with "3di"). |
required |
cds_dict |
Dict[str, str]
|
Flat |
required |
out_path |
Path
|
Directory for output files. |
required |
prefix |
str
|
Filename prefix for CSV / JSONL outputs. |
required |
model_dir |
Path
|
Directory where ProstT5 is cached. |
required |
model_name |
str
|
HuggingFace model identifier. |
required |
checkpoint_path |
Path
|
Path to the CNN |
required |
output_3di |
Path
|
Output FASTA path for 3Di sequences. |
required |
output_h5_per_residue |
Path
|
HDF5 path for per-residue embeddings. |
required |
output_h5_per_protein |
Path
|
HDF5 path for per-protein embeddings. |
required |
half_precision |
bool
|
If True, cast model + predictor to fp16 after loading. |
required |
max_residues |
int
|
Max total residues per inference batch. |
100000
|
max_seq_len |
int
|
Sequences longer than this flush a batch immediately. |
30000
|
max_batch |
int
|
Max sequences per batch. |
10000
|
cpu |
bool
|
Force CPU inference. |
False
|
output_probs |
bool
|
Whether to write per-residue probability JSONL. |
True
|
save_per_residue_embeddings |
bool
|
Save per-residue HDF5. |
False
|
save_per_protein_embeddings |
bool
|
Save per-protein HDF5. |
False
|
threads |
int
|
Number of CPU threads for torch. |
1
|
mask_threshold |
float
|
Residues with max softmax prob < threshold/100 → 'X'. |
0
|
has_duplicate_locus |
bool
|
If True use feat["id"] rather than feat["locus"]. |
False
|
gpus |
Optional[str]
|
Comma-separated CUDA indices (e.g. "0,2"). None = auto-detect
all visible CUDA GPUs. Overridden by |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
predictions |
Dict
|
Flat |
Source code in src/baktfold/features/predict_3Di.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | |
write_embeddings(embeddings, out_path)
Write per-residue or per-protein embeddings to HDF5 (flat key structure).
Source code in src/baktfold/features/predict_3Di.py
write_predictions(hypotheticals, predictions, out_path, mask_threshold, has_duplicate_locus=False)
Write 3Di predictions to FASTA and update Bakta hypotheticals in-place.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hypotheticals |
List[Dict]
|
List of Bakta feature dicts. Each is mutated in-place
with a |
required |
predictions |
Dict[str, Tuple]
|
Flat |
required |
out_path |
Path
|
Output FASTA path. |
required |
mask_threshold |
float
|
Residues with max softmax prob (0–100) below this threshold are replaced with 'X'. |
required |
has_duplicate_locus |
bool
|
If True, use |
False
|
Source code in src/baktfold/features/predict_3Di.py
autotune_batching_real_data(model_dir, model_name, cpu, threads, probe_seqs, start_bs=1, max_bs=100, step=5, device=None)
Autotunes the batch size for a given model and set of sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_dir |
str
|
The directory where the model is stored. |
required |
model_name |
str
|
The name of the model. |
required |
cpu |
bool
|
Whether to use the CPU or not. |
required |
threads |
int
|
The number of threads to use. |
required |
probe_seqs |
list
|
A list of sequences to use for probing. |
required |
start_bs |
int
|
The starting batch size to use. |
1
|
max_bs |
int
|
The maximum batch size to use. |
100
|
step |
int
|
The step size to use when increasing the batch size. |
5
|
device |
Optional[str]
|
Torch device string (e.g. "cuda:1") to pin autotune to a specific GPU. None preserves the original auto-detection behaviour. Used by the multi-GPU caller. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
int |
The optimal batch size. |
|
int |
The maximum number of residues per batch. |
Examples:
>>> autotune_batching_real_data("model_dir", "model_name", True, 4, ["ATCG", "GCTA"], 1, 100, 5)
(10, 100)
Source code in src/baktfold/features/autotune.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | |
run_autotune(input_path, model_dir, model_name, cpu, threads, step, min_batch, max_batch, sample_seqs, gpus=None)
Runs the batch size autotuning process.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_path |
str
|
The path to the input file. |
required |
model_dir |
str
|
The directory where the model is stored. |
required |
model_name |
str
|
The name of the model. |
required |
cpu |
bool
|
Whether to use the CPU or not. |
required |
threads |
int
|
The number of threads to use. |
required |
step |
int
|
The step size to use when increasing the batch size. |
required |
min_batch |
int
|
The minimum batch size to use. |
required |
max_batch |
int
|
The maximum batch size to use. |
required |
sample_seqs |
int
|
The number of sequences to sample for probing. |
required |
gpus |
Optional[str]
|
Comma-separated CUDA indices (e.g. "0,2"). When set, autotune runs on the lowest selected index. Default None = existing behaviour (cuda:0 / mps / xpu / cpu auto-detect). |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
int |
The optimal batch size. |
Examples:
Source code in src/baktfold/features/autotune.py
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | |
sample_probe_sequences(seqs, n=5000, seed=0)
samples sequences