Machine learning methods that predict protein fitness from sequence remain sensitive to changes in data distributions, limiting generalization across common conditions encountered in protein engineering. Practically, protein engineers are thus left wondering about the effective utility of ML tools. The FLIP benchmark established protocols for testing generalization under some domain shifts, but it was limited to measurements of thermostability, binding, and viral capsid viability.
We introduce FLIP2, a protein fitness benchmark spanning seven new datasets, including enzymes, protein-protein interactions, and light-sensitive proteins, as well as splits that measure generalization relevant to real-world protein engineering campaigns. Evaluating a suite of benchmark models across these datasets and splits reveals that simpler models often matched or outperformed fine-tuned protein language models on FLIP2, challenging the utility of existing transfer learning techniques. Provenance for all datasets has been recorded and we redistribute all data under CC-BY 4.0 to facilitate continued progress.
FLIP2 significantly expands the original FLIP benchmark with new datasets that better represent the diversity of protein engineering applications. While FLIP focused on thermostability, binding, and viral capsid viability, FLIP2 introduces datasets covering enzymatic activity, protein-protein interactions, and optogenetics applications.
The benchmark includes 16 distinct train-test splits across 5 categories that simulate realistic protein engineering scenarios: generalization to more mutations (number splits), to mutations at different sequence positions (position splits), to unseen mutations (mutation splits), to higher fitness variants (fitness splits), and to different wild-type proteins (wild-type splits).
FLIP2 includes seven datasets covering diverse protein functions and engineering challenges. Each dataset contains experimentally measured fitness values for hundreds to hundreds of thousands of protein variants.
Alpha amylases catalyze the breakdown of starches and are used in detergents to remove starch stains. This dataset studies stain removal activity for variants of Bacillus subtilis alpha amylase, including variants with up to eight mutations.
Source: Van der Flier et al., Computational and Structural Biotechnology Journal (2024)
License: MIT
Imine reductases reduce imines to amines and are employed in pharmaceutical production. This dataset contains activity measurements from a microfluidic screen of Streptosporangium roseum imine reductase variants from an error-prone PCR library, including variants with up to 15 mutations.
Source: Gantz et al., Microdroplet screening study (2024)
License: CC-BY 4.0
Endonucleases such as Bacillus licheniformis Nuclease B (NucB) degrade DNA and have potential applications in chronic wound care by degrading extracellular DNA required for biofilm formation. Wild-type NucB activity drops to around 80% at physiological pH. This dataset measures nuclease activity at pH 7 for variants from an error-prone PCR library. To ameliorate assay noise effects, measurements are binned into four activity levels.
Source: Thomas et al., Engineering nuclease design (2025)
License: CC-BY 4.0
The β-subunit of tryptophan synthase (TrpB) synthesizes tryptophan from indole and serine and is essential for cell growth. This dataset contains growth-based fitness measurements for combinatorially complete landscapes across multiple sets of interacting residues, comprising ten different sub-landscapes across 20 positions.
Source: Johnston et al., Combinatorial fitness landscapes (2024)
License: CC-BY 4.0
The hydrophobic core of a protein is crucial for its function and stability. This dataset contains stability measurements for variants of three proteins (UniProt entries P06241, P01053, P0A9X9) where seven core residues in each protein were randomized to hydrophobic amino acids (phenylalanine, isoleucine, leucine, methionine, and valine).
Source: Escobedo et al., Hydrophobic core evolution (2024)
License: CC-BY 4.0
Rhodopsins are light-activated membrane proteins with applications in optogenetics. This dataset contains peak absorption wavelength measurements for variants and chimeras derived from 75 microbial rhodopsin sequences. Note that 41 wild types have no variants, while the remainder have between 1 and 181 variants, including sequences with between 1 and 6 mutations and chimeras.
Source: Karasuyama et al., Inoue et al., Sela et al. (2018-2024)
License: CC-BY 4.0
PDZ domains can bind to short linear motifs in intrinsically disordered regions (IDRs). This dataset measures binding affinity between mutant PDZ3 domains and mutant CRIPT peptides. From over 200,000 assayed double mutation pairs, the test set is filtered to 579 sequence pairs exhibiting significant non-additive binding effects (epistasis), where observed affinity significantly exceeds predictions from a simple additive model.
Source: Zarin & Lehner, Complete combinatorial screen (2024)
License: CC-BY 4.0
FLIP2 includes five categories of splits that test different types of generalization relevant to protein engineering:
Train on variants with fewer mutations and test on variants with more mutations. Tests ability to extrapolate to larger numbers of mutations where data becomes sparse.
Examples: one-to-many, two-to-many, three-to-many, single-to-double
Train and test on variants with mutations in different positions in the sequence. Tests ability to generalize to previously unperturbed positions targeted in subsequent engineering rounds.
Examples: close-to-far, far-to-close, by-position
Note: Position splits are among the most challenging in FLIP2.
Train and test on variants with different unique mutations. Mutations to different amino acids at the same position may be split across train and test sets.
Examples: by-mutation
Train on variants with lower fitness and test on variants with higher fitness. Simulates optimization campaigns where later variants have higher fitness.
Examples: low-to-high
Note: Fitness splits are among the most challenging in FLIP2.
Train and test on variants with different wild-type sequences. Tests ability to transfer mutational effects across homologous proteins, critical when limited data exists per wild-type.
Examples: by-wild-type, to-P06241, to-P0A9X9, to-P01053
Note: Wild-type splits are among the most challenging in FLIP2.
Evaluation on FLIP2 shows that wild-type, position, and fitness splits are much more challenging than number and mutation splits. This indicates that while models can generalize well to more mutations or different mutations at the same position, they struggle to generalize to new positions or new protein backbones.
We evaluated zero-shot protein language model likelihood scores, ridge regression baselines, and fine-tuned pLMs on all FLIP2 splits. Performance is measured using Spearman rank correlation and normalized discounted cumulative gain (NDCG) on held-out test sets. These baselines confirmed that FLIP2 splits are more challenging than random splits with the same number of training examples.
FLIP2 datasets are provided in two formats: CSV for regression/classification tasks and FASTA for sequence-based tasks. All data files are gzip-compressed (.csv.gz and .fasta.gz) for efficient storage and transfer.
CSV files (*.csv.gz) contain four columns:
sequence: Amino acid sequence of the varianttarget: Measured fitness value (continuous or binned)set: Data split assignment ("train", "test", or "validation")validation: Boolean indicating if the sequence is in the validation setsequence,target,set,validation
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPIL,0.5123,train,False
MKTAYIAKQRQISFVKSHFSRQLEERLGLIKVQAPIL,-0.2341,test,False
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPIL,0.6789,train,True
FASTA files (*.fasta.gz) include metadata in the header line:
>[ID] TARGET=[value] SET=[train/test/validation] VALIDATION=[True/False]>Q5LL55 TARGET=0.0 SET=train VALIDATION=False
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPIL
>H9L4N9 TARGET=1.0 SET=test VALIDATION=False
MKTAYIAKQRQISFVKSHFSRQLEERLGLIKVQAPIL
Example Python code to load FLIP2 data (gzipped files):
import pandas as pd
# Load gzipped CSV format (pandas handles .gz automatically)
df = pd.read_csv('assets/splits/amylase/one_to_many.csv.gz')
# Split into train/test
train_df = df[df['set'] == 'train']
test_df = df[df['set'] == 'test']
validation_df = df[df['validation'] == True]
# Alternative: Load FASTA format
from Bio import SeqIO
import gzip
with gzip.open('assets/splits/bind/one_vs_many.fasta.gz', 'rt') as f:
for record in SeqIO.parse(f, 'fasta'):
# Parse metadata from header
metadata = dict(item.split('=') for item in record.description.split()[1:])
sequence = str(record.seq)
target = float(metadata['TARGET'])
split = metadata['SET']
All FLIP2 datasets and splits are available for download. All data files are gzip-compressed for efficient transfer. Each dataset includes a README with complete provenance and attribution information. All data is licensed under CC-BY 4.0 or MIT (see individual README files).
| Dataset | Split | Format | Download |
|---|---|---|---|
| Alpha Amylase | one-to-many | CSV.GZ | Download |
| close-to-far | CSV.GZ | Download | |
| far-to-close | CSV.GZ | Download | |
| by-mutation | CSV.GZ | Download | |
| Imine Reductase | two-to-many | CSV.GZ | Download |
| Nuclease B | two-to-many | CSV.GZ | Download |
| TrpB | one-to-many | CSV.GZ | Download |
| two-to-many | CSV.GZ | Download | |
| by-position | CSV.GZ | Download | |
| Hydrophobic Core | three-to-many | CSV.GZ | Download |
| low-to-high | CSV.GZ | Download | |
| to-P06241 | CSV.GZ | Download | |
| to-P0A9X9 | CSV.GZ | Download | |
| to-P01053 | CSV.GZ | Download | |
| Rhodopsin | by-wild-type | CSV.GZ | Download |
| PDZ3 | single-to-double | CSV.GZ | Download |
The original FLIP benchmark (2021) included datasets for AAV capsid viability, GB1 domain stability/binding, and thermostability across multiple protein families. These datasets remain available for continuity and comparison.
Note: These legacy datasets are not part of the FLIP2 benchmark but are included for researchers who wish to compare with the original FLIP results or use them for other purposes.
If you use FLIP2 in your research, please cite:
@article{didi2025flip2,
title={FLIP2: Expanding Protein Fitness Landscape Benchmarks for Real-World Machine Learning Applications},
author={Didi, Kieran and Alamdari, Sarah and Lu, Alex X. and Wittmann, Bruce and Johnston, Kadina E. and Amini, Ava A. and Madani, Ali and Czeneszew, Maya and Dallago, Christian and Yang, Kevin K.},
journal={},
year={2026}
}
Please also cite the original FLIP paper:
@article{dallago2021flip,
title={FLIP: Benchmark tasks in fitness landscape inference for proteins},
author={Dallago, Christian and Mou, Jody and Johnston, Kadina E and Wittmann, Bruce J and Bhattacharya, Nicholas and Goldman, Samuel and Madani, Ali and Yang, Kevin K},
journal={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2021}
}
Individual Dataset Citations: Each dataset has specific attribution requirements. Please see the README file in each dataset directory for proper citations and licenses.