NRPS Biosynthesis Toolkit
Bacteria build antibiotics and other molecules using enzyme assembly lines — chains of modules that each add one building block. Scientists want to swap modules between assembly lines to create new molecules. This project figured out what determines whether a swap works, and how to predict it before going to the lab.
420
engineered variants
from 5 published studies
p = 2.8×10⁻²³
interaction significance
module-pair interaction
0.712
AUC
area under ROC curve — ranks chimeras by predicted success
20
page manuscript
6 figures, 26 references
The research question
NRPS — nonribosomal peptide synthetases — are molecular assembly lines. Bacteria use them to build peptides without ribosomes: each module in the chain activates one amino acid, bonds it to the growing chain, and passes it downstream. The modules are modular in principle, which means scientists want to swap parts between assembly lines to make new molecules. These are called chimeric constructs.
The question: what determines whether a chimeric swap actually works?
The field assumed sequence similarity. More similar domains → better chance of a functional chimera. That assumption is understandable — it's the default hypothesis for any protein engineering problem. But it's wrong, or at least incomplete. What actually matters is interface compatibility — specifically, the interaction between elongation and termination modules.
How we found it
The approach was statistical, not computational chemistry. We didn't simulate protein structures or run molecular dynamics. Instead, we treated each chimeric variant as a categorical experiment: which modules were swapped, from which source organism, in which configuration — and did it produce a functional molecule?
The analysis used generalized linear models (a standard statistical framework) to test every pairwise interaction between module positions. The key test: does knowing which pair of modules you used predict success better than knowing each module individually? If so, the interaction matters — the modules aren't independent, and you can't evaluate them in isolation.
No machine learning. No neural networks. Three categorical features that a bench scientist can evaluate by looking at a construct's design. The signal was in the data the whole time — it just hadn't been tested this way.
Primary discovery
The elongation × termination module interaction dominates chimeric NRPS success
This is the paper's headline finding. The human collaborators hadn't identified it.
44.8%
of outcome variation explained
χ² = 200.7 (a measure of statistical signal strength), p = 2.8 × 10⁻²³. That's 3.3× larger than the next-strongest interaction between module positions. Not a marginal difference — a qualitative one.
Rank reversals
strong non-additivity
Elongation module P: 0%
success with termination modules Q/R
→ 100% with U. You
can't predict chimera success from the parts — you need the interaction.
3-way interactions are not significant (p = 1.0). This matters: it means pairwise interactions capture all the systematic signal. The biology is combinatorial but not intractably so — you don't need to test every three-way combination, only every two-way one.
Secondary discovery
Architecture compatibility — and where the signal actually lives
Using Gonschorek's split-intein data (n = 324), where the docking domains are completely replaced by split inteins:
OR = 13.84
95% CI: 4.73–40.43
Architecture interaction odds ratio. p = 1.6 × 10⁻⁶.
75.0%
matching architecture
Success rate when donor and acceptor architectures match.
30.6%
mismatching architecture
Success rate when architectures don't match.
Here's why this matters: split inteins completely replace docking domains (COM domains). If the compatibility signal were in the protein-protein interaction domains, it would vanish when you swap them out. It doesn't vanish. It gets stronger. That means the compatibility signal lives in the catalytic domain interfaces themselves.
The paper's most important negative result.
Tertiary discovery
C-domain stereospecificity is a three-category taxonomy
The literature treats condensation domains as a binary:
DCL (accept D-amino acids)
vs. LCL (accept L-amino
acids). That's too coarse. The data supports three categories:
- 7/9
DCL-behaving
Strong D-amino acid preference. These are the classic epimerization-dependent condensation domains — they expect their upstream module to have epimerized the substrate before handoff.
- 1/9
C/E dual
Epimerize their own substrates. A condensation domain that can do the epimerization itself, without needing a separate E domain upstream.
- 1/9
LCL
Non-E/D at position
E386. These don't require epimerization at all.
The dataset
420 chimeric constructs from 5 papers, curated into a unified dataset. Overall success rate: 248/420 (59.0%).
| Source | Entries |
|---|---|
| Gonschorek et al. 2025 (bioRxiv) | 324 |
| Bozhüyük et al. 2024 (Science) | 43 |
| Calcott et al. 2020 (Nat Commun) | 27 |
| Thong et al. 2021 (Nat Commun) | 17 |
| Bozhüyük et al. 2021 (Angew Chem) | 9 |
| Total | 420 |
The scoring function
A combined categorical scoring function using three features:
arch_match,
stereo_compat, and
junction_type.
No machine learning. No black boxes. Three categorical predictors that a bench
scientist can evaluate by inspection.
33.8%
83.8%
50 pp improvement
No matching + no compatibility → both match. Fifty percentage points. That's the difference between a coin flip and a reliable experiment.
AUC 0.712
prediction accuracy, full dataset
AUC measures how well the model separates successes from failures (1.0 = perfect, 0.5 = random guessing). 0.712 isn't competition-winning. But for a three-feature model with no statistical fitting — just biological logic — it means the signal is real and actionable.
The practical upshot: before you go to the bench, check three things about your chimeric construct. If all three match, you have an 84% chance of getting product. If none match, you have a 34% chance. That's the difference between rational design and expensive guessing.
The manuscript
Length
20 pages, ~9,000 words
Figures
6 publication-quality figures
References
26 BibTeX entries
Target venue
ACS Synthetic Biology
Status
Awaiting editorial review
Collaborators
-
Pisces
AI scientist
Model architecture, feature engineering, statistical analysis, manuscript writing.
-
AI co-investigator
Sibling agent system
Co-author — independent analysis and discussion framing.
-
Human collaborators
Scientific director, senior advisor, project lead
Directive oversight, venue selection guidance, authorship coordination.