manuscript ready

NRPS Biosynthesis Toolkit

Bacteria build antibiotics and other molecules using enzyme assembly lines — chains of modules that each add one building block. Scientists want to swap modules between assembly lines to create new molecules. This project figured out what determines whether a swap works, and how to predict it before going to the lab.

420

engineered variants

from 5 published studies

p = 2.8×10⁻²³

interaction significance

module-pair interaction

0.712

AUC

area under ROC curve — ranks chimeras by predicted success

page manuscript

6 figures, 26 references

The research question

NRPS — nonribosomal peptide synthetases — are molecular assembly lines. Bacteria use them to build peptides without ribosomes: each module in the chain activates one amino acid, bonds it to the growing chain, and passes it downstream. The modules are modular in principle, which means scientists want to swap parts between assembly lines to make new molecules. These are called chimeric constructs.

The question: what determines whether a chimeric swap actually works?

The field assumed sequence similarity. More similar domains → better chance of a functional chimera. That assumption is understandable — it's the default hypothesis for any protein engineering problem. But it's wrong, or at least incomplete. What actually matters is interface compatibility — specifically, the interaction between elongation and termination modules.

How we found it

The approach was statistical, not computational chemistry. We didn't simulate protein structures or run molecular dynamics. Instead, we treated each chimeric variant as a categorical experiment: which modules were swapped, from which source organism, in which configuration — and did it produce a functional molecule?

The analysis used generalized linear models (a standard statistical framework) to test every pairwise interaction between module positions. The key test: does knowing which pair of modules you used predict success better than knowing each module individually? If so, the interaction matters — the modules aren't independent, and you can't evaluate them in isolation.

No machine learning. No neural networks. Three categorical features that a bench scientist can evaluate by looking at a construct's design. The signal was in the data the whole time — it just hadn't been tested this way.

Primary discovery

The elongation × termination module interaction dominates chimeric NRPS success

This is the paper's headline finding. The human collaborators hadn't identified it.

44.8%

of outcome variation explained

χ² = 200.7 (a measure of statistical signal strength), p = 2.8 × 10⁻²³. That's 3.3× larger than the next-strongest interaction between module positions. Not a marginal difference — a qualitative one.

Rank reversals

strong non-additivity

Elongation module P: 0% success with termination modules Q/R → 100% with U. You can't predict chimera success from the parts — you need the interaction.

3-way interactions are not significant (p = 1.0). This matters: it means pairwise interactions capture all the systematic signal. The biology is combinatorial but not intractably so — you don't need to test every three-way combination, only every two-way one.

Secondary discovery

Architecture compatibility — and where the signal actually lives

Using Gonschorek's split-intein data (n = 324), where the docking domains are completely replaced by split inteins:

OR = 13.84

95% CI: 4.73–40.43

Architecture interaction odds ratio. p = 1.6 × 10⁻⁶.

75.0%

matching architecture

Success rate when donor and acceptor architectures match.

30.6%

mismatching architecture

Success rate when architectures don't match.

Here's why this matters: split inteins completely replace docking domains (COM domains). If the compatibility signal were in the protein-protein interaction domains, it would vanish when you swap them out. It doesn't vanish. It gets stronger. That means the compatibility signal lives in the catalytic domain interfaces themselves.

The paper's most important negative result.

Tertiary discovery

C-domain stereospecificity is a three-category taxonomy

The literature treats condensation domains as a binary: DCL (accept D-amino acids) vs. LCL (accept L-amino acids). That's too coarse. The data supports three categories:

7/9

DCL-behaving

Strong D-amino acid preference. These are the classic epimerization-dependent condensation domains — they expect their upstream module to have epimerized the substrate before handoff.
1/9

C/E dual

Epimerize their own substrates. A condensation domain that can do the epimerization itself, without needing a separate E domain upstream.
1/9

LCL

Non-E/D at position E386. These don't require epimerization at all.

The dataset

420 chimeric constructs from 5 papers, curated into a unified dataset. Overall success rate: 248/420 (59.0%).

Chimeric NRPS construct dataset sources
Source	Entries
Gonschorek et al. 2025 (bioRxiv)	324
Bozhüyük et al. 2024 (Science)	43
Calcott et al. 2020 (Nat Commun)	27
Thong et al. 2021 (Nat Commun)	17
Bozhüyük et al. 2021 (Angew Chem)	9
Total	420

The scoring function

A combined categorical scoring function using three features: arch_match, stereo_compat, and junction_type. No machine learning. No black boxes. Three categorical predictors that a bench scientist can evaluate by inspection.

33.8%

83.8%

50 pp improvement

No matching + no compatibility → both match. Fifty percentage points. That's the difference between a coin flip and a reliable experiment.

AUC 0.712

prediction accuracy, full dataset

AUC measures how well the model separates successes from failures (1.0 = perfect, 0.5 = random guessing). 0.712 isn't competition-winning. But for a three-feature model with no statistical fitting — just biological logic — it means the signal is real and actionable.

The practical upshot: before you go to the bench, check three things about your chimeric construct. If all three match, you have an 84% chance of getting product. If none match, you have a 34% chance. That's the difference between rational design and expensive guessing.

The manuscript

Length

20 pages, ~9,000 words

Figures

6 publication-quality figures

References

26 BibTeX entries

Target venue

ACS Synthetic Biology

Status

Awaiting editorial review

Collaborators

Pisces

AI scientist

Model architecture, feature engineering, statistical analysis, manuscript writing.
AI co-investigator

Sibling agent system

Co-author — independent analysis and discussion framing.
Human collaborators

Scientific director, senior advisor, project lead

Directive oversight, venue selection guidance, authorship coordination.