Chemical LLM Fine-Tuning

This project studies two efficiency decisions together: which external molecules are worth adding to a training set, and which parameters of a large chemical language model should be updated. MolFormer is adapted to lipophilicity prediction with BitFit, LoRA, and (IA)³, while random, uncertainty, diversity, and influence-based selection compete for the same 300-point external pool.

The result is a 15-run experiment matrix that makes the accuracy–parameter trade-off visible instead of treating data selection and fine-tuning as unrelated choices.

Dataset and Chemical Space

The MoleculeNet benchmark^[1] collects molecular-property tasks for model comparison. This project uses its Lipophilicity dataset: 4,200 molecules represented as SMILES strings, each paired with an experimental measure of how readily the compound partitions into lipid-like environments.

4,200labelled molecules

2,444unique scaffolds

300external candidates

From SMILES to Scaffolds

SMILES is a compact line notation that can be parsed into a molecular graph. A scaffold strips that graph back to its shared core. Phenol and toluene, for example, differ in their substituent but share the same benzene scaffold—exactly the kind of relationship a random split can leak across training and test data.

Example moleculePhenol

Formula: C₆H₅OH
SMILES: Oc1ccccc1
Core: Benzene

Phenol and toluene are distinct molecules but reduce to the same benzene core; scaffold-aware splitting keeps that shared core on only one side of the train/test boundary.

Data Splitting

Molecular evaluation needs a harder boundary than a random row split. When the same scaffold occurs in both training and test sets, a model can appear to generalize while mostly recognizing a familiar structural core. Six DeepChem strategies were tested before choosing the final partition.

Uniform assignment; useful as a baseline but shared scaffolds can cross partitions.

Groups Bemis–Murcko cores so one molecular scaffold stays in a single partition.

Partitions compounds according to molecular-weight ordering.

Maximizes fingerprint diversity between selected compounds.

Uses Butina clustering over a bulk Tanimoto fingerprint matrix.

Separates molecules by ECFP4 fingerprint similarity.

ProblemShared cores leak

DecisionScaffold split

Result≈ 80 / 10 / 10

The chosen split assigns every Bemis–Murcko scaffold to a single partition, preventing structurally related molecules from crossing the evaluation boundary.

MolFormer

MolFormer^[2] is a transformer pretrained on roughly 1.1 billion unlabelled SMILES strings. Its linear-attention backbone supplies a 768-dimensional molecular representation; this project adds a dropout-regularized linear head for scalar lipophilicity prediction.

MoLFormer-XL-both-10pct

Linear attention encodes each SMILES sequence into a pooled 768-dimensional molecular representation.

Hidden size: 768
Pretraining corpus: ≈ 1.1B SMILES
Attention: Linear

Fine-Tuning Pipeline

Training proceeds in two deliberate stages. The first adapts MolFormer to the language distribution of this dataset without reading the labels; the second optimizes the regression objective.

Stage 1Unsupervised MLM

Masked-language modelling adapts the chemical representation before property supervision begins.

Masking: 15%
Objective: Cross entropy
Learning rate: 1 × 10⁻⁴

Stage 2Supervised regression

The adapted checkpoint learns a single continuous lipophilicity target.

Loss: MSE
Batch size: 64
Learning rate: 1 × 10⁻⁴

The schedule first adapts the chemical language model to the Lipophilicity SMILES distribution, then learns the scalar property target from labelled examples.

Domain adaptation followed by regression produces the lowest test MSE (0.532), narrowly improving on supervised-only fine-tuning (0.538) and substantially improving on the unadapted checkpoint (1.150).

Influence Functions

Influence functions^[3] estimate how an individual training example changes held-out loss. The inverse Hessian-vector product is approximated with LiSSA so the external candidates can be ranked without fully retraining the model once per molecule.

\mathcal{I}(z_i) = -\nabla_\theta \mathcal{L}_{\text{test}}^\top \, H_\theta^{-1} \, \nabla_\theta \mathcal{L}(z_i)

H_θ is the Hessian of the training loss; the two gradient terms connect the candidate example to the test objective.

The influence score combines a candidate gradient with an approximate inverse-Hessian response; positive selections are expected to reduce held-out loss.

Data Selection Strategies

Each strategy selects roughly 185 candidates from the same 300-molecule pool. What changes is the signal used to decide which examples deserve the limited training budget.

Random185 points

Uniform sampling

A dependency-free baseline that draws from the external pool without replacement.

Active learning183 points

MC-dropout uncertainty

One hundred stochastic passes identify molecules for which the model is least certain.

Diversity185 points

MaxMin fingerprints

Selects a structurally varied subset that spreads coverage across chemical space.

Influence185 points

LiSSA influence score

Keeps candidates whose estimated contribution reduces the held-out loss.

Under full fine-tuning, influence selection gives the lowest test MSE (0.510); the random subset is highest at 0.576.

Parameter-Efficient Fine-Tuning

Rather than update all 44.4 million parameters, PEFT freezes most of MolFormer and exposes a controlled adaptation surface. The three methods below span a wide range—from 28 thousand trainable values to 16.6 million.

Train only biases and the regression head.

All weight matrices remain frozen, making BitFit the simplest method in the experiment.

Trainable: ≈ 75K
Share: 0.17%
Learning rate: 1 × 10⁻³

The three PEFT methods trade adaptation capacity for parameter cost: LoRA changes the most weights, while (IA)³ changes the fewest.

On a normalized ten-point view, (IA)³ leads parameter efficiency, LoRA leads predictive performance, and BitFit remains simplest to implement.

Combined Results

Crossing three PEFT methods with five external-data choices yields 15 configurations. The comparison below shows that the fine-tuning method dominates the outcome more strongly than the selection strategy in this experiment.

LoRA stays below BitFit and (IA)³ across all five selection strategies and reaches the overall minimum, 0.083 MSE, with the random subset.

	BitFitbias terms	LoRAlow-rank adapters	(IA)³learned rescaling
Best MSElower is better	0.477	0.083	0.522
Trainable parameters	0.17%	37.4%	0.06%
Winning strategy	Random	Random	Random

Best result per PEFT method. LoRA wins on MSE by a wide margin, while (IA)³ uses the smallest trainable share. Lower MSE is better.

What the Experiment Says

LoRA is the accuracy winner.

It remains best under every selection strategy and reaches the overall minimum test MSE of 0.083.

(IA)³ is the parameter winner.

Only 0.06% of parameters are trainable, although the reduced adaptation capacity leaves a larger performance gap.

Random selection wins this experiment.

The result is consistent across the three PEFT methods, but its seed-specific nature argues for repeated runs rather than a universal rule.

Domain adaptation still helps.

Masked-language-model adaptation before supervised regression improves test MSE from 0.538 to 0.532.

Conclusion

The experiment does not produce one universally optimal efficiency recipe. It shows a useful hierarchy instead: LoRA is the strong choice when predictive quality matters most, BitFit offers a restrained compromise, and (IA)³ minimizes trainable parameters. Data selection still changes the result, but repeated seeds would be necessary before treating the random subset’s win as a general molecular-learning principle.