This course project for "Neural Networks: Theory and Implementation" addresses different data selection strategies for Parameter-Efficient Fine-Tuning (PEFT) in molecular property prediction. By combining the best data selection method to select points from an external dataset with BitFit, LoRA, and (IA)
3 techniques, the project fine-tunes MolFormer on the
MoleculeNet Lipophilicity dataset using criteria based on gradient norms, embedding distances, and uncertainty estimates. Results demonstrate reduced computational requirements while maintaining comparable predictive performance to full dataset fine-tuning, enabling more resource-conscious adaptation of pre-trained models to specific chemical prediction tasks.
Datasets
- 4200 data points
- 2444 unique scaffolds
MoleculeNet contains various datasets for molecular property prediction tasks. The datasets are used to benchmark machine learning models for drug discovery and molecular property prediction. The datasets are curated from various sources and are used to evaluate the performance of models on different molecular property prediction tasks
Lipophilicity Dataset
The Lipophilicity dataset contains experimental lipophilicity values for 4200 molecules represented as SMILES strings. SMILES
Simplified Molecular Input Line Entry System (SMILES) is a line notation for representing molecules and reactions. It encodes molecular structures as line notations using ASCII strings, which can be used to generate molecular graphs. These notations follow a
certain set of rules which can be parsed to generate molecular structures.
Phenol (C6H5OH) Toluene (C6H5CH3) Lipophilicity Values
Lipophilicity is a measure of the ability of a chemical compound to dissolve in fats, oils, and lipids. It is a key property in drug discovery and design, influencing the absorption, distribution, metabolism, and excretion of drugs. The dataset contains experimental lipophilicity values for the corresponding SMILES strings.
What are Scaffolds?
Scaffolds are the core structures of molecules that are common across different molecules. They are used to represent the core structure of a molecule and are used to group molecules based on their structural similarity. Scaffolds are used in cheminformatics to analyze and compare molecular structures.
For examples, the compounds - benzene, toluene, and phenol have the same scaffold - benzene.
Benzene (C6H6) Data Splitting
For datasets containing molecular structures, it is essential to split the such that it generalizes well to novel molecules. The dataset splits might contain overlapping scaffolds, which can lead to data leakage and overfitting.
Different data splitting strategies were tested and the best performing one was selected to be used for the PEFT experiments. The main criteria for selecting the splitting strategy was to ensure there were no overlapping scaffolds between the sets. The following splits from DeepChem were tested:
Scaffold-based data splits (Bemis-Murcko scaffold)
Diversity splits based on MaxMin diversity algorithm
Data splits based on Butina clustering of a bulk Tanimoto fingerprint matrix
Splits based on Tanimoto similar between ECFP4 fingerprints
Observations
The scaffold-based splitting strategy was selected as it ensures that no overlapping scaffolds exist between train, validation, and test sets. This prevents data leakage from structurally similar molecules and ensures the model generalizes to novel molecular scaffolds. The final split achieves approximately 80/10/10 train/validation/test distribution.
MolFormer is a transformer-based model for molecular property prediction, pre-trained on a large corpus of ~1.1 billion unlabelled SMILES strings. It uses a linear attention mechanism to efficiently process molecular representations and can be fine-tuned for specific downstream tasks.
Model
MoLFormer-XL-both-10pct
Attention
Linear attention mechanism
Pre-training Data
~1.1B unlabelled SMILES strings
A linear layer added on top of the pooler output for predicting continuous lipophilicity values.
Classifier Dropout0.5
Embedding Dropout0.2
Hidden Dropout0.2
Output1 (scalar)
Fine-Tuning Pipeline
The fine-tuning process follows a two-stage approach: first, unsupervised fine-tuning using Masked Language Modeling (MLM) to adapt the model to the domain-specific molecular distribution, followed by supervised fine-tuning with the regression head for lipophilicity prediction.
Stage 1: Unsupervised MLM Masked Language Modeling adapts the pre-trained model to the chemical language patterns specific to the Lipophilicity training set, without using any labels.
ObjectiveCrossEntropyLoss
Masking Probability15%
OptimizerAdamW
Learning Rate1×10−4 Early StoppingPatience = 2
Stage 2: Supervised Regression Fine-tuning with the regression head on labeled lipophilicity values, starting from the unsupervised-adapted checkpoint.
Loss FunctionMSELoss
Evaluation MetricMAE
OptimizerAdamW
Learning Rate1×10−4 Batch Size64
Fine-Tuning Comparison
The two-stage approach (unsupervised MLM + supervised regression) outperforms direct supervised fine-tuning, demonstrating the benefit of domain adaptation before task-specific training.
Fine-Tuning Results (Test Set) Influence functions approximate the effect of each training point on model predictions. By computing the inverse Hessian-vector product (iHVP) through LiSSA (Linear time Stochastic Second-order Algorithm), we estimate how including or removing each external data point would affect the test loss.
Influence Score Formula The influence of a training point on the test loss:
I(zi)=−∇θLtest⊤Hθ−1∇θL(zi) Where
Hθ is the Hessian of the training loss
The iHVP is iteratively estimated:
v~t+1=v+(I−δHθ)v~t Recursion depth = 200, damping = 0.01, scale = 0.04
Data Selection Strategies
Four strategies were used to select approximately 185 data points from the 300-point external dataset for augmenting the training set. Each strategy captures different properties of the data.
Uniform Sampling Baseline
Randomly samples 185 data points from the external dataset uniformly without replacement. Serves as the baseline strategy.
Selected: 185 points
Monte Carlo Dropout Uncertainty
Uses MC Dropout (100 forward passes) to estimate prediction uncertainty. Points with uncertainty above a threshold are selected, targeting samples where the model is least confident.
Selected: 183 points
Structural Diversity Maximization
Uses the MaxMin diversity algorithm from DeepChem to select the most structurally diverse molecules based on molecular fingerprints, maximizing coverage of chemical space.
Selected: 185 points
Gradient-Based Selection
Selects points with positive influence scores (computed via LiSSA), meaning their inclusion is estimated to reduce the test loss. Points with negative influence are excluded.
Selected: 185 points
Strategy Comparison (Full Fine-Tuning)
Each selection strategy was combined with the full training set and used to fine-tune from the best checkpoint. Lower RMSE indicates better predictive performance.
Data Selection Results (Test MSE) Parameter-Efficient Fine-Tuning
Instead of updating all ~44.4M parameters during fine-tuning, PEFT techniques freeze most of the model and train only a small subset. Three PEFT methods were applied with each data selection strategy to evaluate their combined effectiveness.
Bias-Term Fine-Tuning
Only bias terms and the regression head are trainable. All weight matrices remain frozen, providing the simplest form of parameter-efficient adaptation.
Trainable: ~75K / 44.4M
Ratio: 0.17%
Learning Rate1×10−3 Classifier Dropout0.2
Low-Rank Adaptation
Injects trainable low-rank decomposition matrices into attention layers (Q, K, V) and FFN intermediate layers while keeping the original weights frozen.
Trainable: ~16.6M / 45M
Ratio: 37.4%
Rank8
Alpha16
Learning Rate1×10−4 Learned Rescaling Vectors
Introduces per-layer learned scaling vectors that multiply the outputs of attention key, value, and FFN intermediate layers. The most parameter-efficient method tested.
Trainable: ~28K / 44.4M
Ratio: 0.06%
Learning Rate1×10−3 InitializationOnes
PEFT Trade-off Overview
Efficiency vs. Performance vs. Simplicity Scores are normalised 0–10. Higher is better on each axis.
Combined Results
Each PEFT technique was combined with each data selection strategy, resulting in 15 experiment configurations (3 PEFT × 5 selection methods). The combined training data consists of the original training set augmented with the selected external data points.
PEFT × Data Selection (Test MSE) Lower MSE is better. Lines show how each PEFT method responds to different data selection strategies.
Strategy:Random Selection
0.17% trainable parameters
Strategy:Random Selection
37.4% trainable parameters
Strategy:Random Selection
0.06% trainable parameters
Key Findings
- Random selection unexpectedly yields the best MSE across all PEFT methods, likely influenced by seed-specific subset selection
- Active learning (MC Dropout) is the second-best strategy for BitFit and LoRA, reinforcing the value of uncertainty-based selection
- Influence-based selection performs worst for (IA)³, highlighting method-specific sensitivity to data selection strategies
- Structured data selection provides more reliable improvements than random selection due to lower variability across different runs
- LoRA achieves near full-fine-tuning performance with 37.4% trainable parameters
- BitFit offers a good efficiency-performance trade-off with minimal implementation complexity
- (IA)³ is the most parameter-efficient (0.06%) but shows the largest performance gap
- Unsupervised pre-training (MLM) provides consistent improvement across all configurations
Conclusion
This project demonstrates that combining data selection with parameter-efficient fine-tuning enables effective molecular property prediction with significantly reduced computational requirements. Random selection paired with LoRA achieves the best performance overall (MSE 0.083), while BitFit and (IA)³ provide competitive alternatives with only 0.17% and 0.06% of trainable parameters respectively on carefully selected subsets of external data. These findings suggest practical pathways for resource-conscious adaptation of large pre-trained chemical language models to specific molecular prediction tasks.