go back to projects

Course Project - Neural Networks: Theory & Implementation

Data Selection and PEFT

Selecting Influential Data for different Parameter-Efficient Fine-Tuning Techniques

built with
  • MoleculeNet Lipophilicity
  • MolFormer
  • PyTorch
  • RDKit
  • DeepChem
  • Transformers
This course project for "Neural Networks: Theory and Implementation" addresses different data selection strategies for Parameter-Efficient Fine-Tuning (PEFT) in molecular property prediction. By combining the best data selection method to select points from an external dataset with BitFit, LoRA, and (IA)3 techniques, the project fine-tunes MolFormer on the MoleculeNet Lipophilicity dataset using criteria based on gradient norms, embedding distances, and uncertainty estimates. Results demonstrate reduced computational requirements while maintaining comparable predictive performance to full dataset fine-tuning, enabling more resource-conscious adaptation of pre-trained models to specific chemical prediction tasks.

Datasets

Lipophilicity
  • 4200 data points
  • 2444 unique scaffolds
External
  • 300 data points

MoleculeNet Benchmark

MoleculeNet contains various datasets for molecular property prediction tasks. The datasets are used to benchmark machine learning models for drug discovery and molecular property prediction. The datasets are curated from various sources and are used to evaluate the performance of models on different molecular property prediction tasks

Lipophilicity Dataset

The Lipophilicity dataset contains experimental lipophilicity values for 4200 molecules represented as SMILES strings.

SMILES

Simplified Molecular Input Line Entry System (SMILES) is a line notation for representing molecules and reactions. It encodes molecular structures as line notations using ASCII strings, which can be used to generate molecular graphs. These notations follow a certain set of rules which can be parsed to generate molecular structures.
Phenol is Oc1ccccc1
Phenol (C6H5OH)\text{Phenol (C}_6 \text{H}_5 \text{OH)}
Toluene is Cc1ccccc1
Toluene (C6H5CH3)\text{Toluene (C}_6 \text{H}_5 \text{CH}_3 \text{)}

Lipophilicity Values

Lipophilicity is a measure of the ability of a chemical compound to dissolve in fats, oils, and lipids. It is a key property in drug discovery and design, influencing the absorption, distribution, metabolism, and excretion of drugs. The dataset contains experimental lipophilicity values for the corresponding SMILES strings.

What are Scaffolds?

Scaffolds are the core structures of molecules that are common across different molecules. They are used to represent the core structure of a molecule and are used to group molecules based on their structural similarity. Scaffolds are used in cheminformatics to analyze and compare molecular structures.
For examples, the compounds - benzene, toluene, and phenol have the same scaffold - benzene.
Scaffold
Benzene (C6H6)\text{Benzene (C}_6 \text{H}_6 \text{)}

Data Splitting

For datasets containing molecular structures, it is essential to split the such that it generalizes well to novel molecules. The dataset splits might contain overlapping scaffolds, which can lead to data leakage and overfitting.
Different data splitting strategies were tested and the best performing one was selected to be used for the PEFT experiments. The main criteria for selecting the splitting strategy was to ensure there were no overlapping scaffolds between the sets. The following splits from DeepChem were tested:
random
Random data splits
scaffold
Scaffold-based data splits (Bemis-Murcko scaffold)
weight
Molecular weight splits
maxmin
Diversity splits based on MaxMin diversity algorithm
butina
Data splits based on Butina clustering of a bulk Tanimoto fingerprint matrix
fingerprint
Splits based on Tanimoto similar between ECFP4 fingerprints

Observations

The scaffold-based splitting strategy was selected as it ensures that no overlapping scaffolds exist between train, validation, and test sets. This prevents data leakage from structurally similar molecules and ensures the model generalizes to novel molecular scaffolds. The final split achieves approximately 80/10/10 train/validation/test distribution.

MolFormer

MolFormer is a transformer-based model for molecular property prediction, pre-trained on a large corpus of ~1.1 billion unlabelled SMILES strings. It uses a linear attention mechanism to efficiently process molecular representations and can be fine-tuned for specific downstream tasks.
Architecture
Model
MoLFormer-XL-both-10pct
Hidden Size
768
Attention
Linear attention mechanism
Pre-training Data
~1.1B unlabelled SMILES strings
Regression Head
A linear layer added on top of the pooler output for predicting continuous lipophilicity values.
Classifier Dropout0.5
Embedding Dropout0.2
Hidden Dropout0.2
Output1 (scalar)

Fine-Tuning Pipeline

The fine-tuning process follows a two-stage approach: first, unsupervised fine-tuning using Masked Language Modeling (MLM) to adapt the model to the domain-specific molecular distribution, followed by supervised fine-tuning with the regression head for lipophilicity prediction.
Stage 1: Unsupervised MLM
Masked Language Modeling adapts the pre-trained model to the chemical language patterns specific to the Lipophilicity training set, without using any labels.
ObjectiveCrossEntropyLoss
Masking Probability15%
OptimizerAdamW
Learning Rate1×1041 \times 10^{-4}
Early StoppingPatience = 2
Stage 2: Supervised Regression
Fine-tuning with the regression head on labeled lipophilicity values, starting from the unsupervised-adapted checkpoint.
Loss FunctionMSELoss
Evaluation MetricMAE
OptimizerAdamW
Learning Rate1×1041 \times 10^{-4}
Batch Size64

Fine-Tuning Comparison

The two-stage approach (unsupervised MLM + supervised regression) outperforms direct supervised fine-tuning, demonstrating the benefit of domain adaptation before task-specific training.
Fine-Tuning Results (Test Set)

Influence Functions

Influence functions approximate the effect of each training point on model predictions. By computing the inverse Hessian-vector product (iHVP) through LiSSA (Linear time Stochastic Second-order Algorithm), we estimate how including or removing each external data point would affect the test loss.
Influence Score Formula
The influence of a training point on the test loss:
I(zi)=θLtestHθ1θL(zi)\mathcal{I}(z_i) = -\nabla_\theta \mathcal{L}_{\text{test}}^\top \, H_\theta^{-1} \, \nabla_\theta \mathcal{L}(z_i)
Where HθH_\theta is the Hessian of the training loss
LiSSA Approximation
The iHVP is iteratively estimated:
v~t+1=v+(IδHθ)v~t\tilde{v}_{t+1} = v + (I - \delta H_\theta) \tilde{v}_t
Recursion depth = 200, damping = 0.01, scale = 0.04

Data Selection Strategies

Four strategies were used to select approximately 185 data points from the 300-point external dataset for augmenting the training set. Each strategy captures different properties of the data.
Random Selection
Uniform Sampling Baseline
Randomly samples 185 data points from the external dataset uniformly without replacement. Serves as the baseline strategy.
Selected: 185 points
Active Learning
Monte Carlo Dropout Uncertainty
Uses MC Dropout (100 forward passes) to estimate prediction uncertainty. Points with uncertainty above a threshold are selected, targeting samples where the model is least confident.
Selected: 183 points
Diversity (MaxMin)
Structural Diversity Maximization
Uses the MaxMin diversity algorithm from DeepChem to select the most structurally diverse molecules based on molecular fingerprints, maximizing coverage of chemical space.
Selected: 185 points
Influence Score
Gradient-Based Selection
Selects points with positive influence scores (computed via LiSSA), meaning their inclusion is estimated to reduce the test loss. Points with negative influence are excluded.
Selected: 185 points

Strategy Comparison (Full Fine-Tuning)

Each selection strategy was combined with the full training set and used to fine-tune from the best checkpoint. Lower RMSE indicates better predictive performance.
Data Selection Results (Test MSE)

Parameter-Efficient Fine-Tuning

Instead of updating all ~44.4M parameters during fine-tuning, PEFT techniques freeze most of the model and train only a small subset. Three PEFT methods were applied with each data selection strategy to evaluate their combined effectiveness.
BitFit
Bias-Term Fine-Tuning
Only bias terms and the regression head are trainable. All weight matrices remain frozen, providing the simplest form of parameter-efficient adaptation.
Trainable: ~75K / 44.4M
Ratio: 0.17%
Learning Rate1×1031 \times 10^{-3}
Classifier Dropout0.2
Predictions Regression Head Linear Layer Dropout Pretrained Model only bias b terms trainable SMILES
LoRA
Low-Rank Adaptation
Injects trainable low-rank decomposition matrices into attention layers (Q, K, V) and FFN intermediate layers while keeping the original weights frozen.
Trainable: ~16.6M / 45M
Ratio: 37.4%
Rank8
Alpha16
Learning Rate1×1041 \times 10^{-4}
h + Pretrained Weights B = 0 r A = U(-b, b) x
(IA)³
Learned Rescaling Vectors
Introduces per-layer learned scaling vectors that multiply the outputs of attention key, value, and FFN intermediate layers. The most parameter-efficient method tested.
Trainable: ~28K / 44.4M
Ratio: 0.06%
Learning Rate1×1031 \times 10^{-3}
InitializationOnes
softmax V V K K Q h dense FF nonlinearity dense Attention FFN

PEFT Trade-off Overview

Efficiency vs. Performance vs. Simplicity
Scores are normalised 0–10. Higher is better on each axis.

Combined Results

Each PEFT technique was combined with each data selection strategy, resulting in 15 experiment configurations (3 PEFT × 5 selection methods). The combined training data consists of the original training set augmented with the selected external data points.
PEFT × Data Selection (Test MSE)
Lower MSE is better. Lines show how each PEFT method responds to different data selection strategies.
Best BitFit
Strategy:
Random Selection
0.17% trainable parameters
0.477
MSE
Best LoRA
Strategy:
Random Selection
37.4% trainable parameters
0.083
MSE
Best (IA)³
Strategy:
Random Selection
0.06% trainable parameters
0.522
MSE

Key Findings

Data Selection
  • Random selection unexpectedly yields the best MSE across all PEFT methods, likely influenced by seed-specific subset selection
  • Active learning (MC Dropout) is the second-best strategy for BitFit and LoRA, reinforcing the value of uncertainty-based selection
  • Influence-based selection performs worst for (IA)³, highlighting method-specific sensitivity to data selection strategies
  • Structured data selection provides more reliable improvements than random selection due to lower variability across different runs
PEFT Techniques
  • LoRA achieves near full-fine-tuning performance with 37.4% trainable parameters
  • BitFit offers a good efficiency-performance trade-off with minimal implementation complexity
  • (IA)³ is the most parameter-efficient (0.06%) but shows the largest performance gap
  • Unsupervised pre-training (MLM) provides consistent improvement across all configurations

Conclusion

This project demonstrates that combining data selection with parameter-efficient fine-tuning enables effective molecular property prediction with significantly reduced computational requirements. Random selection paired with LoRA achieves the best performance overall (MSE 0.083), while BitFit and (IA)³ provide competitive alternatives with only 0.17% and 0.06% of trainable parameters respectively on carefully selected subsets of external data. These findings suggest practical pathways for resource-conscious adaptation of large pre-trained chemical language models to specific molecular prediction tasks.
my resume

too bright? click ↝