go back to projects

Project - Image Segmentation

Image Segmentation with Scribble Supervision

End-to-end binary segmentation from sparse scribbles using an EfficientNet-B2 U-Net with region-based losses and careful augmentation for weak supervision.

built with
  • PyTorch
  • segmentation_models.pytorch
  • timm
  • EfficientNet-B2
  • U-Net
  • AdamW
  • Albumentations
Binary image segmentation—dividing images into foreground and background regions at the pixel level—is fundamental in computer vision with applications in medical imaging, autonomous driving, and interactive editing. This project addresses the challenge of sparse supervision, where only limited scribbles are provided instead of dense masks. By leveraging modern deep learning architectures and carefully designed loss functions, we achieve competitive segmentation performance while dramatically reducing annotation costs. The approach combines a U-Net decoder with an EfficientNet-B2 encoder, optimized using region-based losses (Dice/IoU) blended with binary cross-entropy for stable training under weak supervision.

Project Overview

Challenge
Predict dense binary segmentation masks from RGB images paired with sparse scribble annotations, where only a tiny fraction of pixels are labeled.
Approach
End-to-end deep learning with early fusion of scribbles, region-based composite losses, and synthetic scribble augmentation to maximize generalization from limited supervision.

Methodology Pipeline

Five-Stage Workflow

The segmentation pipeline turns sparse scribbles into dense masks via a structured five-stage process, moving from input preparation to model selection and rigorous evaluation.
Primary Goal
Preserve alignment between images, scribbles, and masks while maximizing mIoU under weak supervision.
Outputs
Dense binary masks with stable calibration across folds.
1
Input Encoding
Scribble mapping, channel fusion, resizing, normalization
2
Architecture
EfficientNet-B2 encoder, U-Net decoder, scSE attention
3
Loss Design
Region-based (Dice/IoU) + BCE composite with class reweighting
4
Augmentation
Geometric transforms, scribble synthesis, intensity variations
5
Evaluation
5-fold CV, ensemble averaging, mIoU-based selection

Problem Formulation

We observe an RGB image and a sparse scribble mask with three label states. The goal is to predict a dense binary mask that maximizes mean Intersection over Union (mIoU) against the ground truth.
Label Semantics
0 = background, 1 = foreground, = unlabeled
Optimization Target
Maximize mIoU between prediction and ground truth over foreground and background classes.
Mathematical Setup
RGB image
bRnx×ny×3b \in \mathbb{R}^{n_x \times n_y \times 3}
Scribble mask
c{0,1,}nx×nyc \in \{0,1,\emptyset\}^{n_x \times n_y}
Prediction
g^{0,1}nx×ny\hat{g} \in \{0,1\}^{n_x \times n_y}
Ground truth
g{0,1}nx×nyg \in \{0,1\}^{n_x \times n_y}
Intersection over Union (IoU)
IoU measures the overlap between predicted and ground truth regions:
IoUk={g=k}{g^=k}{g=k}{g^=k}\text{IoU}_k = \frac{|\{g=k\} \cap \{\hat{g}=k\}|}{|\{g=k\} \cup \{\hat{g}=k\}|}
Ranges from 0 (no overlap) to 1 (perfect match)
Mean IoU (mIoU)
mIoU averages IoU across foreground and background classes:
mIoU=IoUbg+IoUfg2\text{mIoU} = \frac{\text{IoU}_{\text{bg}} + \text{IoU}_{\text{fg}}}{2}
Provides balanced evaluation across both classes

Input Processing & Encoding

Preprocessing Pipeline

Raw inputs undergo systematic transformations to prepare them for neural network processing, including scribble encoding, spatial resizing, channel-wise normalization, and early fusion of visual and annotation information.
Step 1: Scribble Encoding & Resizing
Mapping Sparse Annotations to Network Input
Scribbles are encoded numerically and resized to match the network architecture constraints while preserving annotation integrity.
Scribble Mapping
Background: 010 \rightarrow -1
Foreground: 1+11 \rightarrow +1
Unlabeled: 0\emptyset \rightarrow 0
Spatial Resizing
Original: 500×375500 \times 375
Resized: 480×352480 \times 352
Largest multiple of 32 below original resolution
Interpolation Methods
RGB Image: Lanczos (preserves edges)
Masks: Nearest-neighbor (preserves labels)
Step 2: Normalization & Channel Fusion
Early Fusion Strategy
The scribble mask is concatenated as a fourth channel to the RGB image, providing immediate access to annotation information throughout the encoder hierarchy.
Channel-wise Normalization
Mean:(0.4589,0.4589,0.4215)(0.4589, 0.4589, 0.4215)
Std:(0.2618,0.2633,0.2822)(0.2618, 0.2633, 0.2822)
Computed from training set RGB channels
4-Channel Input Tensor
Input = concat(RGB_norm, Scribble)
Shape: [B, 4, 480, 352]
Enables contextual scribble interpretation

Model Output & Thresholding

Probabilistic Prediction

The network outputs a continuous probability map p^(0,1)nx×ny\hat{p} \in (0,1)^{n_x \times n_y} where each pixel value represents foreground probability.

Binary Decision

A fixed threshold τ=0.5\tau = 0.5 converts probabilities to binary predictions without calibration.
Thresholding Function
The binary mask is obtained by applying a fixed threshold to the probability map.
This produces a hard decision used for evaluation and reporting.
Gradient descent optimizes the soft probabilities p^\hat{p}, not the binarized masks g^\hat{g}.
g^i={1if p^iτ0otherwise\hat{g}_i = \begin{cases}1 & \text{if } \hat{p}_i \ge \tau \\ 0 & \text{otherwise}\end{cases}
τ=0.5\tau = 0.5

Loss Function Design

Composite Loss Strategy

Training optimizes a weighted combination of region-based and pixel-wise objectives. Region-based losses (Dice, IoU) directly align with the mIoU evaluation metric, while binary cross-entropy provides stable gradients during early training and sparse supervision scenarios.
Region-Based Loss Functions
Maximizing Overlap Metrics
Dice and IoU losses operate on entire prediction regions, providing strong supervision signal even when pixel-level labels are sparse.
Dice Loss
LDice=12ip^igiip^i+gi\mathcal{L}_{\text{Dice}} = 1 - \frac{2 \sum_i \hat{p}_i g_i}{\sum_i \hat{p}_i + g_i}
Properties: Numerically stable, balanced precision-recall
IoU Loss
LIoU=1ip^igiip^i+gip^igi\mathcal{L}_{\text{IoU}} = 1 - \frac{\sum_i \hat{p}_i g_i}{\sum_i \hat{p}_i + g_i - \hat{p}_i g_i}
Properties: Directly optimizes mIoU metric
Composite Objective Function
Blending Region and Pixel Losses
The final objective combines region-based terms with binary cross-entropy for improved gradient flow and training stability.
Weighted Combination
Ltotal=λREGLREG+λBCELBCE\mathcal{L}_{\text{total}} = \lambda_{\text{REG}} \mathcal{L}_{\text{REG}} + \lambda_{\text{BCE}} \mathcal{L}_{\text{BCE}}
REG ∈ {Dice, IoU} selected via cross-validation
Hyperparameters
λ_REG = 1.0
λ_BCE = 0.2
Fixed across all experiments
Class Imbalance Handling
Challenge: 3.88× more background pixels
LBCE=ipcgilogp^i+(1gi)log(1p^i)\mathcal{L}_{\text{BCE}} = -\sum_i p_c g_i \log \hat{p}_i + (1-g_i) \log(1-\hat{p}_i)
Tested pc{1,2,3,3.8}p_c \in \{1, 2, 3, 3.8\}
Rationale
BCE: Reliable pixel-level guidance
Region Loss: Captures global structure
Gradients from region losses can be weak when overlap is low

Data Augmentation

Augmentation Strategy

Data augmentation increases training set diversity and prevents overfitting. We employ standard semantic segmentation transforms alongside a novel synthetic scribble generation algorithm to simulate varied annotation patterns.
Semantic Segmentation Augmentations
Standard Geometric & Photometric Transforms
Augmentations are applied jointly to image, scribble, and mask to preserve spatial correspondence and annotation integrity.
Geometric
  • Rotation
  • Translation
  • Scaling
  • Horizontal / vertical flips
Maintains spatial alignment across image, scribble, and mask.
Photometric
  • Color jitter
  • Brightness / contrast
  • Saturation
Applied to the RGB image only.
Spatial Deformations
  • Random crop
  • Elastic transform
  • Grid distortion
Aggressive deformations used for higher augmentation settings.
Filtering
  • Gaussian blur
Image-only smoothing to reduce high-frequency noise.
Foreground-Aware
  • Random crop with foreground bias
Prioritizes regions containing foreground pixels.
Consistency Rule
  • Same transform applied to image, scribble, and mask
  • Preserves label alignment under weak supervision
Synthetic Scribble Generation
Simulating Human Annotation Patterns
An algorithmic pipeline generates realistic scribble annotations from ground truth masks, enabling training on full segmentation labels while simulating sparse supervision.
Foreground Scribbles
Step 1: Morphological closing of mask
Step 2: Extract connected components
Step 3: Sample random line inside each region
Step 4: Dilate lines to match thickness
Only components above minimum area threshold are annotated
Background Scribbles
Step 1: Compute foreground bounding box
Step 2: Define four background regions with margin
Step 3: Randomly select one region (area-weighted)
Step 4: Draw and dilate background line
Encoded as -1 by subtracting from foreground scribbles

Augmentation Configurations

Three augmentation intensity levels were systematically evaluated to determine the optimal trade-off between diversity and label preservation.
Low Augmentation
  • Affine transforms
  • Flips
  • Color jitter
  • Scribble augmentation
Focus: Basic geometric invariance
Medium Augmentation
  • Low augmentation +
  • Random crop
Focus: Spatial context variation
High Augmentation
  • Medium augmentation +
  • Elastic transform
  • Grid distortion
  • Gaussian blur
Focus: Aggressive deformation robustness

Model Architecture

Encoder Selection

Multiple lightweight encoders were evaluated. EfficientNet-B2 emerged as the optimal choice, balancing parameter efficiency with strong feature representation capacity.

Decoder Architecture

Standard U-Net decoder with scSE (Spatial and Channel Squeeze & Excitation) attention blocks for refined feature recalibration at multiple scales.
Architecture Exploration
Systematic Encoder Evaluation
Initial experiments with MobileNetV3 and ResNet-18 revealed insufficient capacity for the sparse supervision regime. EfficientNet models provided superior performance through compound scaling.
MobileNetV3
Pros: Extremely lightweight
Cons: Limited representational power
Unable to achieve low training loss
ResNet-18
Pros: Well-established architecture
Cons: Suboptimal parameter efficiency
Decent but not optimal performance
EfficientNet-B2
Pros: Balanced efficiency & capacity
Selected: Best validation mIoU
Optimal trade-off for this task
Final Architecture Details
EfficientNet-B2 U-Net with scSE Attention
The architecture combines efficient compound scaling with spatial and channel attention mechanisms for precise multi-scale feature refinement.
Encoder
Backbone: EfficientNet-B2
Initialization: From scratch
Dropout: 0.2 per block
Depth-5 hierarchy, 32x downsampling
Decoder
Style: U-Net symmetric skip connections
Attention: scSE blocks
Concurrent spatial and channel recalibration
Implementation
Framework:segmentation_models.pytorch
Encoder Source:timm

Training Configuration

Optimization Strategy

AdamW optimizer with cosine annealing learning rate schedule and early stopping based on validation mIoU plateau detection.

Computational Efficiency

Mixed-precision training (bfloat16) on A100 GPUs enables larger batch sizes and faster convergence without sacrificing numerical stability.
Hyperparameters
Optimization
OptimizerAdamW
Learning rate5e-3
Weight decay1e-4
Schedule
PolicyCosine annealing
Min lr1e-7
Training length8K–16K steps
Batching
Batch sizes6, 8, 12
Evaluated across configurations
Precision & Hardware
Precisionbfloat16
AcceleratorA100 GPUs
Early Stopping
Patience60 epochs
MetricValidation mIoU

Experimental Results

Evaluation Methodology

5-fold cross-validation was employed to robustly estimate performance and select the optimal configuration. Final test predictions used an ensemble of all five fold models to reduce variance and improve generalization.
Cross-Validation
Folds: 5
Strategy: Stratified split
Selection: Max mean mIoU across folds
Ensemble Prediction
Method: Average probability maps
Models: All 5 fold checkpoints
Threshold: τ = 0.5
Data Utilization
Training set: ≈181 images
Leakage: None (fold-based inference)
Efficiency: All data used for final model

Performance Summary

Test Set Performance
0.8569
mIoU on first test set
Background IoU
0.9215
Foreground IoU
0.7923
Cross-Validation mIoU
0.8758
Mean across 5 folds (best config)
IoU Loss + Low Aug + Batch 8 + pc=3p_c=3
Standard Deviation
±0.0169
Fold variance (best config)

Segmentation Example

Visual comparison of model predictions on a test image demonstrates the quality of dense mask recovery from sparse scribble annotations.
Original Image
Original sheep image
Input image with background and foreground objects
Sparse Scribbles
Sparse scribble annotations
Minimal supervision: only scribbles provided
Predicted Mask
Dense prediction from model
Dense mask recovered from sparse input

Ablation Studies

Systematic experiments across 72 configurations revealed the impact of augmentation intensity, positive class weighting, batch size, and loss function on final performance, particularly in the low-data regime.
Loss Function vs Augmentation
IoU loss outperforms Dice loss across all augmentation levels. Lower augmentation yields better results for this task.
Positive Weight Sensitivity
Class weighting pc=3p_c=3 provides optimal balance for foreground objects.
Batch Size Impact
Batch size 8 achieves best trade-off between stability and generalization.
Augmentation Impact
Effect on Validation mIoU
Contrary to typical expectations, aggressive augmentation reduced validation mIoU in this sparse supervision setting.
Observation
Higher augmentation → Lower validation mIoU
Averaged across loss functions and folds
Hypothesis
Small training set (≈181 images)
Challenging sparse annotation regime
Aggressive transforms may corrupt weak supervision signal
Recommendation
Low augmentation performed best
Suggests careful augmentation design critical for scribble supervision
Positive Weight Sensitivity
Class Imbalance Reweighting
Tested BCE positive weights pc{1,2,3,3.8}p_c \in \{1, 2, 3, 3.8\} to address 3.88x background-to-foreground pixel ratio.
Finding
Limited effect on validation mIoU across tested values
Region-based losses may already handle imbalance effectively
Interpretation
Dice and IoU are inherently robust to class imbalance
BCE contribution (λ=0.2) is secondary to region losses

Key Findings

Architecture Insights
  • EfficientNet-B2 provides optimal efficiency-performance trade-off
  • scSE attention enhances feature refinement at minimal cost
  • Early fusion of scribbles enables contextual interpretation
Loss Design Insights
  • Region-based losses align well with mIoU evaluation
  • BCE stabilizes training under sparse supervision
  • IoU and Dice losses yield similar performance
Augmentation Insights
  • Low augmentation outperforms aggressive transforms
  • Small dataset regime requires careful augmentation design
  • Synthetic scribble generation adds training diversity
Training Insights
  • Ensemble averaging improves test set stability
  • Early stopping prevents overfitting to validation set
  • Mixed-precision training enables efficient experimentation

Conclusion

The final system demonstrates that carefully designed loss functions and strong encoder-decoder backbones can achieve competitive segmentation from sparse scribbles. The approach reduces annotation burden while preserving high-quality masks, making it viable for large datasets with limited labeling budgets.
my resume

too bright? click ↝