Scribble-Supervised Image Segmentation

Dense segmentation labels are expensive: every object boundary has to be traced pixel by pixel. This project asks a cheaper question—if an annotator marks only a few confident foreground and background regions, can a model recover the complete object?

The resulting system fuses a sparse scribble with an RGB image, predicts a full-resolution probability map, and turns it into a binary mask. The strongest configuration combined an EfficientNet-B2 encoder, a U-Net decoder with scSE attention, region-aware loss, restrained augmentation, and a five-fold ensemble.

181training images

4input channels

72configurations

0.8569test mIoU

Original photograph of three sheep standing in grass — One test example across the annotation pipeline. A handful of sparse strokes is converted into a coherent foreground mask.

The Supervision Problem

Each sample contains an image, a tri-state scribble map, and a dense mask used only for training evaluation. The learning signal deliberately distinguishes “background,” “foreground,” and “not annotated” rather than treating every unmarked pixel as background.

Image $b \in \mathbb{R}^{n_x \times n_y \times 3}$ visual evidence

Scribble $c \in \{0,1,\emptyset\}^{n_x \times n_y}$ −1 background · 0 unknown · +1 foreground

Prediction $\hat{g} \in \{0,1\}^{n_x \times n_y}$ dense output

Evaluation targetMean intersection over union

\operatorname{mIoU}=\frac{\operatorname{IoU}_{bg}+\operatorname{IoU}_{fg}}{2}

Averaging both classes prevents the larger background region from hiding weak foreground recovery.

From Sparse Input to Dense Mask

The pipeline is deliberately linear. Every stage changes one representation, which makes failures easier to localize than a collection of loosely connected preprocessing and training panels.

01
Encode
Fuse RGB pixels with the sparse scribble as a fourth channel.
[B, 4, 480, 352]
02
Extract
Build a five-level EfficientNet-B2 feature hierarchy.
32× downsample
03
Recover
Decode with U-Net skip connections and scSE attention.
dense logits
04
Optimize
Balance region overlap against pixel-wise classification.
IoU + 0.2 BCE
05
Validate
Average five fold-specific probability maps before thresholding.
5-fold ensemble

Input Encoding

Images are resized from 500 × 375 to 480 × 352—the nearest lower size divisible by 32. Lanczos interpolation preserves RGB edges; nearest neighbor keeps mask labels discrete. The normalized image and encoded scribble are then concatenated before the first convolution.

RGB3 channelsdataset mean and standard deviation

Scribble1 channel−1, 0, and +1 encoding

Network tensor[B, 4, 480, 352]early fusion

Probability map $\hat{p} \in (0,1)^{n_x \times n_y}$

Decision rule $\hat{g}_{ij}=\mathbb{1}[\hat{p}_{ij}\geq 0.5]$

Designing the Objective

Pixel accuracy is a poor guide when most pixels are background. The objective therefore combines an overlap loss, which reasons about the whole region, with a smaller weighted BCE term that keeps individual pixel decisions calibrated.

Reward the shared foreground mass.

\mathcal{L}_{Dice}=1-\frac{2\sum_i p_i y_i+\epsilon}{\sum_i p_i+\sum_i y_i+\epsilon}

Smooth and stable when the target occupies only a small part of the image.

Augmentation Without Breaking Labels

An image transform is only valid when it is applied identically to the RGB image, the sparse scribble, and the dense target. The experiments compared three recipes, increasing spatial distortion while keeping that alignment invariant.

Selected recipePreserve the annotation geometry.

affinehorizontal flipvertical flipcolor jittersynthetic scribble

Small viewpoint and color changes improved generalization without destroying thin foreground hints.

Synthetic Scribble Generator

Dense masks can be converted into plausible sparse annotations, letting the model see a wider range of marking styles without requiring another annotation pass.

01Find interior regionsmorphology and connected components

02Draw sparse tracesrandom lines with controlled dilation

03Sample backgroundweighted regions outside the object bounds

Architecture

Early fusion lets every encoder level interpret appearance together with annotation. The decoder restores spatial detail through U-Net skip connections, while scSE blocks recalibrate both channels and locations before the final one-channel prediction.

InputRGB + scribble4 × 480 × 352

Encoder

E1E2E3E4E5

EfficientNet-B2

BottleneckscSEdropout 0.2

Decoder

D5D4D3D2D1

U-Net recovery

Outputsigmoid1 × 480 × 352

The selected architecture. Five encoder scales feed matching decoder stages through skip connections, preserving boundaries that would be lost at the 32× bottleneck.

Choosing the Encoder

Efficiency baselineMobileNetV3

Too little capacity for fine boundaries

Balanced baselineResNet18

Competitive, but less stable across folds

Selected encoderEfficientNet-B2

Best accuracy-to-compute trade-off

Training and Evaluation

Seventy-two configurations crossed loss function, augmentation strength, batch size, and positive-class weighting. Validation was stratified into five folds; at inference, the five probability maps were averaged before applying the 0.5 threshold.

Optimizer: AdamW
Learning rate: 5 × 10⁻³ → 1 × 10⁻⁷ cosine
Weight decay: 1 × 10⁻⁴
Precision: bfloat16 on A100
Training budget: 8K–16K steps
Early stopping: 60 checks on validation mIoU

Test mean IoU0.8569five-model ensemble

Background IoU0.9215

Foreground IoU0.7923

Cross-validation0.8758 ± 0.0169

Connected-dot view of mean cross-validation IoU by augmentation strength and region loss. Restrained augmentation performed best for both objectives; aggressive warping consistently reduced overlap.

Foreground weighting under the low-augmentation recipe. IoU loss peaked at a weight of 3.0, while Dice changed more gradually across the tested range.

Dot plot of batch-size sensitivity for IoU loss with low augmentation. A focused vertical scale makes the small differences legible; exact values show that the full spread is only 0.0007 mean IoU.

What the Experiments Changed

Mild augmentation won.

Low-intensity geometric and photometric changes preserved the sparse supervisory signal better than aggressive warps.

A moderate positive weight was enough.

A foreground weight of 3.0 produced the strongest IoU run; pushing to the raw 3.88× class ratio did not help.

Batch size was not the main lever.

Batches of 6, 8, and 12 stayed within seven ten-thousandths of mean IoU once the loss and augmentation recipe were fixed.

Sparse input still recovered coherent objects.

The five-model ensemble reached 0.8569 test mIoU while using only scribbled pixels as explicit annotation.

The main result is not that scribbles behave like dense labels—they do not—but that a carefully aligned pipeline can extract much more supervision from them than their size suggests. The next useful step would be testing richer scribble policies and uncertainty-guided annotation, so the model can ask for the small number of additional strokes that would improve a difficult mask most.