CTRL K

Instance Segmentation Base Model

DeepLavV3+

A robust baseline model for semantic segmentation, DeepLabV3+ with ResNet-50 strikes a good balance between speed and accuracy. Suitable for medium-scale applications where both performance and resource efficiency are important.
SegFormer (MiT-B0 backbone)

A transformer-based semantic segmentation model that integrates hierarchical attention mechanisms and lightweight decoders, achieving efficient and accurate segmentation across diverse image resolutions and

scales
SegFormer-large (MiT-B4 backbone)

Offers significantly better accuracy than MiT-B0 by using a deeper backbone. A great choice for scenarios requiring high-quality segmentation results on complex scenes, while still being manageable on standard GPUs (12–16 GB).
SegFormer-x-large (MiT-B5 backbone)

The most accurate SegFormer variant, especially effective on a diverse and densely annotated dataset. Best suited for large-scale training or inference tasks where precision is critical. Requires high-end GPUs with larger memory (≥24 GB).

Best Overall Model (Top Performer)

Segformer x large :

Use when: You have large datasets, and want best accuracy.
Why: Offers excellent performance across COCO-style benchmarks; optimized for both speed and scale.

Recommendations by Dataset Size

Large Dataset (≥ 10k images)

✅ SeFormer-x large (MiT-B5)

✅ SegFormer-large (MiT-B4)

✅ DeepLabV3+ (ResNet-101)
Why: These models scale well with large datasets and offer high accuracy on complex multi-class segmentation like ADE20K.
Medium Dataset (2k–10k images)

✅ SegFormer-large (MiT-B4)

✅ DeepLabV3+ (ResNet-50)
Why: Balanced models offering a good trade-off between accuracy and training time on medium-scale datasets.
Small Dataset (≤ 2k images)

✅ SegFormer (MiT-B0)

✅ DeepLabV3+ (ResNet-50)
Why: These are lightweight and stable models that avoid overfitting and work well with limited data and compute.