Fine-Tuning YOLOv8 on a Custom Dataset: A Practical Guide From Annotation to Deployment

Most YOLOv8 tutorials show you the happy path: a clean public dataset, default hyperparameters, a few training epochs, and a confusion matrix that looks great. Real production datasets are never that clean. Below is what actually worked across several custom object detection projects - from a defect-detection system on a manufacturing line to a retail shelf-monitoring pipeline - and the mistakes that wasted the most training time.

Start With the Dataset, Not the Model

The single biggest predictor of final model quality is annotation consistency, not architecture choice. Before touching a config file, I run a manual audit pass on 5% of the dataset, checking for three failure patterns: missed objects (false negatives baked into ground truth), inconsistent box tightness between annotators, and class confusion on visually similar categories. Fixing these after training starts is far more expensive than fixing them before.

For class balance, aim for at least 200-300 labeled instances per class before expecting reliable mAP. Below that, the model tends to either overfit to background context or fail to generalize across object orientations and lighting conditions.

Annotation Format and Folder Structure

YOLOv8 expects normalized bounding boxes in its own .txt format, one file per image, with a data.yaml pointing at the class names and directory paths.

dataset/
  images/
    train/
    val/
  labels/
    train/
    val/
  data.yaml

# data.yaml
train: dataset/images/train
val: dataset/images/val
nc: 4
names: ["box", "person", "forklift", "pallet"]

If your annotations come from CVAT, Label Studio, or Roboflow, all three export directly to this format. Roboflow in particular is worth using even just for its augmentation preview and dataset health checks before export.

Choosing a Base Checkpoint

Always fine-tune from a COCO-pretrained checkpoint rather than training from scratch - the low-level feature extractors (edges, textures, gradients) transfer almost entirely regardless of your target domain. For most custom detection tasks, yolov8s or yolov8m strike the right balance between accuracy and training speed on a single GPU.

from ultralytics import YOLO

model = YOLO("yolov8s.pt")  # COCO-pretrained checkpoint

results = model.train(
    data="dataset/data.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    patience=20,       # early stopping
    lr0=0.001,
    augment=True,
    device=0,
)

The Hyperparameters That Actually Move the Needle

Most default YOLOv8 hyperparameters are tuned for COCO-scale datasets with tens of thousands of images. Custom datasets with a few thousand images need different treatment:

-Lower the learning rate. Starting at 0.001 instead of the default 0.01 prevents the model from forgetting pretrained features too quickly on small datasets.
-Use patience-based early stopping. Set patience to 15-20 epochs rather than training a fixed schedule. Custom datasets overfit fast once the loss curve flattens.
-Freeze early backbone layers initially. Freezing the first 10 layers for the first 10-15 epochs stabilizes training before unfreezing for full fine-tuning.
-Match image size to your deployment target. Training at 640px when your production camera feed is 1280px wastes accuracy headroom - train at the resolution you will actually run inference at.

Augmentation: Less Is Often More

YOLOv8's default augmentation pipeline (mosaic, mixup, HSV shifts, random flips) is aggressive and tuned for large general-purpose datasets. On small custom datasets it can actively hurt convergence by introducing label noise faster than the model can learn the underlying signal.

model.train(
    data="dataset/data.yaml",
    epochs=100,
    mosaic=0.5,      # reduce from default 1.0
    mixup=0.0,        # disable for small datasets
    hsv_h=0.01,
    hsv_s=0.5,
    fliplr=0.5,
    degrees=5.0,      # mild rotation only
)

Disable mixup entirely below roughly 2,000 training images, and cut mosaic probability in half. Re-enable both once you have validated that the base augmentation set already overfits - that is the signal you have outgrown the conservative settings.

Evaluating Beyond mAP

mAP@0.5 is the headline metric everyone reports, but it hides failure modes that matter in production. I always pair it with a per-class confusion matrix and a manual review of the lowest-confidence true positives - these are the detections closest to flipping into false negatives under slightly different lighting or occlusion.

Metric	Strength	Limitation
mAP@0.5	Detects presence/absence well	Hides localization quality
mAP@0.5:0.95	Stricter box-tightness signal	Slower to compute, harder to interpret
Per-class confusion matrix	Reveals which classes get confused	Needs manual inspection
Low-confidence TP review	Surfaces near-miss failure modes	Most predictive of production reliability

From Training to Deployment

Once validation metrics stabilize, export to the format your inference target needs - ONNX for cross-platform serving, TensorRT for NVIDIA GPUs, or RKNN for Rockchip edge hardware. Export-time quantization (INT8 or FP16) typically costs 1-2 mAP points in exchange for 2-4x inference speedup, which is almost always the right trade for real-time applications.

model.export(format="onnx", imgsz=640, half=True)
# or for edge deployment:
model.export(format="rknn", imgsz=640, int8=True)

If you are deploying to constrained edge hardware afterward, the quantization and runtime considerations are covered in more depth in my earlier post on deploying YOLOv8 at 30ms without a GPU.

Frequently Asked Questions

How many images do I need to fine-tune YOLOv8?

For a single well-defined class, 150-300 labeled instances is a reasonable starting point. For multi-class detection with visually similar categories, budget for 300-500 instances per class to reach production-grade mAP.

Should I train from scratch instead of fine-tuning?

Almost never for custom datasets under 50,000 images. Training from scratch discards transferable low-level features and requires far more data and compute to reach the same accuracy as fine-tuning a COCO-pretrained checkpoint.

How do I know if my model is overfitting?

Watch the gap between training and validation loss. If training loss keeps dropping while validation loss plateaus or rises for more than 10-15 epochs, the model is overfitting - reduce augmentation aggressiveness or add more data rather than training longer.