How I Deployed YOLOv8 at 30ms on Edge Hardware (No GPU Required)

When a client needed a real-time surveillance system in 2024, the requirement was clear: YOLOv8 inference on a Rockchip RV1126 board, under 30ms per frame, with no GPU. Running a standard PyTorch model on that CPU gave 380ms. Here's exactly how I brought it down to 24ms.

Why Standard PyTorch Fails on Edge Hardware

The RV1126 has a 1.5GHz Cortex-A7 quad-core CPU — nowhere near enough for real-time neural network inference. But it also has a dedicated NPU (Neural Processing Unit) rated at 2.0 TOPS. The trick is getting your model to actually run on the NPU, not the CPU.

That's what RKNN (Rockchip Neural Network) is for. It's Rockchip's conversion toolkit that transforms your ONNX model into a format the NPU can execute. Combined with INT8 quantization, it's the difference between 380ms and 24ms.

The Conversion Pipeline

The full pipeline from training to deployment:

PyTorch (.pt) → ONNX (.onnx) → RKNN (.rknn) → RV1126 NPU

Step 1: Export YOLOv8 to ONNX

Install Ultralytics and export your model:

from ultralytics import YOLO

model = YOLO("yolov8n.pt")
model.export(
    format="onnx",
    opset=12,       # RKNN supports up to opset 12
    simplify=True,  # removes ops that confuse the converter
    imgsz=640,      # lock input size — RKNN doesn't support dynamic shapes
)

Two flags matter: opset=12 keeps you within RKNN's supported operator set, and simplify=True removes redundant ONNX nodes that cause conversion failures.

Step 2: Install rknn-toolkit2 and Convert

pip install rknn-toolkit2  # on x86 Linux host machine

from rknn.api import RKNN

rknn = RKNN(verbose=False)

rknn.config(
    mean_values=[[0, 0, 0]],
    std_values=[[255, 255, 255]],
    target_platform="rv1126",
    quantized_algorithm="normal",
    quantized_method="channel",
)

rknn.load_onnx(model="yolov8n.onnx")

# Build with INT8 quantization
rknn.build(do_quantization=True, dataset="./calibration_dataset.txt")

rknn.export_rknn("./yolov8n.rknn")

The calibration_dataset.txt is a text file with paths to 100–200 representative images from your deployment environment. The more domain-relevant these images are, the less accuracy you lose from quantization.

Step 3: What INT8 Quantization Actually Does

INT8 quantization converts 32-bit float weights to 8-bit integers. This gives roughly 4× memory reduction and 3–5× inference speedup on NPU hardware, with typically less than 2% mAP drop on COCO benchmarks — an acceptable tradeoff for real-time systems.

The calibration dataset is used to compute per-channel scale factors that minimize quantization error. Skipping this (using random calibration) can cause up to 15% mAP drop.

Step 4: Run Inference on the RV1126

Deploy the .rknn file to the board and run with rknn-toolkit-lite2:

from rknnlite.api import RKNNLite
import cv2
import numpy as np

rknn_lite = RKNNLite()
rknn_lite.load_rknn("./yolov8n.rknn")
rknn_lite.init_runtime(core_mask=RKNNLite.NPU_CORE_AUTO)

img = cv2.imread("frame.jpg")
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img_resized = cv2.resize(img_rgb, (640, 640))
img_input = np.expand_dims(img_resized, 0)

outputs = rknn_lite.inference(inputs=[img_input])
# outputs contains raw tensors — apply YOLOv8 post-processing manually

Benchmark Results

Method	Hardware	Latency
YOLOv8n PyTorch FP32	RV1126 CPU	380ms
YOLOv8n ONNX Runtime FP32	RV1126 CPU	290ms
YOLOv8n RKNN FP32	RV1126 NPU	65ms
YOLOv8n RKNN INT8	RV1126 NPU	24ms

24ms = 41 FPS. Real-time at full 640×640 resolution on a $35 board drawing 5W.

Common Pitfalls

—Unsupported op errors: YOLOv8's SPPF module with large MaxPool kernels can fail. Always use opset=12 + simplify=True. Check ops with rknn.accuracy_analysis().
—Accuracy drops more than expected: If mAP drops more than 5% post-quantization, increase calibration dataset size and ensure all object classes are represented proportionally.
—Output parsing: RKNN returns raw tensors. YOLOv8 post-processing (decode boxes, apply NMS) must be done manually — it's not included in the ONNX export by default.
—Dynamic input shapes: RKNN does not support dynamic shapes. Lock your input size at export time and stick to it at inference time.

The Result

The final system: 41 FPS, 640×640 resolution, running on a $35 SBC in a weatherproof enclosure at 5W. The client got real-time detection across 4 simultaneous camera feeds — no cloud, no GPU, no monthly inference bill.

Frequently Asked Questions

Which Rockchip chips support RKNN?

RK1808, RV1109, RV1126, RK3399Pro, RK3566, RK3568, and RK3588. The RK3588 is the current flagship with up to 6.0 TOPS NPU performance.

Can I deploy YOLOv9 or YOLOv10 with RKNN?

Yes. Export to ONNX with opset=12 and simplify=True. YOLOv9 converts cleanly. Some YOLOv10 variants need minor architecture adjustments — check with accuracy_analysis() and fix any unsupported ops.

Does INT8 quantization always work well?

For CNN-based detection models (YOLO family), yes — typically <2% mAP loss. Segmentation and keypoint models need more careful calibration. Transformer-based detectors (DETR) are harder to quantize; stick to YOLO variants for RKNN.

Do I need Linux for the conversion?

rknn-toolkit2 (the host converter) runs on x86 Linux only. rknn-toolkit-lite2 (for inference on the board) runs on the target ARM board. You can use WSL2 on Windows for the conversion step.