首页> 新闻动态> 产品技术> YOLO11模型部署到BM1684平台的核心"ONNX 格式适配→INT8 量化→Runtime 推理"

YOLO11模型部署到BM1684平台的核心"ONNX 格式适配→INT8 量化→Runtime 推理"

作者：万物纵横

发布时间：2025-09-09 09:52

阅读量：

将 YOLO11 模型部署到BM1684 平台（地平线 J5 系列边缘 AI 芯片，主打低功耗、高性价比的目标检测 / 分类任务），核心流程围绕 “模型适配→量化优化→平台部署” 展开，需结合地平线专用工具链（Horizon OpenExplorer）完成模型格式转换与硬件加速。以下是详细的分步指南：

一、前置背景与工具准备

在开始部署前，需明确核心工具链和环境依赖，确保主机（模型预处理）与目标端（BM1684 设备）环境兼容。

1.1 核心工具与环境

类别
工具 / 环境
作用说明
主机环境
Ubuntu 20.04/18.04
用于模型导出、ONNX 简化、量化校准（需 Python 3.8+，PyTorch 2.0+）
地平线工具链
Horizon OpenExplorer (HOE)
包含hb_mapper（量化 / 编译工具）、Horizon Runtime（目标端推理 API）
目标端环境
BM1684 设备（如地平线 J5 开发板）
需安装地平线 BPU SDK（推荐 v4.0+，支持 YOLO 系列模型算子）
辅助工具
onnx-simplifier、OpenCV
简化 ONNX 模型、处理图像预处理（resize / 归一化）

1.2 工具链安装（主机端）

下载地平线 OpenExplorer：从地平线开发者官网注册账号，下载对应 Ubuntu 版本的 HOE 工具链（需选择 “BM1684” 芯片型号）。

安装依赖：

# 安装PyTorch与YOLO11依赖

pip install torch torchvision ultralytics onnx onnx-simplifier opencv-python

# 解压HOE工具链并配置环境变量

tar -zxvf horizon_open_explorer_xxx.tar.gz

source ${HOE_PATH}/envsetup.sh  # 每次终端启动需执行

二、步骤 1：YOLO11 模型导出为 ONNX 格式

BM1684 不直接支持 PyTorch 原生模型（.pt），需先将 YOLO11 导出为ONNX 格式（中间格式，便于量化与编译），且需确保导出时算子兼容 BM1684 的 BPU（避免使用 BPU 不支持的算子）。

2.1 导出 ONNX 模型（主机端）

使用 Ultralytics 库的export接口导出，关键参数需注意：

固定输入尺寸：BM1684 对动态输入支持有限，建议固定为640x640（YOLO11 默认）或320x320（轻量化）。

去除冗余算子：禁用simplify（后续用 onnx-simplifier 单独简化，避免算子丢失）。

指定输出节点：确保输出为 “检测框 + 类别 + 置信度” 的原始张量（不包含后处理，后处理在目标端用 CPU 完成）。

代码示例：

from ultralytics import YOLO

# 加载YOLO11模型（官方预训练模型或自定义训练模型）

model = YOLO("yolo11n.pt")  # 或自定义模型路径："runs/detect/train/weights/best.pt"

# 导出ONNX模型

model.export(

format="onnx",

imgsz=640,          # 固定输入尺寸（宽=高，需为32的倍数）

dynamic=False,      # 禁用动态输入

opset=12,           # ONNX算子集版本（BM1684推荐opset12-14）

simplify=False,     # 暂不简化，后续单独处理

output="yolo11n_640.onnx"  # 输出路径

)

2.2 简化 ONNX 模型（主机端）

导出的 ONNX 模型可能包含冗余节点（如Shape、Gather），需用onnx-simplifier简化，确保 BM1684 能识别所有算子：

python -m onnxsim yolo11n_640.onnx yolo11n_640_simplified.onnx \

--input-shape "1,3,640,640"  # 明确输入形状（batch=1, 3通道, 640x640）

简化后可通过netron工具（pip install netron）可视化模型，确认输入节点为images、输出节点为output0（YOLO11 的默认输出节点名）。

三、步骤 2：模型量化（关键环节）

BM1684 的 BPU 核心优势是低精度加速（INT8/INT16），需将 FP32 的 ONNX 模型量化为 INT8 模型（平衡精度与速度），量化过程需用 “校准数据集” 保证精度损失最小。

3.1 准备校准数据集

校准数据需满足：

分布匹配：与 YOLO11 训练数据分布一致（如 COCO 子集、自定义数据集的验证集）。

数量足够：建议 50-200 张图像（太少会导致量化精度下降）。

格式要求：将图像转换为LMDB格式（地平线工具链推荐，便于批量读取），或直接使用 JPG/PNG（需在配置文件中指定路径）。

生成 LMDB 数据集（使用地平线工具链脚本）：

# 假设校准图像放在 ./calib_data 目录下

python ${HOE_PATH}/tools/dataset_convert/convert_imageset.py \

--image_dir ./calib_data \

--lmdb_path ./calib_data_lmdb \

--image_format jpg  # 或png

3.2 编写量化配置文件（yaml）

创建quant_config.yaml，指定量化参数（核心是芯片型号、模型路径、校准数据路径），示例如下：

model_parameters:

onnx_model: "./yolo11n_640_simplified.onnx"  # 简化后的ONNX模型

input_name: "images"                          # 输入节点名（需与ONNX一致）

input_shape: "[1,3,640,640]"                  # 输入形状

output_name: "output0"                        # 输出节点名（需与ONNX一致）

quantization_parameters:

calibration_data: "./calib_data_lmdb"         # 校准数据集（LMDB路径）

calibration_type: "max"                       # 量化校准方式（max/min/kl，推荐max）

bit_width: 8                                  # 量化精度（INT8）

weight_bit_width: 8                           # 权重量化精度（INT8）

compilation_parameters:

target_platform: "bm1684"                     # 目标芯片型号

output_model: "./yolo11n_640_bm1684.bmodel"   # 输出BM1684专用模型（.bmodel）

optimize: "speed"                             # 优化目标（speed/accuracy）

3.3 执行量化与编译（主机端）

通过hb_mapper工具完成量化（PTQ，Post-Training Quantization）与编译，生成 BM1684 可直接加载的.bmodel文件：

hb_mapper quantize \

--config ./quant_config.yaml \

--log-level info  # 输出详细日志，便于排查错误

成功标志：生成yolo11n_640_bm1684.bmodel文件，日志中无 “算子不支持”（unsupported operator）错误。

四、步骤 3：BM1684 目标端部署

将量化后的.bmodel文件传输到 BM1684 设备，基于地平线Horizon Runtime API 编写推理代码，完成 “模型加载→图像预处理→推理→后处理” 全流程。

4.1 目标端环境准备（BM1684 设备）

安装地平线 BPU SDK：从官网下载对应 BM1684 的 SDK（如horizon_bpu_sdk_v4.0.0_bm1684.tgz），解压后执行安装脚本：

tar -zxvf horizon_bpu_sdk_v4.0.0_bm1684.tgz

cd horizon_bpu_sdk_v4.0.0_bm1684

sudo ./install.sh  # 安装驱动、Runtime和依赖库

验证环境：执行hb_sys_info，若显示BPU Version: BM1684，说明环境正常。

4.2 编写推理代码（C/C++，目标端）

地平线提供 C/C++ API（horizon_runtime.h），核心流程如下（关键步骤带注释）：

#include <iostream>

#include <opencv2/opencv.hpp>

#include "horizon_runtime/horizon_runtime.h"  // 地平线Runtime头文件

using namespace std;

using namespace cv;

// 图像预处理：resize→归一化→转通道（BGR→RGB）→HWC→CHW

Mat preprocess_image(const Mat& img, int target_size) {

Mat resized, normalized, rgb_img;

// 1. Resize（保持比例，补黑边，避免拉伸）

float scale = min((float)target_size/img.cols, (float)target_size/img.rows);

Size new_size(img.cols*scale, img.rows*scale);

resize(img, resized, new_size);

Mat pad_img = Mat::zeros(Size(target_size, target_size), CV_8UC3);

Rect roi((target_size - new_size.width)/2, (target_size - new_size.height)/2, new_size.width, new_size.height);

resized.copyTo(pad_img(roi));

// 2. 归一化（YOLO11默认：img/255.0）

pad_img.convertTo(normalized, CV_32F, 1.0/255.0);

// 3. 通道转换（OpenCV读入为BGR，YOLO11需RGB）

cvtColor(normalized, rgb_img, COLOR_BGR2RGB);

// 4. HWC→CHW（BM1684输入格式为CHW）

vector<Mat> channels;

split(rgb_img, channels);

Mat chw_img(target_size, target_size, CV_32FC3);

for (int i = 0; i < 3; i++) {

channels[i].copyTo(chw_img.rowRange(i*target_size, (i+1)*target_size));

}

return chw_img;

}

// YOLO11后处理：解析输出张量→坐标转换→NMS

vector<Rect> postprocess_output(float* output, int target_size, const Mat& img, float conf_thres=0.25, float nms_thres=0.45) {

vector<Rect> dets;

// YOLO11输出格式：[1, num_boxes, 85]（85=4坐标+1置信度+80类别）

int num_boxes = 8400;  // YOLO11n 640x640的输出框数量（固定）

float scale = min((float)target_size/img.cols, (float)target_size/img.rows);

int pad_w = (target_size - img.cols*scale)/2;

int pad_h = (target_size - img.rows*scale)/2;

for (int i = 0; i < num_boxes; i++) {

float* box = output + i*85;

float conf = box[4];  // 置信度

if (conf < conf_thres) continue;

// 找到最大类别概率

float max_cls_conf = 0;

int cls_idx = 0;

for (int j = 5; j < 85; j++) {

if (box[j] > max_cls_conf) {

max_cls_conf = box[j];

cls_idx = j-5;

}

}

float total_conf = conf * max_cls_conf;

if (total_conf < conf_thres) continue;

// 坐标转换（YOLO输出为[x,y,w,h]，需转成OpenCV的[left,top,width,height]）

float x = (box[0] - pad_w) / scale;  // 中心x

float y = (box[1] - pad_h) / scale;  // 中心y

float w = box[2] / scale;            // 宽度

float h = box[3] / scale;            // 高度

int left = max(0, (int)(x - w/2));

int top = max(0, (int)(y - h/2));

dets.push_back(Rect(left, top, (int)w, (int)h));

}

// NMS非极大值抑制

vector<int> indices;

NMSBoxes(dets, vector<float>(dets.size(), 1.0), conf_thres, nms_thres, indices);

// 筛选NMS后的结果

vector<Rect> final_dets;

for (int idx : indices) {

final_dets.push_back(dets[idx]);

}

return final_dets;

}

int main() {

// 1. 初始化Horizon Runtime

horizon_runtime_handle_t rt_handle;

HORIZON_RUNTIME_CHECK(horizon_runtime_create(&rt_handle));

// 2. 加载.bmodel模型

horizon_model_handle_t model_handle;

HORIZON_RUNTIME_CHECK(horizon_runtime_load_model(rt_handle, "./yolo11n_640_bm1684.bmodel", &model_handle));

// 3. 获取模型输入/输出信息

horizon_tensor_info_t input_info, output_info;

HORIZON_RUNTIME_CHECK(horizon_runtime_get_model_input_tensor_info(model_handle, 0, &input_info));

HORIZON_RUNTIME_CHECK(horizon_runtime_get_model_output_tensor_info(model_handle, 0, &output_info));

// 4. 创建输入/输出缓冲区

void* input_buf = malloc(input_info.data_size);

void* output_buf = malloc(output_info.data_size);

horizon_tensor_t input_tensor = {input_info, input_buf};

horizon_tensor_t output_tensor = {output_info, output_buf};

// 5. 读取图像并预处理

Mat img = imread("./test.jpg");  // 测试图像

Mat chw_img = preprocess_image(img, 640);

memcpy(input_buf, chw_img.data, input_info.data_size);  // 拷贝预处理后的数据到输入缓冲区

// 6. 执行推理

HORIZON_RUNTIME_CHECK(horizon_runtime_infer(rt_handle, model_handle, &input_tensor, 1, &output_tensor, 1));

// 7. 后处理并绘制结果

vector<Rect> dets = postprocess_output((float*)output_buf, 640, img);

for (Rect det : dets) {

rectangle(img, det, Scalar(0,255,0), 2);  // 绘制检测框

}

imwrite("./result.jpg", img);  // 保存结果

// 8. 释放资源

free(input_buf);

free(output_buf);

HORIZON_RUNTIME_CHECK(horizon_runtime_unload_model(model_handle));

HORIZON_RUNTIME_CHECK(horizon_runtime_destroy(rt_handle));

cout << "Inference done! Result saved to result.jpg" << endl;

return 0;

}

4.3 编译与运行（目标端）

将推理代码（yolo11_infer.cpp）和.bmodel文件传输到 BM1684 设备（如通过scp）。

编译代码（需链接地平线 Runtime 库和 OpenCV 库）：

g++ yolo11_infer.cpp -o yolo11_infer \

-I/usr/include/horizon_runtime \

-L/usr/lib -lhorizon_runtime -lopencv_core -lopencv_imgcodecs -lopencv_imgproc

运行推理程序：

./yolo11_infer  # 需提前准备test.jpg测试图像

成功标志：生成result.jpg，图像中正确绘制目标检测框。

五、常见问题与优化方向

5.1 常见问题排查

算子不支持错误：

原因：YOLO11 中的某些算子（如SiLU的优化实现）未被 BM1684 支持。

解决：导出 ONNX 时禁用算子融合，或在 Ultralytics 模型定义中替换为 BPU 支持的算子（如SiLU→ReLU6，需重新训练）。

量化精度下降严重：

原因：校准数据集数量不足或分布与训练数据差异大。

解决：增加校准数据量（≥100 张），或采用量化感知训练（QAT）（需修改 YOLO11 训练代码，加入量化钩子）。

推理速度慢：

原因：图像预处理用 CPU（未用硬件加速），或 batch size=1。

解决：使用地平线hb_preprocess模块加速预处理，或改为 batch=2/4（需重新导出 ONNX 时设置dynamic=False, batch=4）。

5.2 性能优化建议

输入尺寸调整：若对速度要求高，可将输入尺寸从 640x640 改为 320x320，推理速度可提升 2-3 倍（精度略有下降）。

多线程推理：利用地平线 Runtime 的多线程 API，同时处理多路图像（如摄像头实时流）。

模型剪枝：用 Ultralytics 的prune功能对 YOLO11 进行剪枝（去除冗余通道），再重新量化，进一步减小模型体积和 latency。

六、总结

YOLO11 部署到 BM1684 的核心是 “ONNX 格式适配→INT8 量化→Runtime 推理”，关键在于确保模型算子兼容 BPU、量化校准数据代表性足够。通过地平线工具链的hb_mapper和Horizon Runtime，可充分发挥 BM1684 的低功耗优势，适合嵌入式场景（如智能摄像头、边缘检测设备）的实时目标检测任务。

- END -

上一篇：在Docker上安装Node-RED简单高效的方法返回列表下一篇：AI边缘计算盒子常见的硬件故障及解决方法？

类别	工具 / 环境	作用说明
主机环境	Ubuntu 20.04/18.04	用于模型导出、ONNX 简化、量化校准（需 Python 3.8+，PyTorch 2.0+）
地平线工具链	Horizon OpenExplorer (HOE)	包含hb_mapper（量化 / 编译工具）、Horizon Runtime（目标端推理 API）
目标端环境	BM1684 设备（如地平线 J5 开发板）	需安装地平线 BPU SDK（推荐 v4.0+，支持 YOLO 系列模型算子）
辅助工具	onnx-simplifier、OpenCV	简化 ONNX 模型、处理图像预处理（resize / 归一化）

YOLO11模型部署到BM1684平台的核心"ONNX 格式适配→INT8 量化→Runtime 推理"

需求留言: