RK3588上CPU和GPU算力以及opencv resize的性能对比测试
RK3588上CPU和GPU算力以及opencv resize的性能对比测试
- 一.背景
- 二.小结
- 三.相关链接
- 四.操作步骤
- 1.环境搭建
- A.安装依赖
- B.设置GPU为高性能模式
- C.获取GPU信息
- D.获取CPU信息
- 2.调用OpenCL SDK获取GPU信息
- 3.使用OpenCL API计算矩阵乘
- 4.使用clpeak测试GPU的性能
- 5.使用OpenBLAS测试CPU的算力
- 6.分别用CPU与OpenCL测试opencv resize的性能
- A.编译OpenCV支持OpenCL
- B.运行OpenCV测试程序
一.背景
- 希望对比RK3588上CPU和Mali-GPU的性能差异
- Mali-GPU算力测试采用clpeak
- CPU-FP32的性能测试采用Openblas(开启了NEON优化)
- 分别用CPU和opencl测试opencv resize在不同算法下的性能:从32x32放大到8192x8192再缩放回32x32,循环100次
二.小结
- GPU型号: Mali-LODX r0p0 Mali-G610 4 cores r0p0 0xA867
- GPU FP32(clpeak): 441.95 GFLOPS
- CPU FP32(openblas+neon): 53.68 GFLOPS
- 插值方法:INTER_NEAREST CPU耗时(秒):3.01526 GPU耗时(秒):0.0672681
- 插值方法:INTER_LINEAR CPU耗时(秒):5.3227 GPU耗时(秒):0.0189366
- 插值方法:INTER_CUBIC CPU耗时(秒):8.22734 GPU耗时(秒):11.6337
- 插值方法:INTER_AREA CPU耗时(秒):20.4999 GPU耗时(秒):27.3197
- 插值方法:INTER_LANCZOS4 CPU耗时(秒):29.3602 GPU耗时(秒):43.9484
三.相关链接
- opencv编译
四.操作步骤
1.环境搭建
A.安装依赖
mv /lib/aarch64-linux-gnu/libOpenCL.so.1 /lib/aarch64-linux-gnu/libOpenCL.so.1.bk
ln -s /usr/lib/aarch64-linux-gnu/libmali.so /lib/aarch64-linux-gnu/libOpenCL.so.1sudo apt install opencl-headers
sudo apt install ocl-icd-libopencl1
sudo apt install ocl-icd-opencl-dev
sudo apt install clinfo
B.设置GPU为高性能模式
echo performance> /sys/class/devfreq/fb000000.gpu/governor
echo performance> /sys/class/devfreq/fdab0000.npu/governor
C.获取GPU信息
cat /sys/class/misc/mali0/device/gpuinfo
clinfo
输出
Mali-G610 4 cores r0p0 0xA867Number of platforms 1Platform Name ARM PlatformPlatform Vendor ARMPlatform Version OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Platform Profile FULL_PROFILEPlatform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_openclPlatform Host timer resolution 1nsPlatform Extensions function suffix ARMPlatform Name ARM Platform
Number of devices 1
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.Device Name Mali-LODX r0p0Device Vendor ARMDevice Vendor ID 0xa8670000Device Version OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Driver Version 2.1Device OpenCL C Version OpenCL C 2.0 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Device Type GPUDevice Profile FULL_PROFILEDevice Available YesCompiler Available YesLinker Available YesMax compute units 4Max clock frequency 1000MHzDevice Partition (core)Max number of sub-devices 0Supported partition types NoneSupported affinity domains (n/a)Max work item dimensions 3Max work item sizes 1024x1024x1024Max work group size 1024Preferred work group size multiple 16Max sub-groups per work group 64Preferred / native vector sizeschar 16 / 4short 8 / 2int 4 / 1long 2 / 1half 8 / 2 (cl_khr_fp16)float 4 / 1double 0 / 0 (n/a)Half-precision Floating-point support (cl_khr_fp16)Denormals YesInfinity and NANs YesRound to nearest YesRound to zero YesRound to infinity YesIEEE754-2008 fused multiply-add YesSupport is emulated in software NoSingle-precision Floating-point support (core)Denormals YesInfinity and NANs YesRound to nearest YesRound to zero YesRound to infinity YesIEEE754-2008 fused multiply-add YesSupport is emulated in software NoCorrectly-rounded divide and sqrt operations NoDouble-precision Floating-point support (n/a)Address bits 64, Little-EndianGlobal memory size 16643870720 (15.5GiB)Error Correction support NoMax memory allocation 16643870720 (15.5GiB)Unified memory for Host and Device YesShared Virtual Memory (SVM) capabilities (core)Coarse-grained buffer sharing YesFine-grained buffer sharing NoFine-grained system sharing NoAtomics NoMinimum alignment for any data type 128 bytesAlignment of base address 1024 bits (128 bytes)Preferred alignment for atomicsSVM 0 bytesGlobal 0 bytesLocal 0 bytesMax size for global variable 65536 (64KiB)Preferred total size of global vars 0Global Memory cache type Read/WriteGlobal Memory cache size 1048576 (1024KiB)Global Memory cache line size 64 bytesImage support YesMax number of samplers per kernel 16Max size for 1D images from buffer 65536 pixelsMax 1D or 2D image array size 2048 imagesBase address alignment for 2D image buffers 32 bytesPitch alignment for 2D image buffers 64 pixelsMax 2D image size 65536x65536 pixelsMax 3D image size 65536x65536x65536 pixelsMax number of read image args 128Max number of write image args 64Max number of read/write image args 64Max number of pipe args 16Max active pipe reservations 1Max pipe packet size 1024Local memory type GlobalLocal memory size 32768 (32KiB)Max number of constant args 128Max constant buffer size 16643870720 (15.5GiB)Max size of kernel argument 1024Queue properties (on host)Out-of-order execution YesProfiling YesQueue properties (on device)Out-of-order execution YesProfiling YesPreferred size 2097152 (2MiB)Max size 16777216 (16MiB)Max queues on device 1Max events on device 1024Prefer user sync for interop NoProfiling timer resolution 1000nsExecution capabilitiesRun OpenCL kernels YesRun native kernels NoSub-group independent forward progress YesIL version SPIR-V_1.0SPIR versions <printDeviceInfo:161: get CL_DEVICE_SPIR_VERSIONS size : error -30>printf() buffer size 1048576 (1024KiB)Built-in kernels (n/a)Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_openclNULL platform behaviorclGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) ARM PlatformclGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [ARM]clCreateContext(NULL, ...) [default] Success [ARM]clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)Platform Name ARM PlatformDevice Name Mali-LODX r0p0clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)Platform Name ARM PlatformDevice Name Mali-LODX r0p0clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)Platform Name ARM PlatformDevice Name Mali-LODX r0p0
D.获取CPU信息
lscpu
输出
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 3
Vendor ID: ARM
Model: 0
Model name: Cortex-A55
Stepping: r2p0
CPU max MHz: 2208.0000
CPU min MHz: 408.0000
BogoMIPS: 48.00
L1d cache: 256 KiB
L1i cache: 256 KiB
L2 cache: 1 MiB
L3 cache: 3 MiB
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
2.调用OpenCL SDK获取GPU信息
cat > cl_query.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>int main() {cl_platform_id *platforms = NULL;cl_uint num_platforms = 0;// 获取可用的平台数量cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * num_platforms);// 获取所有平台IDclStatus = clGetPlatformIDs(num_platforms, platforms, NULL);printf("OpenCL平台数量: %d\n", num_platforms);// 遍历每个平台for (cl_uint i = 0; i < num_platforms; ++i) {char buffer[10240];printf("\n平台 %d:\n", i+1);// 获取平台名称clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, sizeof(buffer), buffer, NULL);printf(" 名称: %s\n", buffer);// 获取平台供应商clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(buffer), buffer, NULL);printf(" 供应商: %s\n", buffer);// 获取平台版本clGetPlatformInfo(platforms[i], CL_PLATFORM_VERSION, sizeof(buffer), buffer, NULL);printf(" 版本: %s\n", buffer);// 获取设备数量cl_uint num_devices = 0;clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);cl_device_id *devices = (cl_device_id*) malloc(sizeof(cl_device_id) * num_devices);clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL);// 遍历每个设备for (cl_uint j = 0; j < num_devices; ++j) {printf(" 设备 %d:\n", j+1);// 获取设备名称clGetDeviceInfo(devices[j], CL_DEVICE_NAME, sizeof(buffer), buffer, NULL);printf(" 名称: %s\n", buffer);// 获取设备类型cl_device_type device_type;clGetDeviceInfo(devices[j], CL_DEVICE_TYPE, sizeof(device_type), &device_type, NULL);if (device_type & CL_DEVICE_TYPE_CPU)printf(" 类型: CPU\n");if (device_type & CL_DEVICE_TYPE_GPU)printf(" 类型: GPU\n");if (device_type & CL_DEVICE_TYPE_ACCELERATOR)printf(" 类型: 加速器\n");// 获取计算单元数量cl_uint compute_units;clGetDeviceInfo(devices[j], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compute_units), &compute_units, NULL);printf(" 计算单元数: %d\n", compute_units);// 获取全局内存大小cl_ulong global_mem;clGetDeviceInfo(devices[j], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(global_mem), &global_mem, NULL);printf(" 全局内存大小: %llu MB\n", (unsigned long long)(global_mem / (1024 * 1024)));}free(devices);}free(platforms);return 0;
}
EOFgcc -o cl_query cl_query.c -lOpenCL
./cl_query
输出
OpenCL平台数量: 1平台 1:名称: ARM Platform供应商: ARM版本: OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03设备 1:
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.名称: Mali-LODX r0p0类型: GPU计算单元数: 4全局内存大小: 15872 MB
3.使用OpenCL API计算矩阵乘
cat > matmul.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
#include <time.h>
#include <sys/time.h>#define MATRIX_SIZE 8192
#define TILE_SIZE 32// 获取当前时间(秒),用于计算耗时
double get_current_time() {struct timeval tp;gettimeofday(&tp, NULL);return (double)(tp.tv_sec) + (double)(tp.tv_usec) / 1e6;
}#define xstr(s) str(s)
#define str(s) #sconst char *kernelSource = " \n" \
"__kernel void mat_mul_optimized(const int N, \n" \
" __global float* A, \n" \
" __global float* B, \n" \
" __global float* C) { \n" \
" const int TILE_SIZE = " xstr(TILE_SIZE) "; \n" \
" __local float Asub[TILE_SIZE][TILE_SIZE]; \n" \
" __local float Bsub[TILE_SIZE][TILE_SIZE]; \n" \
" int global_row = get_global_id(1); \n" \
" int global_col = get_global_id(0); \n" \
" int local_row = get_local_id(1); \n" \
" int local_col = get_local_id(0); \n" \
" float sum = 0.0f; \n" \
" int numTiles = (N + TILE_SIZE - 1) / TILE_SIZE; \n" \
" for (int t = 0; t < numTiles; ++t) { \n" \
" int tiled_row = global_row; \n" \
" int tiled_col = t * TILE_SIZE + local_col; \n" \
" if (tiled_row < N && tiled_col < N) \n" \
" Asub[local_row][local_col] = A[tiled_row * N + tiled_col];\n" \
" else \n" \
" Asub[local_row][local_col] = 0.0f; \n" \
" tiled_row = t * TILE_SIZE + local_row; \n" \
" tiled_col = global_col; \n" \
" if (tiled_row < N && tiled_col < N) \n" \
" Bsub[local_row][local_col] = B[tiled_row * N + tiled_col];\n" \
" else \n" \
" Bsub[local_row][local_col] = 0.0f; \n" \
" barrier(CLK_LOCAL_MEM_FENCE); \n" \
" for (int k = 0; k < TILE_SIZE; ++k) { \n" \
" sum += Asub[local_row][k] * Bsub[k][local_col]; \n" \
" } \n" \
" barrier(CLK_LOCAL_MEM_FENCE); \n" \
" } \n" \
" if (global_row < N && global_col < N) \n" \
" C[global_row * N + global_col] = sum; \n" \
"} \n";int main() {int N = MATRIX_SIZE;size_t bytes = N * N * sizeof(float);// 分配主机内存float *h_A = (float*)malloc(bytes);float *h_B = (float*)malloc(bytes);float *h_C = (float*)malloc(bytes);// 初始化矩阵for(int i = 0; i < N*N; i++) {h_A[i] = 1.0f;h_B[i] = 1.0f;}// 获取平台和设备信息cl_platform_id platformId = NULL;cl_device_id deviceID = NULL;cl_uint retNumDevices;cl_uint retNumPlatforms;cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);// 创建 OpenCL 上下文cl_context context = clCreateContext(NULL, 1, &deviceID, NULL, NULL, &ret);// 创建命令队列cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &ret);// 创建内存缓冲区cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, &ret);// 将数据写入缓冲区ret = clEnqueueWriteBuffer(commandQueue, d_A, CL_TRUE, 0, bytes, h_A, 0, NULL, NULL);ret = clEnqueueWriteBuffer(commandQueue, d_B, CL_TRUE, 0, bytes, h_B, 0, NULL, NULL);// 记录编译开始时间double compile_start = get_current_time();// 创建程序对象cl_program program = clCreateProgramWithSource(context, 1, (const char**)&kernelSource, NULL, &ret);// 编译内核程序ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);// 检查编译错误if (ret != CL_SUCCESS) {size_t log_size;clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);char *log = (char *)malloc(log_size);clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);printf("CL Compilation failed:\n%s\n", log);free(log);return 1;}// 记录编译结束时间double compile_end = get_current_time();double compile_time = compile_end - compile_start;// 创建 OpenCL 内核cl_kernel kernel = clCreateKernel(program, "mat_mul_optimized", &ret);// 设置内核参数ret = clSetKernelArg(kernel, 0, sizeof(int), (void*)&N);ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_A);ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_B);ret = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void*)&d_C);// 定义全局和本地工作区大小size_t local[2] = {TILE_SIZE, TILE_SIZE};size_t global[2] = {(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE,(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE};// 记录第一次内核执行开始时间double launch_start = get_current_time();// 执行内核ret = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, global, local, 0, NULL, NULL);printf("clEnqueueNDRangeKernel:%d\n",ret);// 等待命令队列执行完成clFinish(commandQueue);// 记录第一次内核执行结束时间double launch_end = get_current_time();double launch_time = launch_end - launch_start;// 读取结果ret = clEnqueueReadBuffer(commandQueue, d_C, CL_TRUE, 0, bytes, h_C, 0, NULL, NULL);// 计算 GFLOPSdouble total_ops = 2.0 * N * N * N;double gflops = (total_ops / 1e9) / launch_time;// 输出结果printf("编译时间: %f 秒\n", compile_time);printf("第一次内核执行时间: %f 秒\n", launch_time);printf("计算性能: %f GFLOPS\n", gflops);// 释放资源ret = clFlush(commandQueue);ret = clFinish(commandQueue);ret = clReleaseKernel(kernel);ret = clReleaseProgram(program);ret = clReleaseMemObject(d_A);ret = clReleaseMemObject(d_B);ret = clReleaseMemObject(d_C);ret = clReleaseCommandQueue(commandQueue);ret = clReleaseContext(context);free(h_A);free(h_B);free(h_C);return 0;
}EOF
gcc -o matmul matmul.c -lOpenCL
./matmul
输出
编译时间: 0.031085 秒
第一次内核执行时间: 62.258528 秒
计算性能: 17.660418 GFLOPS
4.使用clpeak测试GPU的性能
git clone https://gitcode.com/gh_mirrors/cl/clpeak.git
git submodule update --init --recursive --remote
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./clpeak
输出
Platform: ARM Platform
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.Device: Mali-LODX r0p0Driver version : 2.1 (Linux ARM64)Compute units : 4Clock frequency : 1000 MHzGlobal memory bandwidth (GBPS)float : 25.71float2 : 24.45float4 : 23.70float8 : 12.05float16 : 12.01Single-precision compute (GFLOPS)float : 441.77float2 : 470.27float4 : 466.52float8 : 435.65float16 : 411.38Half-precision compute (GFLOPS)half : 441.96half2 : 878.25half4 : 911.51half8 : 886.19half16 : 846.44No double precision support! SkippedInteger compute (GIOPS)int : 124.96int2 : 125.71int4 : 125.16int8 : 123.82int16 : 124.24Integer compute Fast 24bit (GIOPS)int : 125.16int2 : 125.63int4 : 125.20int8 : 123.73int16 : 124.33Integer char (8bit) compute (GIOPS)char : 126.47char2 : 251.55char4 : 498.03char8 : 497.37char16 : 491.94Integer short (16bit) compute (GIOPS)short : 126.31short2 : 250.90short4 : 249.47short8 : 248.51short16 : 245.30Transfer bandwidth (GBPS)enqueueWriteBuffer : 8.54enqueueReadBuffer : 9.97enqueueWriteBuffer non-blocking : 8.55enqueueReadBuffer non-blocking : 9.99enqueueMapBuffer(for read) : 61.66memcpy from mapped ptr : 11.95enqueueUnmap(after write) : 62.02memcpy to mapped ptr : 11.89Kernel launch latency : 26.81 us
5.使用OpenBLAS测试CPU的算力
git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make TARGET=ARMV8
make install
cd benchmark
make TARGET=ARMV8 sgemm
cc sgemm.o -o sgemm /opt/OpenBLAS/lib/libopenblas.so -Wl,-rpath=/opt/OpenBLAS/lib/
export OPENBLAS_NUM_THREADS=8
export OPENBLAS_LOOPS=10
export OPENBLAS_PARAM_M=8192
export OPENBLAS_PARAM_N=8192
export OPENBLAS_PARAM_K=8192
./sgemm
输出
From : 1 To : 200 Step=1 : Transa=N : Transb=NSIZE Flops TimeM=8192, N=8192, K=8192 : 53485.68 MFlops 205.571220 sec
6.分别用CPU与OpenCL测试opencv resize的性能
A.编译OpenCV支持OpenCL
- Opencv修改点[链接libmali.so]
diff --git a/cmake/OpenCVDetectOpenCL.cmake b/cmake/OpenCVDetectOpenCL.cmake
index 6ab2cae070..c3cf235e45 100644
--- a/cmake/OpenCVDetectOpenCL.cmake
+++ b/cmake/OpenCVDetectOpenCL.cmake
@@ -3,9 +3,8 @@ if(APPLE)set(OPENCL_LIBRARY "-framework OpenCL" CACHE STRING "OpenCL library")set(OPENCL_INCLUDE_DIR "" CACHE PATH "OpenCL include directory")else()
- set(OPENCL_LIBRARY "" CACHE STRING "OpenCL library")
- set(OPENCL_INCLUDE_DIR "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/1.2" CACHE PATH "OpenCL include directory")
- ocv_install_3rdparty_licenses(opencl-headers "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/LICENSE.txt")
+ set(OPENCL_LIBRARY "/usr/lib/aarch64-linux-gnu/libmali.so")
+ set(OPENCL_INCLUDE_DIR "/usr/include")endif()mark_as_advanced(OPENCL_INCLUDE_DIR OPENCL_LIBRARY)
- 编译Opencv
git clone https://github.com/opencv/opencv.git
cd opencv
git checkout bdb6a968ce69a2bf7c34724f9052c20e941ab47b
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \-DCMAKE_INSTALL_PREFIX=`pwd`/_install \-DWITH_OPENCL=ON -DWITH_NEON=ON \-DBUILD_SHARED_LIBS=ON \-D BUILD_opencv_world=ON -DBUILD_TESTS=OFF -DBUILD_EXAMPLES=OFF -DBUILD_opencv_apps=OFF \-DBUILD_opencv_dnn=OFF -DBUILD_opencv_calib3d=OFF \-DBUILD_opencv_imgproc=ON -DBUILD_opencv_imgcodecs=ON ..
make -j4
make install
B.运行OpenCV测试程序
cat > opencv_resize.cpp <<-'EOF'
#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>
#include <iostream>
#include <map>void run(int resize_mode)
{// 创建一个32x32的随机图像cv::Mat src = cv::Mat::zeros(32, 32, CV_8UC3);cv::randu(src, cv::Scalar::all(0), cv::Scalar::all(255));// ------------------------------------// 在CPU上执行// ------------------------------------cv::ocl::setUseOpenCL(false);cv::Mat enlarged_cpu, resized_back_cpu;// 记录放大操作的开始时间int64 start_time_cpu = cv::getTickCount();for(int i=0;i<100;i++){// 放大到8192x8192cv::resize(src, enlarged_cpu, cv::Size(8192, 8192), 0, 0, resize_mode);// 缩小回32x32cv::resize(enlarged_cpu, resized_back_cpu, cv::Size(32, 32), 0, 0, resize_mode);}// 记录缩小操作的结束时间int64 end_time_cpu = cv::getTickCount();// 计算缩小操作的耗时double time_resize_cpu = (end_time_cpu - start_time_cpu) / cv::getTickFrequency();// ------------------------------------// 在GPU(OpenCL)上执行// ------------------------------------cv::ocl::setUseOpenCL(true);cv::UMat src_umat;src.copyTo(src_umat);cv::UMat enlarged_gpu, resized_back_gpu;// 记录放大操作的开始时间int64 start_time_gpu = cv::getTickCount();for(int i=0;i<100;i++){// 放大到8192x8192cv::resize(src_umat, enlarged_gpu, cv::Size(8192, 8192), 0, 0, resize_mode);// 缩小回32x32cv::resize(enlarged_gpu, resized_back_gpu, cv::Size(32, 32), 0, 0, resize_mode);}// 记录缩小操作的结束时间int64 end_time_gpu = cv::getTickCount();// 计算缩小操作的耗时double time_resize_gpu = (end_time_gpu - start_time_gpu) / cv::getTickFrequency();std::cout <<"CPU耗时(秒):" << time_resize_cpu << " " << "GPU耗时(秒):" << time_resize_gpu << std::endl;
}int main() {// 检查系统是否支持OpenCLif (!cv::ocl::haveOpenCL()) {std::cout << "系统不支持OpenCL。" << std::endl;return -1;}// 输出OpenCL设备信息cv::ocl::Context context;if (!context.create(cv::ocl::Device::TYPE_GPU)) {std::cout << "未找到可用的GPU设备,使用CPU执行。" << std::endl;} else {cv::ocl::Device device = cv::ocl::Device::getDefault();std::cout << "使用的OpenCL设备:" << device.name() << std::endl;}// 定义要测试的插值方法std::vector<int> interpolation_methods = {cv::INTER_NEAREST,cv::INTER_LINEAR,cv::INTER_CUBIC,cv::INTER_AREA,cv::INTER_LANCZOS4};// 插值方法的名称,用于输出结果std::vector<std::string> interpolation_names = {"INTER_NEAREST","INTER_LINEAR","INTER_CUBIC","INTER_AREA","INTER_LANCZOS4"};for (size_t i = 0; i < interpolation_methods.size(); ++i) {int interpolation = interpolation_methods[i];std::string method_name = interpolation_names[i];std::cout << "插值方法:" << method_name << " ";run(interpolation);} return 0;
}
EOF
g++ -o opencv_resize opencv_resize.cpp -I _install/include/opencv4 \_install/lib/libopencv_world.so -Wl,-rpath=_install/lib
export OPENBLAS_NUM_THREADS=8
./opencv_resize
输出
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
使用的OpenCL设备:Mali-LODX r0p0
插值方法:INTER_NEAREST CPU耗时(秒):3.01526 GPU耗时(秒):0.0672681
插值方法:INTER_LINEAR CPU耗时(秒):5.3227 GPU耗时(秒):0.0189366
插值方法:INTER_CUBIC CPU耗时(秒):8.22734 GPU耗时(秒):11.6337
插值方法:INTER_AREA CPU耗时(秒):20.4999 GPU耗时(秒):27.3197
插值方法:INTER_LANCZOS4 CPU耗时(秒):29.3602 GPU耗时(秒):43.9484
相关文章:
![](https://www.ngui.cc/images/no-images.jpg)
RK3588上CPU和GPU算力以及opencv resize的性能对比测试
RK3588上CPU和GPU算力以及opencv resize的性能对比测试 一.背景二.小结三.相关链接四.操作步骤1.环境搭建A.安装依赖B.设置GPU为高性能模式C.获取GPU信息D.获取CPU信息 2.调用OpenCL SDK获取GPU信息3.使用OpenCL API计算矩阵乘4.使用clpeak测试GPU的性能5.使用OpenBLAS测试CPU的…...
![](https://i-blog.csdnimg.cn/img_convert/bea2bc1461ba494b89af4f62623120e9.png)
基于Centos 7系统的安全加固方案
创作不易,麻烦点个免费的赞和关注吧! 声明! 免责声明:本教程作者及相关参与人员对于任何直接或间接使用本教程内容而导致的任何形式的损失或损害,包括但不限于数据丢失、系统损坏、个人隐私泄露或经济损失等…...
![](https://www.ngui.cc/images/no-images.jpg)
IT行业的发展趋势
一、引言 IT(信息技术)行业自诞生以来,就以惊人的速度发展,不断改变着我们的生活、工作和社会结构。如今,随着技术的持续创新、市场需求的演变以及全球经济格局的变化,IT行业正迈向新的发展阶段࿰…...
![](https://www.ngui.cc/images/no-images.jpg)
《探秘开源多模态神经网络模型:AI 新时代的万能钥匙》
《探秘开源多模态神经网络模型:AI 新时代的万能钥匙》 一、多模态模型的崛起之路(一)从单一到多元:模态的融合演进(二)关键技术突破:解锁多模态潜能 二、开源多模态模型深度剖析(一&…...
![](https://i-blog.csdnimg.cn/img_convert/42514445b57b281e15c9f2244a173419.png)
ROS核心概念解析:从Node到Master,再到roslaunch的全面指南
Node 在ROS中,最小的进程单元就是节点(node)。一个软件包里可以有多个可执行文件,可执行文件在运行之后就成了一个进程(process),这个进程在ROS中就叫做节点。 从程序角度来说,node就是一个可执行文件&…...
![](https://i-blog.csdnimg.cn/img_convert/4b63b57f1bff4549e531b219b2d54ab9.jpeg)
2025广州国际汽车内外饰技术展览会:引领汽车内外饰发展新潮流-Automotive Interiors
随着科技的不断进步和消费者对汽车品质的要求日益提高,汽车内外饰的设计和制造也在不断创新和发展。AUTO TECH China 2025广州国际汽车内外饰技术展览会作为行业内的重要盛会,将于2025年11月20日至22日在广州保利世贸博览馆盛大举办。本次展览会将汇集全…...
![](https://i-blog.csdnimg.cn/direct/30b66128d07846feba3d402e55a891ef.png)
ElasticSearch内存占用率过高怎么办?
文章目录 1,先用top看看各个进程的内存占用情况2,不能简单的杀死进程,然后再重启。3,查看一下ElasticSearch进程的具体启动情况4,修改Elasticsearch 的Java堆内存 1,先用top看看各个进程的内存占用情况 先…...
![](https://www.ngui.cc/images/no-images.jpg)
基于Qt的OFD阅读器开发原理与实践
摘要 本文详细探讨了基于Qt开发OFD阅读器的原理与实践。通过解析OFD文件格式、构建文档结构、实现页面渲染、处理用户交互以及进行性能优化,本文展示了如何使用Qt框架开发一个功能强大、性能优异的OFD阅读器。文章还提供了示例代码和未来发展方向,为开发…...
![](https://www.ngui.cc/images/no-images.jpg)
用 HTML5 Canvas 和 JavaScript 实现流星雨特效
最近在研究前端动画效果时,实现了一个超酷的流星雨特效,今天来和大家分享下具体实现过程。 1,整体实现思路 这个流星雨特效主要由 HTML、CSS 和 JavaScript 协同完成。HTML 搭建基础结构,CSS 负责页面样式设计,JavaScript 实现星星和流星的动态效果。 效果展示: 用 HTM…...
![](https://www.ngui.cc/images/no-images.jpg)
Apifox=Postman+Swagger+Jmeter+Mock
A. 开发人员接口管理使用(Swagger 工具管理接口) B. 后端开发人员通过Postman 工具,一边开发一边测试 C. 前端开发人员需要Mock 工具提供前端调用 D. 测试人员通过(Postman、Jmeter)等工具进行接口测试 为了后台开发、前端开发、测试工程师等不同角色更加便捷管理…...
![](https://www.ngui.cc/images/no-images.jpg)
SpringBoot多数据源架构实现
文章目录 1. 环境准备2. 创建Spring Boot项目3. 添加依赖4. 配置多数据源5. 配置MyBatis-Plus6. 使用多数据源7. 创建Mapper接口8. 实体类定义9. 测试多数据源10. 注意事项10.1 事务导致多数据源失效问题解决方案: 10.2 ClickHouse的事务支持10.3 数据源切换的性能开…...
![](https://www.ngui.cc/images/no-images.jpg)
HarmonyOS开发:传参方式
一、父子组件传参 1、父传子(Prop方式) 父组件代码 Entry Component struct ParentComponent {State parentMessage: string Hello from Parent;build() {Column() {ChildComponent({ message: this.parentMessage });}} } 子组件代码 Component s…...
![](https://i-blog.csdnimg.cn/direct/79e2841b7b104dc1a00cd69d0231469b.png)
OpenCV计算机视觉 07 图像的模块匹配
在做目标检测、图像识别时,我们经常用到模板匹配,以确定模板在输入图像中的可能位置 API函数 cv2.matchTemplate(image, templ, method, resultNone, maskNone) 参数含义: image:待搜索图像 templ:模板图像 method&…...
![](https://i-blog.csdnimg.cn/img_convert/8b21c9ec79fb2590cf688e221a45ce8a.png)
国产游戏崛起,燕云十六移动端1.9上线,ToDesk云电脑先开玩
游戏爱好者的利好消息出新了!网易大型武侠仙游《燕云十六声》正式官宣,移动端要在1月9日正式上线了!你期待手游版的燕云吗?不妨评论区留言说说你的看法。小编分别花了几个小时在台式机电脑和手机上都试了下,欣赏画面还…...
![](https://i-blog.csdnimg.cn/img_convert/5891aba85edd0921675ba6c1da8abec0.png)
企业级PHP异步RabbitMQ协程版客户端 2.0 正式发布
概述 workerman/rabbitmq 是一个异步RabbitMQ客户端,使用AMQP协议。 RabbitMQ是一个基于AMQP(高级消息队列协议)实现的开源消息组件,它主要用于在分布式系统中存储和转发消息。RabbitMQ由高性能、高可用以及高扩展性出名的Erlan…...
![](https://i-blog.csdnimg.cn/direct/c9954aebbbd54b2cb7efebd0b5821f2a.png)
[OPEN SQL] 限定选择行数
本次操作使用的数据库表为SCUSTOM,其字段内容如下所示 航班用户(SCUSTOM) 该数据库表中的部分值如下所示 指定查询多少行数据,我们可以使用语法UP TO n ROWS来实现对数据前n项的查询 语法格式 SELECT * FROM <dbtab> UP TO n ROWS 参数说明 db…...
![](https://i-blog.csdnimg.cn/direct/da08fb3de29d410f97644fddb8ea0c45.png)
Vite源码学习分享(一)
!](https://i-blog.csdnimg.cn/direct/971c35b61c57402b95be91d2b4965d85.png) 同一个项目 vite VS webpack启动速度对比...
![](https://i-blog.csdnimg.cn/direct/87323462f8bb4356a0d5e3cd648b129c.png)
定位,用最通俗易懂的方法2:TDOA与对应的CRLB
二郎就不设置什么VIP可见啥的了,这样大家都能看到。 如果觉得受益,可以给予一些打赏,也算对原创的一些鼓励,谢谢。 钱的用途:1)布施给他人;2)二郎会有更多空闲时间写教程 起因&…...
![](https://i-blog.csdnimg.cn/direct/c1661bfc4e8747028a3d9052fad61bb5.png)
Linux第一课:c语言 学习记录day06
四、数组 冒泡排序 两两比较,第 j 个和 j1 个比较 int a[5] {5, 4, 3, 2, 1}; 第一轮:i 0 n:n个数,比较 n-1-i 次 4 5 3 2 1 // 第一次比较 j 0 4 3 5 2 1 // 第二次比较 j 1 4 3 2 5 1 // 第三次比较 j 2 4 3 2 1 5 // …...
![](https://i-blog.csdnimg.cn/img_convert/5cd21fc9b443868c6c912a43b14e0939.png)
ExplaineR:集成K-means聚类算法的SHAP可解释性分析 | 可视化混淆矩阵、决策曲线、模型评估与各类SHAP图
集成K-means聚类算法的SHAP可解释性分析 加载数据集并训练机器学习模型 SHAP 分析以提取特征对预测的影响 通过混淆矩阵可视化模型性能 决策曲线分析 模型评估(多指标和ROC曲线的目视检查) 带注释阈值的 ROC 曲线 加载 SHAP 结果以进行下游分析 与…...
![](https://www.ngui.cc/images/no-images.jpg)
2025年第三届“华数杯”国际大学生数学建模竞赛A题题目
问题A:他能游得更快吗? 背景介绍 在2024年巴黎奥运会上,中国游泳运动员潘展乐凭借出色的表现成为全球瞩目的焦点。年仅19岁的他在男子100米自由泳比赛中以46秒40 的成绩夺冠,并创造了自己保持的世界纪录。在男子4100米混合泳接力…...
![](https://www.ngui.cc/images/no-images.jpg)
用c实现C++类(八股)
在 C 语言中,虽然没有内建的面向对象编程(OOP)特性(如封装、继承、多态),但通过一些编程技巧,我们仍然可以模拟实现这些概念。下面将用通俗易懂的方式,逐步介绍如何在 C 中实现封装、…...
![](https://www.ngui.cc/images/no-images.jpg)
【C++多线程编程:六种锁】
目录 普通互斥锁: 轻量级锁 独占锁: std::lock_guard: std::unique_lock: 共享锁: 超时的互斥锁 递归锁 普通互斥锁: std::mutex确保任意时刻只有一个线程可以访问共享资源,在多线程中常用于保…...
![](https://i-blog.csdnimg.cn/direct/3211d584df7c4e529006fcf3d61280db.png)
【Javascript Day5】for循环及典型案例
for 循环 // 语法: for( 开始 ; 结束 ; 步长 ){ 循环体 } // for( var i 循环初始值 ; i的循环范围 ; i的增加或减少规则 ){ 循环体 } // 死循环 // for(;;){ // console.log("for循环"); // } // 循环打…...
![](https://i-blog.csdnimg.cn/direct/cfc85783a19d48b4a3f40720d0e0fee7.png)
#渗透测试#网络安全#一文了解什么是shell反弹!!!
免责声明 本教程仅为合法的教学目的而准备,严禁用于任何形式的违法犯罪活动及其他商业行为,在使用本教程前,您应确保该行为符合当地的法律法规,继续阅读即表示您需自行承担所有操作的后果,如有异议,请立即停…...
![](https://i-blog.csdnimg.cn/direct/4445758e2ef14793bfce52953ccc1c0c.png)
《解锁图像的语言密码:Image Caption 开源神经网络项目全解析》
《解锁图像的语言密码:Image Caption 开源项目全解析》 一、开篇:AI 看图说话时代来临二、走进 Image Caption 开源世界三、核心技术拆解:AI 如何学会看图说话(一)深度学习双雄:CNN 与 RNN(二&a…...
![](https://i-blog.csdnimg.cn/direct/743ffbae896545d3b4b7a82e78842e51.jpeg)
抢占欧洲电商高地,TikTok 运营专线成 “秘密武器”
在当今数字化浪潮席卷全球的时代,社交媒体平台已成为商业拓展的关键阵地,TikTok 更是其中的闪耀新星。近日,一则重磅消息引发行业关注:TikTok 正计划于 2025 年初进军荷兰电商市场。这一战略布局,不仅彰显了 TikTok 对…...
![](https://www.ngui.cc/images/no-images.jpg)
人工智能-数据分析及特征提取思路
1、概况 基于学生行为数据预测是否涉黄、涉黑等。 2.数据分析 数据分析的意义包括得到数据得直觉、发掘潜在的结构、提取重要的变量、删除异常值、检验潜在的假设和建立初步的模型。 2.1数据质量分析 2.1.1数据值分析 查看数据类型: 首先明确各字段的数据类型…...
![](https://i-blog.csdnimg.cn/direct/483e7a8a5e0d4bc98828a99d29daa4fa.png)
2024 China Collegiate Programming Contest (CCPC) Zhengzhou Onsite 基础题题解
今天先发布基础题的题解,明天再发布铜牌题和银牌题的题解 L. Z-order Curve 思路:这题目说了,上面那一行,只有在偶数位才有可能存在1,那么一定存在这样的数,0 ,1,100, 10000,那么反之,我们的数…...
![](https://www.ngui.cc/images/no-images.jpg)
halcon3d 如何计算平面法向量!确实很简单
这个问题其实一直困扰了我很长时间,之前是怎么算的呢 对于一个平面,我会先求它的fit_primitives_object_model_3d去将它拟合,接下来用surface_normals_object_model_3d 算子生成它的法线,后用get_object_model_3d_params (ObjectModel3DNormals, ‘point_normal_x’, GenP…...
![](/images/no-images.jpg)
四川建设厅网站查询/苏州百度 seo
ACCESS可在菜单上选择[压缩]来压缩当前正打开的数据库,如是我们就可以通过程序临时添加一个工具条按钮,然后使用 SendKeys 来模拟键盘操作,实现压缩当前数据库,压缩后它会重新打开。函数如下:Function Compact() 添加一…...
wordpress 流水 插件/制作网页完整步骤代码
修改编码格式 增加内存 在tomcat的/bin目录下新建一个文件setenv.sh文件 export JAVA_OPTS"-Dfile.encodingUTF-8 -Xms1024m -Xmx2048m -XX:PermSize1024m -XX:MaxPermSize2048m" 启动tomcat 观察/logs/catalina.out 看看你的配置是否生效 -Xms256m JVM初始分配…...
![](https://img-service.csdnimg.cn/img_convert/cbd1f3928183973bc5097cba45b9f5bf.png)
做网站需要什么样的服务器/精准客源推广引流
前面分析到 SpringMVC工作原理之处理映射[HandlerMapping] ,由映射处理器(HandlerMapping) 解析出对应的 handler。接着 SpringMVC工作原理之适配器[HandlerAdapter] 描述了 handler 是怎么匹配到合适的适配器,进行 handler 对应方法的执行。其他几种适配…...
![](/images/no-images.jpg)
合肥电子商务开发网站建设/seo专员岗位要求
内容篇幅较长,请点击这里阅读全文...
![](/images/no-images.jpg)
dw做网站常用标签/北京seo培训
其实HTMLTestRunner.py是基于python2开发的,为了使其支持python3环境,需要对其的部分内容进行修改。下面我们通过编辑器打开HTMLTestRunner.py文件(编辑器可以选择python IDE 或者pycharm、sublime,个人觉得最好选择后两种之一&am…...
商务网站开发方式/网页设计用什么软件
GPU:Graphics Processing Unit,图像处理器,GPU上有成千上万核,这些核并行运行;CPU:Central Processing Unit,中央处理器,CPU通常有单核、双核、四核和八核,但这些核只能串…...