A White Paper on Neural Network Deployment
  • ❤️‍🔥A White Paper on Neural Network Deployment
    • ❤️‍🔥A White Paper on Neural Network Deployment
    • 🤠CUDA
      • 🤑CPU|GPU程序执行流程
      • 🤗QiuckLearnFromPicture
      • 🤒GPU编程模型
      • 🫣线程束和线程束分化|Warp
      • 🤭Reduction|并行规约
      • 🤔全局内存(Global Memory)访问模式
      • 🫢Share Memory|共享内存|Bank Conflicts
      • 😷CUDA流和事件
      • 🫡Nsight system和Nsight compute
      • 🤫Grid-Stride Loops
    • 😄ONNX
      • 😉ONNX中的各类Proto
      • 🤔onnx->torch
      • 🥳0x00自定义算子
      • 😕0x01自定义算子
      • 🥴ONNX 模型的修改与调试
      • 😆ONNX中的一些概念
      • 😍用python操作ONNX
      • 🥹ONNX中的广播机制
      • 🤣外部数据
      • 🥰ONNX Model hub
      • 😘ONNX IR(Intermediate Representation)
      • 🥳ONNX后端
      • 🥸概述
    • 🐶TensorRT
      • 🐱TensorRT快速入门指南
      • 🐭文档简介
      • 🐹TensorRT的功能
      • 🐰TensorRT的C++接口解析
      • 🦊TensorRT的Python接口解析
      • 🐻TensorRT如何工作
      • 🐼trtexec的使用
      • 🐻‍❄️实战:解析onnx模型保存为engine文件|from scratch
      • 🐨实战:加载engine文件并执行推理|from scratch
      • 🐯手撕TensoRT源码|0x00
    • 🫶模型量化和剪枝
      • 🖕IEEE754标准
      • 🫰浮点运算产生的误差
      • 🤲映射和偏移
      • 🫴quantization from scratch|python
      • 👏动态量化范围
      • 🤝量化粒度
      • 👍校准
      • 👊Post-Training Quantization
      • ✊Quantization-Aware Training
      • 🤞pytorch-quantization使用文档
      • ✌️Polygraphy-Cheatsheet
    • 🤺杂文不杂
      • 😾Roofline_model
      • 🤖模型部署的几大误区
      • 😽手算Ampere架构各个精度的Throughout
      • 😻Tensor Core VS CUDA Core
      • 😺PNNX计算图结构剖析
      • 🎃融合BN和Conv层
      • 👾深度神经网络编译器原理简介
      • 👽在WSL2上安装CUDA_cuDNN_TensorRT
    • 🍀CPP
      • 🪵lamda表达式|C++11
      • 🌴智能指针|C++11
      • 🌲右值引用|移动语义|完美转发|C++11
      • 🫑emplace_back 减少内存拷贝和移动|C++11
      • 🥬多线程|互斥锁|条件变量|C++11
      • 🥒异步操作|C++11
      • 🍆原子变量|CAS操作|内存顺序|C++11
      • 🍏对象生存期和资源管理|RAII设计思想
      • 🍎Pimpl设计模式|编译防火墙
      • 🌶️std::variant|C++17
      • 🫛std::any|C++17
    • 🩷部署实战
      • ❤️yolov8Multitask
      • 💚yolov5
      • 🧡pointpillars
      • 💛centerpoint
      • 🩵deepstream
      • 💜BEVfusion
      • 💙BEVLane
      • 🖤Occupancy
    • ☯️重点参考书籍
Powered by GitBook
On this page
  • GPU 加速应用程序与 CPU 应用程序对比
  • CUDA 线程层次结构
  • CUDA 提供的线程层次结构变量
  • 协调并行线程
  • 网格跨度循环
  • 并发 CUDA 流
  • reference

Was this helpful?

Edit on GitHub
  1. A White Paper on Neural Network Deployment
  2. CUDA

QiuckLearnFromPicture

PreviousCPU|GPU程序执行流程NextGPU编程模型

Last updated 1 year ago

Was this helpful?

GPU 加速应用程序与 CPU 应用程序对比

CUDA 线程层次结构

CUDA 提供的线程层次结构变量

协调并行线程

通过这些变量,公式 threadIdx.x + blockIdx.x * blockDim.x 可将每个线程映射到向量的元素中

必须使用代码检查并确保经由公式 threadIdx.x + blockIdx.x * blockDim.x 计算出的 dataIndex 小于 N(数据元素数量)。

网格跨度循环

#include <stdio.h>

void init(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    a[i] = i;
  }
}

__global__
void doubleElements(int *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = gridDim.x * blockDim.x;

  for (int i = idx; i < N; i += stride)
  {
    a[i] *= 2;
  }
}

bool checkElementsAreDoubled(int *a, int N)
{
  int i;
  for (i = 0; i < N; ++i)
  {
    if (a[i] != i*2) return false;
  }
  return true;
}

int main()
{
  int N = 10000;
  int *a;

  size_t size = N * sizeof(int);
  cudaMallocManaged(&a, size);

  init(a, N);

  size_t threads_per_block = 256;
  size_t number_of_blocks = 32;

  doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);
  cudaDeviceSynchronize();

  bool areDoubled = checkElementsAreDoubled(a, N);
  printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");

  cudaFree(a);
}

并发 CUDA 流

默认流与非默认流中的操作不会发生重叠。

默认流较为特殊。默认流中执行任何操作期间,任何非默认流中皆不可同时执行任何操作,默认流将等待非默认流全部执行完毕后再开始运行,而且在其执行完毕后,其他非默认流才能开始执行。

使用流将复制和计算进行重叠

通过使用默认流,典型的三步式 CUDA 程序会顺次执行 HtoD 复制、计算和 DtoH 复制,如下图:

但是这样明显是不可行的。回忆一下,非默认流中的操作顺序不固定,因此可能会出现这种情况:

在其所需的数据传输到 GPU 之前,计算可能便会开始,

我们还可采用另一种初级做法:将所有操作全部发布在同一个非默认流中,以确保数据和计算的顺序,

但这样做与使用默认流没有区别,结果是依然没有重叠。思考一下,如果采用现有程序并将数据分为 2 块

如果现将针对每个数据块的所有操作移至其各自独立的非默认流,数据和计算顺序得以保持,同时能够实现部分重叠。

根据假设,通过增加数据块数量,重叠效果可能更好。要获得理想的分块数量,最好的途径是观察程序性能。

将数据分块以在多个流中使用时,索引可能较为棘手。让我们通过几个示例了解一下如何进行索引。首先为所有数据块分配所需数据,为使示例更加清晰,我们使用了较小规模的数据。

  1. 为CPU和GPU分配内存(N表示总的数据量);

cudaMallocHost(&data_cpu, N)
cudaMalloc(&data_gpu, N)
  1. 接下来,定义流的数量,并通过执行循环代码以数组形式创建和收集流(此处定义的流的数量为2);

num_streams = 2
for stream_i in num_streams
    cudaStreamCreate(stream)
    streams[stream_i] = stream
  1. 每个数据块的大小取决于数据条目的数量以及流的数量;

chunk_size = N / num_streams
  1. 每个流需要处理一个数据块,我们需要计算其在整个数据集中的索引。为此,我们将遍历流的数量,从 0 开始然后乘以数据块大小。从 lower 索引开始,并使用一个数据块大小的数据,如此即可从全部数据中获得流数据,此方法将会应用至每一个 stream_i;

for stream_i in num_streams
	lower = chunk_size*stream_i
  1. 计算完这些值后,我们现在即可执行非默认流 HtoD 内存复制;

cudaMemcpyAsync(
    data_cpu+lower,
    data_gpu+lower, sizeof(uint64_t)*chunk_size, cudaMemcpyHostToDevice,
    streams[stream_i]
)

上面的示例中,N 被流数量整除。如果不能整除呢?为解决该问题,我们使用向上取整的除法运算来计算数据块大小。但是这还是会有问题,如下图:

我们确实可以访问所有数据,但又产生了新问题:对于最后一个数据块而言,数据块大小过小,存在线程访问越界的问题。

解决方法如下:

  1. 为每个数据块计算 upper 索引(不得超过 N);

upper = min(lower+chunk_size, N)
  1. 然后使用 upper 和 lower 计算数据块 width;

width = upper - lower
  1. 现在使用 width 而非数据块大小进行迭代;

这样我们就能完美适配数据,而不受其大小或流数量的影响。

复制与计算重叠的代码示例

// Able to handle when `num_entries % num_streams != 0`.

const uint64_t num_entries = 10;
const uint64_t num_iters = 1UL << 10;

cudaMallocHost(&data_cpu, sizeof(uint64_t)*num_entries);
cudaMalloc    (&data_gpu, sizeof(uint64_t)*num_entries);

// Set the number of streams to not evenly divide num_entries.
const uint64_t num_streams = 3;

cudaStream_t streams[num_streams];
for (uint64_t stream = 0; stream < num_streams; stream++)
    cudaStreamCreate(&streams[stream]);

// Use round-up division (`sdiv`, defined in helper.cu) so `num_streams*chunk_size`
// is never less than `num_entries`.
// This can result in `num_streams*chunk_size` being greater than `num_entries`, meaning
// we will need to guard against out-of-range errors in the final "tail" stream (see below).
const uint64_t chunk_size = sdiv(num_entries, num_streams);

for (uint64_t stream = 0; stream < num_streams; stream++) {

    const uint64_t lower = chunk_size*stream;
    // For tail stream `lower+chunk_size` could be out of range, so here we guard against that.
    const uint64_t upper = min(lower+chunk_size, num_entries);
    // Since the tail stream width may not be `chunk_size`,
    // we need to calculate a separate `width` value.
    const uint64_t width = upper-lower;

    // Use `width` instead of `chunk_size`.
    cudaMemcpyAsync(data_gpu+lower, data_cpu+lower, 
           sizeof(uint64_t)*width, cudaMemcpyHostToDevice, 
           streams[stream]);

    // Use `width` instead of `chunk_size`.
    decrypt_gpu<<<80*32, 64, 0, streams[stream]>>>
        (data_gpu+lower, width, num_iters);

    // Use `width` instead of `chunk_size`.
    cudaMemcpyAsync(data_cpu+lower, data_gpu+lower, 
           sizeof(uint64_t)*width, cudaMemcpyDeviceToHost, 
           streams[stream]);
}

// Destroy streams.
for (uint64_t stream = 0; stream < num_streams; stream++)
    cudaStreamDestroy(streams[stream]);

完整代码如下:

#include <cstdint>
#include <iostream>
#include "helpers.cuh"
#include "encryption.cuh"

void encrypt_cpu(uint64_t * data, uint64_t num_entries, 
                 uint64_t num_iters, bool parallel=true) {

    #pragma omp parallel for if (parallel)
    for (uint64_t entry = 0; entry < num_entries; entry++)
        data[entry] = permute64(entry, num_iters);
}

__global__ 
void decrypt_gpu(uint64_t * data, uint64_t num_entries, 
                 uint64_t num_iters) {

    const uint64_t thrdID = blockIdx.x*blockDim.x+threadIdx.x;
    const uint64_t stride = blockDim.x*gridDim.x;

    for (uint64_t entry = thrdID; entry < num_entries; entry += stride)
        data[entry] = unpermute64(data[entry], num_iters);
}

bool check_result_cpu(uint64_t * data, uint64_t num_entries,
                      bool parallel=true) {

    uint64_t counter = 0;

    #pragma omp parallel for reduction(+: counter) if (parallel)
    for (uint64_t entry = 0; entry < num_entries; entry++)
        counter += data[entry] == entry;

    return counter == num_entries;
}

int main (int argc, char * argv[]) {

    Timer timer;
    Timer overall;

    const uint64_t num_entries = 1UL << 26;
    const uint64_t num_iters = 1UL << 10;
    const bool openmp = true;

    // Define the number of streams.
    const uint64_t num_streams = 32;
    
    // Use round-up division to calculate chunk size.
    const uint64_t chunk_size = sdiv(num_entries, num_streams);

    timer.start();
    uint64_t * data_cpu, * data_gpu;
    cudaMallocHost(&data_cpu, sizeof(uint64_t)*num_entries);
    cudaMalloc    (&data_gpu, sizeof(uint64_t)*num_entries);
    timer.stop("allocate memory");
    check_last_error();

    timer.start();
    encrypt_cpu(data_cpu, num_entries, num_iters, openmp);
    timer.stop("encrypt data on CPU");

    timer.start();
    
    // Create array for storing streams.
    cudaStream_t streams[num_streams];
    
    // Create number of streams and store in array.
    for (uint64_t stream = 0; stream < num_streams; stream++)
        cudaStreamCreate(&streams[stream]);
    timer.stop("create streams");
    check_last_error();

    overall.start();
    timer.start();
    
    // For each stream...
    for (uint64_t stream = 0; stream < num_streams; stream++) {
        
        // ...calculate index into global data (`lower`) and size of data for it to process (`width`).
        const uint64_t lower = chunk_size*stream;
        const uint64_t upper = min(lower+chunk_size, num_entries);
        const uint64_t width = upper-lower;

        // ...copy stream's chunk to device.
        cudaMemcpyAsync(data_gpu+lower, data_cpu+lower, 
               sizeof(uint64_t)*width, cudaMemcpyHostToDevice, 
               streams[stream]);

        // ...compute stream's chunk.
        decrypt_gpu<<<80*32, 64, 0, streams[stream]>>>
            (data_gpu+lower, width, num_iters);

        // ...copy stream's chunk to host.
        cudaMemcpyAsync(data_cpu+lower, data_gpu+lower, 
               sizeof(uint64_t)*width, cudaMemcpyDeviceToHost, 
               streams[stream]);
    }

    for (uint64_t stream = 0; stream < num_streams; stream++)
	// Synchronize streams before checking results on host.
        cudaStreamSynchronize(streams[stream]);    
    
    // Note modification of timer instance use.
    timer.stop("asynchronous H2D->kernel->D2H");
    overall.stop("total time on GPU");
    check_last_error();
    
    timer.start();
    const bool success = check_result_cpu(data_cpu, num_entries, openmp);
    std::cout << "STATUS: test " 
              << ( success ? "passed" : "failed")
              << std::endl;
    timer.stop("checking result on CPU");

    timer.start();      
    for (uint64_t stream = 0; stream < num_streams; stream++)
        // Destroy streams.
        cudaStreamDestroy(streams[stream]);    
    timer.stop("destroy streams");
    check_last_error();

    timer.start();
    cudaFreeHost(data_cpu);
    cudaFree    (data_gpu);
    timer.stop("free memory");
    check_last_error();
}

reference

❤️‍🔥
🤠
🤗
https://blog.csdn.net/qq_31985307/article/details/126237070
https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+S-AC-04+V1-ZH