自定义 C++ 和 CUDA 算子¶

Created On: Jun 18, 2024 | Last Updated: Jan 28, 2025 | Last Verified: Nov 05, 2024

What you will learn

如何将用 C++/CUDA 编写的自定义算子集成到 PyTorch 中
如何使用 torch.library.opcheck 测试自定义算子

Prerequisites

PyTorch 2.4 或更高版本
对 C++ 和 CUDA 编程的基本了解

备注

本教程在 AMD ROCm 中也可以运行，无需额外修改。

PyTorch 提供了一个操作符的大型库，这些操作符适用于张量（例如 torch.add、torch.sum 等）。然而，您可能希望将新的自定义操作符引入 PyTorch。本教程展示了编写用 C++/CUDA 编写的自定义操作符的推荐路径。

在我们的教程中，我们将演示如何编写一个融合了乘法和加法的 C++ 和 CUDA 操作符，并将其与 PyTorch 子系统结合使用。该操作的语义如下：

def mymuladd(a: Tensor, b: Tensor, c: float):
    return a * b + c

您可以在此处找到本教程的端到端工作示例。

设置构建系统¶

如果您正在开发自定义的 C++/CUDA 代码，它必须编译。注意，如果您正在与已绑定到预编译 C++/CUDA 代码的 Python 库接口，您可能会考虑编写一个自定义 Python 操作符 (自定义 Python 算子)。

使用 torch.utils.cpp_extension 编译自定义 C++/CUDA 代码以与 PyTorch 一起使用。C++ 扩展可以通过 setuptools “提前编译”，也可以 “实时编译” 通过 load_inline；我们将重点关注 “提前编译” 的方式。

使用 cpp_extension 就像编写以下 setup.py 一样简单：

from setuptools import setup, Extension
from torch.utils import cpp_extension

setup(name="extension_cpp",
      ext_modules=[
          cpp_extension.CppExtension(
            "extension_cpp",
            ["muladd.cpp"],
            # define Py_LIMITED_API with min version 3.9 to expose only the stable
            # limited API subset from Python.h
            extra_compile_args={"cxx": ["-DPy_LIMITED_API=0x03090000"]},
            py_limited_api=True)],  # Build 1 wheel across multiple Python versions
      cmdclass={'build_ext': cpp_extension.BuildExtension},
      options={"bdist_wheel": {"py_limited_api": "cp39"}}  # 3.9 is minimum supported Python version
)

如果您需要编译 CUDA 代码（例如 .cu 文件），那么请改用 torch.utils.cpp_extension.CUDAExtension。有关如何设置的示例，请参阅 extension-cpp。

上面的示例代表了我们所称的 CPython 无关轮，这意味着我们正在构建一个可以跨多个 CPython 版本运行的单一轮子（类似于纯 Python 包）。CPython 无关性在尽量减少自定义库需要支持和发布的轮子数量方面是理想的。我们希望支持的最低版本是 3.9，因为它是目前支持的最旧版本，所以我们在整个 setup 代码中使用了相应的十六进制代码和说明符。我们建议在您希望支持的最低 CPython 版本的同一环境中构建扩展，以最大限度减少未知行为，因此，在这里，我们在 CPython 3.9 环境中构建扩展。当构建完成后，这个单一轮子将在任何 CPython 环境 3.9+ 都能运行。为实现这一目标，有三个关键点需要注意。

第一个是在 extra_compile_args 中指定您希望支持的最低 CPython 版本的 Py_LIMITED_API：

extra_compile_args={"cxx": ["-DPy_LIMITED_API=0x03090000"]},

定义 Py_LIMITED_API 标志有助于验证扩展实际上仅使用了 CPython 稳定限制 API，这是构建 CPython 无关轮的要求。如果不满足此要求，可能会构建一个看似 CPython 无关但在另一个 CPython 环境中崩溃或更糟糕的错误的轮子。务必避免使用不稳定的 CPython API，例如来自 libtorch_python 的 API（特别是 pytorch/python 绑定），仅使用来自 libtorch 的 API（ATen 对象、操作符和调度器）。我们强烈建议定义 Py_LIMITED_API 标志，以帮助确定扩展是否符合要求并安全作为 CPython 无关轮构建。注意，定义此标志并不能完全保证构建的轮子符合 CPython 无关性，但比没有任何限制要好。Python 文档中提到了一些注意事项 Python docs，您应该自己测试并验证轮子对于相关的 CPython 版本是否真正无关。

第二和第三行指定 py_limited_api 告诉 setuptools 您打算构建 CPython 无关轮，并将相应地影响轮子的命名：

setup(name="extension_cpp",
      ext_modules=[
          cpp_extension.CppExtension(
            ...,
            py_limited_api=True)],  # Build 1 wheel across multiple Python versions
      ...,
      options={"bdist_wheel": {"py_limited_api": "cp39"}}  # 3.9 is minimum supported Python version
)

有必要在 CppExtension/CUDAExtension 的参数中指定 py_limited_api=True，以及在 "bdist_wheel" 命令中指定最低支持的 CPython 版本（在本例中为 3.9）。因此，我们教程中的 setup 将构建一个可以跨多个 CPython 版本（>=3.9）安装的正确命名的轮子。

如果您的扩展使用了稳定限制集合外的 CPython API，那么您无法构建 CPython 无关轮！你应该针对每个 CPython 版本构建一个轮子，如下所示：

from setuptools import setup, Extension
from torch.utils import cpp_extension

setup(name="extension_cpp",
      ext_modules=[
          cpp_extension.CppExtension(
            "extension_cpp",
            ["muladd.cpp"])],
      cmdclass={'build_ext': cpp_extension.BuildExtension},
)

定义自定义操作符并添加后端实现¶

首先，让我们编写一个 C++ 函数，用于计算 mymuladd：

at::Tensor mymuladd_cpu(at::Tensor a, const at::Tensor& b, double c) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_INTERNAL_ASSERT(a.device().type() == at::DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(b.device().type() == at::DeviceType::CPU);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();
  for (int64_t i = 0; i < result.numel(); i++) {
    result_ptr[i] = a_ptr[i] * b_ptr[i] + c;
  }
  return result;
}

为了从 PyTorch 的 Python 前端使用这个操作符，我们需要使用 TORCH_LIBRARY API 将其注册为 PyTorch 的一个操作符。这将自动将该操作符绑定到 Python。

操作符注册是一个两步流程：

定义操作符 - 此步骤确保 PyTorch 知道新的操作符。
注册后端实现 - 在此步骤中，后端（如 CPU 和 CUDA）的实现与操作符关联。

定义一个操作符¶

要定义一个操作符，请按照以下步骤操作：

为操作符选择一个命名空间。我们建议命名空间使用您的顶级项目名称；在我们的教程中，我们使用 “extension_cpp”。
提供一个规范字符串，用于指定操作符的输入/输出类型以及输入张量是否会被修改。除了 Tensor 和 float，我们还支持更多类型；详情请参考自定义操作符手册。
- 如果您正在编写一个可以修改其输入张量的操作符，请参见此处 (创建可变操作符) 了解如何进行指定。

TORCH_LIBRARY(extension_cpp, m) {
   // Note that "float" in the schema corresponds to the C++ double type
   // and the Python float type.
   m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
 }

这使得该操作符可以通过 torch.ops.extension_cpp.mymuladd 从 Python 中使用。

注册操作符的后端实现¶

使用 TORCH_LIBRARY_IMPL 为操作符注册后端实现。

TORCH_LIBRARY_IMPL(extension_cpp, CPU, m) {
  m.impl("mymuladd", &mymuladd_cpu);
}

如果您还有一个 CUDA 实现的 myaddmul，您可以在单独的 TORCH_LIBRARY_IMPL 块中注册它：

__global__ void muladd_kernel(int numel, const float* a, const float* b, float c, float* result) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx < numel) result[idx] = a[idx] * b[idx] + c;
}

at::Tensor mymuladd_cuda(const at::Tensor& a, const at::Tensor& b, double c) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_INTERNAL_ASSERT(a.device().type() == at::DeviceType::CUDA);
  TORCH_INTERNAL_ASSERT(b.device().type() == at::DeviceType::CUDA);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();

  int numel = a_contig.numel();
  muladd_kernel<<<(numel+255)/256, 256>>>(numel, a_ptr, b_ptr, c, result_ptr);
  return result;
}

TORCH_LIBRARY_IMPL(extension_cpp, CUDA, m) {
  m.impl("mymuladd", &mymuladd_cuda);
}

为一个操作符添加 `torch.compile` 支持¶

要为一个操作符添加 torch.compile 支持，我们必须为其添加一个 FakeTensor 内核（也称为 “meta kernel” 或 “abstract impl”）。 FakeTensors 是具有元数据（如形状、数据类型、设备）但没有数据的张量：操作符的 FakeTensor 内核指定如何根据输入张量的元数据计算输出张量的元数据。 FakeTensor 内核应返回具有正确张量元数据（形状/步幅/dtype/设备）的您选择的伪张量。

我们建议通过 Python 中的 torch.library.register_fake API 完成此操作，尽管也可以从 C++ 中完成（请参阅自定义操作符手册获取更多详情）。

# Important: the C++ custom operator definitions should be loaded first
# before calling ``torch.library`` APIs that add registrations for the
# C++ custom operator(s). The following import loads our
# C++ custom operator definitions.
# Note that if you are striving for Python agnosticism, you should use
# the ``load_library(...)`` API call instead. See the next section for
# more details.
from . import _C

@torch.library.register_fake("extension_cpp::mymuladd")
def _(a, b, c):
    torch._check(a.shape == b.shape)
    torch._check(a.dtype == torch.float)
    torch._check(b.dtype == torch.float)
    torch._check(a.device == b.device)
    return torch.empty_like(a)

设置混合 Python/C++ 注册¶

在本教程中，我们在 C++ 中定义了一个自定义操作符，使用 C++ 添加了 CPU/CUDA 实现，并在 Python 中添加了 FakeTensor 内核和反向公式。这些注册的加载（或导入）顺序很重要（错误的导入顺序会导致错误）。

为了在混合Python/C++注册中使用自定义操作符，我们必须首先加载包含自定义操作符定义的C++库，然后调用 torch.library 注册API。这可以通过以下三种方式实现：

加载包含自定义操作符定义的C++库的第一种方式是为 _C 定义一个伪Python模块。然后，当在Python中使用 import _C 导入模块时，与扩展对应的 .so 文件将被加载，并且 TORCH_LIBRARY 和 TORCH_LIBRARY_IMPL 静态初始化程序将运行。可以像下面这样使用 PYBIND11_MODULE 创建一个伪Python模块，但你会注意到这无法与 Py_LIMITED_API 一起编译，因为 pybind11 不保证仅使用稳定有限的CPython API！使用以下代码，你很遗憾无法为扩展构建跨CPython版本的通用轮子！（预示着：不知道第二种方式是什么呢？）。

// in, say, not_agnostic/csrc/extension_BAD.cpp
#include <pybind11/pybind11.h>

PYBIND11_MODULE("_C", m) {}

# in, say, extension/__init__.py
from . import _C

在本教程中，由于我们看重能够跨多个CPython版本构建单一轮子，我们将用稳定的API调用替换不稳定的 PYBIND11 调用。下面的代码可以通过 -DPy_LIMITED_API=0x03090000 编译，并成功为我们的 _C 扩展创建一个伪Python模块，以便可以从Python中导入。有关更多详细信息，请参阅 extension_cpp/__init__.py 和 extension_cpp/csrc/muladd.cpp。

#include <Python.h>

extern "C" {
  /* Creates a dummy empty _C module that can be imported from Python.
    The import from Python will load the .so consisting of this file
    in this extension, so that the TORCH_LIBRARY static initializers
    below are run. */
  PyObject* PyInit__C(void)
  {
      static struct PyModuleDef module_def = {
          PyModuleDef_HEAD_INIT,
          "_C",   /* name of module */
          NULL,   /* module documentation, may be NULL */
          -1,     /* size of per-interpreter state of the module,
                    or -1 if the module keeps state in global variables. */
          NULL,   /* methods */
      };
      return PyModule_Create(&module_def);
  }
}

# in, say, extension/__init__.py
from . import _C

如果希望完全避免在C++自定义操作符中使用 Python.h，可以在Python中使用 torch.ops.load_library("/path/to/library.so") 来加载由扩展编译的 .so 文件。需要注意的是，使用此方法不会为扩展创建 _C Python模块，因此无法通过Python调用 import _C。取而代之的是，通过 torch.ops.load_library("/path/to/library.so") 触发自定义操作符的注册。不过这带来的挑战是理解 .so 文件的位置，以便你可以加载它们，这并非总是简单的：

import torch
from pathlib import Path

so_files = list(Path(__file__).parent.glob("_C*.so"))
assert (
    len(so_files) == 1
), f"Expected one _C*.so file, found {len(so_files)}"
torch.ops.load_library(so_files[0])

from . import ops

为操作符添加训练（自动微分）支持¶

使用 torch.library.register_autograd 为操作符添加训练支持。建议优先使用该方法，而不是直接使用Python的 torch.autograd.Function 或C++的 torch::autograd::Function；如果使用后者，必须以非常特定的方式使用它们以避免静默的不正确结果（详细信息请参阅自定义操作符手册）。

def _backward(ctx, grad):
    a, b = ctx.saved_tensors
    grad_a, grad_b = None, None
    if ctx.needs_input_grad[0]:
        grad_a = grad * b
    if ctx.needs_input_grad[1]:
        grad_b = grad * a
    return grad_a, grad_b, None

def _setup_context(ctx, inputs, output):
    a, b, c = inputs
    saved_a, saved_b = None, None
    if ctx.needs_input_grad[0]:
        saved_b = b
    if ctx.needs_input_grad[1]:
        saved_a = a
    ctx.save_for_backward(saved_a, saved_b)

# This code adds training support for the operator. You must provide us
# the backward formula for the operator and a `setup_context` function
# to save values to be used in the backward.
torch.library.register_autograd(
    "extension_cpp::mymuladd", _backward, setup_context=_setup_context)

请注意，反向必须是PyTorch可以理解的操作符的组合。如果希望在反向传播中使用其他自定义C++或CUDA内核，则必须将其包装到自定义操作符中。

如果我们有自己的自定义 mymul 内核，则需要将其包装到自定义操作符中，然后在反向传播中调用它：

// New! a mymul_cpu kernel
at::Tensor mymul_cpu(const at::Tensor& a, const at::Tensor& b) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_CHECK(a.device().type() == at::DeviceType::CPU);
  TORCH_CHECK(b.device().type() == at::DeviceType::CPU);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();
  for (int64_t i = 0; i < result.numel(); i++) {
    result_ptr[i] = a_ptr[i] * b_ptr[i];
  }
  return result;
}

TORCH_LIBRARY(extension_cpp, m) {
  m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
  // New! defining the mymul operator
  m.def("mymul(Tensor a, Tensor b) -> Tensor");
}


TORCH_LIBRARY_IMPL(extension_cpp, CPU, m) {
  m.impl("mymuladd", &mymuladd_cpu);
  // New! registering the cpu kernel for the mymul operator
  m.impl("mymul", &mymul_cpu);
}

def _backward(ctx, grad):
    a, b = ctx.saved_tensors
    grad_a, grad_b = None, None
    if ctx.needs_input_grad[0]:
        grad_a = torch.ops.extension_cpp.mymul.default(grad, b)
    if ctx.needs_input_grad[1]:
        grad_b = torch.ops.extension_cpp.mymul.default(grad, a)
    return grad_a, grad_b, None


def _setup_context(ctx, inputs, output):
    a, b, c = inputs
    saved_a, saved_b = None, None
    if ctx.needs_input_grad[0]:
        saved_b = b
    if ctx.needs_input_grad[1]:
        saved_a = a
    ctx.save_for_backward(saved_a, saved_b)


# This code adds training support for the operator. You must provide us
# the backward formula for the operator and a `setup_context` function
# to save values to be used in the backward.
torch.library.register_autograd(
    "extension_cpp::mymuladd", _backward, setup_context=_setup_context)

测试一个操作符¶

使用 torch.library.opcheck 来测试自定义操作符是否已正确注册。请注意，此函数不会测试梯度是否在数学上是正确的——需要为此编写单独的测试，可以是手动测试，也可以通过使用 torch.autograd.gradcheck。

def sample_inputs(device, *, requires_grad=False):
    def make_tensor(*size):
        return torch.randn(size, device=device, requires_grad=requires_grad)

    def make_nondiff_tensor(*size):
        return torch.randn(size, device=device, requires_grad=False)

    return [
        [make_tensor(3), make_tensor(3), 1],
        [make_tensor(20), make_tensor(20), 3.14],
        [make_tensor(20), make_nondiff_tensor(20), -123],
        [make_nondiff_tensor(2, 3), make_tensor(2, 3), -0.3],
    ]

def reference_muladd(a, b, c):
    return a * b + c

samples = sample_inputs(device, requires_grad=True)
samples.extend(sample_inputs(device, requires_grad=False))
for args in samples:
    # Correctness test
    result = torch.ops.extension_cpp.mymuladd(*args)
    expected = reference_muladd(*args)
    torch.testing.assert_close(result, expected)

    # Use opcheck to check for incorrect usage of operator registration APIs
    torch.library.opcheck(torch.ops.extension_cpp.mymuladd.default, args)

创建可变操作符¶

您可能希望编写一个可以改变其输入的自定义操作符。使用 Tensor(a!) 在模式中指定每个可变的张量；否则将存在未定义行为。如果存在多个被改变的张量，则使用不同的名称（例如，Tensor(a!)，Tensor(b!)，Tensor(c!)）为每个可变张量命名。

接下来我们将编写一个 myadd_out(a, b, out) 操作符，它将 a+b 的内容写入 out。

// An example of an operator that mutates one of its inputs.
void myadd_out_cpu(const at::Tensor& a, const at::Tensor& b, at::Tensor& out) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(b.sizes() == out.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_CHECK(out.dtype() == at::kFloat);
  TORCH_CHECK(out.is_contiguous());
  TORCH_INTERNAL_ASSERT(a.device().type() == at::DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(b.device().type() == at::DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(out.device().type() == at::DeviceType::CPU);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = out.data_ptr<float>();
  for (int64_t i = 0; i < out.numel(); i++) {
    result_ptr[i] = a_ptr[i] + b_ptr[i];
  }
}

在定义操作符时，必须在模式中指定它会改变输出张量：

TORCH_LIBRARY(extension_cpp, m) {
  m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
  m.def("mymul(Tensor a, Tensor b) -> Tensor");
  // New!
  m.def("myadd_out(Tensor a, Tensor b, Tensor(a!) out) -> ()");
}

TORCH_LIBRARY_IMPL(extension_cpp, CPU, m) {
  m.impl("mymuladd", &mymuladd_cpu);
  m.impl("mymul", &mymul_cpu);
  // New!
  m.impl("myadd_out", &myadd_out_cpu);
}

备注

不要将任何被改变的张量作为操作符的输出返回，因为这会导致与PyTorch子系统（如 torch.compile）的不兼容。

总结¶

在本教程中，我们介绍了将自定义C++和CUDA操作符与PyTorch集成的推荐方法。TORCH_LIBRARY/torch.library API 相对底层。有关如何使用该API的更多信息，请参阅自定义操作符手册。