备注

点击此处下载完整示例代码

为 Python 运行时提供的 `torch.export` AOTInductor 教程（Beta 版）¶

Created On: Aug 23, 2024 | Last Updated: Jan 24, 2025 | Last Verified: Nov 05, 2024

作者: Ankith Gunapal, Bin Bao, Angela Yi

警告

torch._inductor.aoti_compile_and_package 和 torch._inductor.aoti_load_package 处于 Beta 状态，可能会有向后兼容性破坏的变更。本教程提供了一个使用这些 API 进行模型部署的示例，使用 Python 运行时。

之前已展示过如何使用 AOTInductor 对 PyTorch 导出的模型进行预编译，通过创建一个可以在非 Python 环境下运行的工件。在本教程中，您将学习如何在 Python 运行时中使用 AOTInductor 的端到端示例。

内容

前提条件
您将学到什么
模型编译
在 Python 中进行模型推理
何时使用带有 Python 运行时的 AOTInductor
结论

前提条件 ¶

PyTorch 2.6 或更高版本
对 torch.export 和 AOTInductor 的基础理解
完成 AOTInductor: Torch.Export 模型的预编译教程

您将学到什么 ¶

如何在 Python 运行时中使用 AOTInductor。
如何使用 torch._inductor.aoti_compile_and_package() 和 torch.export.export() 来生成编译后的工件
如何使用 torch._export.aot_load() 在 Python 运行时中加载并运行工件。
何时在 Python 运行时中使用 AOTInductor

模型编译 ¶

我们将使用 TorchVision 预训练的 ResNet18 模型作为示例。

第一步是通过 torch.export.export() 将模型导出为图表示形式。要了解有关此函数的更多信息，可以查看文档或教程。

一旦我们导出了 PyTorch 模型并获得了一个 ExportedProgram，我们可以使用 torch._inductor.aoti_compile_and_package() 将其传递给 AOTInductor，编译程序到指定的设备，并将生成的内容保存为一个 .pt2 工件。

备注

此 API 支持与 torch.compile() 相同的选项，例如 mode 和 ``max_autotune``（对于那些想启用 CUDA 图并利用基于 Triton 的矩阵乘法和卷积的用户）。

import os
import torch
import torch._inductor
from torchvision.models import ResNet18_Weights, resnet18

model = resnet18(weights=ResNet18_Weights.DEFAULT)
model.eval()

with torch.inference_mode():
    inductor_configs = {}

    if torch.cuda.is_available():
        device = "cuda"
        inductor_configs["max_autotune"] = True
    else:
        device = "cpu"

    model = model.to(device=device)
    example_inputs = (torch.randn(2, 3, 224, 224, device=device),)

    exported_program = torch.export.export(
        model,
        example_inputs,
    )
    path = torch._inductor.aoti_compile_and_package(
        exported_program,
        package_path=os.path.join(os.getcwd(), "resnet18.pt2"),
        inductor_configs=inductor_configs
    )

使用 aoti_compile_and_package() 的结果是一个名为 “resnet18.pt2” 的工件，可以在 Python 和 C++ 中加载和执行。

该工件本身包含大量由 AOTInductor 生成的代码，例如生成的 C++ 运行文件、从 C++ 文件编译的共享库以及 CUDA 二进制文件（即 cubin 文件，如果针对 CUDA 优化）。

从结构上看，该工件是一个结构化的 .zip 文件，具有以下规范：

我们可以使用以下命令检查工件内容：

$ unzip -l resnet18.pt2

Archive:  resnet18.pt2
  Length      Date    Time    Name
---------  ---------- -----   ----
        1  01-08-2025 16:40   version
        3  01-08-2025 16:40   archive_format
    10088  01-08-2025 16:40   data/aotinductor/model/cagzt6akdaczvxwtbvqe34otfe5jlorktbqlojbzqjqvbfsjlge4.cubin
    17160  01-08-2025 16:40   data/aotinductor/model/c6oytfjmt5w4c7onvtm6fray7clirxt7q5xjbwx3hdydclmwoujz.cubin
    16616  01-08-2025 16:40   data/aotinductor/model/c7ydp7nocyz323hij4tmlf2kcedmwlyg6r57gaqzcsy3huneamu6.cubin
    17776  01-08-2025 16:40   data/aotinductor/model/cyqdf46ordevqhiddvpdpp3uzwatfbzdpl3auj2nx23uxvplnne2.cubin
    10856  01-08-2025 16:40   data/aotinductor/model/cpzfebfgrusqslui7fxsuoo4tvwulmrxirc5tmrpa4mvrbdno7kn.cubin
    14608  01-08-2025 16:40   data/aotinductor/model/c5ukeoz5wmaszd7vczdz2qhtt6n7tdbl3b6wuy4rb2se24fjwfoy.cubin
    11376  01-08-2025 16:40   data/aotinductor/model/csu3nstcp56tsjfycygaqsewpu64l5s6zavvz7537cm4s4cv2k3r.cubin
    10984  01-08-2025 16:40   data/aotinductor/model/cp76lez4glmgq7gedf2u25zvvv6rksv5lav4q22dibd2zicbgwj3.cubin
    14736  01-08-2025 16:40   data/aotinductor/model/c2bb5p6tnwz4elgujqelsrp3unvkgsyiv7xqxmpvuxcm4jfl7pc2.cubin
    11376  01-08-2025 16:40   data/aotinductor/model/c6eopmb2b4ngodwsayae4r5q6ni3jlfogfbdk3ypg56tgpzhubfy.cubin
    11624  01-08-2025 16:40   data/aotinductor/model/chmwe6lvoekzfowdbiizitm3haiiuad5kdm6sd2m6mv6dkn2zk32.cubin
    15632  01-08-2025 16:40   data/aotinductor/model/c3jop5g344hj3ztsu4qm6ibxyaaerlhkzh2e6emak23rxfje6jam.cubin
    25472  01-08-2025 16:40   data/aotinductor/model/chaiixybeiuuitm2nmqnxzijzwgnn2n7uuss4qmsupgblfh3h5hk.cubin
   139389  01-08-2025 16:40   data/aotinductor/model/cvk6qzuybruhwxtfblzxiov3rlrziv5fkqc4mdhbmantfu3lmd6t.cpp
       27  01-08-2025 16:40   data/aotinductor/model/cvk6qzuybruhwxtfblzxiov3rlrziv5fkqc4mdhbmantfu3lmd6t_metadata.json
 47195424  01-08-2025 16:40   data/aotinductor/model/cvk6qzuybruhwxtfblzxiov3rlrziv5fkqc4mdhbmantfu3lmd6t.so
---------                     -------
 47523148                     18 files

在 Python 中进行模型推理 ¶

要在 Python 中加载并运行工件，我们可以使用 torch._inductor.aoti_load_package()。

import os
import torch
import torch._inductor

model_path = os.path.join(os.getcwd(), "resnet18.pt2")

compiled_model = torch._inductor.aoti_load_package(model_path)
example_inputs = (torch.randn(2, 3, 224, 224, device=device),)

with torch.inference_mode():
    output = compiled_model(example_inputs)

何时使用带有 Python 运行时的 AOTInductor ¶

使用带有 Python 运行时的 AOTInductor 的主要原因有两个：

torch._inductor.aoti_compile_and_package 会生成单一序列化工件。这对于模型部署的版本管理以及跟踪模型性能变化很有用。
由于 torch.compile() 是一个 JIT 编译器，首次编译时会有一个启动成本。部署时需要考虑首次推理编译所需的时间。通过使用 AOTInductor，编译在部署之前通过 torch.export.export 和 torch._inductor.aoti_compile_and_package 完成。在部署期间，加载模型后运行推理不会产生额外的成本。

以下部分展示了使用 AOTInductor 在首次推理中获得的加速效果

我们定义了一个实用函数 timed 来测量推理所需时间

import time
def timed(fn):
    # Returns the result of running `fn()` and the time it took for `fn()` to run,
    # in seconds. We use CUDA events and synchronization for accurate
    # measurement on CUDA enabled devices.
    if torch.cuda.is_available():
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        start.record()
    else:
        start = time.time()

    result = fn()
    if torch.cuda.is_available():
        end.record()
        torch.cuda.synchronize()
    else:
        end = time.time()

    # Measure time taken to execute the function in miliseconds
    if torch.cuda.is_available():
        duration = start.elapsed_time(end)
    else:
        duration = (end - start) * 1000

    return result, duration

让我们测量使用 AOTInductor 进行首次推理的时间

torch._dynamo.reset()

model = torch._inductor.aoti_load_package(model_path)
example_inputs = (torch.randn(1, 3, 224, 224, device=device),)

with torch.inference_mode():
    _, time_taken = timed(lambda: model(example_inputs))
    print(f"Time taken for first inference for AOTInductor is {time_taken:.2f} ms")

让我们测量使用 torch.compile 进行首次推理的时间

torch._dynamo.reset()

model = resnet18(weights=ResNet18_Weights.DEFAULT).to(device)
model.eval()

model = torch.compile(model)
example_inputs = torch.randn(1, 3, 224, 224, device=device)

with torch.inference_mode():
    _, time_taken = timed(lambda: model(example_inputs))
    print(f"Time taken for first inference for torch.compile is {time_taken:.2f} ms")

我们发现，与使用 torch.compile 相比，使用 AOTInductor 的首次推理时间显著减少

结论 ¶

在本示例中，我们学习了如何通过编译和加载预训练的 ResNet18 模型来有效地使用 AOTInductor 在 Python 运行时中运行。这一过程展示了生成编译工件并在 Python 环境中运行它的实际应用。我们还研究了在模型部署中使用 AOTInductor 的优势，比如首次推理时间的加速效果。

脚本的总运行时间： (0分钟 0.000秒)

由Sphinx-Gallery生成的图集

为 Python 运行时提供的 torch.export AOTInductor 教程（Beta 版）¶

前提条件¶

您将学到什么¶

模型编译¶

在 Python 中进行模型推理¶

何时使用带有 Python 运行时的 AOTInductor¶

结论¶

文档

教程

资源