使用区域编译减少torch.compile冷启动编译时间¶

Created On: Oct 10, 2024 | Last Updated: Oct 16, 2024 | Last Verified: Oct 10, 2024

作者： Animesh Jain

随着深度学习模型变得越来越庞大，这些模型的编译时间也随之增加。这种延长的编译时间可能会导致推理服务的启动时间变长或在大规模训练中浪费资源。本教程展示了如何通过选择编译模型的重复区域而不是整个模型来减少冷启动编译时间。

前提条件¶

PyTorch 2.5 或更高版本

环境设置¶

在开始之前，如果尚未安装 torch，需要安装它。

pip install torch

备注

此功能从 2.5 版本开始可用。如果您使用的是 2.4 版本，可以启用配置标志 torch._dynamo.config.inline_inbuilt_nn_modules=True 来防止区域编译期间的重新编译。在 2.5 版本中，此标志默认启用。

from time import perf_counter

步骤¶

在本教程中，我们将遵循以下步骤：

导入所有必要的库。
定义并初始化具有重复区域的神经网络。
理解完整模型和区域编译之间的区别。
测量完整模型和区域编译的编译时间。

首先，让我们导入加载数据所需的库：

import torch
import torch.nn as nn

接下来，我们定义并初始化一个具有重复区域的神经网络。

通常，神经网络由重复的层组成。例如，一个大型语言模型由多个 Transformer 块组成。在本教程中，我们将使用 nn.Module 类创建一个 Layer 作为重复区域的代理。然后我们将创建一个由 64 个 Layer 类实例组成的 Model。

class Layer(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(10, 10)
        self.relu1 = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(10, 10)
        self.relu2 = torch.nn.ReLU()

    def forward(self, x):
        a = self.linear1(x)
        a = self.relu1(a)
        a = torch.sigmoid(a)
        b = self.linear2(a)
        b = self.relu2(b)
        return b


class Model(torch.nn.Module):
    def __init__(self, apply_regional_compilation):
        super().__init__()
        self.linear = torch.nn.Linear(10, 10)
        # Apply compile only to the repeated layers.
        if apply_regional_compilation:
            self.layers = torch.nn.ModuleList(
                [torch.compile(Layer()) for _ in range(64)]
            )
        else:
            self.layers = torch.nn.ModuleList([Layer() for _ in range(64)])

    def forward(self, x):
        # In regional compilation, the self.linear is outside of the scope of `torch.compile`.
        x = self.linear(x)
        for layer in self.layers:
            x = layer(x)
        return x

接下来，我们回顾完整模型编译和区域编译之间的区别。

在完整模型编译中，整个模型作为一个整体被编译。这是大多数用户使用 torch.compile 的常见方法。在本示例中，我们将 torch.compile 应用于 Model 对象。这将有效地内联 64 个层，生成一个需要编译的大型图。运行此教程时，可通过设置 TORCH_LOGS=graph_code 查看完整图。

model = Model(apply_regional_compilation=False).cuda()
full_compiled_model = torch.compile(model)

另一方面，区域编译只编译模型的一部分。通过策略性地选择编译模型的重复区域，我们可以编译一个小得多的图并重复使用该已编译的图。在本示例中，torch.compile 仅应用于 layers 而不是整个模型。

regional_compiled_model = Model(apply_regional_compilation=True).cuda()

将编译应用于重复区域而不是整个模型，可以显著节省编译时间。在这里，我们仅编译一个层实例，然后在 Model 对象中重用该实例 64 次。

请注意，对于重复区域，模型的某些部分可能未被编译。例如，在 Model 中的 self.linear 超出了区域编译的范围。

还需注意的是，性能加速和编译时间之间存在权衡。完整模型编译涉及更大的图，并且理论上提供了更多优化的空间。然而，对于实际用途而言，根据模型的具体情况，我们已观察到完整模型编译和区域编译之间的加速差异通常很小。

接下来，让我们测量完整模型和区域编译的编译时间。

torch.compile 是一个 JIT 编译器，这意味着它会在第一次调用时进行编译。在以下代码中，我们测量第一次调用所花费的总时间。虽然此方法并不精确，但由于主要时间花费在编译过程中，提供了一个不错的估计。

def measure_latency(fn, input):
    # Reset the compiler caches to ensure no reuse between different runs
    torch.compiler.reset()
    with torch._inductor.utils.fresh_inductor_cache():
        start = perf_counter()
        fn(input)
        torch.cuda.synchronize()
        end = perf_counter()
        return end - start


input = torch.randn(10, 10, device="cuda")
full_model_compilation_latency = measure_latency(full_compiled_model, input)
print(f"Full model compilation time = {full_model_compilation_latency:.2f} seconds")

regional_compilation_latency = measure_latency(regional_compiled_model, input)
print(f"Regional compilation time = {regional_compilation_latency:.2f} seconds")

assert regional_compilation_latency < full_model_compilation_latency

结论¶

此教程展示了如何控制模型具有重复区域时的冷启动编译时间。此方法需要用户修改以在重复区域而非更常用的完整模型编译中应用 torch.compile。我们将继续致力于减少冷启动编译时间。

脚本的总运行时间： (0分钟 0.000秒)

由Sphinx-Gallery生成的图集

使用区域编译减少torch.compile冷启动编译时间¶

前提条件¶

环境设置¶

步骤¶

结论¶

文档

教程

资源