分析您的 PyTorch 模块¶

Created On: Dec 30, 2020 | Last Updated: Jan 19, 2024 | Last Verified: Nov 05, 2024

作者： Suraj Subramanian

PyTorch 包含一个分析器 API，可用于识别代码中各种 PyTorch 操作的时间和内存成本。分析器可以轻松集成到您的代码中，结果可以作为表格打印，或者以 JSON 跟踪文件返回。

备注

分析器支持多线程模型。分析器在与操作相同的线程中运行，但它还会分析可能运行在其他线程中的子操作符。并发运行的分析器将限制在它们自己的线程中以防止结果混杂。

备注

PyTorch 1.8 引入了新的 API，未来版本中将替代较旧的分析器 API。查看新 API 请参考此页面。

前往这个教程快速了解分析器 API 的用法。

import torch
import numpy as np
from torch import nn
import torch.autograd.profiler as profiler

使用分析器进行性能调试¶

分析器可用于识别模型中的性能瓶颈。在此示例中，我们构建了一个执行两个子任务的自定义模块：

对输入进行线性变换，以及
使用变换结果获取掩码张量上的索引。

我们使用 profiler.record_function("label") 将每个子任务的代码包裹在单独的标记上下文管理器中。在分析器输出中，子任务中所有操作的整体性能指标将显示在其对应的标签下。

请注意，使用分析器会带来一些开销，最好仅在调查代码时使用。若进行运行时基准测试，请记得将其移除。

class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean().item()
            hi_idx = np.argwhere(mask.cpu().numpy() > threshold)
            hi_idx = torch.from_numpy(hi_idx).cuda()

        return out, hi_idx

分析前向传播过程¶

我们初始化随机输入和掩码张量，以及模型。

在运行分析器之前，我们先热启动 CUDA，以确保性能基准测试的准确性。我们将模块的前向传播封装在 profiler.profile 上下文管理器中。参数 with_stack=True 会在跟踪中附加操作的文件和行号。

警告

with_stack=True 会带来额外的开销，因此更适用于调查代码。若进行性能基准测试，请记得将其移除。

model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.double).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

打印分析器结果¶

最后，我们打印分析器结果。profiler.key_averages 按操作符名称汇总结果，并可选择按输入形状和/或堆栈跟踪事件分组。按输入形状分组有助于识别模型使用的张量形状。

在这里，我们使用 group_by_stack_n=5，按操作及其堆栈回溯（截取最近 5 个事件）汇总运行时间，并按注册顺序显示事件。表格还可以通过传递 sort_by 参数进行排序（有关有效的排序键，请参考文档）。

备注

在 notebook 中运行分析器时，您可能会在堆栈跟踪中看到类似 <ipython-input-18-193a910735e8>(13): forward 这样的条目，而不是文件名。这些对应于 <notebook-cell>(行号): 调用函数。

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

-------------  ------------  ------------  ------------  ---------------------------------
         Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-------------  ------------  ------------  ------------  ---------------------------------
 MASK INDICES        87.88%        5.212s    -953.67 Mb  /mnt/xarfuse/.../torch/au
                                                         <ipython-input-...>(10): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/

  aten::copy_        12.07%     715.848ms           0 b  <ipython-input-...>(12): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/
                                                         /mnt/xarfuse/.../IPython/

  LINEAR PASS         0.01%     350.151us         -20 b  /mnt/xarfuse/.../torch/au
                                                         <ipython-input-...>(7): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/

  aten::addmm         0.00%     293.342us           0 b  /mnt/xarfuse/.../torch/nn
                                                         /mnt/xarfuse/.../torch/nn
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(8): forward
                                                         /mnt/xarfuse/.../torch/nn

   aten::mean         0.00%     235.095us           0 b  <ipython-input-...>(11): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/
                                                         /mnt/xarfuse/.../IPython/

-----------------------------  ------------  ---------- ----------------------------------
Self CPU time total: 5.931s

"""

改进内存性能¶

注意，在内存和时间方面最耗资源的操作是 forward (10)，表示掩码索引操作中的操作。让我们先尝试解决内存消耗问题。我们可以看到，第 12 行的 .to() 操作消耗了 953.67 MB。此操作将 mask 复制到 CPU 上。mask 的初始化数据类型是 torch.double。我们能通过将其转换为 torch.float 来减少内存占用吗？

model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

-----------------  ------------  ------------  ------------  --------------------------------
             Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-----------------  ------------  ------------  ------------  --------------------------------
     MASK INDICES        93.61%        5.006s    -476.84 Mb  /mnt/xarfuse/.../torch/au
                                                             <ipython-input-...>(10): forward
                                                             /mnt/xarfuse/  /torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/

      aten::copy_         6.34%     338.759ms           0 b  <ipython-input-...>(12): forward
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

 aten::as_strided         0.01%     281.808us           0 b  <ipython-input-...>(11): forward
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

      aten::addmm         0.01%     275.721us           0 b  /mnt/xarfuse/.../torch/nn
                                                             /mnt/xarfuse/.../torch/nn
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(8): forward
                                                             /mnt/xarfuse/.../torch/nn

      aten::_local        0.01%     268.650us           0 b  <ipython-input-...>(11): forward
      _scalar_dense                                          /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

-----------------  ------------  ------------  ------------  --------------------------------
Self CPU time total: 5.347s

"""

此操作的 CPU 内存占用减少了一半。

改进时间性能¶

虽然时间消耗也有所减少，但仍然过高。事实证明，将矩阵从 CUDA 复制到 CPU 非常昂贵！forward (12) 处的 aten::copy_ 操作符将 mask 复制到 CPU，以便使用 NumPy 的 argwhere 函数。forward (13) 处的 aten::copy_ 将数组作为张量复制回 CUDA。如果我们在此处使用 torch 函数 nonzero()，可以消除这两种情况。

class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean()
            hi_idx = (mask > threshold).nonzero(as_tuple=True)

        return out, hi_idx


model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

--------------  ------------  ------------  ------------  ---------------------------------
          Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
--------------  ------------  ------------  ------------  ---------------------------------
      aten::gt        57.17%     129.089ms           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

 aten::nonzero        37.38%      84.402ms           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

   INDEX SCORE         3.32%       7.491ms    -119.21 Mb  /mnt/xarfuse/.../torch/au
                                                          <ipython-input-...>(10): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/

aten::as_strided         0.20%    441.587us          0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

 aten::nonzero
     _numpy             0.18%     395.602us           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/
--------------  ------------  ------------  ------------  ---------------------------------
Self CPU time total: 225.801ms

"""

进一步阅读¶

我们已经了解了如何利用分析器调查 PyTorch 模型中的时间和内存瓶颈。阅读更多关于分析器的信息：

脚本总运行时间： (0 分钟 0.000 秒)

画廊由 Sphinx-Gallery 生成