备注
点击:ref:`这里 <sphx_glr_download_intermediate_inductor_debug_cpu.py>`下载完整示例代码
Inductor CPU后端调试和性能分析¶
Created On: Jul 01, 2023 | Last Updated: Jan 08, 2025 | Last Verified: Nov 05, 2024
作者: Xuan Liao, Haozhe Zhu, Jiong Gong, Weihan Wang
概述¶
PyTorch 2.0引入了名为``torch.compile``的编译API。此新功能通过默认的Inductor后端提供图级优化,在急切模式执行方面显著加快了速度。
本教程旨在通过深入探讨``torch.compile``的细节,对Inductor CPU后端进行调试和性能分析进行深入介绍。
同时,您还可以找到有关``torch.compile``的相关教程,包括`基本用法 <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_、全面的`故障排查 <https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html>`_以及GPU相关知识,如`GPU性能分析 <https://pytorch.org/docs/stable/torch.compiler_inductor_profiling.html>`_。
我们将从一个激发学习兴趣的例子开始调试,该例子会触发编译问题和精确性问题,并展示排查问题的过程。
通过启用日志和研究生成的底层代码,您可以学习如何逐步缩小故障范围,直到最终找出根本原因。
接下来,我们将探讨如何对编译代码进行分析,并通过与急切模式的性能比较,详细阐述``torch.compile``能提供额外性能提升的原因。
调试¶
这是一个简单运行``torch.compile``使用Inductor并将其结果与急切模式比较的例子:
import torch
def foo1(x1, x2):
a = torch.neg(x1)
b = torch.maximum(x2, a)
y = torch.cat([b], dim=0)
return y
x1 = torch.randint(256, (1, 8), dtype=torch.uint8)
x2 = torch.randint(256, (8390, 8), dtype=torch.uint8)
compiled_foo1 = torch.compile(foo1)
result = compiled_foo1(x1, x2)
在``cpp``代码生成中的``neg``的正确实现如下:
def neg1(x):
return f"decltype({x})(-{x})"
为了演示调试,我们稍后将此函数修改为错误版本。
获取更多日志信息¶
如果默认情况下运行此简单例子,将不会提供任何调试信息。为了获得更多有用的调试和日志信息,我们通常会添加一个``TORCH_COMPILE_DEBUG``环境变量,如下所示:
TORCH_COMPILE_DEBUG=1 python xx.py
这将在输出日志中打印更多调试信息,并且还会转储代码生成过程中生成的中间IR。您可以在日志中找到转储文件路径,如下所示:
torch._inductor.debug: [WARNING] model___20 debug trace: /tmp/torchinductor_root/rx/crxfi2ybd7yp5sbj2pnhw33wfhtdw7wumvrobyp5sjvdui5ktjc2.debug
在这个目录中,保存了以下文件以供调试使用:
文件 |
描述 |
---|---|
|
可执行FX图,分解后,模式匹配前 |
|
转化的FX图,模式匹配后 |
|
融合前的Inductor IR |
|
融合后的Inductor IR |
|
生成的带有C++/Triton内核的图Python代码 |
注意,为了方便调试,``fx_graph_runnable.py``和``output_code.py``可以运行和编辑。以下是从文件中提取的主要代码部分,我们将C++生成的行与FX代码行关联。
fx_graph_runnable
:
def forward1(self, arg0_1, arg1_1):
neg = torch.ops.aten.neg.default(arg0_1); arg0_1 = None
maximum = torch.ops.aten.maximum.default(arg1_1, neg); arg1_1 = neg = None
clone = torch.ops.aten.clone.default(maximum); maximum = None
return (clone,)
``output_code``中的C++内核:
import torch
from torch._inductor.async_compile import AsyncCompile
async_compile = AsyncCompile()
cpp_fused_cat_maximum_neg_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
extern "C" void kernel(const unsigned char* in_ptr0,
const unsigned char* in_ptr1,
unsigned char* out_ptr0)
{
{
#pragma GCC ivdep
for(long i0=static_cast<long>(0L); i0<static_cast<long>(8390L); i0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(8L); i1+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(i1 + (8L*i0))];
auto tmp1 = in_ptr1[static_cast<long>(i1)];
// Corresponding FX code line: neg = torch.ops.aten.neg.default(arg0_1); arg0_1 = None
auto tmp2 = decltype(tmp1)(-tmp1);
// Corresponding FX code line: maximum = torch.ops.aten.maximum.default(arg1_1, neg); arg1_1 = neg = None
auto tmp3 = max_propagate_nan(tmp0, tmp2);
// Corresponding FX code line: clone = torch.ops.aten.clone.default(maximum); maximum = None
out_ptr0[static_cast<long>(i1 + (8L*i0))] = tmp3;
}
}
}
}''')
确定错误的组成部分¶
在遇到错误或准确性问题时,找到问题的一个直接解决方法是缩小问题范围。第一步是确定错误发生的组件。幸运的是,通过更改``torch.compile``的后端可以简单实现。
代码 |
描述 |
---|---|
|
启用Dynamo |
|
启用Dynamo + AOT Autograd |
|
启用Dynamo + AOT Autograd + Inductor |
如果将后端设置为``eager``或``aot_eager``时模型能够成功运行,而设置为``inductor``时失败,我们可以将问题范围缩小到Inductor。
编译错误¶
众所周知,图级优化的演化链如下:
torch.neg (Python) -> torch.ops.aten.neg.default (within FX graph) -> ops.neg (within IR node) -> tmp2 = -tmp1 (within C++ kernel)
如果遇到编译错误,则表示在输出代码中编译C++内核时出现问题。这种错误表明在将IR节点降低到输出代码时引入了问题。编译错误的根本原因通常会显示在回溯日志中。
例如,``neg``函数被修改如下:
def neg2(x):
return f"-{x}"
日志记录给出了如下编译错误,并列出了相当清晰的原因。
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
CppCompileError: C++ compile error
/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp: In function ‘void kernel(const unsigned char*, const unsigned char*, unsigned char*)’:
/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:17:57: error: no matching function for call to ‘max_propagate_nan(unsigned char&, int&)’
17 | auto tmp3 = max_propagate_nan(tmp0, tmp2);
| ^
In file included from /tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:2:
/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h:27:17: note: candidate: ‘template<class scalar_t> scalar_t max_propagate_nan(scalar_t, scalar_t)’
27 | inline scalar_t max_propagate_nan(scalar_t a, scalar_t b) {
| ^~~~~~~~~~~~~~~~~
/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h:27:17: note: template argument deduction/substitution failed:
/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:17:57: note: deduced conflicting types for parameter ‘scalar_t’ (‘unsigned char’ and ‘int’)
17 | auto tmp3 = max_propagate_nan(tmp0, tmp2);
| ^
让我们也看看对应输出代码中的C++内核和IR节点。
C++内核:
include "/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
extern "C" void kernel(const unsigned char* in_ptr0,
const unsigned char* in_ptr1,
unsigned char* out_ptr0)
{
{
#pragma GCC ivdep
for(long i0=static_cast<long>(0L); i0<static_cast<long>(8390L); i0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(8L); i1+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(i1 + (8L*i0))];
auto tmp1 = in_ptr1[static_cast<long>(i1)];
auto tmp2 = -tmp1;
auto tmp3 = max_propagate_nan(tmp0, tmp2);
out_ptr0[static_cast<long>(i1 + (8L*i0))] = tmp3;
}
}
}
}
IR节点:
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep('buf0', c0, {c0: 67120})]
buf0.unmet_dependencies = []
buf0.met_dependencies =
[ MemoryDep('arg0_1', c1, {c0: 8390, c1: 8}),
MemoryDep('arg1_1', c0, {c0: 67120})]
buf0.users = [NodeUser(node=OUTPUT, can_inplace=False)]
buf0.group.device = cpu
buf0.group.iteration = ((8390, 8), ())
buf0.sizes = ([8390, 8], [])
class buf0_loop_body:
var_ranges = {z0: 8390, z1: 8}
index0 = 8*z0 + z1
index1 = z1
def body(self, ops):
get_index = self.get_index('index0')
load = ops.load('arg1_1', get_index)
get_index_1 = self.get_index('index1')
load_1 = ops.load('arg0_1', get_index_1)
neg = ops.neg(load_1)
maximum = ops.maximum(load, neg)
get_index_2 = self.get_index('index0')
store = ops.store('buf0', get_index_2, maximum, None)
return store
根据回溯日志记录,编译错误是由于``max_propagate_nan``的输入数据类型不一致导致的。通过检查C++内核,我们知道在``tmp0``为``long``类型时进行``-操作后,``tmp2``不再是``long``类型。我们可以轻松在C++内核中匹配
-和``max_propagate_nan
,在IR节点中分别对应``ops.neg``和``ops.maximum``。
现在我们成功找到了问题的根本原因,是``ops.neg``在``cpp``代码生成中的实现,它在执行``neg``时静默改变了数据类型。
准确性调试¶
否则,如果模型运行时出现其他错误或准确性问题,可以使用PyTorch的调试工具`Minifier <https://pytorch.org/functorch/stable/notebooks/minifier.html>`_。
``Minifier``的核心思想是不断移除图的节点和输入直到找到最小的具有问题的图。它通过四种策略帮助自动生成精简的有问题的图:截断后缀、差分调试、消除死代码以及移除未使用输入。
现在我们将展示如何在``Minifier``的帮助下调试准确性问题。准确性问题指后端``eager``和``inductor``的输出不同的情况。
例如,我们将示例修改如下:
from torch._dynamo.utils import same
def foo2(x1, x2):
a = torch.neg(x1)
b = torch.maximum(x2, a)
y = torch.cat([b], dim=0)
return y
x1 = torch.randn((1, 8), dtype=torch.float32)
x2 = torch.randn((8390, 8), dtype=torch.float32)
expected_result = foo2(x1, x2)
compiled_foo2 = torch.compile(foo2)
actual_result = compiled_foo2(x1, x2)
assert same(expected_result, actual_result) == True
同时修改``neg``函数:
def neg3(x):
return f"decltype({x})(2 * {x})"
准确性问题将会如下引发:
torch._dynamo.utils: [ERROR] Accuracy failed: allclose not within tol=0.0001
Traceback (most recent call last):
File "test_script.py", line 18, in <module>
assert same(expected_result, actual_result) == True
AssertionError
要使用Minifier调试准确性问题,需要设置两个环境变量:
TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4 python xx.py
由此我们可以获得记录信息,显示精简的步骤:
Started off with 6 nodes
Trying granularity 2
Strategy: Truncate suffix (G: 2) (6 nodes, 2 inputs)
SUCCESS: Went from 6 to 4 nodes
Trying granularity 4
Strategy: Remove unused inputs (G: 4) (4 nodes, 2 inputs)
SUCCESS: Went from 4 to 3 nodes
运行后,我们得到包含目标节点``neg``的最终精简图:
def forward2(self, arg0_1):
neg = torch.ops.aten.neg.default(arg0_1); arg0_1 = None
return (neg,)
有关Minifier的更多使用详情,请参考`故障排除指南 <https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html>`_。
性能分析¶
在本节中,我们将演示对使用Inductor CPU后端编译的模型进行性能分析的过程。在下面的示例中,我们比较了Hugging Face Transformer模型``MobileBertForQuestionAnswering``在``eager``模式和Inductor图模式下的基准测试。基准测试完成后打印出Inductor的执行时间和加速比。我们使用Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz并在第一插槽上运行基准测试来演示针对本节的优化。我们设置以下环境变量作为在Intel(R) CPU上进行基准测试的最佳实践。
export KMP_BLOCKTIME=1
export KMP_SETTINGS=1
export KMP_AFFINITY=granularity=fine,compact,1,0
export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
numactl -C 0-31 -m 0 python bench.py
# bench.py
from transformers import MobileBertForQuestionAnswering
# Initialize an eager model
model = MobileBertForQuestionAnswering.from_pretrained("csarron/mobilebert-uncased-squad-v2")
seq_length = 128
bs = 128
vocab_size = model.config.vocab_size
input = torch.randint(0, vocab_size, (bs, seq_length), dtype=torch.int64)
input_dict = {"input_ids": input}
# Initialize the inductor model
compiled_model = torch.compile(model)
with torch.no_grad():
compiled_model(**input_dict)
NUM_ITERS=50
import timeit
with torch.no_grad():
# warmup
for _ in range(10):
model(**input_dict)
eager_t = timeit.timeit("model(**input_dict)", number=NUM_ITERS, globals=globals())
with torch.no_grad():
# warmup
for _ in range(10):
compiled_model(**input_dict)
inductor_t = timeit.timeit("compiled_model(**input_dict)", number=NUM_ITERS, globals=globals())
# print(f"eager use: {eager_t * 1000 / NUM_ITERS} ms/iter")
# print(f"inductor use: {inductor_t * 1000 / NUM_ITERS} ms/iter")
# print(f"speed up ratio: {eager_t / inductor_t}")
输出:
eager use: 802.1023553796113 ms/iter
inductor use: 339.95180135127157 ms/iter
speed up ratio: 2.359459053287382
在我们的测试中,我们发现Inductor CPU后端使模型加速了约2.355倍。
接下来,让我们深入到操作级别性能分析以了解加速来源。`PyTorch Profiler <https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>`_是一个很好的工具,可以帮助我们完成此任务。Inductor CPU后端支持通过启用``enable_kernel_profile``配置选项向Profiler报告融合内核的时间。
from torch._inductor import config
config.cpp.enable_kernel_profile = True
按照`PyTorch Profiler <https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>`_中的步骤,我们能够生成性能剖析表和跟踪文件。
# bench.py
from torch.profiler import profile, schedule, ProfilerActivity
RESULT_DIR = "./prof_trace"
my_schedule = schedule(
skip_first=10,
wait=5,
warmup=5,
active=1,
repeat=5)
def trace_handler(p):
output = p.key_averages().table(sort_by="self_cpu_time_total", row_limit=20)
# print(output)
p.export_chrome_trace(f"{RESULT_DIR}/{p.step_num}.json")
for _ in range(10):
model(**input_dict) # compiled_model(**input_dict) to get inductor model profiling
total = 0
with profile(
activities=[ProfilerActivity.CPU],
schedule=my_schedule,
on_trace_ready=trace_handler
) as p:
for _ in range(50):
model(**input_dict) # compiled_model(**input_dict) to get inductor model profiling
p.step()
我们得到以下基于``eager``模式的模型性能分析表(省略部分列):
------------------------- ------------ ------------ ------------
Name CPU total % CPU total # of Calls
------------------------- ------------ ------------ ------------
aten::addmm 45.73% 370.814ms 362
aten::add 19.89% 161.276ms 363
aten::copy_ 14.97% 121.416ms 488
aten::mul 9.02% 73.154ms 194
aten::clamp_min 8.81% 71.444ms 96
aten::bmm 5.46% 44.258ms 48
ProfilerStep* 100.00% 810.920ms 1
aten::div 2.89% 23.447ms 24
aten::_softmax 1.00% 8.087ms 24
aten::linear 46.48% 376.888ms 362
aten::clone 2.77% 22.430ms 98
aten::t 0.31% 2.502ms 362
aten::view 0.14% 1.161ms 850
aten::transpose 0.17% 1.377ms 386
aten::index_select 0.12% 952.000us 3
aten::expand 0.12% 986.000us 458
aten::matmul 8.31% 67.420ms 48
aten::cat 0.09% 703.000us 1
aten::as_strided 0.08% 656.000us 963
aten::relu 8.86% 71.864ms 96
------------------------- ------------ ------------ ------------
Self CPU time total: 810.920ms
类似地,我们还为使用Inductor编译的模型生成了表(省略部分列):
----------------------------------------------- ------------ ------------ ------------
Name CPU total % CPU total # of Calls
----------------------------------------------- ------------ ------------ ------------
mkl::_mkl_linear 68.79% 231.573ms 362
aten::bmm 8.02% 26.992ms 48
ProfilerStep* 100.00% 336.642ms 1
graph_0_cpp_fused_constant_pad_nd_embedding_0 0.27% 915.000us 1
aten::empty 0.27% 911.000us 362
graph_0_cpp_fused__mkl_linear_add_mul_relu_151 0.27% 901.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_226 0.27% 899.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_361 0.27% 898.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_121 0.27% 895.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_31 0.27% 893.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_76 0.26% 892.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_256 0.26% 892.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_346 0.26% 892.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_241 0.26% 891.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_316 0.26% 891.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_91 0.26% 890.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_106 0.26% 890.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_211 0.26% 890.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_61 0.26% 889.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_286 0.26% 889.000us 1
----------------------------------------------- ------------ ------------ ------------
Self CPU time total: 336.642ms
从基于``eager``模式模型的性能剖析表中,我们可以看到最耗时的操作是[aten::addmm
、aten::add
、aten::copy_
、aten::mul
、aten::clamp_min
、aten::bmm
]。与Inductor模型剖析表相比,我们注意到有一个``mkl::_mkl_linear``条目以及多个以``graph_0_cpp_fused_*``形式存在的融合内核。这些是Inductor模型进行的主要优化。让我们分别讨论它们。
(1) Regarding mkl::_mkl_linear
: You may notice the number of calls to this kernel is 362, which is exactly the same as aten::linear
in the eager model profiling table.
The CPU total of aten::linear
is 376.888ms, while it is 231.573ms for mkl::_mkl_linear
. This suggests a ~1.63x for the “linear” part.
The speedup mainly comes from packing the weight tensor to block memory format
and invoking cblas_sgemm_compute within the Inductor CPU backend
to have a better cache behavior during GEMM computation.
(2) Regarding other memory-intensive ops: The end-to-end latency for the eager/inductor model is 802/339ms in our testing. So we can roughly infer that the speed up for the other memory-intensive ops is around 3.94x.
Let’s read the generated code to understand how the inductor achieves this impressive optimization. You can find the generated code by
searching cpp_fused__mkl_linear_add_mul_relu_151
in output_code.py
cpp_fused__mkl_linear_add_mul_relu_151 = async_compile.cpp('''
#include <ATen/record_function.h>
#include "/tmp/torchinductor_root/lr/clrlgu27q4ggd472umdzwsu6qcpqxcuusjxqvx2hwitjbujiiz7z.h"
extern "C" void kernel(float* in_out_ptr0,
const float* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
const float* in_ptr3)
{
RECORD_FUNCTION("graph_0_cpp_fused__mkl_linear_add_mul_relu_151", c10::ArrayRef<c10::IValue>({}));
#pragma omp parallel num_threads(32)
{
{
#pragma omp for
for(long i0=static_cast<long>(0L); i0<static_cast<long>(16384L); i0+=static_cast<long>(1L))
{
for(long i1=static_cast<long>(0L); i1<static_cast<long>(512L); i1+=static_cast<long>(8L))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i1 + (512L*i0)));
auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(i1));
auto tmp3 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i1 + (512L*i0)));
auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(i1));
auto tmp7 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(i1));
auto tmp2 = tmp0 + tmp1;
auto tmp4 = tmp2 + tmp3;
auto tmp6 = tmp4 * tmp5;
auto tmp8 = tmp6 + tmp7;
tmp8.store(in_out_ptr0 + static_cast<long>(i1 + (512L*i0)));
}
}
}
}
}''')
从上面的生成代码中,我们可以看到该内核对``[add, add, mul, add]``进行了典型的`循环融合 <https://en.wikipedia.org/wiki/Loop_fission_and_fusion>`_。这是一个内存受限的瓶颈,限制了性能的提升。为了更直观地理解此优化,我们可以推断输入的大小和步幅,并进一步对模式``[add, add, mul, add]``进行基准测试。
# bench.py
def func(arg_0, arg_1, arg_2, arg_3, arg_4):
add_0 = arg_0 + arg_1
add_1 = add_0 + arg_2
mul_1 = add_1 * arg_3
add_2 = mul_1 + arg_4
arg_2 = add_2
return arg_2
arg_0 = torch.rand(16384, 512)
arg_1 = torch.rand(1, 512)
arg_2 = torch.zeros(16384, 512)
arg_3 = torch.rand(1, 512)
arg_4 = torch.rand(1, 512)
input = (arg_0, arg_1, arg_2, arg_3, arg_4)
inductor_func = torch.compile(func)
with torch.no_grad():
inductor_func(*input)
import timeit
NUM_ITERS=100
with torch.no_grad():
# warmup
for _ in range(10):
func(*input)
eager_t = timeit.timeit("func(*input)", number=NUM_ITERS, globals=globals())
with torch.no_grad():
# warmup
for _ in range(10):
inductor_func(*input)
inductor_t = timeit.timeit("inductor_func(*input)", number=NUM_ITERS, globals=globals())
# print(f"eager use: {eager_t * 1000 / NUM_ITERS} ms/iter")
# print(f"inductor use: {inductor_t * 1000 / NUM_ITERS} ms/iter")
# print(f"speed up ratio: {eager_t / inductor_t}")
输出:
eager use: 5.780875144992024 ms/iter
inductor use: 0.9588955780491233 ms/iter
speed up ratio: 6.0286805751604735
这只是一个示例。性能分析表显示模型中的所有元素级操作都被Inductor自动融合。你可以阅读更多的内核内容在`output_code.py`中。
结论¶
本文档提供了针对Inductor CPU后端的深入教程。
通过激发示例,我们走过了调试和性能分析的全过程。主要思想是缩小问题范围。
我们一步步演示了深入分析问题并找到失败根本原因的方法,借助调试日志和工具Minifier。首先确定失败发生在哪个组件,然后尝试生成可以复制失败的最小代码片段。
当Inductor性能优于``eager``模式时,我们提供了进行性能分析的可靠的分析方法。我们展示了如何使用PyTorch Profiler找到耗时的热点,并从操作级或内核级解释现象的原因。