分布式和并行训练教程¶

Created On: Oct 04, 2022 | Last Updated: Oct 31, 2024 | Last Verified: Nov 05, 2024

分布式训练是一种模型训练模式，通过将训练工作负载分散到多个工作节点上，从而显著提高训练速度和模型精度。虽然分布式训练可以用于任何类型的机器学习模型训练，但对于大模型和计算密集型任务（如深度学习）最为有效。

在 PyTorch 中有几种方法可以进行分布式训练，每种方法在特定使用场景都有其优势：

分布式数据并行 (DDP)
完全分片数据并行 (FSDP)
张量并行 (TP)
设备网格 (Device Mesh)
远程过程调用 (RPC) 分布式训练
自定义扩展

在分布式概览中了解更多关于这些选项的内容。

学习 DDP¶

DDP Intro Video Tutorials

逐步视频系列，教你如何从分布式数据并行 (DistributedDataParallel) 入门并深入到更复杂的主题

代码视频

https://pytorch.org/tutorials/beginner/ddp_series_intro.html?utm_source=distr_landing&utm_medium=ddp_series_intro

Getting Started with Distributed Data Parallel

本教程为 PyTorch 分布式数据并行 (DistributedDataParallel) 提供简短且易懂的入门介绍。

代码

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html?utm_source=distr_landing&utm_medium=intermediate_ddp_tutorial

Distributed Training with Uneven Inputs Using the Join Context Manager

本教程描述了 Join 上下文管理器的功能并展示了其在分布式数据并行中的应用。

代码

https://pytorch.org/tutorials/advanced/generic_join.html?utm_source=distr_landing&utm_medium=generic_join

学习 FSDP¶

Getting Started with FSDP

本教程展示了如何在 MNIST 数据集上使用 FSDP 进行分布式训练。

代码

https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_getting_started

FSDP Advanced

在本教程中，你将学习如何使用 FSDP 微调一个用于文本摘要的 HuggingFace (HF) T5 模型。

代码

https://pytorch.org/tutorials/intermediate/FSDP_advanced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced

学习张量并行 (TP)¶

Large Scale Transformer model training with Tensor Parallel (TP)

本教程展示了如何在数百到数千个 GPU 上使用张量并行和完全分片数据并行 (FSDP) 训练一个大型类似 Transformer 的模型。

代码

https://pytorch.org/tutorials/intermediate/TP_tutorial.html

学习设备网格 (DeviceMesh)¶

Getting Started with DeviceMesh

在本教程中，你将学习关于设备网格 (DeviceMesh) 的知识以及它如何帮助进行分布式训练。

代码

https://pytorch.org/tutorials/recipes/distributed_device_mesh.html?highlight=devicemesh

学习 RPC¶

Getting Started with Distributed RPC Framework

本教程演示了如何使用基于 RPC 的分布式训练入门。

代码

https://pytorch.org/tutorials/intermediate/rpc_tutorial.html?utm_source=distr_landing&utm_medium=rpc_getting_started

Implementing a Parameter Server Using Distributed RPC Framework

本教程通过一个简单的示例介绍了如何使用 PyTorch 的分布式 RPC 框架实现一个参数服务器。

代码

https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html?utm_source=distr_landing&utm_medium=rpc_param_server_tutorial

Implementing Batch RPC Processing Using Asynchronous Executions

在本教程中，您将通过 @rpc.functions.async_execution 装饰器构建批量处理 RPC 应用程序。

代码

https://pytorch.org/tutorials/intermediate/rpc_async_execution.html?utm_source=distr_landing&utm_medium=rpc_async_execution

Combining Distributed DataParallel with Distributed RPC Framework

在本教程中，您将学习如何将分布式数据并行与分布式模型并行相结合。

代码

https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html?utm_source=distr_landing&utm_medium=rpc_plus_ddp

自定义扩展¶

Customize Process Group Backends Using Cpp Extensions

在本教程中，您将学习实现一个自定义的 ProcessGroup 后端，并将其通过 cpp 扩展集成到 PyTorch 分布式包中。

代码

https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html?utm_source=distr_landing&utm_medium=custom_extensions_cpp