分布式训练
现在已经不推荐使用 DataParallel
了[1],
rank
gpu
local_rank
:进程阶序(rank)
global_rank
world_size
:总进程数
torchrun
torch.distributed.launch
nproc_per_node
什么是 DDP?
DistributedDataParallel
NCCL
RDMA
NVML
torch.distributed
分布式计算
All-Reduce
Reduced ring
In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. The GIL prevents race conditions and ensures thread safety. A nice explanation of how the Python GIL helps in these areas can be found here. In short, this mutex is necessary mainly because CPython's memory management is not thread-safe.
Distributed training with 🤗 Accelerate https://huggingface.co/docs/transformers/accelerate
TorchX — PyTorch/TorchX main documentation
TorchElastic Kubernetes — PyTorch 2.1 documentation
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-------------------------------------
0 | conv1 | Conv2d | 320
1 | conv2 | Conv2d | 18.5 K
2 | dropout1 | Dropout | 0
3 | dropout2 | Dropout | 0
4 | fc1 | Linear | 1.2 M
5 | fc2 | Linear | 1.3 K
-------------------------------------
1.2 M Trainable params
0 Non-trainable params
1.2 M Total params
4.800 Total estimated model params size (MB)
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Pytorch DDP
Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.2.0+cu121 documentationWriting Distributed Applications with PyTorch — PyTorch Tutorials 2.2.0+cu121 documentationSaving and Loading Models — PyTorch Tutorials 2.2.0+cu121 documentationTorch Distributed Elastic — PyTorch 2.1 documentationTorchElastic Kubernetes — PyTorch 2.1 documentationelastic/kubernetes at master · pytorch/elastic (github.com)Kubeflow Pipelines — PyTorch/TorchX main documentationTorchX — PyTorch/TorchX main documentationKubeflow Pipelines — PyTorch/TorchX main documentation
监控 Pytorch
Metrics — PyTorch 2.1 documentation
Trainer
Distributed training with 🤗 Accelerate (huggingface.co)
- PyTorch 分布式训练实现(DP/DDP/torchrun/多机多卡) - 知乎
- Pytorch - 分布式训练极简体验 - 知乎
- PyTorch分布式训练基础--DDP使用 - 知乎
- 开源一个 PyTorch 分布式(DDP)训练 mnist 的例子代码 - 知乎
- Distributed Data Parallel — PyTorch 2.1 documentation
- Machine Learning as a Flow: Kubeflow vs. Metaflow | by Roman Kazinnik | Medium
- Ring Allreduce - 简书
- GPU高效通信算法——Ring Allreduce
- Reduced ring - Wikipedia
- Machine Learning Distributed: Ring-Reduce vs. All-Reduce | by Roman Kazinnik | Medium
- 【转载】 Ring Allreduce (深度神经网络的分布式计算范式 -------------- 环形全局规约) - Angry_Panda - 博客园
- ddp 多卡训练torch 记录_torch ddp 卡死-CSDN博客
- pytorch多卡分布式训练简要分析 - 知乎
- Distributed data parallel training in Pytorch
- Pytorch中的Distributed Data Parallel与混合精度训练(Apex) - 知乎
- Pytorch 分散式訓練 DistributedDataParallel — 實作篇 | by 李謦伊 | 謦伊的閱讀筆記 | Medium
- Multi-GPU training — PyTorch Lightning 1.4.9 documentation
- Deepspeed 大模型分布式框架精讲 - 哔哩哔哩 bilibili