分布式训练

标签

AI/教程

字数

657 字

阅读时间

4 分钟

现在已经不推荐使用 DataParallel 了^[1]，

rank

gpu

local_rank：进程阶序（rank）

global_rank

world_size：总进程数

torchrun

torch.distributed.launch

nproc_per_node

什么是 DDP？

DistributedDataParallel

NCCL

RDMA

NVML

torch.distributed

分布式计算

All-Reduce

Reduced ring

In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. The GIL prevents race conditions and ensures thread safety. A nice explanation of how the Python GIL helps in these areas can be found here. In short, this mutex is necessary mainly because CPython's memory management is not thread-safe.
来源： GlobalInterpreterLock - Python Wiki

Distributed training with 🤗 Accelerate https://huggingface.co/docs/transformers/accelerate

TorchX — PyTorch/TorchX main documentation

pytorch/torchx: TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

TorchElastic Kubernetes — PyTorch 2.1 documentation

shell

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name     | Type    | Params
-------------------------------------
0 | conv1    | Conv2d  | 320
1 | conv2    | Conv2d  | 18.5 K
2 | dropout1 | Dropout | 0
3 | dropout2 | Dropout | 0
4 | fc1      | Linear  | 1.2 M
5 | fc2      | Linear  | 1.3 K
-------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.800     Total estimated model params size (MB)

shell

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

shell

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Pytorch DDP

Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.2.0+cu121 documentation Writing Distributed Applications with PyTorch — PyTorch Tutorials 2.2.0+cu121 documentation Saving and Loading Models — PyTorch Tutorials 2.2.0+cu121 documentation Torch Distributed Elastic — PyTorch 2.1 documentation TorchElastic Kubernetes — PyTorch 2.1 documentation elastic/kubernetes at master · pytorch/elastic (github.com)Kubeflow Pipelines — PyTorch/TorchX main documentation TorchX — PyTorch/TorchX main documentation Kubeflow Pipelines — PyTorch/TorchX main documentation

监控 Pytorch

Metrics — PyTorch 2.1 documentation

Trainer

Distributed training with 🤗 Accelerate (huggingface.co)

参考资料

贡献者

絢香猫

页面历史

最后编辑于 2 个月前

查看完整历史

Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.2.0+cu121 documentation ↩︎

SRS 性别重置手术

Chocola-

why技术

百心全说

🏠 Clickhouse

🐘 PostgreSQL

命令和语句

📊 MySQL

🌃 NebulaGraph

游乐场