Tools

Tools: How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

2026-01-17 0 views admin

Tools: How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

Overview ## Hardware Setup ## Prerequisites ## Software Dependencies ## Host Setup ## Data Preparation ## Two Approaches for Multi-Node Training ## 1. SLURM-Based (AMD Default) ## 2. Manual Multi-Node with Rendezvous ## Implementation ## Container Launch Pattern ## Orchestration Script ## Key Configuration ## Results ## Single Node (8 GPUs) ## Four Nodes (32 GPUs) ## Scaling Analysis ## Comparison with Official Results ## Key Takeaways ## Resources ## Full Script This guide covers running the MLPerf Training v5.1 Llama 2 70B LoRA fine-tuning benchmark on a multi-node AMD Instinct MI325X cluster without a SLURM scheduler. AMD provides an official MLPerf Training Docker image (rocm/amd-mlperf:llama2_70b_training_5.1) designed primarily for SLURM-managed clusters. However, many environments use simpler SSH-based orchestration. This post demonstrates how to run multi-node distributed training using PyTorch's rendezvous mechanism. ROCm Installation: Follow ROCm installation guide for your Linux distribution. Docker GPU Access: Ensure Docker can access AMD GPUs: Multi-Node Networking: Pull the MLPerf Container on all nodes: The benchmark requires ~270GB for the Llama 2 70B model and GovReport dataset. A HuggingFace token with Llama 2 license acceptance is required: AMD's container supports two launch methods: For non-SLURM environments, PyTorch's torchrun supports a rendezvous backend that handles rank assignment automatically: This command runs identically on all nodes - the c10d backend coordinates rank assignment. Our approach uses SSH to launch training on each node, passing the distributed configuration via environment variables: The main script SSHs to each node in parallel: The config_MI325X_4x8x1.sh sets critical parameters: Near-linear throughput scaling validates that the network interconnect is not a bottleneck. Our single-node result (20.57 min) matches AMD's official MLPerf v5.1 submission (~21 min) for MI325X, confirming correct configuration. Container Design: AMD's container expects training scripts at /workspace/code - mount custom configs there rather than extracting files. Network Interface: Set NCCL_SOCKET_IFNAME to your high-speed network interface for optimal RCCL performance. SLURM Variables: The container's run_and_time_slurm.sh reads SLURM_NNODES and SLURM_NODEID - these can be set manually for non-SLURM environments. Scaling: Expect near-linear throughput scaling on properly configured clusters. Time-to-convergence scaling may differ due to batch size effects on convergence dynamics. The complete finetune_llama.sh script supports: Interested in the full script? Reach out via LinkedIn and I'll be happy to share. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. as well , this person and/or COMMAND_BLOCK: docker run --rm --device /dev/dri --device /dev/kfd rocm/pytorch:latest rocm-smi COMMAND_BLOCK: docker run --rm --device /dev/dri --device /dev/kfd rocm/pytorch:latest rocm-smi COMMAND_BLOCK: docker run --rm --device /dev/dri --device /dev/kfd rocm/pytorch:latest rocm-smi COMMAND_BLOCK: docker pull rocm/amd-mlperf:llama2_70b_training_5.1 COMMAND_BLOCK: docker pull rocm/amd-mlperf:llama2_70b_training_5.1 COMMAND_BLOCK: docker pull rocm/amd-mlperf:llama2_70b_training_5.1 CODE_BLOCK: export HF_TOKEN=your_token_here ./finetune_llama.sh prepare CODE_BLOCK: export HF_TOKEN=your_token_here ./finetune_llama.sh prepare CODE_BLOCK: export HF_TOKEN=your_token_here ./finetune_llama.sh prepare COMMAND_BLOCK: # Requires SLURM scheduler sbatch run_with_docker_slurm.sh COMMAND_BLOCK: # Requires SLURM scheduler sbatch run_with_docker_slurm.sh COMMAND_BLOCK: # Requires SLURM scheduler sbatch run_with_docker_slurm.sh CODE_BLOCK: torchrun \ --nnodes=4 \ --nproc_per_node=8 \ --rdzv_backend=c10d \ --rdzv_endpoint=MASTER_IP:29500 \ --rdzv_id=mlperf_run \ train.py CODE_BLOCK: torchrun \ --nnodes=4 \ --nproc_per_node=8 \ --rdzv_backend=c10d \ --rdzv_endpoint=MASTER_IP:29500 \ --rdzv_id=mlperf_run \ train.py CODE_BLOCK: torchrun \ --nnodes=4 \ --nproc_per_node=8 \ --rdzv_backend=c10d \ --rdzv_endpoint=MASTER_IP:29500 \ --rdzv_id=mlperf_run \ train.py COMMAND_BLOCK: # Start container with data mounts docker run --rm --init --detach \ --net=host --ipc=host \ --device /dev/dri --device /dev/kfd \ --name mlperf_llama2sft \ -v $DATADIR/data:/data \ -v $DATADIR/model:/ckpt \ -v $RESULTS:/logs \ -v $CODE_DIR:/workspace/code \ rocm/amd-mlperf:llama2_70b_training_5.1 sleep infinity # Execute training with distributed config docker exec \ -e MASTER_ADDR=$MASTER_IP \ -e MASTER_PORT=29500 \ -e SLURM_NNODES=$NUM_NODES \ -e SLURM_NODEID=$NODE_RANK \ -e NCCL_SOCKET_IFNAME=$NET_IF \ mlperf_llama2sft \ bash -c 'cd /workspace/code && source config_MI325X_4x8x1.sh && bash ./run_and_time_slurm.sh' COMMAND_BLOCK: # Start container with data mounts docker run --rm --init --detach \ --net=host --ipc=host \ --device /dev/dri --device /dev/kfd \ --name mlperf_llama2sft \ -v $DATADIR/data:/data \ -v $DATADIR/model:/ckpt \ -v $RESULTS:/logs \ -v $CODE_DIR:/workspace/code \ rocm/amd-mlperf:llama2_70b_training_5.1 sleep infinity # Execute training with distributed config docker exec \ -e MASTER_ADDR=$MASTER_IP \ -e MASTER_PORT=29500 \ -e SLURM_NNODES=$NUM_NODES \ -e SLURM_NODEID=$NODE_RANK \ -e NCCL_SOCKET_IFNAME=$NET_IF \ mlperf_llama2sft \ bash -c 'cd /workspace/code && source config_MI325X_4x8x1.sh && bash ./run_and_time_slurm.sh' COMMAND_BLOCK: # Start container with data mounts docker run --rm --init --detach \ --net=host --ipc=host \ --device /dev/dri --device /dev/kfd \ --name mlperf_llama2sft \ -v $DATADIR/data:/data \ -v $DATADIR/model:/ckpt \ -v $RESULTS:/logs \ -v $CODE_DIR:/workspace/code \ rocm/amd-mlperf:llama2_70b_training_5.1 sleep infinity # Execute training with distributed config docker exec \ -e MASTER_ADDR=$MASTER_IP \ -e MASTER_PORT=29500 \ -e SLURM_NNODES=$NUM_NODES \ -e SLURM_NODEID=$NODE_RANK \ -e NCCL_SOCKET_IFNAME=$NET_IF \ mlperf_llama2sft \ bash -c 'cd /workspace/code && source config_MI325X_4x8x1.sh && bash ./run_and_time_slurm.sh' CODE_BLOCK: for node_idx in 0 1 2 3; do ssh node-$node_idx "launch_training.sh $node_idx $NUM_NODES" & done wait CODE_BLOCK: for node_idx in 0 1 2 3; do ssh node-$node_idx "launch_training.sh $node_idx $NUM_NODES" & done wait CODE_BLOCK: for node_idx in 0 1 2 3; do ssh node-$node_idx "launch_training.sh $node_idx $NUM_NODES" & done wait COMMAND_BLOCK: export DGXNNODES=4 export DGXNGPU=8 export FP8=True export LR=0.0004 export MBS=1 # micro batch size export MAX_STEPS=1024 COMMAND_BLOCK: export DGXNNODES=4 export DGXNGPU=8 export FP8=True export LR=0.0004 export MBS=1 # micro batch size export MAX_STEPS=1024 COMMAND_BLOCK: export DGXNNODES=4 export DGXNGPU=8 export FP8=True export LR=0.0004 export MBS=1 # micro batch size export MAX_STEPS=1024 COMMAND_BLOCK: ./finetune_llama.sh run 4 # 4-node, single run ./finetune_llama.sh run 4 10 # 4-node, 10 runs (MLPerf submission) COMMAND_BLOCK: ./finetune_llama.sh run 4 # 4-node, single run ./finetune_llama.sh run 4 10 # 4-node, 10 runs (MLPerf submission) COMMAND_BLOCK: ./finetune_llama.sh run 4 # 4-node, single run ./finetune_llama.sh run 4 10 # 4-node, 10 runs (MLPerf submission) - Cluster: 4× MI325X nodes - GPUs: 8× AMD Instinct MI325X per node (32 total) - Network: High-speed interconnect for RCCL communication - Storage: Shared NFS mount at /mnt/shared - ROCm Installation: Follow ROCm installation guide for your Linux distribution. - Docker GPU Access: Ensure Docker can access AMD GPUs: - Multi-Node Networking: Passwordless SSH between all nodes High-speed network interface (InfiniBand/RoCE recommended) Shared filesystem accessible from all nodes - Passwordless SSH between all nodes - High-speed network interface (InfiniBand/RoCE recommended) - Shared filesystem accessible from all nodes - Pull the MLPerf Container on all nodes: - Passwordless SSH between all nodes - High-speed network interface (InfiniBand/RoCE recommended) - Shared filesystem accessible from all nodes - Container Design: AMD's container expects training scripts at /workspace/code - mount custom configs there rather than extracting files. - Network Interface: Set NCCL_SOCKET_IFNAME to your high-speed network interface for optimal RCCL performance. - SLURM Variables: The container's run_and_time_slurm.sh reads SLURM_NNODES and SLURM_NODEID - these can be set manually for non-SLURM environments. - Scaling: Expect near-linear throughput scaling on properly configured clusters. Time-to-convergence scaling may differ due to batch size effects on convergence dynamics. - AMD MLPerf Training v5.1 Technical Blog - Reproducing AMD MLPerf Training Results - ROCm Multi-Node Setup Guide - PyTorch Distributed Training on AMD GPUs - Single and multi-node runs - Configurable NEXP for MLPerf-compliant submissions - Automatic config selection based on node count

🏷️ Tags

toolsutilitiessecurity toolsmlperfllamatrainingmi325xwithoutslurmrce