Tools: How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

Tools: How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

Source: Dev.to

Overview ## Hardware Setup ## Prerequisites ## Software Dependencies ## Host Setup ## Data Preparation ## Two Approaches for Multi-Node Training ## 1. SLURM-Based (AMD Default) ## 2. Manual Multi-Node with Rendezvous ## Implementation ## Container Launch Pattern ## Orchestration Script ## Key Configuration ## Results ## Single Node (8 GPUs) ## Four Nodes (32 GPUs) ## Scaling Analysis ## Comparison with Official Results ## Key Takeaways ## Resources ## Full Script This guide covers running the MLPerf Training v5.1 Llama 2 70B LoRA fine-tuning benchmark on a multi-node AMD Instinct MI325X cluster without a SLURM scheduler. AMD provides an official MLPerf Training Docker image (rocm/amd-mlperf:llama2_70b_training_5.1) designed primarily for SLURM-managed clusters. However, many environments use simpler SSH-based orchestration. This post demonstrates how to run multi-node distributed training using PyTorch's rendezvous mechanism. ROCm Installation: Follow ROCm installation guide for your Linux distribution. Docker GPU Access: Ensure Docker can access AMD GPUs: Multi-Node Networking: Pull the MLPerf Container on all nodes: The benchmark requires ~270GB for the Llama 2 70B model and GovReport dataset. A HuggingFace token with Llama 2 license acceptance is required: AMD's container supports two launch methods: For non-SLURM environments, PyTorch's torchrun supports a rendezvous backend that handles rank assignment automatically: This command runs identically on all nodes - the c10d backend coordinates rank assignment. Our approach uses SSH to launch training on each node, passing the distributed configuration via environment variables: The main script SSHs to each node in parallel: The config_MI325X_4x8x1.sh sets critical parameters: Near-linear throughput scaling validates that the network interconnect is not a bottleneck. Our single-node result (20.57 min) matches AMD's official MLPerf v5.1 submission (~21 min) for MI325X, confirming correct configuration. Container Design: AMD's container expects training scripts at /workspace/code - mount custom configs there rather than extracting files. Network Interface: Set NCCL_SOCKET_IFNAME to your high-speed network interface for optimal RCCL performance. SLURM Variables: The container's run_and_time_slurm.sh reads SLURM_NNODES and SLURM_NODEID - these can be set manually for non-SLURM environments. Scaling: Expect near-linear throughput scaling on properly configured clusters. Time-to-convergence scaling may differ due to batch size effects on convergence dynamics. The complete finetune_llama.sh script supports: Interested in the full script? Reach out via LinkedIn and I'll be happy to share. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: docker run --rm --device /dev/dri --device /dev/kfd rocm/pytorch:latest rocm-smi Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: docker run --rm --device /dev/dri --device /dev/kfd rocm/pytorch:latest rocm-smi COMMAND_BLOCK: docker run --rm --device /dev/dri --device /dev/kfd rocm/pytorch:latest rocm-smi COMMAND_BLOCK: docker pull rocm/amd-mlperf:llama2_70b_training_5.1 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: docker pull rocm/amd-mlperf:llama2_70b_training_5.1 COMMAND_BLOCK: docker pull rocm/amd-mlperf:llama2_70b_training_5.1 CODE_BLOCK: export HF_TOKEN=your_token_here ./finetune_llama.sh prepare Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: export HF_TOKEN=your_token_here ./finetune_llama.sh prepare CODE_BLOCK: export HF_TOKEN=your_token_here ./finetune_llama.sh prepare COMMAND_BLOCK: # Requires SLURM scheduler sbatch run_with_docker_slurm.sh Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Requires SLURM scheduler sbatch run_with_docker_slurm.sh COMMAND_BLOCK: # Requires SLURM scheduler sbatch run_with_docker_slurm.sh CODE_BLOCK: torchrun \ --nnodes=4 \ --nproc_per_node=8 \ --rdzv_backend=c10d \ --rdzv_endpoint=MASTER_IP:29500 \ --rdzv_id=mlperf_run \ train.py Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: torchrun \ --nnodes=4 \ --nproc_per_node=8 \ --rdzv_backend=c10d \ --rdzv_endpoint=MASTER_IP:29500 \ --rdzv_id=mlperf_run \ train.py CODE_BLOCK: torchrun \ --nnodes=4 \ --nproc_per_node=8 \ --rdzv_backend=c10d \ --rdzv_endpoint=MASTER_IP:29500 \ --rdzv_id=mlperf_run \ train.py COMMAND_BLOCK: # Start container with data mounts docker run --rm --init --detach \ --net=host --ipc=host \ --device /dev/dri --device /dev/kfd \ --name mlperf_llama2sft \ -v $DATADIR/data:/data \ -v $DATADIR/model:/ckpt \ -v $RESULTS:/logs \ -v $CODE_DIR:/workspace/code \ rocm/amd-mlperf:llama2_70b_training_5.1 sleep infinity # Execute training with distributed config docker exec \ -e MASTER_ADDR=$MASTER_IP \ -e MASTER_PORT=29500 \ -e SLURM_NNODES=$NUM_NODES \ -e SLURM_NODEID=$NODE_RANK \ -e NCCL_SOCKET_IFNAME=$NET_IF \ mlperf_llama2sft \ bash -c 'cd /workspace/code && source config_MI325X_4x8x1.sh && bash ./run_and_time_slurm.sh' Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Start container with data mounts docker run --rm --init --detach \ --net=host --ipc=host \ --device /dev/dri --device /dev/kfd \ --name mlperf_llama2sft \ -v $DATADIR/data:/data \ -v $DATADIR/model:/ckpt \ -v $RESULTS:/logs \ -v $CODE_DIR:/workspace/code \ rocm/amd-mlperf:llama2_70b_training_5.1 sleep infinity # Execute training with distributed config docker exec \ -e MASTER_ADDR=$MASTER_IP \ -e MASTER_PORT=29500 \ -e SLURM_NNODES=$NUM_NODES \ -e SLURM_NODEID=$NODE_RANK \ -e NCCL_SOCKET_IFNAME=$NET_IF \ mlperf_llama2sft \ bash -c 'cd /workspace/code && source config_MI325X_4x8x1.sh && bash ./run_and_time_slurm.sh' COMMAND_BLOCK: # Start container with data mounts docker run --rm --init --detach \ --net=host --ipc=host \ --device /dev/dri --device /dev/kfd \ --name mlperf_llama2sft \ -v $DATADIR/data:/data \ -v $DATADIR/model:/ckpt \ -v $RESULTS:/logs \ -v $CODE_DIR:/workspace/code \ rocm/amd-mlperf:llama2_70b_training_5.1 sleep infinity # Execute training with distributed config docker exec \ -e MASTER_ADDR=$MASTER_IP \ -e MASTER_PORT=29500 \ -e SLURM_NNODES=$NUM_NODES \ -e SLURM_NODEID=$NODE_RANK \ -e NCCL_SOCKET_IFNAME=$NET_IF \ mlperf_llama2sft \ bash -c 'cd /workspace/code && source config_MI325X_4x8x1.sh && bash ./run_and_time_slurm.sh' CODE_BLOCK: for node_idx in 0 1 2 3; do ssh node-$node_idx "launch_training.sh $node_idx $NUM_NODES" & done wait Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: for node_idx in 0 1 2 3; do ssh node-$node_idx "launch_training.sh $node_idx $NUM_NODES" & done wait CODE_BLOCK: for node_idx in 0 1 2 3; do ssh node-$node_idx "launch_training.sh $node_idx $NUM_NODES" & done wait COMMAND_BLOCK: export DGXNNODES=4 export DGXNGPU=8 export FP8=True export LR=0.0004 export MBS=1 # micro batch size export MAX_STEPS=1024 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: export DGXNNODES=4 export DGXNGPU=8 export FP8=True export LR=0.0004 export MBS=1 # micro batch size export MAX_STEPS=1024 COMMAND_BLOCK: export DGXNNODES=4 export DGXNGPU=8 export FP8=True export LR=0.0004 export MBS=1 # micro batch size export MAX_STEPS=1024 COMMAND_BLOCK: ./finetune_llama.sh run 4 # 4-node, single run ./finetune_llama.sh run 4 10 # 4-node, 10 runs (MLPerf submission) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: ./finetune_llama.sh run 4 # 4-node, single run ./finetune_llama.sh run 4 10 # 4-node, 10 runs (MLPerf submission) COMMAND_BLOCK: ./finetune_llama.sh run 4 # 4-node, single run ./finetune_llama.sh run 4 10 # 4-node, 10 runs (MLPerf submission) - Cluster: 4× MI325X nodes - GPUs: 8× AMD Instinct MI325X per node (32 total) - Network: High-speed interconnect for RCCL communication - Storage: Shared NFS mount at /mnt/shared - ROCm Installation: Follow ROCm installation guide for your Linux distribution. - Docker GPU Access: Ensure Docker can access AMD GPUs: - Multi-Node Networking: Passwordless SSH between all nodes High-speed network interface (InfiniBand/RoCE recommended) Shared filesystem accessible from all nodes - Passwordless SSH between all nodes - High-speed network interface (InfiniBand/RoCE recommended) - Shared filesystem accessible from all nodes - Pull the MLPerf Container on all nodes: - Passwordless SSH between all nodes - High-speed network interface (InfiniBand/RoCE recommended) - Shared filesystem accessible from all nodes - Container Design: AMD's container expects training scripts at /workspace/code - mount custom configs there rather than extracting files. - Network Interface: Set NCCL_SOCKET_IFNAME to your high-speed network interface for optimal RCCL performance. - SLURM Variables: The container's run_and_time_slurm.sh reads SLURM_NNODES and SLURM_NODEID - these can be set manually for non-SLURM environments. - Scaling: Expect near-linear throughput scaling on properly configured clusters. Time-to-convergence scaling may differ due to batch size effects on convergence dynamics. - AMD MLPerf Training v5.1 Technical Blog - Reproducing AMD MLPerf Training Results - ROCm Multi-Node Setup Guide - PyTorch Distributed Training on AMD GPUs - Single and multi-node runs - Configurable NEXP for MLPerf-compliant submissions - Automatic config selection based on node count