1 Star 0 Fork 0

WuYiming/mdt_policy

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
sbatch_train_calvin_hk.slurm 2.74 KB
一键复制 编辑 原始数据 按行查看 历史
WuYiming 提交于 2025-03-02 12:42 +08:00 . update slurm script
#!/bin/bash
#SBATCH -J mdt_calvin_hk
#SBATCH -p gpux
#SBATCH -n 4 # Number of tasks
#SBATCH -c 16 # Number of cores per task
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --time 72:00:00
#SBATCH --comment=LightVLA
#SBATCH -o /public/home/group_xudong/yimingwu/project/MyProjects/LightVLA/examples/mdt/jobs/mdt_calvin_hk/std.out.%j
#SBATCH -e /public/home/group_xudong/yimingwu/project/MyProjects/LightVLA/examples/mdt/jobs/mdt_calvin_hk/std.err.%j
#########################################################
### Slurm scripts for Sugon Portal_5.0 of BASE
### Version 1.0 | 2019-09-09 | created by Zhang Guoliang
### Version 1.0.1 | 2020-11-24 | modified by Zhang Guoliang
#########################################################
### Get parameters from GUI
# Remove references to portal files that don't exist
# MIDFILE_DIR=/public/home/group_xudong/yimingwu/project/MyProjects/LightVLA/examples/mdt/jobs/mdt_calvin_hk/.portal
# source $MIDFILE_DIR/job_portal.var
# source $MIDFILE_DIR/job_interface.var
### Set basic var ### MARK_slurm2pbs
WORK_DIR=/public/home/group_xudong/yimingwu/project/MyProjects/LightVLA/examples/mdt/jobs/mdt_calvin_hk
mkdir -p $WORK_DIR
JOB_NAME=mdt_calvin_hk
JOBID=$SLURM_JOB_ID
NP=$SLURM_NPROCS
NNODE=`srun hostname | sort | uniq | wc -l`
LOG_FILE=$WORK_DIR/job_${JOB_NAME}_${JOBID}.log
HOST_FILE=$WORK_DIR/job_${JOB_NAME}_${JOBID}_${NP}c_${NNODE}n.ma
# Create the directory if it doesn't exist
mkdir -p $WORK_DIR
# Generate host file
srun hostname | sort | uniq -c | awk '{print $2":"$1}' > $HOST_FILE
# Log basic job information
echo -e "The start time is: `date +"%Y-%m-%d %H:%M:%S"` \n" | tee -a $LOG_FILE
echo -e "My job ID is: $JOBID \n" | tee -a $LOG_FILE
echo -e "The total cores is: $NP \n" | tee -a $LOG_FILE
echo -e "The hosts is: \n" | tee -a $LOG_FILE
cat $HOST_FILE | tee -a $LOG_FILE
echo -e "\n" | tee -a $LOG_FILE
# Run the application
echo "Running the application..." | tee -a $LOG_FILE
# 设置分布式训练所需的环境变量
export MASTER_ADDR=$(hostname -i)
export MASTER_PORT=$(( 10000 + $RANDOM % 20000 ))
export WORLD_SIZE=4
# export PYTHONPATH=/public/home/group_xudong/yimingwu/project/MyProjects/LightVLA:$PYTHONPATH
# Change to the project directory
cd /public/home/group_xudong/yimingwu/project/MyProjects/LightVLA/examples/mdt
# Properly activate conda environment
# Use the full path to conda and source the conda.sh file
source /public/home/group_xudong/yimingwu/miniconda3/etc/profile.d/conda.sh
conda activate mdt_env
# Set environment variables
export TORCH_USE_CUDA_DSA=1
# Run the training script with full path
srun --ntasks=4 --ntasks-per-node=4 --cpus-per-task=16 --gres=gpu:4 \
python mdt/training.py
echo "The end time is: `date +"%Y-%m-%d %H:%M:%S"` | tee -a $LOG_FILE"
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/weleen/mdt_policy.git
git@gitee.com:weleen/mdt_policy.git
weleen
mdt_policy
mdt_policy
main

搜索帮助