Table of Contents

Demonstration of the use of GNU parallel with oarsh, so that a user can execute batch (parallel) jobs using subsets of the resources allocated to a bigger (OAR) job.

E.g. 1 batch per core (or gpu), in a job which has many nodes/cores/gpus allocated.

This requires OAR 2.5.9 which is only available in beta version for now (oar-2.5.9+g5k6)

PoC with cores

PoC in oar-docker

Create a job with 2 nodes (and all their cores, here 4)

docker@frontend ~$ oarsub -l nodes=2 "sleep 4h"
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=1

Create the parallel sshloginfile that defines the connector to each core

docker@frontend ~$ oarstat -j 1 -p | oarprint -f - core -P cpuset,host -F "1/OAR_JOB_ID=1 OAR_USER_CPUSET=% oarsh %" | tee .parallel/cores
1/OAR_JOB_ID=1 OAR_USER_CPUSET=3 oarsh node2
1/OAR_JOB_ID=1 OAR_USER_CPUSET=1 oarsh node1
1/OAR_JOB_ID=1 OAR_USER_CPUSET=2 oarsh node2
1/OAR_JOB_ID=1 OAR_USER_CPUSET=0 oarsh node2
1/OAR_JOB_ID=1 OAR_USER_CPUSET=3 oarsh node1
1/OAR_JOB_ID=1 OAR_USER_CPUSET=1 oarsh node2
1/OAR_JOB_ID=1 OAR_USER_CPUSET=2 oarsh node1
1/OAR_JOB_ID=1 OAR_USER_CPUSET=0 oarsh node1

We force the 1/ at the beginning of the lines, so that there is only 1 run of parallel on each remote. If not, parallel complains it cannot find out itself the number of cpus = run per remote, and if it could anyway, it might be completely wrong because not aware of the cgroups/cpusets.

Create a sample script

docker@frontend ~$ cat <<'EOF' > script.sh 
#!/bin/bash
HOSTNAME=$(hostname)
CPUSET=$(< /proc/self/cpuset)
CPUSET_CPUS=$(< /sys/fs/cgroup/cpuset/$CPUSET/cpuset.cpus)
echo "$HOSTNAME:$CPUSET:$CPUSET_CPUS> job $@"
EOF
docker@frontend ~$ chmod 755 script.sh 

Test a run with a batch of 10 inputs

docker@frontend ~$ seq 10 | parallel --slf cores ./script.sh 
node2:/oardocker/node2/oar/docker_1/2:2> job 1
node1:/oardocker/node1/oar/docker_1/3:3> job 4
node1:/oardocker/node1/oar/docker_1/1:1> job 2
node2:/oardocker/node2/oar/docker_1/3:3> job 3
node2:/oardocker/node2/oar/docker_1/0:0> job 5
node1:/oardocker/node1/oar/docker_1/0:0> job 8
node2:/oardocker/node2/oar/docker_1/1:1> job 7
node1:/oardocker/node1/oar/docker_1/2:2> job 6
node2:/oardocker/node2/oar/docker_1/2:2> job 9
node1:/oardocker/node1/oar/docker_1/3:3> job 10

As we can see, every job is indeed run in the cpuset with only 1 logical cpu available for the execution !

PoC with GPUs in Grid'5000

Same can be done with GPUs: run of a batch of jobs that executes each on a single GPU only.

Here we have 2 nodes (chifflet-3 and chifflet-7) with 2 GeForce GPUs each.

Generate the parallel sshlogin file to execute on each GPU

From the head node of the OAR job, chifflet-3:

[pneyron@chifflet-3 ~](1733271-->56mn)$ oarprint gpu -P gpudevice,cpuset,host -C+ -F "1/OAR_USER_GPUDEVICE=% OAR_USER_CPUSET=% oarsh %" | tee ~/.parallel/gpus
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+14+12+16+0+6+26+24+22+10+4+2+8 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+12+14+16+0+26+24+22+6+10+4+8+2 oarsh chifflet-7.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=25+13+27+11+9+15+23+19+1+21+17+5+7+3 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=13+25+27+11+23+19+15+9+1+21+17+7+5+3 oarsh chifflet-7.lille.grid5000.fr

Here we use the -C+ option of oarprint, because GNU parallel does not like , as a separator for the OAR_USER_CPUSET values in the sshlogin file. oarsh accepts + like , or . or : indifferently.

Create a new sample script

[pneyron@chifflet-3 ~](1733271-->56mn)$ cat <<'EOF' > ~/script.sh 
#!/bin/bash
echo ===============================================================================
echo "JOB: $@"
echo -n "BEGIN: "; date
echo -n "HOSTNAME: "; hostname
echo -n "CGROUPS CPUSET: "; grep -o -e "cpuset:.*" /proc/self/cgroup
echo -n "CPUs: "; cat /sys/fs/cgroup/cpuset/$(< /proc/self/cpuset)/cpuset.cpus
echo -n "CGROUPS DEVICES: "; grep -o -e "devices:.*" /proc/self/cgroup
echo -n "GPUs: "; nvidia-smi | grep -io -e " \(tesla\|geforce\)[^|]\+|[^|]\+" | sed 's/|/=/' | paste -sd+ -
sleep 3
echo -n "END: "; date
 
[pneyron@chifflet-3 ~](1733271-->56mn)$ chmod 755 ~/script.sh 

Run parallel

[pneyron@chifflet-3 ~](1733271-->-56mn)$ seq 5 | parallel --slf gpus ~/script.sh 
===============================================================================
JOB: 1
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
HOSTNAME: chifflet-3.lille.grid5000.fr
CGROUPS CPUSET: cpuset:/oar/pneyron_1733271/3,31,25,53,17,45,7,35,5,33,13,41,11,39,15,43,19,47,21,49,9,37,1,29,23,51,27,55
CPUs: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
CGROUPS DEVICES: devices:/oar/pneyron_1733271/1
GPUs:  GeForce GTX 108...  Off  = 00000000:82:00.0 Off 
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
JOB: 3
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
HOSTNAME: chifflet-3.lille.grid5000.fr
CGROUPS CPUSET: cpuset:/oar/pneyron_1733271/14,42,8,36,26,54,2,30,12,40,10,38,0,28,4,32,6,34,22,50,20,48,16,44,24,52,18,46
CPUs: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
CGROUPS DEVICES: devices:/oar/pneyron_1733271/0
GPUs:  GeForce GTX 108...  Off  = 00000000:03:00.0 Off 
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
JOB: 2
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
HOSTNAME: chifflet-7.lille.grid5000.fr
CGROUPS CPUSET: cpuset:/oar/pneyron_1733271/19,47,15,43,23,51,9,37,1,29,21,49,27,55,25,53,3,31,7,35,5,33,17,45,11,39,13,41
CPUs: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
CGROUPS DEVICES: devices:/oar/pneyron_1733271/1
GPUs:  GeForce GTX 108...  Off  = 00000000:82:00.0 Off 
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
JOB: 4
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
HOSTNAME: chifflet-7.lille.grid5000.fr
CGROUPS CPUSET: cpuset:/oar/pneyron_1733271/24,52,16,44,6,34,20,48,22,50,18,46,10,38,12,40,2,30,26,54,8,36,14,42,4,32,0,28
CPUs: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
CGROUPS DEVICES: devices:/oar/pneyron_1733271/0
GPUs:  GeForce GTX 108...  Off  = 00000000:03:00.0 Off 
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
JOB: 5
BEGIN: Thu 27 Feb 2020 10:01:56 PM CET
HOSTNAME: chifflet-3.lille.grid5000.fr
CGROUPS CPUSET: cpuset:/oar/pneyron_1733271/3,31,25,53,17,45,7,35,5,33,13,41,11,39,15,43,19,47,21,49,9,37,1,29,23,51,27,55
CPUs: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
CGROUPS DEVICES: devices:/oar/pneyron_1733271/1
GPUs:  GeForce GTX 108...  Off  = 00000000:82:00.0 Off 
END: Thu 27 Feb 2020 10:01:59 PM CET

As expected, every job only has access to 1 single GPU.

Regarding the logical CPUs, we see that we got those given by OAR along with their thread sibling. This is because OAR in Grid'5000 does not define the the siblings in its resources (using a thread resource, or given all siblings in the cpuset resource property), but uses the “COMPUTE_THREAD_SIBLINGS” option to compute them at the execution time.