Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
wiki:coupling_oarsh_with_gnu_parallel_to_mimic_salloc_sbatch [2020/02/27 22:04] neyronwiki:coupling_oarsh_with_gnu_parallel_to_mimic_salloc_srun [2020/03/03 13:28] – [PoC with cores] neyron
Line 3: Line 3:
 E.g. 1 batch per core (or gpu), in a job which has many nodes/cores/gpus allocated. E.g. 1 batch per core (or gpu), in a job which has many nodes/cores/gpus allocated.
  
-This requires the change made in this commit:  +This requires the changes made in this [[https://github.com/oar-team/oar/tree/2.5_oarsh_on_a_sub_set_of_resources_a_la_srun|branch]], especially this  
-https://github.com/oar-team/oar/commit/72a6cce952a3fa736872d69c6e72fb2d8ffac5de +[[https://github.com/oar-team/oar/commit/72a6cce952a3fa736872d69c6e72fb2d8ffac5de|commit]] 
-(not merged yet)+(not merged yet).
  
-==== PoC with cores ==== +===== PoC with cores ====
-== Create a job with 2 nodes (and all their cores, here 4) ==+PoC in [[oar-docker]] 
 +==== Create a job with 2 nodes (and all their cores, here 4) ====
  
 <code bash> <code bash>
Line 17: Line 18:
 </code> </code>
  
-== Create the parallel sshloginfile that defines the connector to each core ==+==== Create the parallel sshloginfile that defines the connector to each core ====
 <code bash> <code bash>
 docker@frontend ~$ oarstat -j 1 -p | oarprint -f - core -P cpuset,host -F "1/OAR_JOB_ID=1 OAR_USER_CPUSET=% oarsh %" | tee .parallel/cores docker@frontend ~$ oarstat -j 1 -p | oarprint -f - core -P cpuset,host -F "1/OAR_JOB_ID=1 OAR_USER_CPUSET=% oarsh %" | tee .parallel/cores
Line 31: Line 32:
 We force the ''1/'' at the beginning of the lines, so that there is only 1 run of parallel on each remote. If not, parallel complains it cannot find out itself the number of cpus = run per remote, and if it could anyway, it might be completely wrong because not aware of the cgroups/cpusets. We force the ''1/'' at the beginning of the lines, so that there is only 1 run of parallel on each remote. If not, parallel complains it cannot find out itself the number of cpus = run per remote, and if it could anyway, it might be completely wrong because not aware of the cgroups/cpusets.
  
-== Create a sample script ==+==== Create a sample script ====
 <code bash> <code bash>
 docker@frontend ~$ cat <<'EOF' > script.sh  docker@frontend ~$ cat <<'EOF' > script.sh 
Line 43: Line 44:
 </code> </code>
  
-== Test a run with a batch of 10 inputs ==+==== Test a run with a batch of 10 inputs ====
 <code bash> <code bash>
 docker@frontend ~$ seq 10 | parallel --slf cores ./script.sh  docker@frontend ~$ seq 10 | parallel --slf cores ./script.sh 
Line 59: Line 60:
 As we can see, every job is indeed run in the cpuset with only 1 logical cpu available for the execution ! As we can see, every job is indeed run in the cpuset with only 1 logical cpu available for the execution !
  
-==== PoC with GPUs in Grid'5000====+===== PoC with GPUs in Grid'5000 ====
 Same can be done with GPUs: run of a batch of jobs that executes each on a single GPU only. Same can be done with GPUs: run of a batch of jobs that executes each on a single GPU only.
  
 Here we have 2 nodes (chifflet-3 and chifflet-7) with 2 GeForce GPUs each. Here we have 2 nodes (chifflet-3 and chifflet-7) with 2 GeForce GPUs each.
  
-== Generate the parallel sshlogin file to execute on each GPU ==+==== Generate the parallel sshlogin file to execute on each GPU ==== 
 From the head node of the OAR job, chifflet-3: From the head node of the OAR job, chifflet-3:
 <code bash> <code bash>
Line 75: Line 78:
 Here we use the ''-C+'' option of ''oarprint'', because ''GNU parallel'' does not like '','' as a separator for the ''OAR_USER_CPUSET'' values in the sshlogin file. ''oarsh'' accepts ''+'' like '','' or ''.'' or '':'' indifferently. Here we use the ''-C+'' option of ''oarprint'', because ''GNU parallel'' does not like '','' as a separator for the ''OAR_USER_CPUSET'' values in the sshlogin file. ''oarsh'' accepts ''+'' like '','' or ''.'' or '':'' indifferently.
  
-== Create a new sample script ==+==== Create a new sample script ====
 <code bash> <code bash>
 [pneyron@chifflet-3 ~](1733271-->56mn)$ cat <<'EOF' > ~/script.sh  [pneyron@chifflet-3 ~](1733271-->56mn)$ cat <<'EOF' > ~/script.sh 
Line 93: Line 96:
 </code> </code>
  
-== Run parallel ==+==== Run parallel ====
 <code bash> <code bash>
 [pneyron@chifflet-3 ~](1733271-->-56mn)$ seq 5 | parallel --slf gpus ~/script.sh  [pneyron@chifflet-3 ~](1733271-->-56mn)$ seq 5 | parallel --slf gpus ~/script.sh 
wiki/coupling_oarsh_with_gnu_parallel_to_mimic_salloc_srun.txt · Last modified: 2020/04/16 15:45 by neyron
Recent changes RSS feed GNU Free Documentation License 1.3 Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki