This page presents some internals of OAR's management of resources, in order to give administrators some useful knowledge and help creating relevant definitions of the resources with regards to CPUs-cores topologies as well as CPUs-GPUs. ====== Managing processing unit topologies ====== The OAR database ''resources'' tables provides several kinds of information: - resources **states** (Alive/Absent/Dead/..., maintenance mode, availability...) ; - resources **characteristics** (CPU speed, CPU model, memory size...) ; - and also it defines the computing resources **hierarchy** (topology) ; - and finally some **hardware** related **identifiers** (e.g. cpusets). They are called OAR resources **properties**. In database these 4 kinds of resource properties are all stored as **columns** of the ''resources'' table. A rows then gives the set of properties for one resource. **Given a hierarchy** (chosen by the administrator for his cluster setup, for instance cluster/switch/host/cpu/core, but thread could be added, or other customizations), one row in the table gives information for the **lowest level of resources of the hierarchy** (e.g. core). Then **groups of lines** define higher levels, like Russian dolls (e.g. ''#C'' rows for CPUs, ''#H*#C'' rows for hosts, ...). One rule must be kept in mind: **any unique object in the resources hierarchy must have a unique id among its set of object**. For example: * any of the cores (of any of the CPUs, of any of the hosts...) must have a unique id among the whole set of cores ; * any of the CPUs (of any of the nodes, of any of the clusters...) must have a unique id among the whole set of CPUs ; * and so one of any resource. Then, when it comes to the hardware identifiers (cpusets, or see below for GPU devices id), the administrator must take a special attention so that a **correct mapping** is done between the **logical hierarchy** (e.g. id of the host, CPUs, cores, hyperthreads) and the **hardware** processing unit **ids** (cpuset value). Using a tool such as ''hwloc'' may help at this point. The basic commands to work with resources are: * The ''oarnodes'' command gives an extract of the resources table. * The ''oarnodesetting'' command can be used to modify or create rows in the resources table. * The ''oarproperty'' command can be used to modify resource properties (columns of the table). But two meta-command are provided to build the resource table (using underneath the ''oarpropery'' and ''oarnodesetting'' commands): * ''oar_resource_init'' first looks at the machines hardware topology (connecting to them using SSH) and then provides the relevant ''oarnodesetting'' commands ; * ''oar_resource_add'' builds ''oarnodesetting'' commands from a model defined in the command line. See the next section for an example of resource table. ====== Managing GPUs ====== Support of Nvidia GPU devices was added to OAR and ships with OAR 2.5.8. OAR provides tools to help setup GPU resources (see the ''oar_resources_add'' command), but here are some explanations for the setup. In order to enable the mechanism, you have to: - use the latest version of the job resource manager (job_resource_manager_cgroup.pl shipped with the latest version of OAR, at least version 2.5.8) - enable the device cgroup mechanism in it (''$ENABLE_DEVICESCG = "YES";'') - add a resource property for the gpu devices (''oarproperty -a gpudevice'') - set the values for this new resource property for all resources. ===== First scenario, simple ===== For step 4, several scenarios are possible. Let's consider here a first scenario where you have 2 GPUs on your nodes, one attached to the first CPU's PCI-e bus, and the second to the 2nd CPU's PCI-e bus (see ''lstopo'''s output for instance to know that). You can then use ''oarnodesetting'' to set the ''gpudevice'' property to 0 (matching the ''/dev/nvidia0'' Linux device) for the resources associated with the first CPU, and 1 (matching the ''/dev/nvidia1'' Linux device) for those associated to the 2nd CPU, for every hosts. Users would then only have access to 1 GPU of N hosts if requesting a single CPU per host, e.g.: $ oarsub -l /host=N/cpu=1 ... or could even request GPUs directely, e.g.: $ oarsub -l /host=N/gpudevice=1 Result would be equivalent (1 GPU = 1 CPU). (Warning giving host=N is mandatory in the second variant of the ''oarsub'' command, or one would reserve all hosts of the cluster with a same GPU id.) Running the ''nvidia-smi'' command in the job should show what GPU is available. Also, if some nodes do not have any GPU, you could set the value of the property for the corresponding resources to ''gpudevice=-1'', and let the users add to the ''oarsub'' command a ''-p "gpudevice >=0"'' in order to get resources with GPUs. But be **warned**, that the following commands will mostly not provide what a user would expect: $ oarsub -l gpudevice=1 will gives all resources matching one identifier of gpudevices, which means all nodes limited to their first gpus (''gpudevice=0''), or second gpus (''gpudevice=1''), or all nodes which have no gpus (''gpudevice=-1''). $ oarsub -l gpudevice=N with N > 1 makes even less sense. See the setup proposed in the section below if you want to let your users request N gpus like that (using ''oarsub -l gpu=N''). **We strongly suggest to setup the second scenario below, which defines the ''gpu'' hierarchy property, along with the ''gpudevice'' hardware resource identifier.** ===== Second scenario, full featured ===== Lets assume now that you have a cluster of 3 nodes with 32 GB of RAM and per node: * 2 CPUs of 6 cores each * 4 GPUs, 2 attached to the PCIe bus of the first CPU and 2 attached to the PCIe bus of the second CPU (e.g. as reported by ''lstopo'') Obviously, cores and GPUs appear at the same level in the hierarchy of resources we can define, giving two possible hierarchy definitions: * /cluster/host/cpu/core * /cluster/host/cpu/gpu In order to provide a single hierarchy, we have to make a arbitrary choice (not related to a real hardware hierarchy), and choose for instance to associate 3 cores to a GPU. The hierarchy then becomes: * /cluster/host/cpu/gpu/core Lets translate that to technical words: * first we have to define the resources hierarchy levels: cluster, host, cpu, gpu, core * then we have to define the GPU resource for the system mapping: gpudevice (i.e. the GPU's equivalent of cpuset used to identify the hardware core ids) * finally any additional resource property can be added, like mem for host memory, cpumodel for the CPU model, gpumodel, etc. See the ''oarproperty'' command manual, in order to create cluster, host, cpu, core, gpu, gpudevice, cpumode, gpumodel. Finally, lets now assume that your OAR server also manages a second cluster with no GPU, e.g. 2 nodes with 24GB of RAM and 2 CPU each of 4 cores. We can now define the resources in the resource table of OAR as follows (1 lines par core, 3 lines par GPU, 6 lines per CPU, 12 lines per host...): ^ cluster ^ host ^ cpu ^ core ^ cpuset ^ gpu ^ gpudevice ^ mem ^ cpumodel ^ gpumodel ^ | cluster1 | host1 | 1 | 1 | 0 | 1 | 0 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 1 | 2 | 1 | 1 | 0 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 1 | 3 | 2 | 1 | 0 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 1 | 4 | 3 | 2 | 1 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 1 | 5 | 4 | 2 | 1 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 1 | 6 | 5 | 2 | 1 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 2 | 7 | 6 | 3 | 2 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 2 | 8 | 7 | 3 | 2 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 2 | 9 | 8 | 3 | 2 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 2 | 10 | 9 | 4 | 3 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 2 | 11 | 10 | 4 | 3 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host1 | 2 | 12 | 11 | 4 | 3 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 3 | 13 | 0 | 5 | 0 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 3 | 14 | 1 | 5 | 0 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 3 | 15 | 2 | 5 | 0 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 3 | 16 | 3 | 6 | 1 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 3 | 17 | 4 | 6 | 1 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 3 | 18 | 5 | 6 | 1 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 4 | 19 | 6 | 7 | 2 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 4 | 20 | 7 | 7 | 2 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 4 | 21 | 8 | 7 | 2 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 4 | 22 | 9 | 8 | 3 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 4 | 23 | 10 | 8 | 3 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host2 | 4 | 24 | 11 | 8 | 3 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 5 | 25 | 0 | 9 | 0 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 5 | 26 | 1 | 9 | 0 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 5 | 27 | 2 | 9 | 0 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 5 | 28 | 3 | 10 | 1 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 5 | 29 | 4 | 10 | 1 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 5 | 30 | 5 | 10 | 1 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 6 | 31 | 6 | 11 | 2 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 6 | 32 | 7 | 11 | 2 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 6 | 33 | 8 | 11 | 2 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 6 | 34 | 9 | 12 | 3 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 6 | 35 | 10 | 12 | 3 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster1 | host3 | 6 | 36 | 11 | 12 | 3 | 32 | Xeon Gold 6128 | Geforce 1080ti | | cluster2 | host4 | 7 | 37 | 0 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host4 | 7 | 38 | 1 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host4 | 7 | 39 | 2 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host4 | 7 | 40 | 3 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host4 | 8 | 41 | 4 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host4 | 8 | 42 | 5 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host4 | 8 | 43 | 6 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host4 | 8 | 44 | 7 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host5 | 9 | 45 | 0 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host5 | 9 | 46 | 1 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host5 | 9 | 47 | 2 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host5 | 9 | 48 | 3 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host5 | 10 | 49 | 4 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host5 | 10 | 50 | 5 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host5 | 10 | 51 | 6 | 0 | 0 | 24 | Xeon Gold 5122 | | | cluster2 | host5 | 10 | 52 | 7 | 0 | 0 | 24 | Xeon Gold 5122 | | (Remember that OAR properties (columns) that define the resource hiearachy must be unique values for each unique entity, but system properties (cpuset, gpudevice) loops on each host) See the ''oarnodesetting'' command manual, in order to create the table rows and set the values. Help yourselfs with scripting the command calls (e.g. use shell ''for'' loops). Finally, you can look at the created rows using the ''oarnodes'' command, or even look at the resources table in the database. Once this is done, users can request one of the 12 GPUs as follows: $ oarsub -p "gpu > 0" -l gpu=1 Or 2 GPUs (possibly not on the same host) as follows: $ oarsub -p "gpu > 0" -l gpu=2 Or 2 GPUs on a same host as follows: $ oarsub -p "gpu > 0" -l host=1/gpu=2 When reserving 1 GPU, the user obviously gets the 3 cores associated to the GPUs. Finally, GPU jobs can be tied to GPU resources (where ''gpu > 0'') with the following admission rule (see ''oaradmissionrules''), so that users don't have to set ''-p "gpu > 0"'' in their command lines: foreach my $mold (@{$ref_resource_list}){ foreach my $r (@{$mold->[0]}){ my $gpu_request = 0; foreach my $resource (@{$r->{resources}}) { if ($resource->{resource} eq "gpu") { $gpu_request = 1; } } if ($gpu_request) { if ($r->{property} ne ""){ $r->{property} = "($r->{property}) AND gpu > 0"; }else{ $r->{property} = "gpu > 0"; } print("[ADMISSION RULE] Tie job resource request for GPU to resources with GPU\n"); } } } Warning: make sure to look at lstopo output in order to correctly associate cpuset and gpudevices, e.g. not associating cores and GPUs not attached to a same CPU. Warning: mind the fact that with the defined hierarchy ''host/cpu/gpu/core'', it makes no sense to use such a oarsub command as follows: $ oarsub -l host=1/core=8/gpu=2 In that case, you select 1 host, with 8 of its cores, and 2 GPUs of each cores. But since for each core, there is at most 1 gpu value, that makes no sense. Also, that: $ oarsub -l host=1/core=8/gpu=1 is equivalent to: $ oarsub -l host=1/core=8 The user will get 1 host with 8 cores of it. Nothing is said about what or how many GPUs will be available in the job. ====== The oar_resource_add command ====== The oar_resource_add command provides some support to create GPU resources in OAR as well as CPU-Core resources with relevant topologies.