This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
wiki:customization_tips [2020/03/25 15:19] – neyron | wiki:customization_tips [2020/03/25 15:24] (current) – neyron | ||
---|---|---|---|
Line 338: | Line 338: | ||
</ | </ | ||
- | ====== Use cases ====== | ||
- | ===== OpenMPI + affinity ===== | ||
- | |||
- | We saw that the Linux kernel seems to be incapable of using correctly all the CPUs from the cpusets. | ||
- | |||
- | Indeed, reserving 2 out of 8 cores on a node and running a code that uses 2 | ||
- | process, these 2 process where not well assigned to each cpu. | ||
- | We had to give the CPU MAP to OpenMPI to do cpu_affinity: | ||
- | |||
- | <code bash> | ||
- | i=0 ; oarprint core -P host,cpuset -F "% slot=%" | ||
- | |||
- | | ||
- | </ | ||
- | |||
- | ===== NUMA topology optimization ===== | ||
- | In this use case, we've got a numa host (an Altix 450) having a " | ||
- | |||
- | {{: | ||
- | In yellow, " | ||
- | |||
- | Routers interconnect IRUS (chassis) on which the nodes are plugged (4 or 5 nodes per IRU). | ||
- | |||
- | What we want is that for jobs that can enter into 2 IRUS or less, minimize the distance between the resources (ie use IRUS that have only one router interconnexion between them). The topology may be siplified as follows: | ||
- | |||
- | {{: | ||
- | |||
- | The idea is to use moldable jobs and an admission rule that converts automatically the user requests to a moldable job. This job uses 2 resource properties: **numa_x** and **numa_y** that may be analogue to the square coordinates. What we want in fact, is the job that ends the soonest between a job running on an X or on a Y coordinate (we only want vertical or horizontal placed jobs). | ||
- | |||
- | The numa_x and numa_y properties are set up this way (pnode is a property corresponding to physical nodes): | ||
- | |||
- | {|border=" | ||
- | !pnode | ||
- | !iru | ||
- | !numa_x | ||
- | !numa_y | ||
- | |- | ||
- | |itanium1 | ||
- | |1 | ||
- | |0 | ||
- | |1 | ||
- | |- | ||
- | |itanium2 | ||
- | |1 | ||
- | |0 | ||
- | |1 | ||
- | |- | ||
- | |itanium3 | ||
- | |1 | ||
- | |0 | ||
- | |1 | ||
- | |- | ||
- | |itanium4 | ||
- | |1 | ||
- | |0 | ||
- | |1 | ||
- | |- | ||
- | |itanium5 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium6 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium7 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium8 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium9 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium10 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium11 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium12 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium13 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium14 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium15 | ||
- | |4 | ||
- | |1 | ||
- | |0 | ||
- | |- | ||
- | |itanium16 | ||
- | |4 | ||
- | |1 | ||
- | |0 | ||
- | |- | ||
- | |itanium17 | ||
- | |4 | ||
- | |1 | ||
- | |0 | ||
- | |- | ||
- | |itanium18 | ||
- | |4 | ||
- | |1 | ||
- | |0 | ||
- | |} | ||
- | |||
- | For example, the following requested ressources: | ||
- | < | ||
- | -l /core=16 | ||
- | </ | ||
- | will result into: | ||
- | < | ||
- | -l / | ||
- | </ | ||
- | |||
- | Here is the admission rule making that optimization: | ||
- | |||
- | <code bash> | ||
- | # Title : Numa optimization | ||
- | # Description : Creates a moldable job to take into account the " | ||
- | my $n_core_per_cpus=2; | ||
- | my $n_cpu_per_pnode=2; | ||
- | if (grep(/ | ||
- | print " | ||
- | my $resources_def=$ref_resource_list-> | ||
- | my $core=0; | ||
- | my $cpu=0; | ||
- | my $pnode=0; | ||
- | | ||
- | | ||
- | if ($resource-> | ||
- | if ($resource-> | ||
- | if ($resource-> | ||
- | } | ||
- | } | ||
- | # Now, calculate the number of total cores | ||
- | my $n_cores=0; | ||
- | if ($pnode == 0 && $cpu != 0 && $core == 0) { | ||
- | | ||
- | } | ||
- | elsif ($pnode != 0 && $cpu == 0 && $core == 0) { | ||
- | | ||
- | } | ||
- | elsif ($pnode != 0 && $cpu == 0 && $core != 0) { | ||
- | | ||
- | } | ||
- | elsif ($pnode == 0 && $cpu != 0 && $core != 0) { | ||
- | | ||
- | } | ||
- | elsif ($pnode == 0 && $cpu == 0 && $core != 0) { | ||
- | | ||
- | } | ||
- | else { $n_cores = $pnode*$cpu*$core; | ||
- | print " | ||
- | if ($n_cores > 32) { | ||
- | print " | ||
- | | ||
- | print " | ||
- | / | ||
- | |||
- | my @newarray=eval(Dumper(@{$ref_resource_list}-> | ||
- | push (@{$ref_resource_list}, | ||
- | | ||
- | | ||
- | } | ||
- | } | ||
- | </ | ||
- | |||
- | ====== Troubles and solutions ====== | ||
===== Can't do setegid! ===== | ===== Can't do setegid! ===== | ||
Some distributions have perl_suid installed, but not set up correctly. The solution is something like that: | Some distributions have perl_suid installed, but not set up correctly. The solution is something like that: | ||
Line 536: | Line 345: | ||
| | ||
</ | </ | ||
- | |||
- | ====== Users tips ====== | ||
- | ===== oarsh completion ===== | ||
- | //Tip based on an idea from Jerome Reybert// | ||
- | |||
- | In order to complete nodes names in a oarsh command, add these lines in your .bashrc | ||
- | |||
- | <code bash> | ||
- | function _oarsh_complete_() { | ||
- | if [ -n " | ||
- | local word=${comp_words[comp_cword]} | ||
- | local list=$(cat $OAR_NODEFILE | uniq | tr ' | ||
- | COMPREPLY=($(compgen -W " | ||
- | fi | ||
- | } | ||
- | complete -o default -F _oarsh_complete_ oarsh | ||
- | </ | ||
- | |||
- | Then try oarsh <TAB> | ||
- | ===== OAR aware shell prompt for Interactive jobs ===== | ||
- | If you want to have a bash prompt with your job id and the remaining walltime then you can add in your ~/.bashrc: | ||
- | |||
- | <code bash> | ||
- | if [ " | ||
- | __oar_ps1_remaining_time(){ | ||
- | if [ -n " | ||
- | DATE_NOW=$(date +%s) | ||
- | DATE_JOB_START=$(stat -c %Y $OAR_NODE_FILE) | ||
- | DATE_TMP=$OAR_JOB_WALLTIME_SECONDS | ||
- | ((DATE_TMP = (DATE_TMP - DATE_NOW + DATE_JOB_START) / 60)) | ||
- | echo -n " | ||
- | fi | ||
- | } | ||
- | PS1=' | ||
- | if [ -n " | ||
- | echo "[OAR] OAR_JOB_ID=$OAR_JOB_ID" | ||
- | echo "[OAR] Your nodes are:" | ||
- | sort $OAR_NODE_FILE | uniq -c | awk ' | ||
- | fi | ||
- | fi | ||
- | </ | ||
- | |||
- | |||
- | Then the prompt inside an Interactive job will be like: | ||
- | |||
- | <code bash> | ||
- | [capitn@node006~](3101--> | ||
- | </ | ||
- | |||
- | ===== Many small jobs grouping ===== | ||
- | Many small jobs of a few seconds may be painful for the OAR system. OAR may spend more time scheduling, allocating and launching than the actual computation time for each job. | ||
- | |||
- | Gabriel Moreau developed a script that may be useful when you have a large set of small jobs. It groups you jobs into a unique bigger OAR job: | ||
- | * http:// | ||
- | You can download it from this page: | ||
- | * http:// | ||
- | |||
- | For a more generic approach you can use Cigri, a grid middleware running onto OAR cluster(s) that is able to automatically group parametric jobs. Cigri is currently in a re-writing process and a new public release is planned for the end of 2012. | ||
- | |||
- | Please contact Bruno.Bzeznik@imag.fr for more informations. | ||
- | |||
- | ===== Environment variables through oarsh ===== | ||
- | |||
- | * http:// | ||
- | * http:// | ||
- |