This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
wiki:customization_tips [2020/03/25 15:16] – neyron | wiki:customization_tips [2020/03/25 15:24] (current) – neyron | ||
---|---|---|---|
Line 338: | Line 338: | ||
</ | </ | ||
- | ====== Green computing ====== | ||
- | //In this section, you'll find tips for optimizing the fluids consumptions of your clusters// | ||
- | ===== Activating the dynamic on/off of nodes but keeping a few nodes always ready ===== | ||
- | **Warning: | ||
- | |||
- | First of all, you have to set up the ecological feature as told into the FAQ: [[http:// | ||
- | |||
- | **Note:** if you have an ordinary cluster with nodes that are always available, you may set the cm_availability property to 2147483646 (infinite minus 1) | ||
- | |||
- | **Note: ** once this feature has been activated, the **absent** status may not always really mean absent, but **standby** as oar may want to automatically power on the node. to put a node into a real absent status, you have to set the cm_availability property to **0** | ||
- | |||
- | This tip supposes that you have set up your nodes to automatically set them to the Alive state when they boot and to the Absent state when they shutdown. You may refer to the FAQ for this: [[http:// | ||
- | |||
- | Here, we provide 3 scripts that you may customize and that make your ecological configuration a bit smarter than the default as it will be aware of keeping powered on a few nodes (4 in this example) that will be ready for incoming jobs: | ||
- | |||
- | ==wake_up_nodes.sh== | ||
- | <code bash> | ||
- | #!/bin/bash | ||
- | |||
- | IPMI_HOST=" | ||
- | POWER_ON_CMD=" | ||
- | |||
- | NODES=`cat` | ||
- | |||
- | for NODE in $NODES | ||
- | do | ||
- | ssh $IPMI_HOST $POWER_ON_CMD $NODE | ||
- | done | ||
- | </ | ||
- | |||
- | Very simple script containing the command that powers on your nodes. In this example, suitable for an SGI Altix Ice, we do a **cpower** from an **admin** host. You'll probably have to customize this. This script is to be put in front of the SCHEDULER_NODE_MANAGER_WAKE_UP_CMD option of the oar.conf file, like this: | ||
- | < | ||
- | | ||
- | </ | ||
- | |||
- | == set_standby_nodes.sh == | ||
- | <code bash> | ||
- | #!/bin/bash | ||
- | set -e | ||
- | |||
- | # This script is intended to be used from the SCHEDULER_NODE_MANAGER_SLEEP_CMD | ||
- | # variable of the oar.conf file. | ||
- | # It halts the nodes given in the stdin, but refuses to stop nodes if this | ||
- | # results in less than # | ||
- | # want to have some nodes ready for treating immediately some jobs. | ||
- | |||
- | NODES_KEEP_ALIVE=4 | ||
- | |||
- | NODES=`cat` | ||
- | |||
- | ALIVE_NODES=`oarnodes | ||
- | |||
- | NODES_TO_SHUTDOWN="" | ||
- | |||
- | for NODE in $NODES | ||
- | do | ||
- | if [ $ALIVE_NODES -gt $NODES_KEEP_ALIVE ] | ||
- | then | ||
- | NODES_TO_SHUTDOWN=" | ||
- | let ALIVE_NODES=ALIVE_NODES-1 | ||
- | else | ||
- | echo "Not halting $NODE because I need to keep $NODES_KEEP_ALIVE alive nodes" | ||
- | fi | ||
- | done | ||
- | |||
- | if [ " | ||
- | then | ||
- | echo -e " | ||
- | fi | ||
- | </ | ||
- | |||
- | This is the script for shutting down nodes. It uses **sentinelle** to send the **halt** command to the nodes, as suggested by the default configuration, | ||
- | |||
- | < | ||
- | | ||
- | </ | ||
- | |||
- | ==nodes_keepalive.sh== | ||
- | <code bash> | ||
- | #!/bin/bash | ||
- | set -e | ||
- | |||
- | # This script is intended to be ran every 5 minutes from the crontab | ||
- | # It ensures that # | ||
- | # are always alive and not shut down. It wakes up the nodes by submiting | ||
- | # a dummy job. It does not submit jobs if all the resources are used or | ||
- | # not available (cm_availability set to a low value) | ||
- | |||
- | NODES_KEEP_ALIVE=4 | ||
- | ADMIN_USER=bzeznik | ||
- | |||
- | # Locking | ||
- | LOCK=/ | ||
- | ### Locking for Debian (using lockfile-progs): | ||
- | # | ||
- | # | ||
- | # | ||
- | ### Locking for others (using sendmail lockfile) | ||
- | lockfile -r3 -l 43200 $LOCK | ||
- | |||
- | if [ " | ||
- | then | ||
- | |||
- | # Get the number of Alive nodes with at least 1 free resource | ||
- | | ||
- | |||
- | # Get the number of nodes in standby | ||
- | let AVAIL_DATE=`date +%s`+3600 | ||
- | | ||
- | |||
- | if [ $ALIVE_NODES -lt $NODES_KEEP_ALIVE ] | ||
- | then | ||
- | if [ $WAKEABLE_NODES -gt 0 ] | ||
- | then | ||
- | if [ $NODES_KEEP_ALIVE -gt $WAKEABLE_NODES ] | ||
- | then | ||
- | | ||
- | fi | ||
- | su - $ADMIN_USER -c " | ||
- | fi | ||
- | fi | ||
- | fi | ||
- | |||
- | ### Unlocking for Debian: | ||
- | #kill " | ||
- | # | ||
- | ### Unlocking for others: | ||
- | rm -f $LOCK | ||
- | </ | ||
- | |||
- | This script is responsible of waking up (power on) some nodes if there' | ||
- | |||
- | < | ||
- | */5 * * * * | ||
- | </ | ||
- | |||
- | |||
- | ====== Use cases ====== | ||
- | ===== OpenMPI + affinity ===== | ||
- | |||
- | We saw that the Linux kernel seems to be incapable of using correctly all the CPUs from the cpusets. | ||
- | |||
- | Indeed, reserving 2 out of 8 cores on a node and running a code that uses 2 | ||
- | process, these 2 process where not well assigned to each cpu. | ||
- | We had to give the CPU MAP to OpenMPI to do cpu_affinity: | ||
- | |||
- | <code bash> | ||
- | i=0 ; oarprint core -P host,cpuset -F "% slot=%" | ||
- | |||
- | | ||
- | </ | ||
- | |||
- | ===== NUMA topology optimization ===== | ||
- | In this use case, we've got a numa host (an Altix 450) having a " | ||
- | |||
- | {{: | ||
- | In yellow, " | ||
- | |||
- | Routers interconnect IRUS (chassis) on which the nodes are plugged (4 or 5 nodes per IRU). | ||
- | |||
- | What we want is that for jobs that can enter into 2 IRUS or less, minimize the distance between the resources (ie use IRUS that have only one router interconnexion between them). The topology may be siplified as follows: | ||
- | |||
- | {{: | ||
- | |||
- | The idea is to use moldable jobs and an admission rule that converts automatically the user requests to a moldable job. This job uses 2 resource properties: **numa_x** and **numa_y** that may be analogue to the square coordinates. What we want in fact, is the job that ends the soonest between a job running on an X or on a Y coordinate (we only want vertical or horizontal placed jobs). | ||
- | |||
- | The numa_x and numa_y properties are set up this way (pnode is a property corresponding to physical nodes): | ||
- | |||
- | {|border=" | ||
- | !pnode | ||
- | !iru | ||
- | !numa_x | ||
- | !numa_y | ||
- | |- | ||
- | |itanium1 | ||
- | |1 | ||
- | |0 | ||
- | |1 | ||
- | |- | ||
- | |itanium2 | ||
- | |1 | ||
- | |0 | ||
- | |1 | ||
- | |- | ||
- | |itanium3 | ||
- | |1 | ||
- | |0 | ||
- | |1 | ||
- | |- | ||
- | |itanium4 | ||
- | |1 | ||
- | |0 | ||
- | |1 | ||
- | |- | ||
- | |itanium5 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium6 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium7 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium8 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium9 | ||
- | |2 | ||
- | |1 | ||
- | |1 | ||
- | |- | ||
- | |itanium10 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium11 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium12 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium13 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium14 | ||
- | |3 | ||
- | |0 | ||
- | |0 | ||
- | |- | ||
- | |itanium15 | ||
- | |4 | ||
- | |1 | ||
- | |0 | ||
- | |- | ||
- | |itanium16 | ||
- | |4 | ||
- | |1 | ||
- | |0 | ||
- | |- | ||
- | |itanium17 | ||
- | |4 | ||
- | |1 | ||
- | |0 | ||
- | |- | ||
- | |itanium18 | ||
- | |4 | ||
- | |1 | ||
- | |0 | ||
- | |} | ||
- | |||
- | For example, the following requested ressources: | ||
- | < | ||
- | -l /core=16 | ||
- | </ | ||
- | will result into: | ||
- | < | ||
- | -l / | ||
- | </ | ||
- | |||
- | Here is the admission rule making that optimization: | ||
- | |||
- | <code bash> | ||
- | # Title : Numa optimization | ||
- | # Description : Creates a moldable job to take into account the " | ||
- | my $n_core_per_cpus=2; | ||
- | my $n_cpu_per_pnode=2; | ||
- | if (grep(/ | ||
- | print " | ||
- | my $resources_def=$ref_resource_list-> | ||
- | my $core=0; | ||
- | my $cpu=0; | ||
- | my $pnode=0; | ||
- | | ||
- | | ||
- | if ($resource-> | ||
- | if ($resource-> | ||
- | if ($resource-> | ||
- | } | ||
- | } | ||
- | # Now, calculate the number of total cores | ||
- | my $n_cores=0; | ||
- | if ($pnode == 0 && $cpu != 0 && $core == 0) { | ||
- | | ||
- | } | ||
- | elsif ($pnode != 0 && $cpu == 0 && $core == 0) { | ||
- | | ||
- | } | ||
- | elsif ($pnode != 0 && $cpu == 0 && $core != 0) { | ||
- | | ||
- | } | ||
- | elsif ($pnode == 0 && $cpu != 0 && $core != 0) { | ||
- | | ||
- | } | ||
- | elsif ($pnode == 0 && $cpu == 0 && $core != 0) { | ||
- | | ||
- | } | ||
- | else { $n_cores = $pnode*$cpu*$core; | ||
- | print " | ||
- | if ($n_cores > 32) { | ||
- | print " | ||
- | | ||
- | print " | ||
- | / | ||
- | |||
- | my @newarray=eval(Dumper(@{$ref_resource_list}-> | ||
- | push (@{$ref_resource_list}, | ||
- | | ||
- | | ||
- | } | ||
- | } | ||
- | </ | ||
- | |||
- | ====== Troubles and solutions ====== | ||
===== Can't do setegid! ===== | ===== Can't do setegid! ===== | ||
Some distributions have perl_suid installed, but not set up correctly. The solution is something like that: | Some distributions have perl_suid installed, but not set up correctly. The solution is something like that: | ||
Line 673: | Line 345: | ||
| | ||
</ | </ | ||
- | |||
- | ====== Users tips ====== | ||
- | ===== oarsh completion ===== | ||
- | //Tip based on an idea from Jerome Reybert// | ||
- | |||
- | In order to complete nodes names in a oarsh command, add these lines in your .bashrc | ||
- | |||
- | <code bash> | ||
- | function _oarsh_complete_() { | ||
- | if [ -n " | ||
- | local word=${comp_words[comp_cword]} | ||
- | local list=$(cat $OAR_NODEFILE | uniq | tr ' | ||
- | COMPREPLY=($(compgen -W " | ||
- | fi | ||
- | } | ||
- | complete -o default -F _oarsh_complete_ oarsh | ||
- | </ | ||
- | |||
- | Then try oarsh <TAB> | ||
- | ===== OAR aware shell prompt for Interactive jobs ===== | ||
- | If you want to have a bash prompt with your job id and the remaining walltime then you can add in your ~/.bashrc: | ||
- | |||
- | <code bash> | ||
- | if [ " | ||
- | __oar_ps1_remaining_time(){ | ||
- | if [ -n " | ||
- | DATE_NOW=$(date +%s) | ||
- | DATE_JOB_START=$(stat -c %Y $OAR_NODE_FILE) | ||
- | DATE_TMP=$OAR_JOB_WALLTIME_SECONDS | ||
- | ((DATE_TMP = (DATE_TMP - DATE_NOW + DATE_JOB_START) / 60)) | ||
- | echo -n " | ||
- | fi | ||
- | } | ||
- | PS1=' | ||
- | if [ -n " | ||
- | echo "[OAR] OAR_JOB_ID=$OAR_JOB_ID" | ||
- | echo "[OAR] Your nodes are:" | ||
- | sort $OAR_NODE_FILE | uniq -c | awk ' | ||
- | fi | ||
- | fi | ||
- | </ | ||
- | |||
- | |||
- | Then the prompt inside an Interactive job will be like: | ||
- | |||
- | <code bash> | ||
- | [capitn@node006~](3101--> | ||
- | </ | ||
- | |||
- | ===== Many small jobs grouping ===== | ||
- | Many small jobs of a few seconds may be painful for the OAR system. OAR may spend more time scheduling, allocating and launching than the actual computation time for each job. | ||
- | |||
- | Gabriel Moreau developed a script that may be useful when you have a large set of small jobs. It groups you jobs into a unique bigger OAR job: | ||
- | * http:// | ||
- | You can download it from this page: | ||
- | * http:// | ||
- | |||
- | For a more generic approach you can use Cigri, a grid middleware running onto OAR cluster(s) that is able to automatically group parametric jobs. Cigri is currently in a re-writing process and a new public release is planned for the end of 2012. | ||
- | |||
- | Please contact Bruno.Bzeznik@imag.fr for more informations. | ||
- | |||
- | ===== Environment variables through oarsh ===== | ||
- | |||
- | * http:// | ||
- | * http:// | ||
- |