Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
wiki:customization_tips [2019/02/20 18:42] – [PROMPT BASH for Interactive jobs] neyronwiki:customization_tips [2020/03/25 15:24] (current) neyron
Line 338: Line 338:
 </code> </code>
  
-====== Useful commands and administration tasks ====== 
-//Here, you'll find useful commands, sometimes a bit tricky, to put into your scripts or administration tasks// 
- 
-===== List suspected nodes without running jobs ===== 
-You may need this list of nodes if you want to automatically reboot them because you don't know why they have been suspected and you think that it is a simple way to clean things: 
-<code> 
- oarnodes  --sql "state = 'Suspected' and network_address NOT IN (SELECT distinct(network_address) FROM resources where resource_id IN \\ 
- (SELECT resource_id  FROM assigned_resources WHERE assigned_resource_index = 'CURRENT'))" | grep '^network_address' | sort -u 
-</code> 
- 
-===== List alive nodes without running jobs ===== 
-<code> 
- oarnodes  --sql "state = 'Alive' and network_address NOT IN (SELECT distinct(network_address) FROM resources where resource_id IN \\ 
- (SELECT resource_id  FROM assigned_resources WHERE assigned_resource_index = 'CURRENT'))" | grep '^network_address' | sort -u 
-</code> 
- 
-===== Oarstat display without best-effort jobs ===== 
- 
-<code> 
- oarstat --sql "job_id NOT IN  (SELECT job_id FROM job_types where types_index = 'CURRENT' AND type = 'besteffort') AND state != 'Error' AND state != 'Terminated'" 
-</code> 
- 
-===== Setting some nodes in maintenance mode only when they are free ===== 
- 
-You may need to plan some maintenance operations on some particular nodes (for example add somme memory, upgrade bios,...) but you don't want to interrupt currently running or planned users jobs. To do so, you can simply run a "sleep" job into the admin queue and wait for it to become running, and then set the node into maintenance mode. But you also can use this trick to set automatically the node into maintenance mode when the admin job starts: 
-<code bash> 
- oarsub -q admin -t cosystem -l /nodes=2 'uniq $OAR_NODE_FILE|awk "{print \\"sudo oarnodesetting -m on -h \\" \\$1}"|bash' 
-</code> 
-This uses the "cosystem" job type that does nothing but start your command on a given host. This host has to be configured into the //COSYSTEM_HOSTNAME// variable of the //oar.conf// file, and for the current purpose, you can simply put //127.0.0.1//. You also need to install the oar-node package on this host. 
- 
-The example above will disable 2 free nodes, but you may want to add a //-p// option to specify the nodes you want to disable, for example: ''-p "network_address in ('node-1','node-2')"'' 
- 
-**Note:** you can't simply do that within a "normal" job as oar will kill your job before all the resources of the node are set into the maintenance mode 
- 
-===== Optimizing and re-initializing the database with Postgres ===== 
-Sometimes, the database contains so much jobs that you need to optimize it. Normally, you should have a **vacuumdb** running daily fron cron. You can do manually a **vacuumdb -a -f -z ; reindexdb oar** but don't forget to stop OAR before, and be aware that it may take some time. But the DB still may be very big and it may be a problem for backups or the nightly vaccum takes too much time. A more radical solution is to start again with a new database, but keep the old one so that you can still connect to it for jobs history. You can do this once a year for example, and you only have to backup the current database. Here is a way to do this: 
-  
-  *  First of all, make a backup of your database! With postgres, it is as easy as: 
-<code> 
- create database oar_backup_2012 with template oar 
-</code> 
-It will create an exact copy of the "oar" database named "oar_backup_2012". Be sure that you have enough space left on the device hosting your postgres data directory. Doing so will allow you to make queries on the backup database if you need to find the history of old jobs. 
-  *  You should plan a maintenance and be sure there's no more jobs into the system. 
-  *  Make a dump of your "queues", "resources" and "admission_rules" tables. 
-  *  Stop the oar server, drop the oar database and re-create it. 
-  *  Finally, restore the "queues", "resources" and "admission_rules" tables into the new database.  
-  *  And restart the server. 
- 
-====== Green computing ====== 
-//In this section, you'll find tips for optimizing the fluids consumptions of your clusters// 
-===== Activating the dynamic on/off of nodes but keeping a few nodes always ready ===== 
-**Warning:** this tip is now partly obsoleted by the new **hulot** module that comes with the latest oar release. this energy saving module has got a keepalive feature. take a look at the comments above all the energy* variables into the oar.conf file. 
- 
-First of all, you have to set up the ecological feature as told into the FAQ: [[http://oar.imag.fr/admins/faq_admin.html#how-to-configure-a-more-ecological-cluster-or-how-to-make-some-power-consumption-economies|How to configure a more ecological cluster]].  
- 
-**Note:** if you have an ordinary cluster with nodes that are always available, you may set the cm_availability property to 2147483646 (infinite minus 1) 
- 
-**Note: ** once this feature has been activated, the **absent** status may not always really mean absent, but **standby** as oar may want to automatically power on the node. to put a node into a real absent status, you have to set the cm_availability property to **0** 
- 
-This tip supposes that you have set up your nodes to automatically set them to the Alive state when they boot and to the Absent state when they shutdown. You may refer to the FAQ for this: [[http://oar.imag.fr/admins/faq_admin.html#how-to-manage-start-stop-of-the-nodes|How to manage start/stop of the nodes?]] or to this section of the [[#Start.2Fstop_of_nodes_using_ssh_keys|Customization tips]]. 
- 
-Here, we provide 3 scripts that you may customize and that make your ecological configuration a bit smarter than the default as it will be aware of keeping powered on a few nodes (4 in this example) that will be ready for incoming jobs: 
- 
-==wake_up_nodes.sh== 
-<code bash> 
-#!/bin/bash 
- 
-IPMI_HOST="admin" 
-POWER_ON_CMD="cpower --up --quiet" 
- 
-NODES=`cat` 
- 
-for NODE in $NODES 
-do 
-  ssh $IPMI_HOST $POWER_ON_CMD $NODE 
-done 
-</code> 
- 
-Very simple script containing the command that powers on your nodes. In this example, suitable for an SGI Altix Ice, we do a **cpower** from an **admin** host. You'll probably have to customize this. This script is to be put in front of the SCHEDULER_NODE_MANAGER_WAKE_UP_CMD option of the oar.conf file, like this: 
-<code> 
- SCHEDULER_NODE_MANAGER_WAKE_UP_CMD="/usr/lib/oar/oardodo/oardodo /usr/local/sbin/wake_up_nodes.sh" 
-</code> 
- 
-== set_standby_nodes.sh == 
-<code bash> 
-#!/bin/bash 
-set -e 
- 
-# This script is intended to be used from the SCHEDULER_NODE_MANAGER_SLEEP_CMD 
-# variable of the oar.conf file. 
-# It halts the nodes given in the stdin, but refuses to stop nodes if this 
-# results in less than #NODES_KEEP_ALIVE alive nodes, because we generally 
-# want to have some nodes ready for treating immediately some jobs. 
- 
-NODES_KEEP_ALIVE=4 
- 
-NODES=`cat` 
- 
-ALIVE_NODES=`oarnodes  --sql "state = 'Alive' and network_address NOT IN (SELECT distinct(network_address) FROM resources where resource_id IN (SELECT resource_id  FROM assigned_resources WHERE assigned_resource_index = 'CURRENT'))" | grep '^network_address' | sort -u` 
- 
-NODES_TO_SHUTDOWN="" 
- 
-for NODE in $NODES 
-do 
-  if [ $ALIVE_NODES -gt $NODES_KEEP_ALIVE ] 
-  then 
-    NODES_TO_SHUTDOWN="$NODE\n$NODES_TO_SHUTDOWN" 
-    let ALIVE_NODES=ALIVE_NODES-1 
-  else 
-    echo "Not halting $NODE because I need to keep $NODES_KEEP_ALIVE alive nodes" 
-  fi 
-done 
- 
-if [ "$NODES_TO_SHUTDOWN" != "" ] 
-then 
-  echo -e "$NODES_TO_SHUTDOWN" |/usr/lib/oar/sentinelle.pl -f - -t 3 -p '/sbin/halt -p' 
-fi 
-</code> 
- 
-This is the script for shutting down nodes. It uses **sentinelle** to send the **halt** command to the nodes, as suggested by the default configuration, but it refuses to shutdown some nodes if this results in less than 4 ready nodes. This script is to be put into the SCHEDULER_NODE_MANAGER_SLEEP_CMD by this way: 
- 
-<code> 
- SCHEDULER_NODE_MANAGER_SLEEP_CMD="/usr/lib/oar/oardodo/oardodo /usr/local/sbin/set_standby_nodes.sh" 
-</code> 
- 
-==nodes_keepalive.sh== 
-<code bash> 
-#!/bin/bash 
-set -e 
- 
-# This script is intended to be ran every 5 minutes from the crontab 
-# It ensures that #NODES_KEEP_ALIVE nodes with at least 1 free resource 
-# are always alive and not shut down. It wakes up the nodes by submiting 
-# a dummy job. It does not submit jobs if all the resources are used or 
-# not available (cm_availability set to a low value) 
- 
-NODES_KEEP_ALIVE=4 
-ADMIN_USER=bzeznik 
- 
-# Locking 
-LOCK=/var/lock/`basename $0` 
-### Locking for Debian (using lockfile-progs): 
-#lockfile-create $LOCK || exit 1 
-#lockfile-touch $LOCK & 
-#BADGER="$!" 
-### Locking for others (using sendmail lockfile) 
-lockfile -r3 -l 43200 $LOCK 
- 
-if [ "`oarstat |grep \"wake_up_.*node\"`" = "" ] 
-then 
- 
- # Get the number of Alive nodes with at least 1 free resource 
- ALIVE_NODES=`oarnodes  --sql "state = 'Alive' and network_address NOT IN (SELECT distinct(network_address) FROM resources where resource_id IN (SELECT resource_id  FROM assigned_resources WHERE assigned_resource_index = 'CURRENT'))" | grep '^network_address' | sort -u` 
-  
- # Get the number of nodes in standby 
- let AVAIL_DATE=`date +%s`+3600 
- WAKEABLE_NODES=`oarnodes  --sql "state = 'Absent' and cm_availability > $AVAIL_DATE" |grep "^network_address" |sort -u|wc -l` 
-  
- if [ $ALIVE_NODES -lt $NODES_KEEP_ALIVE ] 
- then 
-   if [ $WAKEABLE_NODES -gt 0 ] 
-   then 
-     if [ $NODES_KEEP_ALIVE -gt $WAKEABLE_NODES ] 
-     then 
-       NODES_KEEP_ALIVE=$WAKEABLE_NODES 
-     fi 
-     su - $ADMIN_USER -c "oarsub -n wake_up_${NODES_KEEP_ALIVE}nodes -l /nodes=${NODES_KEEP_ALIVE}/core=1,walltime=00:00:10 'sleep 1'" 
-   fi 
- fi 
-fi 
-  
-### Unlocking for Debian: 
-#kill "${BADGER}" 
-#lockfile-remove $LOCK 
-### Unlocking for others: 
-rm -f $LOCK 
-</code> 
- 
-This script is responsible of waking up (power on) some nodes if there's not enough free alive nodes. The trick used by this script is to submit a dummy job to force OAR to wake up some nodes. It's intended to be ran periodically from the crontab, for example with such a /etc/cron.d/nodes_keepalive file: 
- 
-<code> 
- */5 * * * *     root    /usr/local/sbin/nodes_keepalive.sh 
-</code> 
- 
-====== Admission rules ====== 
-//OAR offers a powerful system letting you customize the way that jobs enter into queues (or are rejected from queues) called "admission rules". An admission rule is a little perl script that you insert into the admission_rules SQL table of the OAR database. Here, you'll find some advanced and useful examples.// 
- 
-===== Cluster routing depending on the name of the queue ===== 
-<code perl> 
- # Title : Cluster routing 
- # Description : Send to the corresponding cluster 
- my $cluster=$queue_name; 
- if ($jobproperties ne ""){ 
-   $jobproperties = "($jobproperties) AND cluster = '".$cluster."'"; 
- } 
- else{ 
-   $jobproperties = "cluster = '".$cluster."'"; 
- } 
-</code> 
- 
-===== Cluster routing depending on the name of the submission host ===== 
-<code perl> 
- # Title : Cluster routing 
- # Description : Send to the corresponding cluster and queue depending on the submission host 
- use Sys::Hostname; 
- my @h = split('\\.',hostname()); 
- my $cluster; 
- if ($h[0] eq "service0") {  
-   $cluster="nanostar"; 
-   print "[ADMISSION RULE] Routing to NANOSTAR cluster"; 
- }else {  
-   $cluster="foehn"; 
-   print "[ADMISSION RULE] Routing to FOEHN cluster"; 
- } 
- if ($queue_name eq "default") { 
-   $queue_name=$cluster; 
- } 
- if ($jobproperties ne ""){ 
-   $jobproperties = "($jobproperties) AND cluster = '".$cluster."'"; 
- } 
- else{ 
-   $jobproperties = "cluster = '".$cluster."'"; 
- } 
-</code> 
- 
-===== Best-effort automatic routing for some unprivileged users ===== 
- 
-Description : Users that are not members of a given group are automatically directed to the besteffort queue 
- 
-<code perl> 
- my $GROUP="nanostar"; 
- system("id -Gn $user |sed 's/ /\\\/g'|grep -w $GROUP >/dev/null"); 
- if ($? != 0){ 
-   print("[ADMISSION RULE] !!!! WARNING                                          !!!"); 
-   print("[ADMISSION RULE] !!!! AS AN EXTERNAL USER, YOU HAVE BEEN AUTOMATICALLY !!!"); 
-   print("[ADMISSION RULE] !!!! REDIRECTED TO THE BEST-EFFORT QUEUE              !!!"); 
-   print("[ADMISSION RULE] !!!! YOUR JOB MAYBE KILLED WITHOUT NOTICE             !!!"); 
-   $queue_name = "besteffort"; 
-   push (@{$type_list},"besteffort"); 
-   if ($jobproperties ne ""){    $jobproperties = "($jobproperties) and besteffort = \\'yes\\'";  }else{ 
-         $jobproperties = "besteffort = \\'YES\\'"; 
-   } 
-   $reservationField="None"; 
- } 
-</code> 
- 
-===== Automatic licence assignment by job type ===== 
-Description : Creates a **mathlab** job type that automatically assigns a mathlab licence 
- 
-<code perl> 
- if (grep(/^mathlab$/, @{$type_list})){ 
-   print "[LICENCE ADMISSION RULE] Adding a mathlab licence to the query"; 
-   foreach my $mold (@{$ref_resource_list}){ 
-     push(@{$mold->[0]}, 
-                        {'resources' => 
-                             [{'resource'|=> 'licence','value' => '1'}], 
-                         'property' => 'type = \\'mathlab\\'} 
-     ); 
-   } 
- } 
-</code> 
- 
-===== Walltime limit ===== 
-Description : By default, an admission rule limits the walltime of interactiv jobs to 2 hours. This modified rule also set up a walltime for passive jobs. 
- 
-<code perl> 
- my $max_interactive_walltime = OAR::IO::sql_to_duration("12:00:00"); 
- # 7 days = 168 hours 
- my $max_batch_walltime = OAR::IO::sql_to_duration("168:00:00"); 
- foreach my $mold (@{$ref_resource_list}){ 
-     if (defined($mold->[1])){ 
-         if (($jobType eq "INTERACTIVE") and ($reservationField eq "None") and ($max_interactive_walltime < $mold->[1])){ 
-             print("[ADMISSION RULE] Walltime too big for an INTERACTIVE job so it is set to $max_interactive_walltime."); 
-             $mold->[1] = $max_interactive_walltime; 
-         }elsif ($max_batch_walltime < $mold->[1]){ 
-             print("[ADMISSION RULE] Walltime too big for a BATCH job so it is set to $max_batch_walltime."); 
-             $mold->[1] = $max_batch_walltime; 
-         } 
-     } 
- } 
-</code> 
- 
-Thanks to Nicolas Capit 
- 
-===== Cpu time limit ===== 
-Description : Rejects jobs asking for more than a cpu*walltime limit. Current limit is set to 384 hours (16 days of cpu-time) 
- 
-Note: This rule is for an SMP host on which we only have a "pnodes" property (physical nodes) and "cpu" (only one core per cpu). It should be adapted for a more conventional distributed memory cluster having simple nodes with several cores per cpus. 
- 
-<code perl> 
- my $cpu_walltime=iolib::sql_to_duration("384:00:00"); 
- my $msg="";                                           
- foreach my $mold (@{$ref_resource_list}){             
-  foreach my $r (@{$mold->[0]}){                      
-    my $cpus=0;                                       
-    my $pnodes=0;                                     
-    # Catch the cpu and pnode resources 
-    foreach my $resource (@{$r->{resources}}) { 
-        if ($resource->{resource} eq "cpu") {   
-          $cpus=$resource->{value};             
-        }                                       
-        if ($resource->{resource} eq "pnode") { 
-          $pnodes=$resource->{value};           
-        }                                       
-    }                                           
-    # Calculate the number of cpus 
-    if ($pnodes == 0 && $cpus == 0) { $cpus=1; } 
-    if ($pnodes != 0) { 
-      if ($cpus == 0) { $cpus=$pnodes*2;} 
-      else {$cpus=$pnodes*$cpus;} 
-    } 
-    # Reject if walltime*cpus is too big 
-    if ($cpus * $mold->[1] > $cpu_walltime) { 
-      $msg="\ 
-   [WALLTIME TOO BIG] The maximum allowed walltime for $cpus cpus is "; 
-      $msg.= $cpu_walltime / $cpus / 3600; 
-      $msg.= " hours."; 
-      die($msg); 
-    } 
-  } 
- } 
-</code> 
- 
-===== Jobs number limit ===== 
-Description : Limits the maximum number of simultaneous jobs allowed for each user on the cluster. Default is 50 jobs maximum per user.<br> 
-It is possible to specify users having unlimited jobs number in //~oar/unlimited_reservation.users// file (on oar-server)<br> 
-You can also configure the max_nb_jobs by setting your value in //~oar/max_jobs// (on oar-server)<br> 
-Note : Array jobs are also limited by this rule. 
- 
-<code perl> 
- # Title : Limit the number of jobs per user to max_nb_jobs 
- # Description : If user is not listed in unlimited users file, it checks if current number of jobs is well under $max_nb_jobs, which is defined in ~oar/max_jobs or is 50 by default 
- my $unlimited=0; 
- if (open(FILE, "< $ENV{home}/unlimited_reservation.users")) { 
-     while (<FILE>) { 
-         if (m/^\\s*$user\\s*$/m) { 
-             $unlimited=1; 
-         } 
-     } 
-     close(FILE); 
- } 
- if ($unlimited == 0) { 
-     my $max_nb_jobs = 50; 
-     if (open(FILE, "< $ENV{home}/max_jobs")) { 
-         while (<FILE>) { 
-             chomp; 
-             $max_nb_jobs=$_; 
-         } 
-         close(FILE); 
-     } 
-     my $nb_jobs = $dbh->selectrow_array( 
-         qq{ select count(job_id) 
-             FROM jobs 
-             WHERE job_user = ? 
-             AND (state = \\'Waiting\\' 
-             OR state = \\'Hold\\' 
-             OR state = \\'toLaunch\\' 
-             OR state = \\'toAckReservation\\' 
-             OR state = \\'Launching\\' 
-             OR state = \\'Running\\' 
-             OR state = \\'Suspended\\' 
-             OR state = \\'Resuming\\' 
-             OR state = \\'Finishing\\') }, 
-         undef, 
-         $user); 
-     if (($nb_jobs + $array_job_nb) > $max_nb_jobs) { 
-         die("[ADMISSION RULE] Error: you cannot have more than $max_nb_jobs submitted jobs at the same time."); 
-     } 
- } 
-</code> 
- 
-===== Project assignment ===== 
-If you want to automatically assign a project to users submissions (replacing --project oarsub option), you simply have to set the **$project** variable to what you want inside an admission rule. 
- 
-===== Restricts access to a user list for a set of resources ===== 
-For example, if you defined with the command "oarproperty" a property "model" 
-then you can enforce some properties constraints for some users. 
- 
-<code perl> 
-  # Title : Restricts the use of resources for some users 
-  # Description :  think to change the user list in this admission rule 
-  my %allowed_users = ( 
-      "toto" => 1, 
-      "titi" => 1, 
-      "tutu" => 0 
-  ); 
-  if (!defined($allowed_users{$user}) or ($allowed_users{$user} == 0)){ 
-      if ($jobproperties ne ""){ 
-          $jobproperties = "($jobproperties) AND model != 'bullx'"; 
-      }else{ 
-          $jobproperties = "model != 'bullx'"; 
-      } 
-      print("[ADMISSION RULE] Automatically add the constraint to not go on the bullx nodes"); 
-  } 
-</code> 
- 
-===== Limit the number of interactive jobs per user ===== 
-<code perl> 
-  # Title :  Limit the number of interactive jobs per user 
-  # Description :  Limit the number of interactive jobs per user 
-  my $max_interactive_jobs = 2; 
-  if (($jobType eq "INTERACTIVE") and ($reservationField eq "None")){ 
-      my $nb_jobs = $dbh->do("    SELECT job_id 
-                                  FROM jobs 
-                                  WHERE 
-                                        job_user = '$user' AND 
-                                        reservation = 'None' AND 
-                                        job_type = 'INTERACTIVE' AND 
-                                        (state = 'Waiting' 
-                                            OR state = 'Hold' 
-                                            OR state = 'toLaunch' 
-                                            OR state = 'toAckReservation' 
-                                            OR state = 'Launching' 
-                                            OR state = 'Running' 
-                                            OR state = 'Suspended' 
-                                            OR state = 'Resuming' 
-                                            OR state = 'Finishing') 
-                             "); 
-      if ($nb_jobs >= $max_interactive_jobs){ 
-          die("You cannot have more than $max_interactive_jobs interactive jobs at a time."); 
-      } 
-  } 
-</code> 
- 
-===== Auto property restriction for specific user groups ===== 
- 
-<code perl> 
-  # Title :  Infiniband user restrictions 
-  # Description :  put the ib property restriction depending of the groups of the user 
-  if ((! grep(/^besteffort$/, @{$type_list})) and ($user ne "serviware")){ 
-      print("[ADMISSION RULE] Check on which Infiniband network you can go on..."); 
-      my ($user_name,$user_passwd,$user_uid,$user_gid,$user_quota,$user_comment,$user_gcos,$user_dir,$user_shell,$user_expire) = getpwnam($user); 
-      my ($primary_group,$primary_passwd,$primary_gid,$primary_members) = getgrgid($user_gid); 
-      my ($seiscope_name,$seiscope_passwd,$seiscope_gid,$seiscope_members) = getgrnam("seiscope"); 
-      my %seiscope_hash = map { $_ => 1 } split(/\\s+/,$seiscope_members); 
-      my ($globalseis_name,$globalseis_passwd,$globalseis_gid,$globalseis_members) = getgrnam("globalseis"); 
-      my %globalseis_hash = map { $_ => 1 } split(/\\s+/,$globalseis_members); 
-      my ($tohoku_name,$tohoku_passwd,$tohoku_gid,$tohoku_members) = getgrnam("tohoku"); 
-      my %tohoku_hash = map { $_ => 1 } split(/\\s+/,$tohoku_members); 
-      my $sql_str = "ib = \\'none\\'"; 
-      if (($primary_group eq "seiscope") or (defined($seiscope_hash{$user}))){ 
-          print("[ADMISSION RULE] You are in the group seiscope so you can go on the QDR Infiniband nodes"); 
-          $sql_str .= " OR ib = \\'QDR\\'"; 
-      }    
-      if (($primary_group eq "globalseis") or (defined($globalseis_hash{$user})) or ($primary_group eq "tohoku") or (defined($tohoku_hash{$user}))){ 
-          print("[ADMISSION RULE] You are in the group globalseis or tohoku so you can go on the DDR Infiniband nodes"); 
-          $sql_str .= " OR ib = \\'DDR\\'"; 
-      }    
-      if ($jobproperties ne ""){ 
-          $jobproperties = "($jobproperties) AND ($sql_str)"; 
-      }else{ 
-          $jobproperties = "$sql_str"; 
-      }    
-  }    
-</code> 
- 
-===== Debug admission rule ===== 
-When you play with admission rules, you can dump some data structures with YAML to have a readable output of the submission requests for example: 
- 
-<code> 
- print "[DEBUG] Output of the resources query data structure:"; 
- print YAML::Dump(@{$ref_resource_list}); 
-</code> 
-===== NUMA topology optimization ===== 
-See the [[#NUMA_topology_optimization_2|NUMA topology optimization]] usecase 
-===== Short, medium and long queues ===== 
- 
-Description: The following is a set of admission rules that route on 3 different queues having different priorities. Some core number restrictions per queue are set up. 
- 
-Queues creation: 
-<code bash> 
- oarnotify --add_queue short,9,oar_sched_gantt_with_timesharing_and_fairsharing 
- oarnotify --add_queue medium,5,oar_sched_gantt_with_timesharing_and_fairsharing 
- oarnotify --add_queue long,3,oar_sched_gantt_with_timesharing_and_fairsharing 
-</code> 
- 
-Rules: 
-<code perl> 
- ------ 
- Rule : 20 
- # Title: Automatic routing into the short queue 
- # Description: Short jobs are automatically routed into the short queue 
- my $max_walltime="6:00:00"; 
- my $walltime=0; 
- # Search for the max walltime of the moldable jobs 
- foreach my $mold (@{$ref_resource_list}){ 
-   foreach my $r (@{$mold->[0]}){ 
-     if ($mold->[1] > $walltime) { 
-       $walltime = $mold->[1];  
-     } 
-   } 
- } 
- # Put into the short queue if the job is short 
- if ($walltime <=  OAR::IO::sql_to_duration($max_walltime) 
-                 && !(grep(/^besteffort$/, @{$type_list}))) { 
-   print "   [SHORT QUEUE] This job is routed into the short queue"; 
-   $queue_name="short"; 
- } 
- ------ 
- Rule : 21 
- # Title: Automatic routing into the medium queue 
- # Description: Medium jobs are automatically routed into the medium queue 
- my $max_walltime="120:00:00"; 
- my $min_walltime="6:00:00"; 
- my $walltime=0; 
- # Search for the max walltime of the moldable jobs 
- foreach my $mold (@{$ref_resource_list}){ 
-   foreach my $r (@{$mold->[0]}){ 
-     if ($mold->[1] > $walltime) { 
-       $walltime = $mold->[1];  
-     } 
-   } 
- } 
- # Put into the medium queue if the job is medium 
- if ($walltime <= OAR::IO::sql_to_duration($max_walltime) 
-     && $walltime > OAR::IO::sql_to_duration($min_walltime) 
-     && !(grep(/^besteffort$/, @{$type_list}))) { 
-   print "  [MEDIUM QUEUE] This job is routed into the medium queue"; 
-   $queue_name="medium"; 
- } 
- ------ 
- Rule : 22 
- # Title: Automatic routing into the long queue 
- # Description: Medium jobs are automatically routed into the medium queue 
- my $max_walltime="360:00:00"; 
- my $min_walltime="120:00:00"; 
- my $walltime=0; 
- # Search for the max walltime of the moldable jobs 
- foreach my $mold (@{$ref_resource_list}){ 
-   foreach my $r (@{$mold->[0]}){ 
-     if ($mold->[1] > $walltime) { 
-       $walltime = $mold->[1];  
-     } 
-   } 
- } 
- # Put into the long queue if the job is long 
- if ($walltime > OAR::IO::sql_to_duration($min_walltime) 
-     && !(grep(/^besteffort$/, @{$type_list}))) { 
-   print "    [LONG QUEUE] This job is routed into the long queue"; 
-   $queue_name="long"; 
- } 
- # Limit walltime of the "long" queue 
- if ($queue_name eq "long"){ 
-   my $min_walltime="120:00:00"; 
-   my $max_walltime="360:00:00"; 
-   foreach my $mold (@{$ref_resource_list}){ 
-     foreach my $r (@{$mold->[0]}){ 
-       if ($mold->[1] > OAR::IO::sql_to_duration($max_walltime)) { 
-         print "\ 
-   [WALLTIME TOO BIG] The maximum allowed walltime for the long queue is $max_walltime"; 
-         exit(1); 
-       } 
-       if ($mold->[1] <= OAR::IO::sql_to_duration($min_walltime)) { 
-         print "\ 
-   [WALLTIME TOO SHORT] The minimum allowed walltime for the long queue is $min_walltime"; 
-         exit(1); 
-       } 
-     } 
-   } 
- } 
- ------ 
- Rule : 23 
- # Title : Core number restrictions  
- # Description : Count the number of cores requested and reject if the queue does not allow this 
- # Check the resources 
- my $resources_def=$ref_resource_list->[0]; 
- my $n_core_per_cpus=6; 
- my $n_cpu_per_node=2; 
- my $core=0; 
- my $cpu=0; 
- my $node=0; 
- foreach my $r (@{$resources_def->[0]}) { 
-   foreach my $resource (@{$r->{resources}}) { 
-     if ($resource->{resource} eq "core") {$core=$resource->{value};} 
-     if ($resource->{resource} eq "cpu") {$cpu=$resource->{value};} 
-     if ($resource->{resource} eq "nodes") {$node=$resource->{value};} 
-     if ($resource->{resource} eq "network_address") {$node=$resource->{value};} 
-   } 
- } 
- # Now, calculate the number of total cores 
- my $n_cores=0; 
- if ($node == 0 && $cpu != 0 && $core == 0) { 
-     $n_cores = $cpu*$n_core_per_cpus; 
- }elsif ($node != 0 && $cpu == 0 && $core == 0) { 
-     $n_cores = $node*$n_cpu_per_node*$n_core_per_cpus; 
- }elsif ($node != 0 && $cpu == 0 && $core != 0) { 
-     $n_cores = $node*$core; 
- }elsif ($node == 0 && $cpu != 0 && $core != 0) { 
-     $n_cores = $cpu*$core; 
- }elsif ($node == 0 && $cpu == 0 && $core != 0) { 
-     $n_cores = $core; 
- } 
- else { $n_cores = $node*$cpu*$core; } 
- print "   [CORES COUNT] You requested $n_cores cores"; 
-  
- # Now the restrictions: 
- my $short=132; # 132 cores = 11 noeuds 
- my $medium=132; # 132 cores = 11 noeuds 
- my $long=132; # 132 cores = 11 noeuds 
- if ("$queue_name" eq "long" && $n_cores > $long) { 
-   print "\ 
-   [CORES COUNT] Too many cores for this queue (max is $long)!"; 
-   exit(1); 
- } 
- if ("$queue_name" eq "medium" && $n_cores > $medium) { 
-   print "\ 
-   [CORES COUNT] Too many cores for this queue (max is $medium)!"; 
-   exit(1); 
- } 
- if ("$queue_name" eq "short" && $n_cores > $short) { 
-   print "\ 
-   [CORES COUNT] Too many cores for this queue (max is $short)!"; 
-   exit(1); 
- } 
- ------ 
- Rule : 24 
- # Title : Restriction des jobs long ou medium 
- # Description : Les jobs long ou medium ne peuvent pas tourner sur les ressources ayant la propriété long=NO 
- if ("$queue_name" eq "long" || "$queue_name" eq "medium"){ 
-     if ($jobproperties ne ""){        
-         $jobproperties = "($jobproperties) AND long = \\'YES\\'"; 
-     }else{                                                        
-         $jobproperties = "long = \\'YES\\'"; 
-     } 
-     print "[ADMISSION RULE] Adding long/medium jobs resources restrictions"; 
- } 
-</code> 
- 
-===== Naming interactive jobs by default ===== 
-Description: Interactive jobs with no name are automatically named "interactive unnamed job" 
- 
-<code perl> 
- if (($jobType eq "INTERACTIVE") and ($job_name eq //)){ 
-    $job_name = 'interactive unnamed job'; 
- } 
-</code> 
- 
-====== Use cases ====== 
-===== OpenMPI + affinity ===== 
- 
-We saw that the Linux kernel seems to be incapable of using correctly all the CPUs from the cpusets. 
- 
-Indeed, reserving 2 out of 8 cores on a node and running a code that uses 2 
-process, these 2 process where not well assigned to each cpu. 
-We had to give the CPU MAP to OpenMPI to do cpu_affinity: 
- 
-<code bash> 
- i=0 ; oarprint core -P host,cpuset -F "% slot=%" | while read line ; do echo "rank $i=$line"; ((i++)); done > affinity.txt 
- 
- [user@node12 tmp]$ mpirun -np 8 --mca btl openib,self -v -display-allocation -display-map  --machinefile $OAR_NODEFILE -rf affinity.txt /home/user/espresso-4.0.4/PW/pw.x < BeO_100.inp 
-</code> 
- 
-===== NUMA topology optimization ===== 
-In this use case, we've got a numa host (an Altix 450) having a "squared" topology: nodes are interconnected by routers like in this view: 
- 
-{{:wiki:lfi_topology.png?nolink&300|}} 
-In yellow, "routers", in magenta, "nodes" (2 dual-core processors per node) 
- 
-Routers interconnect IRUS (chassis) on which the nodes are plugged (4 or 5 nodes per IRU).  
- 
-What we want is that for jobs that can enter into 2 IRUS or less, minimize the distance between the resources (ie use IRUS that have only one router interconnexion between them). The topology may be siplified as follows: 
- 
-{{:wiki:lfi_topology_square.png?nolink&300|}} 
- 
-The idea is to use moldable jobs and an admission rule that converts automatically the user requests to a moldable job. This job uses 2 resource properties: **numa_x** and **numa_y** that may be analogue to the square coordinates. What we want in fact, is the job that ends the soonest between a job running on an X or on a Y coordinate (we only want vertical or horizontal placed jobs). 
- 
-The numa_x and numa_y properties are set up this way (pnode is a property corresponding to physical nodes): 
- 
-{|border="1" 
-!pnode 
-!iru 
-!numa_x 
-!numa_y 
-|- 
-|itanium1 
-|1 
-|0 
-|1 
-|- 
-|itanium2 
-|1 
-|0 
-|1 
-|- 
-|itanium3 
-|1 
-|0 
-|1 
-|- 
-|itanium4 
-|1 
-|0 
-|1 
-|- 
-|itanium5 
-|2 
-|1 
-|1 
-|- 
-|itanium6 
-|2 
-|1 
-|1 
-|- 
-|itanium7 
-|2 
-|1 
-|1 
-|- 
-|itanium8 
-|2 
-|1 
-|1 
-|- 
-|itanium9 
-|2 
-|1 
-|1 
-|- 
-|itanium10 
-|3 
-|0 
-|0 
-|- 
-|itanium11 
-|3 
-|0 
-|0 
-|- 
-|itanium12 
-|3 
-|0 
-|0 
-|- 
-|itanium13 
-|3 
-|0 
-|0 
-|- 
-|itanium14 
-|3 
-|0 
-|0 
-|- 
-|itanium15 
-|4 
-|1 
-|0 
-|- 
-|itanium16 
-|4 
-|1 
-|0 
-|- 
-|itanium17 
-|4 
-|1 
-|0 
-|- 
-|itanium18 
-|4 
-|1 
-|0 
-|} 
- 
-For example, the following requested ressources: 
-<code> 
- -l /core=16 
-</code> 
-will result into: 
-<code> 
- -l /numa_x=1/pnode=4/cpu=2/core=2 -l /numa_y=1/pnode=4/cpu=2/core=2 
-</code> 
- 
-Here is the admission rule making that optimization: 
- 
-<code bash> 
- # Title : Numa optimization 
- # Description : Creates a moldable job to take into account the "squared" topology of an Altix 450 
- my $n_core_per_cpus=2; 
- my $n_cpu_per_pnode=2; 
- if (grep(/^itanium$/, @{$type_list}) && (grep(/^manual$/, @{$type_list}) == "") && $#$ref_resource_list == 0){ 
-   print "[ADMINSSION RULE] Optimizing for numa architecture (use \\"-t manual\\" to disable)"; 
-   my $resources_def=$ref_resource_list->[0]; 
-   my $core=0; 
-   my $cpu=0; 
-   my $pnode=0; 
-   foreach my $r (@{$resources_def->[0]}) { 
-     foreach my $resource (@{$r->{resources}}) { 
-       if ($resource->{resource} eq "core") {$core=$resource->{value};} 
-       if ($resource->{resource} eq "cpu") {$cpu=$resource->{value};} 
-       if ($resource->{resource} eq "pnode") {$pnode=$resource->{value};} 
-     } 
-   } 
-   # Now, calculate the number of total cores 
-   my $n_cores=0; 
-   if ($pnode == 0 && $cpu != 0 && $core == 0) { 
-     $n_cores = $cpu*$n_core_per_cpus; 
-   } 
-   elsif ($pnode != 0 && $cpu == 0 && $core == 0) { 
-     $n_cores = $pnode*$n_cpu_per_pnode*$n_core_per_cpus; 
-   } 
-   elsif ($pnode != 0 && $cpu == 0 && $core != 0) { 
-     $n_cores = $pnode*$core; 
-   } 
-   elsif ($pnode == 0 && $cpu != 0 && $core != 0) { 
-     $n_cores = $cpu*$core; 
-   } 
-   elsif ($pnode == 0 && $cpu == 0 && $core != 0) { 
-     $n_cores = $core; 
-   } 
-   else { $n_cores = $pnode*$cpu*$core; } 
-   print "[ADMINSSION RULE] You requested $n_cores cores\"; 
-   if ($n_cores > 32) { 
-     print "[ADMISSION RULE] Big job (>32 cores), no optimization is possible"; 
-   }else{ 
-     print "[ADMISSION RULE] Optimization produces: /numa_x=1/$pnode/$cpu/$core 
-                                         /numa_y=1/$pnode/$cpu/$core"; 
-  
-     my @newarray=eval(Dumper(@{$ref_resource_list}->[0])); 
-     push (@{$ref_resource_list},@newarray); 
-     unshift(@{%{@{@{@{$ref_resource_list}->[0]}->[0]}->[0]}->{resources}},{'resource' => 'numa_x','value' => '1'}); 
-     unshift(@{%{@{@{@{$ref_resource_list}->[1]}->[0]}->[0]}->{resources}},{'resource' => 'numa_y','value' => '1'}); 
-   } 
- } 
-</code> 
- 
-====== Troubles and solutions ====== 
 ===== Can't do setegid! ===== ===== Can't do setegid! =====
 Some distributions have perl_suid installed, but not set up correctly. The solution is something like that: Some distributions have perl_suid installed, but not set up correctly. The solution is something like that:
Line 1173: Line 345:
  bzeznik@healthphy:~> sudo chmod u+s /usr/bin/sperl5.8.8  bzeznik@healthphy:~> sudo chmod u+s /usr/bin/sperl5.8.8
 </code> </code>
- 
-====== Users tips ====== 
-===== oarsh completion ===== 
-//Tip based on an idea from Jerome Reybert// 
- 
-In order to complete nodes names in a oarsh command, add these lines in your .bashrc 
- 
-<code bash> 
-function _oarsh_complete_() { 
-  if [ -n "$OAR_NODEFILE" -a "$COMP_CWORD" -eq 1 ]; then 
-    local word=${comp_words[comp_cword]} 
-    local list=$(cat $OAR_NODEFILE | uniq | tr '\n' ' ') 
-    COMPREPLY=($(compgen -W "$list" -- "${word}")) 
-  fi 
-} 
-complete -o default -F _oarsh_complete_ oarsh 
-</code> 
- 
-Then try oarsh <TAB> 
-===== OAR aware shell prompt for Interactive jobs ===== 
-If you want to have a bash prompt with your job id and the remaining walltime then you can add in your ~/.bashrc: 
- 
-<code bash> 
-if [ "$PS1" ]; then 
-    __oar_ps1_remaining_time(){ 
-        if [ -n "$OAR_JOB_WALLTIME_SECONDS" -a -n "$OAR_NODE_FILE" -a -r "$OAR_NODE_FILE" ]; then 
-            DATE_NOW=$(date +%s) 
-            DATE_JOB_START=$(stat -c %Y $OAR_NODE_FILE) 
-            DATE_TMP=$OAR_JOB_WALLTIME_SECONDS 
-            ((DATE_TMP = (DATE_TMP - DATE_NOW + DATE_JOB_START) / 60)) 
-            echo -n "$DATE_TMP" 
-        fi 
-    } 
-    PS1='[\u@\h|\W]$([ -n "$OAR_NODE_FILE" ] && echo -n "(\[\e[1;32m\]$OAR_JOB_ID\[\e[0m\]-->\[\e[1;34m\]$(__oar_ps1_remaining_time)mn\[\e[0m\])")\$ ' 
-    if [ -n "$OAR_NODE_FILE" ]; then 
-        echo "[OAR] OAR_JOB_ID=$OAR_JOB_ID" 
-        echo "[OAR] Your nodes are:" 
-        sort $OAR_NODE_FILE | uniq -c | awk '{printf("      %s*%d", $2, $1)}END{printf("\n")}' | sed -e 's/,$//' 
-    fi 
-fi 
-</code> 
- 
- 
-Then the prompt inside an Interactive job will be like: 
- 
-<code bash> 
-  [capitn@node006~](3101-->29mn)$ 
-</code> 
- 
-===== Many small jobs grouping ===== 
-Many small jobs of a few seconds may be painful for the OAR system. OAR may spend more time scheduling, allocating and launching than the actual computation time for each job. 
- 
-Gabriel Moreau developed a script that may be useful when you have a large set of small jobs. It groups you jobs into a unique bigger OAR job: 
-  * http://servforge.legi.grenoble-inp.fr/pub/soft-trokata/oarutils/oar-parexec.html 
-You can download it from this page: 
-  * http://servforge.legi.grenoble-inp.fr/projects/soft-trokata/wiki/SoftWare/OarUtils 
- 
-For a more generic approach you can use Cigri, a grid middleware running onto OAR cluster(s) that is able to automatically group parametric jobs. Cigri is currently in a re-writing process and a new public release is planned for the end of 2012. 
- 
-Please contact Bruno.Bzeznik@imag.fr for more informations. 
- 
-===== Environment variables through oarsh ===== 
- 
-  * http://servforge.legi.grenoble-inp.fr/pub/soft-trokata/oarutils/oar-envsh.html 
-  * http://servforge.legi.grenoble-inp.fr/projects/soft-trokata/wiki/SoftWare/OarUtils 
- 
wiki/customization_tips.txt · Last modified: 2020/03/25 15:24 by neyron
Recent changes RSS feed GNU Free Documentation License 1.3 Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki