In this section, you'll find advanced configuration tips
This tip is useful for clusters of big nodes, like NUMA hosts with numerous cpus and a few nodes. When the cluster has a lot of free resources, users often wonder why their jobs are always sent to the first node while the others are completely free. With this simple trick, new jobs are sent preferably on the nodes that have the lowest 15 minutes workload.
Caution: Doing this will significantly reduce the chances for jobs that want to use entire nodes or big parts of them (they may wait for a longer time)! do so only if this is what you want!/
oarproperty -a wload
#!/bin/bash set -e HOSTS="zephir alize" for host in $HOSTS do load=`ssh $host head -1 /proc/loadavg|awk '{print $3*100}'` /usr/local/sbin/oarnodesetting -h $host -p wload=$load done
*/5 * * * * root /usr/local/sbin/update_workload.sh > /dev/null
SCHEDULER_RESOURCE_ORDER="**wload ASC**,scheduler_priority ASC, suspended_jobs ASC, switch ASC, network_address DESC, resource_id ASC"
That's it!
Supposing you have a NUMA system (Non Uniform Access Memory), you may want to associate memory banks to sockets (cpus). This has 2 advantages:
If you have a UMA system, you still may want to confine small jobs (ie jobs not using the entire node) to a subset of the memory and the trick is to use fake numa so that this tip will work for you.
All you have to do is to customize the job_resource_manager. It's a perl script, generally found into /etc/oar that you specify into the JOB_RESOURCE_MANAGER_FILE of the oar.conf file.
Examples (differences from the original script are set in bold):
With the linux kernel (depending on the version), it is possible to split the memory into a predefined number of chunks, exactly like if a chunk was corresponding to a memory bank. This way, it's then possible to associate some “virtual” memory banks to a cpuset. As OAR creates cpusets to isolate cpu workload from other jobs, it's also possible to isolate the memory usage. A job that tries to use more memory than the total amount of the virtual memory banks associated into its cpuset should swap or fail with a out of memory signal.
Fake-numa is activated at the boot process, by a kernel option, for example:
numa=fake=12
will create 12 slots of memory accessible from the cpusets filesystem:
bzeznik@gofree-8:~$ cat /dev/cpuset/mems 0-11
Each slot size is the total size of the node divided by 12.
Once activated into the kernel of your cluster's nodes, you should edit the OAR's job manager script to take this into account. This is a perl script, located into /etc/oar/job_resource_manager.pl on the OAR server. The easiest configuration is to create as many virtual memory banks as there are cores into your nodes. By this way, you have one virtual memory bank for one core and you can tell oar to associate the corresponding memory bank to a core:
# Copy the original job manager: cp /etc/oar/job_resource_manager.pl /etc/oar/job_resource_manager_with_mem.pl # Edit job_resource_manager_with_mem.pl, arround line 122, replace this line: # 'cat /dev/cpuset/mems > /dev/cpuset/'.$Cpuset_path_job.'/mems &&'. # by this line: # '/bin/echo '.join(",",@Cpuset_cpus).' | cat > /dev/cpuset/'.$Cpuset_path_job.'/mems && '. # (actually, it is the same line as for the "cpus", but into the "mems" file)
Once the new job manager created, you can activate it by changing the JOB_RESOURCE_MANAGER_FILE variable of your oar.conf file:
JOB_RESOURCE_MANAGER_FILE="/etc/oar/job_resource_manager_with_mem.pl"
Now, you can check if it's working by creating a new job, and checking into it's cpuset memory file. For example:
bzeznik@gofree:~$ oarsub -l /nodes=1/core=2 -I [ADMISSION RULE] Set default walltime to 7200. [ADMISSION RULE] Modify resource description with type constraints OAR_JOB_ID=307855 Interactive mode : waiting... Starting... Connect to OAR job 307855 via the node gofree-8 bzeznik@gofree-8:~$ cat /proc/self/cpuset /oar/bzeznik_307855 bzeznik@gofree-8:~$ cat /dev/cpuset/oar/bzeznik_307855/cpus 8-9 bzeznik@gofree-8:~$ cat /dev/cpuset/oar/bzeznik_307855/mems 8-9
Then you have to teach to your users that cores are associated to a certain amount of memory per core. In this example, it's 4GB/core. Then, if a user has a memory bounded job and needs 17GB of memory, he should ask for 5 cores on the same node, even for a sequential job. It's generally not to be considered as a waste in the HPC context because cpu cores are operating correctly only if memory i/o can operate correctly. It's also possible to create an admission rule that will convert a query like “-l /memory=17” into “-l /nodes=1/core=5”. Finally, it should also be possible to create more virtual memory banks (2 or 4… per core), but you then should have to create your resources as memory slots and manage a memory_slot property into the job manager for example.
If you want to use the cpusets feature, the JOB_RESOURCE_MANAGER_PROPERTY_DB_FIELD variable from your oar.conf file must be uncommented and set to the property that gives the cpuset ids of the resources (generally cpuset). This property must be configured properly for each resource. You can use the oar_resources_init command.
Nodes can set them automatically to the Alive status at boot time, and Absent status at shutdown. One efficient way to do this, is to use dedicated ssh keys. The advantages are:
First of all, you need to add a ip property to the resources table and put the ip addresses of your nodes inside:
oarproperty -a ip -c oarnodesetting -p ip=192.168.0.1 --sql "network_address='node1'" oarnodesetting -p ip=192.168.0.2 --sql "network_address='node2'" ...
Then, you have to put 2 scripts into the /etc/oar directory:
#!/bin/sh # oarnodesetting_ssh: oarnodesetting SSH wrapper # $Id: oarnodesetting_ssh 949 2007-10-22 15:44:26Z capitn $ # This script is to be called from the node via SSH so that the server performs # the oarnodesetting command and changes the state of the calling node. # # NB: # 1- To get this script working, the oar ressource database table must have a # `ip' field containing the IP address for all the nodes # 2- A dedicated SSH key may be configured to restrict the ssh call capability # from the nodes to the server, by modifying the authorized_keys of oar on the # serveur as follows: # command="/usr/lib/oar/oarnodesetting_ssh" [dediacted pub key info]... # # Warning: if $IP does not exist in the database or every corresponding # resource states are 'Dead' then this script will return an exit code # of 12 not 0 (this is the default behaviour of "oarnodesetting"). IP=$(echo $SSH_CONNECTION | cut -d " " -f 1 ) OARNODESETTINGCMD=/usr/sbin/oarnodesetting [ -n "$IP" ] || exit 1 # This updates matching core/cpuset based on /proc/cpuinfo /etc/oar/update_cpuset_id.sh $IP # Set the node Alive exec $OARNODESETTINGCMD -s Alive --sql "ip = '$IP' AND state != 'Dead'" exit 1
#!/bin/sh # oarnodesetting_ssh: oarnodesetting SSH wrapper # $Id: oarnodesetting_ssh 949 2007-10-22 15:44:26Z capitn $ # This script is to be called from the node via SSH so that the server performs # the oarnodesetting command and changes the state of the calling node. # # NB: # 1- To get this script working, the oar ressource database table must have a # `ip' field containing the IP address for all the nodes # 2- A dedicated SSH key may be configured to restrict the ssh call capability # from the nodes to the server, by modifying the authorized_keys of oar on the # serveur as follows: # command="/usr/lib/oar/oarnodesetting_ssh" [dediacted pub key info]... # # Warning: if $IP does not exist in the database or every corresponding # resource states are 'Dead' then this script will return an exit code # of 12 not 0 (this is the default behaviour of "oarnodesetting"). IP=$(echo $SSH_CONNECTION | cut -d " " -f 1 ) OARNODESETTINGCMD=/usr/sbin/oarnodesetting [ -n "$IP" ] || exit 1 exec $OARNODESETTINGCMD -s Absent --sql "ip = '$IP' AND state != 'Dead'" exit 1
Then, create 2 ssh keys with no passphrase and put them inside the .ssh directory of the home of the oar user on every nodes:
sudo su - oar ssh-keygen -t rsa -f .ssh/oarnodesetting_alive.key ssh-keygen -t rsa -f .ssh/oarnodesetting_absent.key scp -P 6667 .ssh/oarnodesetting_a* node1:.ssh ...
Add the public keys, on your frontend, into the authorized_keys file of the oar user by prefixing them with the names of the scripts seen above:
environment="OAR_KEY=1",command="/etc/oar/oarnodesetting_ssh_alive.sh" ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAryzISWw4jbhphQfxWq2onrv8hZJlQo/aIjkDyh6wtriT9W289RB+SUNT7qnrDOcorgpwoCOdT6Y6ezlH2R2mLkbNyegV8q8wVTw0E96Rw7iBFXyyjsoq27E9J8ddlH6mE05G9vRaBDQiLJ76+lG20hnE1jhHiQX8DuFzG+qxmNiLGSIlYNCGNzP2RudQ6vdACzkOUw74dpwmJK0ko4YyHpxpbZ2/x66nJTINaIAPBJZ09FpUbWIRABOozr8u0GayiB06JOYnsbW0PqNUOGEvChYV8Kh3FJsM+geNh43I+uEo17p9DYhSGd1enPFOIv4VmPzZ3huT8TJH88FEz1F/zw ===== environment="OAR_KEY=1",command="/etc/oar/oarnodesetting_ssh_absent.sh" ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA3cM8AUC5F8Olb/umgjDztTOOWiRHj3WMy+js2dowfkO0s1yNkXa+L93UOC0L/BTSTbr8ZqGWV+yNvx36T8tFjWVnd+wkjwl616SxfEQQ1YXQWS8m55vPpCs3dT4ZvtSceB9G3XCoGje+fsOpNb05X9DhX+2bXwe69SwK3e8J7QkDIeRwcEiv6vrteHE04qaVBXTJGLgJToxcPKdKDNhPUUoA+f4ZO3OG0exrfhWNfrLpVqc69nOGiTI9/9N/Dmw/V5oAEvKED2H/Ek1EaptW7hCgZTHoyj9OXbpofSro768ecymRBa6/qfEC/LvSp9e2HYIjn5rcL0WqlKBajpblmQ==
Finaly, customize the oar-node init script (generally /etc/default/oar-node or /etc/sysconfig/oar-node) with the following script:
## Auto update node status at boot time # # OARREMOTE: machine where we remotely run oarnodesetting (e.g. the main oar+kadeploy frontend) OARREMOTE="172.23.0.3" # retry settings MODSLEEP=8 MINSLEEP=2 MAXRETRY=30 start_oar_node() { test -n "$OARREMOTE" || exit 0 echo " * Set the ressources of this node to Alive" local retry=0 local sleep=0 until ssh -t -oStrictHostKeyChecking=no -oPasswordAuthentication=no -i /var/lib/oar/.ssh/oarnodesetting_alive.key oar@$OARREMOTE -p 6667 do if [$((retry+=sleep)) -gt $MAXRETRY ]; then echo "Failed." return 1 fi ((sleep = $RANDOM % $MODSLEEP + $MINSLEEP)) echo "Retrying in $sleep seconds..." sleep $sleep done return 0 } stop_oar_node() { test -n "$OARREMOTE" || exit 0 echo " * Set the ressources of this node to Absent" local retry=0 local sleep=0 until ssh -t -oStrictHostKeyChecking=no -oPasswordAuthentication=no -i /var/lib/oar/.ssh/oarnodesetting_absent.key oar@$OARREMOTE -p 6667 do if [$((retry+=sleep)) -gt $MAXRETRY ]; then echo "Failed." return 1 fi ((sleep = $RANDOM % $MODSLEEP + $MINSLEEP)) echo "Retrying in $sleep seconds..." sleep $sleep done return 0 }
You can test by issuing the following from a node:
node1:~ # /etc/init.d/oar-node stop Stopping OAR dedicated SSH server: * Set the ressources of this node to Absent 33 --> Absent 34 --> Absent 35 --> Absent 36 --> Absent 37 --> Absent 38 --> Absent 39 --> Absent 40 --> Absent Check jobs to delete on resource 33 : Check done Check jobs to delete on resource 34 : Check done Check jobs to delete on resource 35 : Check done Check jobs to delete on resource 36 : Check done Check jobs to delete on resource 37 : Check done Check jobs to delete on resource 38 : Check done Check jobs to delete on resource 39 : Check done Check jobs to delete on resource 40 : Check done Connection to 172.23.0.3 closed. node1:~ # /etc/init.d/oar-node start Starting OAR dedicated SSH server: * Set the ressources of this node to Alive 33 --> Alive 34 --> Alive 35 --> Alive 36 --> Alive 37 --> Alive 38 --> Alive 39 --> Alive 40 --> Alive Done Connection to 172.23.0.3 closed.
You can manage several different clusters with a unique OAR server. You may also choose to have one or several submission hosts. Simply install the oar-server package on the server and the oar-user package on all the submission hosts.
You can tag the resources to keep track of which resource belongs to which cluster. Simply create a new property (for example: “cluster”) and set it for each resource. Example:
oarproperties -c -a cluster for i in `seq 1 32`; do oarnodesetting -r $i -p cluster="clusterA"; done for i in `seq 33 64`; do oarnodesetting -r $i -p cluster="clusterB"; done
Users can choose on which cluster to submit by asking for a specific cluster value:
oarsub -I -l /nodes=2 -p "cluster='clusterA'"
If you have several submission hosts, you can make an admission rule to automatically set the value of the cluster property. For example, the following submission rule should do the trick:
# Title : Cluster property management # Description : Set the cluster property to the hostname of the submission host use Sys::Hostname; my @h = split('\\.',hostname()); # If you want to set up a queue per cluster, you can uncomment the following: #if ($queue_name eq "default") { # $queue_name=$h[0]; #} if ($jobproperties ne ""){ $jobproperties = "($jobproperties) AND cluster = '".$h[0]."'"; } else{ $jobproperties = "cluster = '".$h[0]."'"; }
Finally, you may also want to set up a queue per cluster, just because it's nicer in the oarstat output:
oarnotify --add_queue "clusterA,5,oar_sched_gantt_with_timesharing" oarnotify --add_queue "clusterB,5,oar_sched_gantt_with_timesharing"
In /etc/oar/job_resource_manager.pl simply uncomment the #exit(0) line.
Note: this tips depends on the start/stop of nodes using ssh keys tips, for the node to be automatically set up to the alive state at boot time.}} OAR server now comes with a perl script, located into /etc/oar/oar_phoenix.pl that searches for fully suspected nodes and may send customized commands aimed at repairing them. It has a 2 level mechanism: First, it sends a 'soft' command. And after a timeout, if the node is still suspected, it sends a 'hard' command. Here is how to install the script:
cluster:~# vi **/etc/oar/oar_phoenix.pl** # Command sent to reboot a node (first attempt) my $PHOENIX_SOFT_REBOOTCMD="ssh -p 6667 {nodename} oardodo reboot"; # Timeout for a soft rebooted node to be considered hard rebootable my $PHOENIX_SOFT_TIMEOUT=300; # Command sent to reboot a node (second attempt) #my $PHOENIX_HARD_REBOOTCMD="oardodo ipmitool -U USERID -P PASSW0RD -H {nodename}-mgt power off;sleep 2;oardodo ipmitool -U USERID -P PASSW0RD -H {NODENAME}-mgt power on"; my $PHOENIX_HARD_REBOOTCMD="oardodo /etc/oar/reboot_node_hard.sh {nodename}";
cluster:~# vi **/etc/cron.d/oar-phoenix** */10 * * * * root /usr/sbin/oar_phoenix
Some distributions have perl_suid installed, but not set up correctly. The solution is something like that:
bzeznik@healthphy:~> which sperl5.8.8 /usr/bin/sperl5.8.8 bzeznik@healthphy:~> sudo chmod u+s /usr/bin/sperl5.8.8