This page describes a proof of concept to use OAR to couple “cpu” resources with some “disk” resources. E.g. Each node has “cpus” and “disks”, we want to allow users to create jobs to reserve disks for a long period, and eventually request smaller compute jobs on resources on which those disks are reserved. This would give users the capability of keeping data on disk for a longer time than the compute time, e.g. a more persistent storage for data.
We create disk resources, then setup some coupling so that a compute resource is tagged when a user has disks reserved on it. Tagging in done in the disk property of the compute resource (type default), with the syntax: /userid1:#disks/userid2:#disks/
.
This setup was tested on oar-docker.
Create some disk resources:
oarproperty -c -a disk oarproperty -a diskid # The host property here takes the same hostname for the new resources of type //disk//, # as for the resources of type //default// (in oardocker default setup: node1, node2 and node3). # We create 6 disks per host. for h in {1..3}; do # Set the default disk value for default (compute) resources oarnodesetting -h node$h -p disk=/ for d in {1..6}; do # Create the disk resources: disk is a unique value (hierarchical resource), # diskid is the disk identifier on the machine (used to setup the disk). oarnodesetting -a -h "''" -p type='disk' -p diskid=$d -p "disk=$(((h-1)*6 + d))" -p "host=node$h"; done done
Change OAR configuration on the server as follows:
cd /etc/oar cat <<'EOF' | patch -b --- /etc/oar/oar.conf.old 2016-11-25 16:43:45.000000000 +0100 +++ /etc/oar/oar.conf 2016-11-25 01:11:29.510798912 +0100 @@ -205,8 +205,8 @@ # Files to execute before and after each job on the OAR server (by default # nothing is executed) -#SERVER_PROLOGUE_EXEC_FILE="/path/to/prog" -#SERVER_EPILOGUE_EXEC_FILE="/path/to/prog" +SERVER_PROLOGUE_EXEC_FILE="/etc/oar/server_prologue" +SERVER_EPILOGUE_EXEC_FILE="/etc/oar/server_epilogue" # # File to execute just after a group of resources has been supected. It may EOF
Put the code below in both /etc/oar/server_prologue
and /etc/oar/server_epilogue
on the server:
#!/bin/bash # Usage: # Script is run under uid of oar who is sudo # argv[1] is the jobid # Other job information are meant to be retrieved using oarstat. ACTION=${0##*/} JOBID=$1 get_disk_resa() { if [ "$ACTION" == "server_epilogue" ]; then EPILOGUE_EXCEPT_JOB="AND j.job_id != $JOBID" fi cat <<EOF | oardodo su postgres -c 'psql -t -A -F: oar' SELECT r.host, j.job_user, COUNT(r.diskid) FROM resources r LEFT JOIN assigned_resources a ON a.resource_id = r.resource_id LEFT JOIN jobs j ON j.assigned_moldable_job = a.moldable_job_id AND j.state in ('Launching','Running') $EPILOGUE_EXCEPT_JOB WHERE r.type = 'disk' AND r.host IN ( SELECT r.host FROM resources r, jobs j, assigned_resources a WHERE j.job_id=$JOBID AND j.assigned_moldable_job = a.moldable_job_id AND a.resource_id = r.resource_id AND r.type = 'disk' GROUP BY r.host ) GROUP BY r.host, j.job_user ORDER BY r.host ASC, j.job_user DESC EOF } set_disk_property_on_hosts() { exec 3> /var/lib/oar/server_prologue_epilogue.lock flock -x 3 declare -A host for l in $(get_disk_resa); do h=${l%%:*} ud=${l#*:} u=${ud%%:*} if [ -z "$u" ]; then # dont change but make sure to create the array key host[$h]="${host[$h]}" else host[$h]="${host[$h]}/$ud" fi done for h in "${!host[@]}"; do echo "$h -> disk=${host[$h]}/" /usr/local/sbin/oarnodesetting -n -h $h -p "disk=${host[$h]}/" done flock -u 3 } set_disk_property_on_hosts & exit
This code creates the coupling between the disk resources and the compute resources. Some complementary code is to be added to do the disk effective setup (create logical volume, mount, whatever) at the beginning/end of the compute job.
In order to:
A job type with_disk
is added in admission rules. This type of job will be used for the compute job submission.
First modify the job type checking: $ oaradmission -m 15 -e
# Check if job types are valid my @types = ( qr/^container(?:=\w+)?$/, qr/^deploy(?:=standby)?$/, qr/^desktop_computing$/, qr/^besteffort$/, qr/^cosystem(?:=standby)?$/, qr/^idempotent$/, qr/^placeholder=\w+$/, qr/^allowed=\w+$/, qr/^inner=\w+$/, qr/^timesharing=(?:(?:\*|user),(?:\*|name)|(?:\*|name),(?:\*|user))$/, qr/^token\:\w+\=\d+$/, qr/^noop(?:=standby)?$/, qr/^(?:postpone|deadline|expire)=\d\d\d\d-\d\d-\d\d(?:\s+\d\d:\d\d(?::\d\d)?)?$/, + qr/^with_disk(?:=\d+)?$/, ); foreach my $t ( @{$type_list} ) { my $match = 0; foreach my $r (@types) { if ($t =~ $r) { $match = 1; } } unless ( $match ) { die( "[ADMISSION RULE] Error: unknown job type: $t\n"); } }
Then add a new rule: oaradmissionrules -n -e
.
# Handle the 'with_disk' type of job foreach my $t ( @{$type_list} ) { if ($t =~ qr/^with_disk(?:=(\d+))?$/) { my $disk_count = $1; if (defined($disk_count)) { $jobproperties_applied_after_validation = "disk like \'%/$user:$disk_count/%\'"; } else { $jobproperties_applied_after_validation = "disk like \'%/$user:%\'"; } last; } }
Mind setting a relevant priority for the new admission rule.
Finally, just like for deploy jobs, we might want to allow only whole nodes for compute with disk jobs. For that, we edit the corresponding rule: oaradmissionrules -m 9 -e
-# Restrict allowed properties for deploy jobs to force requesting entire nodes +# Restrict allowed properties for deploy and with_disk jobs to force requesting entire nodes my @bad_resources = ("cpu","core", "thread","resource_id",); -if (grep(/^deploy$/, @{$type_list})){ +if (grep(/^(:?deploy|with_disk(?:=\d+)?)$/, @{$type_list})){ foreach my $mold (@{$ref_resource_list}){ foreach my $r (@{$mold->[0]}){ my $i = 0; while (($i <= $#{$r->{resources}})){ if (grep(/^$r->{resources}->[$i]->{resource}$/i, @bad_resources)){ - die("[ADMISSION RULE] '$r->{resources}->[$i]->{resource}' resource is not allowed with a deploy job\n"); + die("[ADMISSION RULE] '$r->{resources}->[$i]->{resource}' resource is not allowed with a deploy or with_disk job\n"); } $i++; } } } }
Change the drawgantt-svg for the visualisation of the disk resources:
cd /etc/oar cat <<'EOF' | patch -b --- /etc/oar/drawgantt-config.inc.php.old 2016-11-25 16:43:37.000000000 +0100 +++ /etc/oar/drawgantt-config.inc.php 2016-11-24 14:36:16.721873682 +0100 @@ -17,7 +17,7 @@ $CONF['default_relative_start'] = ""; // default relative start and stop times ([+-]<seconds>), mind setting it $CONF['default_relative_stop'] = ""; // accordingly to the nav_forecast values below, eg -24*3600*0.1 and 24*3600*0.9 $CONF['default_timespan'] = 6*3600; // default timespan, should be one of the nav_timespans below -$CONF['default_resource_base'] = 'cpuset'; // default base resource, should be one of the nav_resource_bases below +$CONF['default_resource_base'] = 'diskid'; // default base resource, should be one of the nav_resource_bases below $CONF['default_scale'] = 10; // default vertical scale of the grid, should be one of the nav_scales bellow // Navigation bar configuration @@ -54,15 +54,16 @@ ); $CONF['nav_filters'] = array( // proposed filters in the "misc" bar - 'all clusters' => 'resources.type = \'default\'', + 'all clusters' => '', 'cluster1 only' => 'resources.cluster=\'cluster1\'', 'cluster2 only' => 'resources.cluster=\'cluster2\'', 'cluster3 only' => 'resources.cluster=\'cluster3\'', ); $CONF['nav_resource_bases'] = array( // proposed base resources - 'network_address', + 'host', 'cpuset', + 'diskid', ); $CONF['nav_timezones'] = array( // proposed timezones in the "misc" bar (the first one will be selected by default) @@ -85,22 +86,22 @@ // Data display configuration $CONF['timezone'] = "UTC"; -$CONF['site'] = "oardocker resources for OAR 2.5.7-40-gfc66d71*"; // name for your infrastructure or site -$CONF['resource_labels'] = array('network_address','cpuset'); // properties to describe resources (labels on the left). Must also be part of resource_hierarchy below +$CONF['site'] = "oardocker resources for OAR 2.5.7-42-gb1b3b04*"; // name for your infrastructure or site +$CONF['resource_labels'] = array('host','type', 'cpuset','diskid'); // properties to describe resources (labels on the left). Must also be part of resource_hierarchy below $CONF['cpuset_label_display_string'] = "%02d"; $CONF['label_display_regex'] = array( // shortening regex for labels (e.g. to shorten node-1.mycluster to node-1 - 'network_address' => '/^([^.]+)\..*$/', + 'host' => '/^([^.]+)\..*$/', ); $CONF['label_cmp_regex'] = array( // substring selection regex for comparing and sorting labels (resources) - 'network_address' => '/^([^-]+)-(\d+)\..*$/', + 'host' => '/^([^-]+)-(\d+)\..*$/', ); $CONF['resource_properties'] = array( // properties to display in the pop-up on top of the resources labels (on the left) - 'deploy', 'cpuset', 'besteffort', 'network_address', 'type', 'drain'); + 'deploy', 'cpuset', 'disk', 'diskid', 'besteffort', 'host', 'type', 'drain'); $CONF['resource_hierarchy'] = array( // properties to use to build the resource hierarchy drawing - 'network_address','cpuset', + 'host','type','cpuset', 'diskid', ); -$CONF['resource_base'] = "cpuset"; // base resource of the hierarchy/grid -$CONF['resource_group_level'] = "network_address"; // level of resources to separate with blue lines in the grid +$CONF['resource_base'] = "diskid"; // base resource of the hierarchy/grid +$CONF['resource_group_level'] = "host"; // level of resources to separate with blue lines in the grid $CONF['resource_drain_property'] = "drain"; // if set, must also be one of the resource_properties above to activate the functionnality $CONF['state_colors'] = array( // colors for the states of the resources in the gantt 'Absent' => 'url(#absentPattern)', 'Suspected' => 'url(#suspectedPattern)', 'Dead' => 'url(#deadPattern)', 'Standby' => 'url(#standbyPattern)', 'Drain' => 'url(#drainPattern)'); EOF
Please note that this may not be the best way to display both compute and job disks: setting up a webpage with 2 connected gantt charts: one for compute jobs and one for disk can be more readable.
Reserve 4 disks for 10 days:
user1$ oarsub -t noop -l {"type='disk'"}/host=2/disk=4,walltime=240:0:0 "sleep 10d"
The disk property for the reserved hosts is now: disk=/user1:4/
Submit a compute job on resources that have those disks:
user1$ oarsub -t with_disk=4 -l host=2 "sleep 1h"
or if we don't mind specifying the disk count, just
user1$ oarsub -t with_disk -l host=2 "sleep 1h"
Underneath, this will add a property filter: disk like '%/user1:4/%
'.
Regarding the first command, it could be eased using another custom job type, e.g.:
user1$ oarsub -t reserve_disk -l /host=2/disk=4,walltime=240:0:0 "sleep 10d"
(with the setting underneath of -t noop
and {“type='disk'”}
handled in yet another admission rule).
reserve_disk
and with_disk
job types do not allow to reserve at once (in a single oarsub
command) both the disk and compute resources.