OAR [wiki:disk_reservation]

This page describes a proof of concept to use OAR to couple “cpu” resources with some “disk” resources. E.g. Each node has “cpus” and “disks”, we want to allow users to create jobs to reserve disks for a long period, and eventually request smaller compute jobs on resources on which those disks are reserved. This would give users the capability of keeping data on disk for a longer time than the compute time, e.g. a more persistent storage for data.

Setup

We create disk resources, then setup some coupling so that a compute resource is tagged when a user has disks reserved on it. Tagging in done in the disk property of the compute resource (type default), with the syntax: /userid1:#disks/userid2:#disks/.

This setup was tested on oar-docker.

Disk resources creation

Create some disk resources:

oarproperty -c -a disk
oarproperty -a diskid
# The host property here takes the same hostname for the new resources of type //disk//,
# as for the resources of type //default// (in oardocker default setup: node1, node2 and node3).
# We create 6 disks per host.
for h in {1..3}; do
  # Set the default disk value for default (compute) resources
  oarnodesetting -h node$h -p disk=/
  for d in {1..6}; do
    # Create the disk resources: disk is a unique value (hierarchical resource), 
    # diskid is the disk identifier on the machine (used to setup the disk).
    oarnodesetting -a -h "''" -p type='disk' -p diskid=$d -p "disk=$(((h-1)*6 + d))" -p "host=node$h";
  done
done

Server prologue/epilogue

Change OAR configuration on the server as follows:

cd /etc/oar
cat <<'EOF' | patch -b
--- /etc/oar/oar.conf.old	2016-11-25 16:43:45.000000000 +0100
+++ /etc/oar/oar.conf	2016-11-25 01:11:29.510798912 +0100
@@ -205,8 +205,8 @@
 
 # Files to execute before and after each job on the OAR server (by default
 # nothing is executed)
-#SERVER_PROLOGUE_EXEC_FILE="/path/to/prog"
-#SERVER_EPILOGUE_EXEC_FILE="/path/to/prog"
+SERVER_PROLOGUE_EXEC_FILE="/etc/oar/server_prologue"
+SERVER_EPILOGUE_EXEC_FILE="/etc/oar/server_epilogue"
 
 #
 # File to execute just after a group of resources has been supected. It may 
EOF

Put the code below in both /etc/oar/server_prologue and /etc/oar/server_epilogue on the server:

#!/bin/bash
# Usage:
# Script is run under uid of oar who is sudo
# argv[1] is the jobid
# Other job information are meant to be retrieved using oarstat.
 
ACTION=${0##*/}
JOBID=$1
 
get_disk_resa() {
if [ "$ACTION" == "server_epilogue" ]; then
  EPILOGUE_EXCEPT_JOB="AND j.job_id != $JOBID"
fi
cat <<EOF | oardodo su postgres -c 'psql -t -A -F:  oar'
SELECT r.host, j.job_user, COUNT(r.diskid)
FROM resources r
  LEFT JOIN assigned_resources a ON a.resource_id = r.resource_id
  LEFT JOIN jobs j ON j.assigned_moldable_job = a.moldable_job_id AND j.state in ('Launching','Running') $EPILOGUE_EXCEPT_JOB
WHERE
      r.type = 'disk' AND
      r.host IN (
          SELECT r.host 
          FROM resources r, jobs j, assigned_resources a 
          WHERE j.job_id=$JOBID AND
                j.assigned_moldable_job = a.moldable_job_id AND
                a.resource_id = r.resource_id AND
                r.type = 'disk'
          GROUP BY r.host
      )
GROUP BY r.host, j.job_user
ORDER BY r.host ASC, j.job_user DESC
EOF
}
set_disk_property_on_hosts() {
  exec 3> /var/lib/oar/server_prologue_epilogue.lock
  flock -x 3
  declare -A host
  for l in $(get_disk_resa); do 
    h=${l%%:*}
    ud=${l#*:}
    u=${ud%%:*}
    if [ -z "$u" ]; then
      # dont change but make sure to create the array key
      host[$h]="${host[$h]}"
    else
      host[$h]="${host[$h]}/$ud"
    fi
  done
  for h in "${!host[@]}"; do
    echo "$h -> disk=${host[$h]}/"
    /usr/local/sbin/oarnodesetting -n -h $h -p "disk=${host[$h]}/"
  done
  flock -u 3
}
set_disk_property_on_hosts &
exit

This code creates the coupling between the disk resources and the compute resources. Some complementary code is to be added to do the disk effective setup (create logical volume, mount, whatever) at the beginning/end of the compute job.

Admission rules

In order to:

simplify the user interface
allow one to submit a compute job before the disk job actually starts (not possible otherwise, because not resource exists with the requested property value)

A job type with_disk is added in admission rules. This type of job will be used for the compute job submission.

First modify the job type checking: $ oaradmission -m 15 -e

 # Check if job types are valid
 my @types = (
     qr/^container(?:=\w+)?$/,                 qr/^deploy(?:=standby)?$/,
     qr/^desktop_computing$/,                   qr/^besteffort$/,
     qr/^cosystem(?:=standby)?$/,                qr/^idempotent$/,
     qr/^placeholder=\w+$/,                    qr/^allowed=\w+$/,
     qr/^inner=\w+$/,                          qr/^timesharing=(?:(?:\*|user),(?:\*|name)|(?:\*|name),(?:\*|user))$/,
     qr/^token\:\w+\=\d+$/,                 qr/^noop(?:=standby)?$/,
     qr/^(?:postpone|deadline|expire)=\d\d\d\d-\d\d-\d\d(?:\s+\d\d:\d\d(?::\d\d)?)?$/,
+    qr/^with_disk(?:=\d+)?$/,
 );
 foreach my $t ( @{$type_list} ) {
     my $match = 0;
     foreach my $r (@types) {
         if ($t =~ $r) {
             $match = 1;
         }
     }
     unless ( $match ) {
         die( "[ADMISSION RULE] Error: unknown job type: $t\n");
     }
 }

Then add a new rule: oaradmissionrules -n -e.

# Handle the 'with_disk' type of job
 
foreach my $t ( @{$type_list} ) {
    if ($t =~ qr/^with_disk(?:=(\d+))?$/) {
        my $disk_count = $1;
        if (defined($disk_count)) {
            $jobproperties_applied_after_validation = "disk like \'%/$user:$disk_count/%\'";
        } else {
            $jobproperties_applied_after_validation = "disk like \'%/$user:%\'";
        }
        last;
   }
}

Mind setting a relevant priority for the new admission rule.

Finally, just like for deploy jobs, we might want to allow only whole nodes for compute with disk jobs. For that, we edit the corresponding rule: oaradmissionrules -m 9 -e

-# Restrict allowed properties for deploy jobs to force requesting entire nodes
+# Restrict allowed properties for deploy and with_disk jobs to force requesting entire nodes
 my @bad_resources = ("cpu","core", "thread","resource_id",);
-if (grep(/^deploy$/, @{$type_list})){
+if (grep(/^(:?deploy|with_disk(?:=\d+)?)$/, @{$type_list})){
     foreach my $mold (@{$ref_resource_list}){
         foreach my $r (@{$mold->[0]}){
             my $i = 0;
             while (($i <= $#{$r->{resources}})){
                 if (grep(/^$r->{resources}->[$i]->{resource}$/i, @bad_resources)){
-                     die("[ADMISSION RULE] '$r->{resources}->[$i]->{resource}' resource is not allowed with a deploy job\n");
 
+                     die("[ADMISSION RULE] '$r->{resources}->[$i]->{resource}' resource is not allowed with a deploy or with_disk job\n");
                 }
                 $i++;
             }
         }
     }
 }

Drawgantt

Change the drawgantt-svg for the visualisation of the disk resources:

cd /etc/oar
cat <<'EOF' | patch -b
--- /etc/oar/drawgantt-config.inc.php.old	2016-11-25 16:43:37.000000000 +0100
+++ /etc/oar/drawgantt-config.inc.php	2016-11-24 14:36:16.721873682 +0100
@@ -17,7 +17,7 @@
 $CONF['default_relative_start'] = ""; // default relative start and stop times ([+-]<seconds>), mind setting it
 $CONF['default_relative_stop'] = "";  // accordingly to the nav_forecast values below, eg -24*3600*0.1 and 24*3600*0.9
 $CONF['default_timespan'] = 6*3600; // default timespan, should be one of the nav_timespans below
-$CONF['default_resource_base'] = 'cpuset'; // default base resource, should be one of the nav_resource_bases below
+$CONF['default_resource_base'] = 'diskid'; // default base resource, should be one of the nav_resource_bases below
 $CONF['default_scale'] = 10; // default vertical scale of the grid, should be one of the nav_scales bellow
 
 // Navigation bar configuration
@@ -54,15 +54,16 @@
 );
 
 $CONF['nav_filters'] = array( // proposed filters in the "misc" bar
-  'all clusters' => 'resources.type = \'default\'',
+  'all clusters' => '',
   'cluster1 only' => 'resources.cluster=\'cluster1\'',
   'cluster2 only' => 'resources.cluster=\'cluster2\'',
   'cluster3 only' => 'resources.cluster=\'cluster3\'',
 );
 
 $CONF['nav_resource_bases'] = array( // proposed base resources
-  'network_address',
+  'host',
   'cpuset',
+  'diskid',
 );
 
 $CONF['nav_timezones'] = array( // proposed timezones in the "misc" bar (the first one will be selected by default)
@@ -85,22 +86,22 @@
 
 // Data display configuration
 $CONF['timezone'] = "UTC";
-$CONF['site'] = "oardocker resources for OAR 2.5.7-40-gfc66d71*"; // name for your infrastructure or site
-$CONF['resource_labels'] = array('network_address','cpuset'); // properties to describe resources (labels on the left). Must also be part of resource_hierarchy below 
+$CONF['site'] = "oardocker resources for OAR 2.5.7-42-gb1b3b04*"; // name for your infrastructure or site
+$CONF['resource_labels'] = array('host','type', 'cpuset','diskid'); // properties to describe resources (labels on the left). Must also be part of resource_hierarchy below 
 $CONF['cpuset_label_display_string'] = "%02d";
 $CONF['label_display_regex'] = array( // shortening regex for labels (e.g. to shorten node-1.mycluster to node-1
-  'network_address' => '/^([^.]+)\..*$/',
+  'host' => '/^([^.]+)\..*$/',
   );
 $CONF['label_cmp_regex'] = array( // substring selection regex for comparing and sorting labels (resources)
-  'network_address' => '/^([^-]+)-(\d+)\..*$/',
+  'host' => '/^([^-]+)-(\d+)\..*$/',
   );
 $CONF['resource_properties'] = array( // properties to display in the pop-up on top of the resources labels (on the left)
-  'deploy', 'cpuset', 'besteffort', 'network_address', 'type', 'drain');
+  'deploy', 'cpuset', 'disk', 'diskid', 'besteffort', 'host', 'type', 'drain');
 $CONF['resource_hierarchy'] = array( // properties to use to build the resource hierarchy drawing
-  'network_address','cpuset',
+  'host','type','cpuset', 'diskid',
   ); 
-$CONF['resource_base'] = "cpuset"; // base resource of the hierarchy/grid
-$CONF['resource_group_level'] = "network_address"; // level of resources to separate with blue lines in the grid
+$CONF['resource_base'] = "diskid"; // base resource of the hierarchy/grid
+$CONF['resource_group_level'] = "host"; // level of resources to separate with blue lines in the grid
 $CONF['resource_drain_property'] = "drain"; // if set, must also be one of the resource_properties above to activate the functionnality
 $CONF['state_colors'] = array( // colors for the states of the resources in the gantt
   'Absent' => 'url(#absentPattern)', 'Suspected' => 'url(#suspectedPattern)', 'Dead' => 'url(#deadPattern)', 'Standby' => 'url(#standbyPattern)', 'Drain' => 'url(#drainPattern)');
 
EOF

Please note that this may not be the best way to display both compute and job disks: setting up a webpage with 2 connected gantt charts: one for compute jobs and one for disk can be more readable.

Use case

Reserve 4 disks for 10 days:

user1$ oarsub -t noop -l {"type='disk'"}/host=2/disk=4,walltime=240:0:0 "sleep 10d"

The disk property for the reserved hosts is now: disk=/user1:4/

Submit a compute job on resources that have those disks:

user1$ oarsub -t with_disk=4 -l host=2 "sleep 1h"

or if we don't mind specifying the disk count, just

user1$ oarsub -t with_disk -l host=2 "sleep 1h"

Underneath, this will add a property filter: disk like '%/user1:4/%'.

Regarding the first command, it could be eased using another custom job type, e.g.:

user1$ oarsub -t reserve_disk -l /host=2/disk=4,walltime=240:0:0 "sleep 10d"

(with the setting underneath of -t noop and {“type='disk'”} handled in yet another admission rule).

Limits

This workaround does not allow to place and show (schedule) the compute jobs in the future (i.e. before the disk job is running)… While not scheduled, compute jobs can be submitted ahead of time however. Also the scheduling decision could place compute jobs after the end of the running disk job, so that they would actually not launch in fine.
Users may want to select specific disks/hosts when using only a part of the reserved disks for a given compute job. This mechanism does not allow it.
The reserve_disk and with_disk job types do not allow to reserve at once (in a single oarsub command) both the disk and compute resources.
A with_disk compute job could easily suffer from starvation since it will only be scheduled once the disk job is started:
- While batch jobs which are not started yet will be move with regard to previous scheduling decisions, some may have started before the disk property of the resources is changed, making resources whose disks are reserved unavailable for the duration of those jobs.
- Advance reservations could also be accepted on the resources: resources are booked upon submission acceptation for an advance reservation, possibly before the disk property is changed.

Table of Contents