This page describes a proof of concept to use OAR to couple "cpu" resources with some "disk" resources.
E.g. Each node has "cpus" and "disks", we want to allow users to create jobs to reserve disks for a long period, and eventually request smaller compute jobs on resources on which those disks are reserved.
This would give users the capability of keeping data on disk for a longer time than the compute time, e.g. a more persistent storage for data.
===== Setup =====
We create //disk// resources, then setup some coupling so that a //compute// resource is **tagged** when a user has disks reserved on it. Tagging in done in the //disk// property of the //compute// resource (type //default//), with the syntax: ''/userid1:#disks/userid2:#disks/''.
This setup was tested on [[oar-docker]].
==== Disk resources creation ====
Create some //disk// resources:
oarproperty -c -a disk
oarproperty -a diskid
# The host property here takes the same hostname for the new resources of type //disk//,
# as for the resources of type //default// (in oardocker default setup: node1, node2 and node3).
# We create 6 disks per host.
for h in {1..3}; do
# Set the default disk value for default (compute) resources
oarnodesetting -h node$h -p disk=/
for d in {1..6}; do
# Create the disk resources: disk is a unique value (hierarchical resource),
# diskid is the disk identifier on the machine (used to setup the disk).
oarnodesetting -a -h "''" -p type='disk' -p diskid=$d -p "disk=$(((h-1)*6 + d))" -p "host=node$h";
done
done
==== Server prologue/epilogue ====
Change OAR configuration on the server as follows:
cd /etc/oar
cat <<'EOF' | patch -b
--- /etc/oar/oar.conf.old 2016-11-25 16:43:45.000000000 +0100
+++ /etc/oar/oar.conf 2016-11-25 01:11:29.510798912 +0100
@@ -205,8 +205,8 @@
# Files to execute before and after each job on the OAR server (by default
# nothing is executed)
-#SERVER_PROLOGUE_EXEC_FILE="/path/to/prog"
-#SERVER_EPILOGUE_EXEC_FILE="/path/to/prog"
+SERVER_PROLOGUE_EXEC_FILE="/etc/oar/server_prologue"
+SERVER_EPILOGUE_EXEC_FILE="/etc/oar/server_epilogue"
#
# File to execute just after a group of resources has been supected. It may
EOF
Put the code below in both ''/etc/oar/server_prologue'' and ''/etc/oar/server_epilogue'' on the server:
#!/bin/bash
# Usage:
# Script is run under uid of oar who is sudo
# argv[1] is the jobid
# Other job information are meant to be retrieved using oarstat.
ACTION=${0##*/}
JOBID=$1
get_disk_resa() {
if [ "$ACTION" == "server_epilogue" ]; then
EPILOGUE_EXCEPT_JOB="AND j.job_id != $JOBID"
fi
cat < /var/lib/oar/server_prologue_epilogue.lock
flock -x 3
declare -A host
for l in $(get_disk_resa); do
h=${l%%:*}
ud=${l#*:}
u=${ud%%:*}
if [ -z "$u" ]; then
# dont change but make sure to create the array key
host[$h]="${host[$h]}"
else
host[$h]="${host[$h]}/$ud"
fi
done
for h in "${!host[@]}"; do
echo "$h -> disk=${host[$h]}/"
/usr/local/sbin/oarnodesetting -n -h $h -p "disk=${host[$h]}/"
done
flock -u 3
}
set_disk_property_on_hosts &
exit
This code creates the coupling between the //disk// resources and the //compute// resources. Some complementary code is to be added to do the disk effective setup (create logical volume, mount, whatever) at the beginning/end of the compute job.
==== Admission rules ====
In order to:
- simplify the user interface
- allow one to submit a compute job before the //disk// job actually starts (not possible otherwise, because not resource exists with the requested property value)
A job type ''with_disk'' is added in admission rules. This type of job will be used for the compute job submission.
First modify the job type checking: ''$ oaradmission -m 15 -e''
# Check if job types are valid
my @types = (
qr/^container(?:=\w+)?$/, qr/^deploy(?:=standby)?$/,
qr/^desktop_computing$/, qr/^besteffort$/,
qr/^cosystem(?:=standby)?$/, qr/^idempotent$/,
qr/^placeholder=\w+$/, qr/^allowed=\w+$/,
qr/^inner=\w+$/, qr/^timesharing=(?:(?:\*|user),(?:\*|name)|(?:\*|name),(?:\*|user))$/,
qr/^token\:\w+\=\d+$/, qr/^noop(?:=standby)?$/,
qr/^(?:postpone|deadline|expire)=\d\d\d\d-\d\d-\d\d(?:\s+\d\d:\d\d(?::\d\d)?)?$/,
+ qr/^with_disk(?:=\d+)?$/,
);
foreach my $t ( @{$type_list} ) {
my $match = 0;
foreach my $r (@types) {
if ($t =~ $r) {
$match = 1;
}
}
unless ( $match ) {
die( "[ADMISSION RULE] Error: unknown job type: $t\n");
}
}
Then add a new rule: ''oaradmissionrules -n -e''.
# Handle the 'with_disk' type of job
foreach my $t ( @{$type_list} ) {
if ($t =~ qr/^with_disk(?:=(\d+))?$/) {
my $disk_count = $1;
if (defined($disk_count)) {
$jobproperties_applied_after_validation = "disk like \'%/$user:$disk_count/%\'";
} else {
$jobproperties_applied_after_validation = "disk like \'%/$user:%\'";
}
last;
}
}
Mind setting a relevant priority for the new admission rule.
Finally, just like for deploy jobs, we might want to allow only whole nodes for //compute with disk// jobs. For that, we edit the corresponding rule: ''oaradmissionrules -m 9 -e''
-# Restrict allowed properties for deploy jobs to force requesting entire nodes
+# Restrict allowed properties for deploy and with_disk jobs to force requesting entire nodes
my @bad_resources = ("cpu","core", "thread","resource_id",);
-if (grep(/^deploy$/, @{$type_list})){
+if (grep(/^(:?deploy|with_disk(?:=\d+)?)$/, @{$type_list})){
foreach my $mold (@{$ref_resource_list}){
foreach my $r (@{$mold->[0]}){
my $i = 0;
while (($i <= $#{$r->{resources}})){
if (grep(/^$r->{resources}->[$i]->{resource}$/i, @bad_resources)){
- die("[ADMISSION RULE] '$r->{resources}->[$i]->{resource}' resource is not allowed with a deploy job\n");
+ die("[ADMISSION RULE] '$r->{resources}->[$i]->{resource}' resource is not allowed with a deploy or with_disk job\n");
}
$i++;
}
}
}
}
==== Drawgantt ====
Change the drawgantt-svg for the visualisation of the disk resources:
cd /etc/oar
cat <<'EOF' | patch -b
--- /etc/oar/drawgantt-config.inc.php.old 2016-11-25 16:43:37.000000000 +0100
+++ /etc/oar/drawgantt-config.inc.php 2016-11-24 14:36:16.721873682 +0100
@@ -17,7 +17,7 @@
$CONF['default_relative_start'] = ""; // default relative start and stop times ([+-]), mind setting it
$CONF['default_relative_stop'] = ""; // accordingly to the nav_forecast values below, eg -24*3600*0.1 and 24*3600*0.9
$CONF['default_timespan'] = 6*3600; // default timespan, should be one of the nav_timespans below
-$CONF['default_resource_base'] = 'cpuset'; // default base resource, should be one of the nav_resource_bases below
+$CONF['default_resource_base'] = 'diskid'; // default base resource, should be one of the nav_resource_bases below
$CONF['default_scale'] = 10; // default vertical scale of the grid, should be one of the nav_scales bellow
// Navigation bar configuration
@@ -54,15 +54,16 @@
);
$CONF['nav_filters'] = array( // proposed filters in the "misc" bar
- 'all clusters' => 'resources.type = \'default\'',
+ 'all clusters' => '',
'cluster1 only' => 'resources.cluster=\'cluster1\'',
'cluster2 only' => 'resources.cluster=\'cluster2\'',
'cluster3 only' => 'resources.cluster=\'cluster3\'',
);
$CONF['nav_resource_bases'] = array( // proposed base resources
- 'network_address',
+ 'host',
'cpuset',
+ 'diskid',
);
$CONF['nav_timezones'] = array( // proposed timezones in the "misc" bar (the first one will be selected by default)
@@ -85,22 +86,22 @@
// Data display configuration
$CONF['timezone'] = "UTC";
-$CONF['site'] = "oardocker resources for OAR 2.5.7-40-gfc66d71*"; // name for your infrastructure or site
-$CONF['resource_labels'] = array('network_address','cpuset'); // properties to describe resources (labels on the left). Must also be part of resource_hierarchy below
+$CONF['site'] = "oardocker resources for OAR 2.5.7-42-gb1b3b04*"; // name for your infrastructure or site
+$CONF['resource_labels'] = array('host','type', 'cpuset','diskid'); // properties to describe resources (labels on the left). Must also be part of resource_hierarchy below
$CONF['cpuset_label_display_string'] = "%02d";
$CONF['label_display_regex'] = array( // shortening regex for labels (e.g. to shorten node-1.mycluster to node-1
- 'network_address' => '/^([^.]+)\..*$/',
+ 'host' => '/^([^.]+)\..*$/',
);
$CONF['label_cmp_regex'] = array( // substring selection regex for comparing and sorting labels (resources)
- 'network_address' => '/^([^-]+)-(\d+)\..*$/',
+ 'host' => '/^([^-]+)-(\d+)\..*$/',
);
$CONF['resource_properties'] = array( // properties to display in the pop-up on top of the resources labels (on the left)
- 'deploy', 'cpuset', 'besteffort', 'network_address', 'type', 'drain');
+ 'deploy', 'cpuset', 'disk', 'diskid', 'besteffort', 'host', 'type', 'drain');
$CONF['resource_hierarchy'] = array( // properties to use to build the resource hierarchy drawing
- 'network_address','cpuset',
+ 'host','type','cpuset', 'diskid',
);
-$CONF['resource_base'] = "cpuset"; // base resource of the hierarchy/grid
-$CONF['resource_group_level'] = "network_address"; // level of resources to separate with blue lines in the grid
+$CONF['resource_base'] = "diskid"; // base resource of the hierarchy/grid
+$CONF['resource_group_level'] = "host"; // level of resources to separate with blue lines in the grid
$CONF['resource_drain_property'] = "drain"; // if set, must also be one of the resource_properties above to activate the functionnality
$CONF['state_colors'] = array( // colors for the states of the resources in the gantt
'Absent' => 'url(#absentPattern)', 'Suspected' => 'url(#suspectedPattern)', 'Dead' => 'url(#deadPattern)', 'Standby' => 'url(#standbyPattern)', 'Drain' => 'url(#drainPattern)');
EOF
Please note that this may not be the best way to display both compute and job disks: setting up a webpage with 2 connected gantt charts: one for compute jobs and one for disk can be more readable.
=====Use case=====
Reserve 4 disks for 10 days:
user1$ oarsub -t noop -l {"type='disk'"}/host=2/disk=4,walltime=240:0:0 "sleep 10d"
The //disk// property for the reserved hosts is now: ''disk=/user1:4/''
Submit a //compute// job on resources that have those disks:
user1$ oarsub -t with_disk=4 -l host=2 "sleep 1h"
or if we don't mind specifying the disk count, just
user1$ oarsub -t with_disk -l host=2 "sleep 1h"
Underneath, this will add a property filter: ''disk like '%/user1:4/%'''.
Regarding the first command, it could be eased using another custom job type, e.g.:
user1$ oarsub -t reserve_disk -l /host=2/disk=4,walltime=240:0:0 "sleep 10d"
(with the setting underneath of ''-t noop'' and ''{"type='disk'"}'' handled in yet another admission rule).
=====Limits=====
* This workaround does not allow to place and show (schedule) the //compute// jobs in the future (i.e. before the //disk// job is running)... While not scheduled, //compute// jobs can be submitted ahead of time however. Also the scheduling decision could place //compute// jobs after the end of the running //disk// job, so that they would actually not launch in fine.
* Users may want to select specific disks/hosts when using only a part of the reserved disks for a given compute job. This mechanism does not allow it.
* The ''reserve_disk'' and ''with_disk'' job types do not allow to reserve at once (in a single ''oarsub'' command) both the disk and compute resources.
* A //with_disk// compute job could easily suffer from starvation since it will only be scheduled once the disk job is started:
* While batch jobs which are not started yet will be move with regard to previous scheduling decisions, some may have started before the disk property of the resources is changed, making resources whose disks are reserved unavailable for the duration of those jobs.
* Advance reservations could also be accepted on the resources: resources are //booked// upon submission acceptation for an advance reservation, possibly before the disk property is changed.