OAR [start]

What is OAR ?

OAR is a versatile resource and task manager (also called a batch scheduler) for HPC clusters, and other computing infrastructures (like distributed computing experimental testbeds where versatility is a key).

Overview

OAR architecture is based on a database (PostgreSQL (preferred) or MySQL), a script language (Perl) and an optional scalable administrative tool (e.g. Taktuk). It is composed of modules which interact mainly via the database and are executed as independent programs. Therefore, formally, there is no API, the system interaction is completely defined by the database schema. This approach eases the development of specific modules. Indeed, each module (such as schedulers) may be developed in any language having a database access library.

Main features

Batch and Interactive jobs
Advance Reservation
Admission rules
Walltime
Matching of resources (job/node properties)
Hold and resume jobs
Multi-schedulers support (simple fifo and fifo with matching)
Multi-queues with priority
Best-effort queues (for exploiting idle resources)
Check compute nodes before launching
Epilogue/Prologue scripts
Jobs and resources visualization tools (Monika, Drawgantt)
No Daemon on compute nodes
SSH based remote execution protocols (managed by TakTuk)
Dynamic insertion/deletion of compute node
Logging
Backfilling
First-Fit Scheduler with matching resource
On demand OS deployment support with Kadeploy3 coupling
Grid computing support with Cigri

Why using OAR

A better resource management

Using Linux kernel new feature called cpuset, OAR2 allows a more reliable management of the resources:

No unattended processes should remain from previous jobs.
Access to the resources is now restricted to the owner of the resources.

Beside, features like job dependency and check-pointing are now available, allowing a better resources use.

A cpuset is attached to every process, and allows:

to specify which resource processor/memory can be used by a process, e.g. resources allocated to the job in OAR 2 context.
to group and identify processes that share the same cpuset, e.g. the processes of a job in OAR 2 context, so that actions like clean-up can be efficiently performed. (here, cpusets provide a replacement for the group/session of processes concept that is not efficient in Linux).

Multi-cluster

OAR 2 can manage complex hierarchies of resources. For example: ~ 1. clusters 2. switchs 3. nodes 4. cpus 5. cores

A modern cluster management system

By providing a mechanism to isolate the jobs at the core level, OAR 2 is one of the most modern cluster management systems. Users developing cluster or grid algorithms and programs will then work in a today's up-to-date environment similar to the ones they will meet with other recent cluster management systems on production platforms for instance.

Optimization of the resources usage

Now a day, machines with more than 4 cores become common. Thus, it is then very important to be able to handle cores efficiently. By providing resources selection and processes isolation at the core level, OAR 2 allows users running experiments that do not require the exclusivity of a node (at least during a preparation phase) to have access to many nodes on one core only, but leave the remaining cores free for other users. This can allow to optimize the number of available resources.

Beside, OAR 2 also provide a time-sharing feature which will allow to share a same set of resources among users.

Easier access to the resources

Using OAR 2 OARSH connector to access the job resources, basic usages will not anymore require the user to configure his SSH environment as everything is handled internally (known host keys management, etc). Beside, users that would actually prefer not using OARSH can still use SSH with just the cost of some options to set (one of the features of the OARSH wrapper is to actually hide these options).

Grid resources interconnection

As access to one cluster resources is restricted to an attached job, one may wonder if connections from job to job, from cluster to cluster, from site to site would still be possible. OAR 2 provides a mechanism called job-key than allows inter job communication, even on several sites managed by several OAR 2 servers (this mechanism is indeed used by OARGrid2 for instance).

Management of abstract resources

OAR 2 features a mechanism to manage resources like software licenses or other non-material resources the same way it manages classical resources.

More technical details

Oar is an opensource batch scheduler which provides a simple and flexible exploitation of a cluster.

It manages resources of clusters like other traditional batch scheduler (as PBS / Torque / LSF / SGE). In other words, it doesn't execute your job on the resources but manages them (reservation, acces granting) in order to allow you to connect these resources and use them.

Its design is based on high level tools:

relational database engine MySQL or PostgreSQL,
scripting language Perl,
confinement system mechanism cpuset,
scalable exploiting tool Taktuk.

It is flexible enough to be suitable to manage large HPC clusters as well as other computer infrastructures such as research testbeds like Grid'5000.

Features:

Only need a SSH daemon on nodes.
No tied to any specific libraries such as MPI. OAR supports any sort of parallel user applications.
Cpuset/cgroup integration to restricts jobs to assigned resources (core, memory bench)
Configurable for multicore, hyperthreading and GPUs
Remote procedure calls using the TakTuk software: a large scale remote execution deployment tool.
Hierarchical resource requests (multiple/heterogeneous clusters support).
Gantt scheduling (can visualize the internal scheduler decisions).
Full or partial time-sharing.
Checkpoint/resubmit.
Support for software licenses servers (license tokens).
Best effort jobs: such a job is stopped automatically as soon as another job requires the resources.
Special job types:
- deploy: support for in job OS deployment with software such as Kadeploy
- cosystem: support for a delegation to another job and resource management system
- noop: reservation only jobs (no execution)
- placeholder, container: meta jobs
Batch and Interactive jobs
Advance reservations
Admission rules
Job walltime
Multi-schedulers support.
Multi-queues with priority
First-Fit Scheduler with conservative backfilling
Moldable tasks.
Epilogue/Prologue scripts.
Dynamic resources definition
Logging/Accounting.
Suspend/resume jobs.

Table of Contents