This shows you the differences between two versions of the page.
— | wiki:old:gsoc_advanced_scheduler_for_the_cigri_grid [2013/07/10 22:55] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | [[GSoC_Proposal_Cigri_Scheduler| See original proposal here]] | ||
+ | |||
+ | **Student and mentor ;), please read carefully this page...** | ||
+ | |||
+ | |||
+ | Student: Elton Mathias | ||
+ | |||
+ | Mentor: Bruno Bzeznik | ||
+ | |||
+ | Co-Mentor: Yiannis Georgiou | ||
+ | |||
+ | ---- | ||
+ | |||
+ | |||
+ | ===== Student: Things to do before starting ===== | ||
+ | |||
+ | * Already done: g5k account, svn account, branch into cigri svn | ||
+ | * Meeting at Grenoble 11,12 mai | ||
+ | * Ordre du jour à compléter: | ||
+ | * Présentation de Cigri et de son code à Elton | ||
+ | * Mise au point de la plateforme de développement (environnement g5k, environnement de test sur ciment) | ||
+ | * Etat des lieux du code de Cigri | ||
+ | * **Versionning | ||
+ | * **Travaux en cours ou expérimentaux, | ||
+ | * **Plateformes en production et besoins récurrents | ||
+ | * **CIRA: RaGrid (et ses besoins), Egee | ||
+ | * Point sur le scheduling dans CiGri avec Olivier Richard et Yiannis Georgiou | ||
+ | * **Prédiction | ||
+ | * **Fairsharing | ||
+ | * **Problème des petits jobs | ||
+ | * **Règles d' | ||
+ | * **Priorités | ||
+ | * **Queues Multiples | ||
+ | * Rencontre de Arnaud Legrand et Rémi Bertin, spécialistes de l' | ||
+ | * Brainstorming avec Arnaud, Olivier, Rémi, Yiannis, Joseph, Bruno,... | ||
+ | * **Définir les fonctionnalités requises | ||
+ | * **Architecture du nouveau scheduler CiGri | ||
+ | * Définition des taches et classement par priorités | ||
+ | * Roadmap | ||
+ | |||
+ | ===== 29 April' | ||
+ | |||
+ | **Presents**: | ||
+ | |||
+ | We tried to list the items we have to work on and check those that have an impact on the scheduler. I (Bruno) don't want Elton to work on all those items, but firstly to work on the scheduler, and then, if there is time left, work on the other subjects. | ||
+ | |||
+ | * Things that have an impact on the scheduler: | ||
+ | * **Evolutive scheduler** that can be modified (plug-ins?) to test different algorithms (that' | ||
+ | |||
+ | * **Little test campains problem**: that's the main problem we want to solve: we want " | ||
+ | |||
+ | * **parallel jobs**: that's the second problem we want to solve: to be able to run campains of parallel jobs using several nodes of a given cluster. // | ||
+ | * **Priorities** | ||
+ | * **a simple example: when Lyon-users and Grenoble-users have running campains, we want Lyon-users to have their jobs running on the Lyon-clusters and Grenoble-users having theirs jobs running on the Grenoble-clusters. Grenoble jobs may be submited on Lyon clusters if some Lyon resources are free. | ||
+ | * **we can also imagine another priority level by projects, like the OAR fairsharing currently does | ||
+ | * **Fairsharing**: | ||
+ | * **File transfer of input data per job**: idea is to add a a parameter or something into the JDL to match input data with jobs so that only needed data is transfered before starting a job. It may be interesting to have informations like size and network bandwidth to know how to schedule depending on the transfer time | ||
+ | * **End of campains optimization**: | ||
+ | * **Checkpointing**: | ||
+ | * **A checkpointed job may have to be restarted prior to a non-checkpointed one | ||
+ | * **A checkpointed job may have to be restarted on the same cluster | ||
+ | * **Data of the checkpoint needs to be there | ||
+ | * **Dependencies**: | ||
+ | * **Distributed filesystem** (may help for checkpointing, | ||
+ | |||
+ | * Things not directly linked to the scheduling | ||
+ | * **array jobs support**. Warning: contentions may occure with "bad jobs" and it may be interesting to use the profiling data of the predictor (see Fairsharing above) to see if we can run for example 1000 jobs on the same cluster without causing damage to the NFS server for example // | ||
+ | * **The sink effect**: problem of the jobs running in near 0 seconds on a given cluster without exiting with an error status | ||
+ | * **OAR API support**: we have to recode the clusterqueries to take advantage of the new RESTFul OAR API and maybe no more use ssh and sudo. The OAR API should probably evolve to provide the necessary gateways for CiGri. // | ||
+ | |||
+ | |||
+ | // | ||
+ | |||
+ | |||
+ | //**(1)**// //Maybe, a way to do that would be to have a design like the gLite WM (but simplified), | ||
+ | |||
+ | //**(2)**// //Does it imply on preemption, or the idea is just to bypass the campaign jobs in the queue? Either way this seems to le to be something that should be treated at the scheduling algorithm level (thinking on a lazy strategy, I think it should call it match maker) // | ||
+ | |||
+ | // | ||
+ | |||
+ | // | ||
+ | |||
+ | // | ||
+ | |||
+ | // | ||
+ | |||
+ | // | ||
+ | |||
+ | // | ||
+ | --[[User: | ||
+ | |||
+ | |||
+ | ===== Project Pseudo-Specification ===== | ||
+ | |||
+ | Following the brainstorming done in 11-13th May in Grenoble, we (Bruno, Olivier, Yiannis, Arnaud L. and Elton) have defined an initial architecture for the new CiGri scheduling module. | ||
+ | |||
+ | ==== Required Features (to be updated) ==== | ||
+ | * Support to different campaign types (for now: test, and default) | ||
+ | * Support to ruby-based admission rules | ||
+ | * Multiple scheduling modes (for now: test, prediction, workqueue and finalization) | ||
+ | * Support to parallel jobs (jobs that require more than one resource) | ||
+ | * Support Job requirements at JDL definition: | ||
+ | * resources properties (parallel to ' | ||
+ | * hierarchy of resources for parallel jobs (parallel to ' | ||
+ | |||
+ | ==== Campaign Types Definition ==== | ||
+ | The new CiGri infrastructure will include support to different campaign types. The campaign type is defined by users when submiting campaigns through the parameter ' | ||
+ | |||
+ | For now, we have defined two basic campaign types: | ||
+ | * **default campaigns**: | ||
+ | * **test campaigns**: | ||
+ | |||
+ | ==== General Architecture ==== | ||
+ | |||
+ | The general architecture of the new scheduling module is composed by three main entities: a metascheduler, | ||
+ | |||
+ | [[Image: | ||
+ | |||
+ | The next subsections present in more details each of these entities and their operation. At the end of this section, a diagram gives a step-by-step example that clarifies these concepts. | ||
+ | |||
+ | === Updator === | ||
+ | The updator module works as a probe that updates a table on the CiGri DB with snapshots of the existing resource availability. <s>It works as an internal CiGri cache to avoid the cost of checking remotelly on every cluster the current state of resources</ | ||
+ | |||
+ | === Metascheduler === | ||
+ | Metascheduler is the brain of the whole scheduling mechanism. It is responsible for the orchestration of the different campaign schedulers based on resources availability, | ||
+ | The initial implementation of the metascheduler follows a FIFO-based approach (+ priority for test campaigns for the scheduling of the campaigns). But we expect, in the future, to provide a clever solution that might include campaign < | ||
+ | |||
+ | The initial pseudo-algorithm which describes the metascheduler operation is given by: | ||
+ | |||
+ | |||
+ | < | ||
+ | | ||
+ | if (there are resources available) | ||
+ | if (there is a test campaign to be schedule on the available resources) | ||
+ | | ||
+ | else | ||
+ | | ||
+ | end else | ||
+ | end if | ||
+ | end while</ | ||
+ | |||
+ | NOTE --[[User: | ||
+ | |||
+ | //Well, isn't it:// | ||
+ | < | ||
+ | | ||
+ | | ||
+ | end while | ||
+ | | ||
+ | launch scheduler of the campaign on involved clusters | ||
+ | end while</ | ||
+ | //And the second fifo may be replaced by an order over users/ | ||
+ | |||
+ | //Supposing we have (user, | ||
+ | //We have:// | ||
+ | < | ||
+ | cluster1: | ||
+ | affinities: campaing1, campaing2 | ||
+ | others: campaing3 | ||
+ | cluster2: | ||
+ | affinities: campaing2 | ||
+ | others: campaing1, campaing3</ | ||
+ | //Then, the metascheduler launches:// | ||
+ | < | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | END NOTE --[[User: | ||
+ | |||
+ | === Scheduler === | ||
+ | The scheduler is the module which is responsible for the scheduling of a given campaign (the jobs defined on the campaign definition). This process includes resource matching based on the resources offered by the metascheduler and estimations provided by the ' | ||
+ | |||
+ | The scheduling process also depends on a scheduling policy, which can be: | ||
+ | * test scheduling : this is the scheduling policy specifically designed for test campaigns. It just promote the scheduling of one job by required cluster, dropping all the other defined jobs. | ||
+ | * workqueue scheduling: this is the default policy, the scheduler verifies the amount of available resources and launch an equivalent number of jobs to fill up resources. This resource matching depends on a prediction of resources required by each job (more about it on the predictor section) | ||
+ | * finalize scheduling: to accelerate the closing of campaigns, the scheduler promotes the replication of remaining jobs. This is specially important when jobs have failed for an unknown reason or externally killed. Once these jobs have finished successfully, | ||
+ | |||
+ | === Predictor === | ||
+ | The predictor module is responsible for following the progress of the campaigns scheduling, based on two estimations: | ||
+ | * the throughput of jobs (in jobs/time): this may offer the users an estimation of when the campaign is supposed to ends, supposing and average elapsed time for the jobs; | ||
+ | * the ratio resources/ | ||
+ | |||
+ | ==== Scheduling Control/ | ||
+ | |||
+ | The following diagram depicts briefly the general operation and control/ | ||
+ | |||
+ | [[Image: | ||
+ | |||
+ | In order to follow CiGri philosophy, the communication between different modules is done, whenever convenient, through the database. The scheduling follow an iterative operation given by the following steps: | ||
+ | |||
+ | * 1. Initially, the CiGri **Almighty** module activate the **Updator** module, which retrieves information related to resources utilization and store it into the // | ||
+ | * 2. Once the **Updator** has finished his work, the **Metaschedule**r is activated and it checks resources availability, | ||
+ | * 3. According to this match, the **Metascheduler** activate the **Campaign Schedulers**, | ||
+ | * 4. When **Campaign Schedulers** are activated, they perform the resource matching based on an estimation of the required resources available on the // | ||
+ | * 5. Based on the //ratio//, the different **Campaign Schedulers** activate the **Runner** module do handle the job submission; | ||
+ | * 6. The **Runner** module takes care to prepare the environment for the execution of the jobs (with the help of external modules) and submit jobs by means of clusters R.M.; | ||
+ | * 7. After this submission, the **Almighty** activates the **Predictor** module which gets information from the R.M. job queues and the number of jobs submitted in the last scheduling iteration and calculate a new ratio, based on the number of jobs that are effectively running and the number of waiting jobs. For information, | ||
+ | * 8. The **Predictor** module stores data on the forecasts table. | ||
+ | * 9. Once the prediction is done, the **Runner** module (to be defined) is activated again to wipe up excessive waiting jobs, yet letting a small number of waiting jobs to optimize resources usage. The aborted jobs will be rescheduled on a further iteration. | ||
+ | |||
+ | After this, the iteration starts again with the Updator module, the metascheduler and campaign schedulers. The difference now is that the forecasts table will contain valuable information of the previous operation so that the scheduling will be adapted to cope with resources requirement. | ||
+ | |||
+ | |||
+ | NOTE --[[User: | ||
+ | //I think that step 7 should be placed just after step 1 (predictor is ran just after the updator). Because updator not only updates the status table, but also the statuses of the jobs (running, waiting, finished, error...). The predictor will adjust the resources/ | ||
+ | END NOTE --[[User: | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Schedule Staging | ||
+ | |||
+ | < | ||
+ | TODO: Waiting for Bruno aproval to add this feature</ | ||
+ | |||
+ | |||
+ | ===== Prediction Algorithms | ||
+ | |||
+ | The prediction is used to optimize the submition so that the number of launched jobs can fully use the resources, but not flood RM waiting queue. The idea is basically adapt the job dispatch taking into account the last prediction and current state of resources. Being the job_ratio a value given by the inverse of the number of resources needed by a single job (just to keep coherent with the idea that a bigger job_ratio leads to a bigger number of jobs to be submited) | ||
+ | | ||
+ | The adjust can be calculated by the following way: | ||
+ | |||
+ | < | ||
+ | if (number of jobs waiting > allowed) | ||
+ | reduce job_ratio | ||
+ | else | ||
+ | increase job_ratio</ | ||
+ | |||
+ | What changes here is how we increase or decrease this job ratio. Some different algorithms are proposed: | ||
+ | |||
+ | **Slow-start**: | ||
+ | * increase is exponential: | ||
+ | < | ||
+ | JR = OLD_JR * 2</ | ||
+ | * decrease to nearly 0: if there is excessive jobs waiting on the queue, the job_ratio is decreased to 1 single job/cluster | ||
+ | < | ||
+ | JR = 1/ | ||
+ | |||
+ | **Two-way Adaptative**: | ||
+ | * increase is adaptative: if there is no excessive job on the queue, we calculate a new job ratio based on the number of free resources and the previous job ratio: | ||
+ | < | ||
+ | JR = 1/ | ||
+ | JR = 1/ | ||
+ | |||
+ | * decrease is adaptative: if there is excessive jobs waiting on the queue, the job_ratio is adapted based on the previous job ratio (considering the previous resources status) and the load it generated: | ||
+ | < | ||
+ | JR = (NB_JOBS_SUBMITTED - EXCESS) / NB_JOBS_SUBMITTED | ||
+ | JR = ((OLD_JR * OLD_FREE) - (NB_WAITING_JOBS - NB_ALLOWED_WAITING_JOBS)) / (OLD_JR * OLD_FREE)</ | ||
+ | |||
+ | This prediction algorithms seems to adapt very well to parallel jobs as the number of resources/ | ||
+ | |||
+ | |||
+ | **Adaptative-down, | ||
+ | |||
+ | * increase is exponential: | ||
+ | < | ||
+ | JR = OLD_JR * 2</ | ||
+ | |||
+ | * decrease is adaptative: if there is excessive jobs waiting on the queue, the job_ratio is adapted based on the previous job ratio (considering the previous resources status) and the load it generated: | ||
+ | < | ||
+ | JR = (NB_JOBS_SUBMITTED - EXCESS) / NB_JOBS_SUBMITTED | ||
+ | JR = ((OLD_JR * OLD_FREE) - (NB_WAITING_JOBS - NB_ALLOWED_WAITING_JOBS)) / (OLD_JR * OLD_FREE)</ | ||
+ | |||
+ | This algorithm is specially useful if a grid is used to execute a large number of small jobs. | ||
+ | |||
+ | --[[User: | ||
+ | |||
+ | ===== Initial implementation Workplan ===== | ||
+ | This section (to be completelly as the specification evolves), contains a sequence of steps necessary to achieve the desired characteristics: | ||
+ | |||
+ | * < | ||
+ | * <s>2. Include a preliminar implementation of ruby-based admission rules for CiGri | ||
+ | * 2.1 Define/ | ||
+ | * 2.2 Create some simple rules for the ' | ||
+ | * 3. Support to job requirements at JDL definition (equivalent to OAR -p) | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * 8.2. The Metascheduler will implemented by using some well-known OO design patters to improve extensibility and code reuse, mainly for the following activities. | ||
+ | * **< | ||
+ | * **< | ||
+ | * **< | ||
+ | * < | ||
+ | * 9.1 the behavior is to be defined through a strategy design pattern. | ||
+ | * < | ||
+ | * **< | ||
+ | * **< | ||
+ | * **9.2.3 finalize scheduling: replicate the remaining jobs of a given campaign (the policy to be defined) | ||
+ | |||
+ | Obs: | ||
+ | * the phase 1-3 are self-contained and it'll be used to get a closer look on the CiGri code, as well as to learn how to test and evaluate the implementation. After the phase 2 and 3, a working snapshot that solve the problem of small test campaigns is to be obtained. | ||
+ | * the phases 4-7 are inter-dependent and will be enough to provide support to parallel jobs and an adequate prediction policy (even if the predictions are not used by the scheduler). After this phase 7, a working snapshot is to be obtained. | ||
+ | * the phases 8 and 9 depend in 1-7 and provide the core implementation of the new scheduling module. | ||
+ | * after 9 the implementation will be refined to support new scheduling strategies that might take into account campaign priorities, users accounting, ... | ||
+ | |||
+ | --[[User: | ||
+ | |||
+ | ===== Main Milestones ===== | ||
+ | |||
+ | **19th June: Campaign types and Parallel Jobs** | ||
+ | |||
+ | * Campaign types: | ||
+ | On r484 and r485 I included the implementation of CiGri campaign types. | ||
+ | |||
+ | * Parallel CiGri Jobs (r490): | ||
+ | Resources requirement can be defined on JDL, by means of " | ||
+ | The Nikita module is responsible for verifying jobs waiting on RM queue and control cluster job flooding (based on a CiGri parameter) | ||
+ | Now, as in OAR, CiGri exports enviroment variables, which can be used by CiGri jobs: | ||
+ | < | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | **10th July: New prediction (Spritz) module and extension of gridstat (r495) ** | ||
+ | |||
+ | * Changed forecasts DB: | ||
+ | * predictions are now by cluster, and not for the whole campagne | ||
+ | * forecast table includes job ratio prediction (to guide scheduling) | ||
+ | * forecast table includes historical data | ||
+ | |||
+ | * Added new Spritz implementation: | ||
+ | * the spritz now predicts: | ||
+ | * **end time (based on throughput and average duration time | ||
+ | * **jobratio (based on submission, waiting jobs and flood rate) | ||
+ | * two prediction algorithms were included | ||
+ | * **slow-start (like TCP, flow congestion control) | ||
+ | * **adaptative | ||
+ | |||
+ | * Extended gridstat Qfunction, to ease handling big campaigns: | ||
+ | * normal gridstat just displays shortly the global informations about a campaign (waiting/ | ||
+ | * gridstat now features ' | ||
+ | |||
+ | * Added jobration + FIFO based scheduler: | ||
+ | * this perl scheduler takes into account job ratios prediction to solve the problem of parallel jobs scheduling and job flood control. | ||
+ | * the core is pretty small (10 lines), so, it'll be easy to translate it to ruby (next step of GSoC project) | ||
+ | |||
+ | * Changed Almight execution order: | ||
+ | * now, Spritz module runs just after Updator, to help getting fresh data | ||
+ | |||
+ | |||
+ | **6th August: Included implementation of ruby MetaScheduler and refined Spritz algorithms (r503) ** | ||
+ | |||
+ | * Spritz (aka Predictor): included 3 prediction algorithms: | ||
+ | * slow-start | ||
+ | * two-way adaptive | ||
+ | * adaptive-down, | ||
+ | |||
+ | * Reimplement some parts of Iolib/ | ||
+ | * Iolib/ | ||
+ | * Iolib/ | ||
+ | * Iolib/ | ||
+ | |||
+ | * MetaScheduler: | ||
+ | * FIFO based | ||
+ | * prototype includes MetaScheduler + CampaignScheduler all-in-one | ||
+ | |||
+ | * Bugfixes for #8215, #8295, #8355 and #8355 | ||
+ | |||
+ | **13th August: Separation of MetaScheduler and CampaignSchedules (r505) ** | ||
+ | * added implementation of " | ||
+ | * reimplement properly the support to test jobs on new scheduling system | ||
+ | |||
+ | **15th August: Improvements on gridstat tool (r508) ** | ||
+ | * support to multi-campagne view through summary | ||
+ | |||
+ | **17th August: Improvements on gridstat tool (r508) ** | ||
+ | * Added visualization of errors to fix on gridstat | ||
+ | * Added summary visualization by user | ||
+ | |||
+ | --[[User: | ||
+ | |||
+ | ===== TODO list ===== | ||
+ | ==== Mentor ==== | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * Implement OAR array jobs support into cigri | ||
+ | * Move the CIGRI web site into OAR web site and update it | ||
+ | |||
+ | ==== Student ==== | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | * Study the fairsharing problem, scheduling algorithms, and " | ||
+ | * < | ||
+ | * < | ||
+ | * < | ||
+ | |||
+ | --[[User: | ||
+ | |||
+ | ===== Links to look at ===== | ||