Fault Tolerance Page

From WikiOAR

Jump to: navigation, search

See original proposal here

Student, please read carefully this page...

Student: Joris Bremond

Mentor: Joseph Emeras

Co-Mentor: Olivier Richard

Student: Things to do before starting

get an account on grid5000: https://www.grid5000.fr/mediawiki/index.php/Grid5000:Get_an_account
get a svn account on the inria gforge: https://gforge.inria.fr/account/register.php (Mescal team)
connect to the g5k jabber and add mentor as contact

Project's specifications

MUST

work with security systems such as Kerberos
make the database fault tolerant
make the oar server fault tolerant
lighter possible
well documented and packaged

SHOULD

independent from the OAR's code

MAY

automatic redirection of requests to the database if main db fails over
automatic redirection of packets to server if the main server crashes
load-balancing

How to start the project

Begin by studying drdb and ultramonkey. What they are intended to, their limits their different configuration/setup options.
Test OAR: submission, jobs execution and management, see:
Test deploying environment on OAR
Create your own environment or modify one existing and save it.

Roadmap (and Timeline)

Official gsoc date: 23rd May to 17th August. Joris availability: 1st June to 28th August. So as the dates almost match, we will stick on the student's availability for the schedule. The last 10 days when gsoc is officially over and not Joris's internship will be reserved for making all the documentation and packaging needed.

Important steps

23 May: Official gsoc start date

1 June: Joris begins his work

6-12 July: mid-term eval

17 August: Official gsoc end date

17-24 August: final eval

28 August: Joris terminates his internship

3 Sept: Students can begin submitting required code samples to Google

Roadmap

Roadmap is available as a picture:

TODO list

Mentor

Student

Links to look at

priority:

http://www.drbd.org/

http://www.ultramonkey.org/

then also:

Proposition

Architecture

This proposition is based on :

HeartBeat is the daemon which make resources hightly available. We can use heartbeat to manage OAR-server and BDD (mysql or postgres)
DRBD : Raid1 over IP --> Mirroring on each disk. The DB data will be on this shared disk.

Abstract

With this solution, heartbeat can detect when an other server is down and lunch services. It can also monitor the different services which run on the server, and detect errors. For example, if a service fails, heartbeat server tries to restart this service. If it's impossible, the group of resources (Virtual IP + DRBD + OAR-server + BDD) are migrated on the backup server.

Progress

Script

I have realized a script which takes parameters and install / configure heartbeat and DRBD on two servers. It takes different parameters :

Is it the master or slave server
Interface, for communicate between the two servers
Database type : mysql or postgres. The script can deploy HA in both configurations
Size of the database partition
Virtual IP and CIDR netmask

Heartbeat communications are encrypted with SHA. We can also encrypt DRBD communications, but currently I haven't do that.

This configuration with the script is OK. We can deploy High Availability solution in 2 different configurations :

OAR-server and database on the same server --> 2 nodes (1 master / 1 backup)
OAR-server and database on different server --> 4 nodes (2 master (OAR, DB) / 2 backup (OAR, DB))

TO DO --> My work now is to test the configurations 2master/2backup, with 4 nodes for servers, 1 front-end, and N nodes. I must test network crashes, computer shutdown, etc.

Tests :

disconnect the network on one nodes
crash oar-server service
crash mysql service
crash mysql servcie when the oar server write on it (difficult)
reserve a job, crash oar-server, close the job
...

Test OK The test are pretty good. I have practice test on four different configuration : 2 nodes with postgres or mysql, 4 nodes with postgres or mysql.

Now I begin to write documentations.

I also plan the synchronization of oar.log between OAR-servers
If I have time, test HA on CentOs distribution

DRBD Benchmark

For know DRBD performance, I have realized different benchmark with mysql for test DRBD performance.

without DRBD, mysql data mounted on the system filesystem
with DRBD, mysql data mounted on DRBD filesystem
with DRBD and saturated network

Results

This test was maked on genepi-31.grenoble.grid5000.fr.
The backup server (for DRBD) was genepi-32.grenoble.grid5000.fr
Filesystem with and without DRBD was ext2
The rate between this two node was 740Mo/s (Max)

Fault Tolerance Page

From WikiOAR

Contents

Student: Things to do before starting

Project's specifications

MUST

SHOULD

MAY

How to start the project

Roadmap (and Timeline)

Important steps

Roadmap

TODO list

Mentor

Student

Links to look at

Proposition

Architecture

Abstract

Progress

Script

DRBD Benchmark

Views

Personal tools

Public portal

Summer of Code

Related links

Developers portal

Search

Toolbox