This shows you the differences between two versions of the page.
— | wiki:old:gsoc_fault_tolerance [2013/07/10 22:55] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | [[GSoc_Proposal_Fault_Tolerance| See original proposal here]] | ||
+ | |||
+ | **Student, please read carefully this page...** | ||
+ | |||
+ | |||
+ | Student: Joris Bremond | ||
+ | |||
+ | Mentor: Joseph Emeras | ||
+ | |||
+ | Co-Mentor: Olivier Richard | ||
+ | |||
+ | ---- | ||
+ | |||
+ | |||
+ | ===== Student: Things to do before starting ===== | ||
+ | |||
+ | * get an account on grid5000: https:// | ||
+ | * get a svn account on the inria gforge: https:// | ||
+ | * connect to the g5k jabber and add mentor as contact | ||
+ | |||
+ | ===== Project' | ||
+ | |||
+ | ==== MUST ==== | ||
+ | * work with security systems such as Kerberos | ||
+ | * make the database fault tolerant | ||
+ | * make the oar server fault tolerant | ||
+ | * lighter possible | ||
+ | * well documented and packaged | ||
+ | |||
+ | ==== SHOULD ==== | ||
+ | * independent from the OAR's code | ||
+ | |||
+ | ==== MAY ==== | ||
+ | * automatic redirection of requests to the database if main db fails over | ||
+ | * automatic redirection of packets to server if the main server crashes | ||
+ | * load-balancing | ||
+ | |||
+ | ===== How to start the project ===== | ||
+ | |||
+ | * Begin by studying drdb and ultramonkey. What they are intended to, their limits their different configuration/ | ||
+ | * Test OAR: submission, jobs execution and management, see: | ||
+ | * https:// | ||
+ | * https:// | ||
+ | * https:// | ||
+ | * and more... | ||
+ | * Test deploying environment on OAR | ||
+ | * Create your own environment or modify one existing and save it. | ||
+ | |||
+ | ===== Roadmap (and Timeline) ===== | ||
+ | Official gsoc date: 23rd May to 17th August. Joris availability: | ||
+ | |||
+ | ==== Important steps ==== | ||
+ | * 23 May: Official gsoc start date | ||
+ | |||
+ | * 1 June: Joris begins his work | ||
+ | |||
+ | * 6-12 July: mid-term eval | ||
+ | |||
+ | * 17 August: Official gsoc end date | ||
+ | |||
+ | * 17-24 August: final eval | ||
+ | |||
+ | * 28 August: Joris terminates his internship | ||
+ | |||
+ | * 3 Sept: Students can begin submitting required code samples to Google | ||
+ | |||
+ | ==== Roadmap ==== | ||
+ | |||
+ | Roadmap is available as a picture: | ||
+ | |||
+ | {{: | ||
+ | |||
+ | ===== TODO list ===== | ||
+ | |||
+ | ==== Mentor ==== | ||
+ | |||
+ | ==== Student ==== | ||
+ | |||
+ | ===== Links to look at ===== | ||
+ | |||
+ | priority: | ||
+ | * http:// | ||
+ | * http:// | ||
+ | |||
+ | then also: | ||
+ | * http:// | ||
+ | * http:// | ||
+ | |||
+ | |||
+ | ===== Proposition ===== | ||
+ | ==== Architecture ==== | ||
+ | [[Image: | ||
+ | |||
+ | This proposition is based on : | ||
+ | * HeartBeat is the daemon which make resources hightly available. We can use heartbeat to manage OAR-server and BDD (mysql or postgres) | ||
+ | * DRBD : Raid1 over IP --> Mirroring on each disk. The DB data will be on this shared disk. | ||
+ | |||
+ | ==== Abstract ==== | ||
+ | With this solution, heartbeat can detect when an other server is down and lunch services. It can also monitor the different services which run on the server, and detect errors. | ||
+ | For example, if a service fails, heartbeat server tries to restart this service. If it's impossible, the group of resources (Virtual IP + DRBD + OAR-server + BDD) are migrated on the backup server. | ||
+ | |||
+ | ===== Progress ===== | ||
+ | |||
+ | ==== Script ==== | ||
+ | I have realized a script which takes parameters and install / configure heartbeat and DRBD on two servers. It takes different parameters : | ||
+ | |||
+ | * Is it the master or slave server | ||
+ | * Interface, for communicate between the two servers | ||
+ | * Database type : mysql or postgres. The script can deploy HA in both configurations | ||
+ | * Size of the database partition | ||
+ | * Virtual IP and CIDR netmask | ||
+ | |||
+ | Heartbeat communications are encrypted with SHA. | ||
+ | We can also encrypt DRBD communications, | ||
+ | |||
+ | This configuration with the script is OK. We can deploy High Availability solution in 2 different configurations : | ||
+ | * OAR-server and database on the same server --> 2 nodes (1 master / 1 backup) | ||
+ | * OAR-server and database on different server --> 4 nodes (2 master (OAR, DB) / 2 backup (OAR, DB)) | ||
+ | |||
+ | |||
+ | |||
+ | **TO DO** --> **My work now is to test the configurations 2master/ | ||
+ | I must test network crashes, computer shutdown, etc. | ||
+ | |||
+ | **Tests :** | ||
+ | * disconnect the network on one nodes | ||
+ | * crash oar-server service | ||
+ | * crash mysql service | ||
+ | * crash mysql servcie when the oar server write on it (difficult) | ||
+ | * reserve a job, crash oar-server, close the job | ||
+ | * ... | ||
+ | |||
+ | **Test OK** | ||
+ | The test are pretty good. I have practice test on four different configuration : 2 nodes with postgres or mysql, 4 nodes with postgres or mysql. | ||
+ | |||
+ | Now I begin to write documentations. | ||
+ | * I also plan the synchronization of oar.log between OAR-servers | ||
+ | * If I have time, test HA on CentOs distribution | ||
+ | |||
+ | ==== DRBD Benchmark ==== | ||
+ | For know DRBD performance, | ||
+ | |||
+ | * without DRBD, mysql data mounted on the system filesystem | ||
+ | * with DRBD, mysql data mounted on DRBD filesystem | ||
+ | * with DRBD and saturated network | ||
+ | |||
+ | **Results** | ||
+ | |||
+ | [[Image: | ||
+ | |||
+ | * This test was maked on genepi-31.grenoble.grid5000.fr. | ||
+ | * The backup server (for DRBD) was genepi-32.grenoble.grid5000.fr | ||
+ | * Filesystem with and without DRBD was ext2 | ||
+ | * The rate between this two node was 740Mo/s (Max) | ||