Differences

This shows you the differences between two versions of the page.

Link to this comparison view

wiki:useful_commands_and_administration_tasks [2020/03/25 15:15] (current)
neyron created
Line 1: Line 1:
 +====== Useful commands and administration tasks ======
 +//Here, you'll find useful commands, sometimes a bit tricky, to put into your scripts or administration tasks//
 +
 +===== List suspected nodes without running jobs =====
 +You may need this list of nodes if you want to automatically reboot them because you don't know why they have been suspected and you think that it is a simple way to clean things:
 +<​code>​
 + ​oarnodes ​ --sql "state = '​Suspected'​ and network_address NOT IN (SELECT distinct(network_address) FROM resources where resource_id IN \\
 + ​(SELECT resource_id ​ FROM assigned_resources WHERE assigned_resource_index = '​CURRENT'​))"​ | grep '​^network_address'​ | sort -u
 +</​code>​
 +
 +===== List alive nodes without running jobs =====
 +<​code>​
 + ​oarnodes ​ --sql "state = '​Alive'​ and network_address NOT IN (SELECT distinct(network_address) FROM resources where resource_id IN \\
 + ​(SELECT resource_id ​ FROM assigned_resources WHERE assigned_resource_index = '​CURRENT'​))"​ | grep '​^network_address'​ | sort -u
 +</​code>​
 +
 +===== Oarstat display without best-effort jobs =====
 +
 +<​code>​
 + ​oarstat --sql "​job_id NOT IN  (SELECT job_id FROM job_types where types_index = '​CURRENT'​ AND type = '​besteffort'​) AND state != '​Error'​ AND state != '​Terminated'"​
 +</​code>​
 +
 +===== Setting some nodes in maintenance mode only when they are free =====
 +
 +You may need to plan some maintenance operations on some particular nodes (for example add somme memory, upgrade bios,...) but you don't want to interrupt currently running or planned users jobs. To do so, you can simply run a "​sleep"​ job into the admin queue and wait for it to become running, and then set the node into maintenance mode. But you also can use this trick to set automatically the node into maintenance mode when the admin job starts:
 +<code bash>
 + ​oarsub -q admin -t cosystem -l /nodes=2 'uniq $OAR_NODE_FILE|awk "​{print \\"​sudo oarnodesetting -m on -h \\" \\$1}"​|bash'​
 +</​code>​
 +This uses the "​cosystem"​ job type that does nothing but start your command on a given host. This host has to be configured into the //​COSYSTEM_HOSTNAME//​ variable of the //​oar.conf//​ file, and for the current purpose, you can simply put //​127.0.0.1//​. You also need to install the oar-node package on this host.
 +
 +The example above will disable 2 free nodes, but you may want to add a //-p// option to specify the nodes you want to disable, for example: ''​-p "​network_address in ('​node-1','​node-2'​)"''​
 +
 +**Note:** you can't simply do that within a "​normal"​ job as oar will kill your job before all the resources of the node are set into the maintenance mode
 +
 +===== Optimizing and re-initializing the database with Postgres =====
 +Sometimes, the database contains so much jobs that you need to optimize it. Normally, you should have a **vacuumdb** running daily fron cron. You can do manually a **vacuumdb -a -f -z ; reindexdb oar** but don't forget to stop OAR before, and be aware that it may take some time. But the DB still may be very big and it may be a problem for backups or the nightly vaccum takes too much time. A more radical solution is to start again with a new database, but keep the old one so that you can still connect to it for jobs history. You can do this once a year for example, and you only have to backup the current database. Here is a way to do this:
 + 
 +  *  First of all, make a backup of your database! With postgres, it is as easy as:
 +<​code>​
 + ​create database oar_backup_2012 with template oar
 +</​code>​
 +It will create an exact copy of the "​oar"​ database named "​oar_backup_2012"​. Be sure that you have enough space left on the device hosting your postgres data directory. Doing so will allow you to make queries on the backup database if you need to find the history of old jobs.
 +  *  You should plan a maintenance and be sure there'​s no more jobs into the system.
 +  *  Make a dump of your "​queues",​ "​resources"​ and "​admission_rules"​ tables.
 +  *  Stop the oar server, drop the oar database and re-create it.
 +  *  Finally, restore the "​queues",​ "​resources"​ and "​admission_rules"​ tables into the new database. ​
 +  *  And restart the server.
 +
  
wiki/useful_commands_and_administration_tasks.txt ยท Last modified: 2020/03/25 15:15 by neyron
Recent changes RSS feed GNU Free Documentation License 1.3 Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki