Reduce Precise OGE exec hosts to 10
Closed, ResolvedPublic

Description

We are making good progress on shifting grid jobs from Precise to Trusty. Approximately 2/3 of active jobs are now on Trusty. We can start to clean up the Precise hosts.

@yuvipanda has made a checklist for decommissioning an OGE host: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Decommission_a_node

A reasonable approach to choosing which hosts to remove is to start with the hosts with the fewest active jobs.

Event Timeline

Looking at the nodes, I think it will be easiest to remove tools-exec-1201 through tools-exec-1211 and leave consecutively numbered hosts.

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:55:59Z] <bd808> disabled queues on tools-exec-1202 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:56:19Z] <bd808> disabled queues on tools-exec-1203 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:56:34Z] <bd808> disabled queues on tools-exec-1204 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:56:47Z] <bd808> disabled queues on tools-exec-1205 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:57:14Z] <bd808> disabled queues on tools-exec-1206 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:57:30Z] <bd808> disabled queues on tools-exec-1207 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:57:44Z] <bd808> disabled queues on tools-exec-1208 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:58:02Z] <bd808> disabled queues on tools-exec-1209 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:58:16Z] <bd808> disabled queues on tools-exec-1210 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T04:58:36Z] <bd808> disabled queues on tools-exec-1211 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:00:16Z] <bd808> drained tools-exec-1202 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:00:45Z] <bd808> drained tools-exec-1203 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:04:32Z] <bd808> drained tools-exec-1204 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:07:14Z] <bd808> drained tools-exec-1205 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:10:57Z] <bd808> drained tools-exec-1206 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:12:29Z] <bd808> drained tools-exec-1207 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:13:00Z] <bd808> drained tools-exec-1208 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:14:26Z] <bd808> drained tools-exec-1209 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:18:32Z] <bd808> drained tools-exec-1211 (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T05:20:41Z] <bd808> rescheduled continuous jobs on tools-exec-1210; 2 task queue jobs remain (T151980)

The queues are disabled on tools-exec-1201 through tools-exec-1211. All continuous jobs have been rescheduled using qmod -rj. There are 2 task queue jobs still running on tools-exec-1210. I'll let them be for now and hope they finish in a reasonable amount of time.

Mentioned in SAL (#wikimedia-labs) [2016-11-30T22:47:27Z] <bd808> Deleted 2 jobs running on tools-exec-1210 for many hours/days (T151980)

Mentioned in SAL (#wikimedia-labs) [2016-11-30T23:06:48Z] <bd808> Removed tools-exec-12[00-11] from gridengine (T151980)

Change 324623 had a related patch set uploaded (by BryanDavis):
toollabs: remove host aliases for tools-exec-12[01-11]

https://gerrit.wikimedia.org/r/324623

Just as a sanity check to prevent another T149634#2758566:

tools-bastion-02.tools:~
bd808$ sudo qconf -sel|grep -- -12
tools-exec-1212.eqiad.wmflabs
tools-exec-1213.eqiad.wmflabs
tools-exec-1214.eqiad.wmflabs
tools-exec-1215.eqiad.wmflabs
tools-exec-1216.eqiad.wmflabs
tools-exec-1217.eqiad.wmflabs
tools-exec-1218.eqiad.wmflabs
tools-exec-1219.eqiad.wmflabs
tools-exec-1220.tools.eqiad.wmflabs
tools-exec-1221.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1201.eqiad.wmflabs
tools-webgrid-lighttpd-1202.eqiad.wmflabs
tools-webgrid-lighttpd-1203.eqiad.wmflabs
tools-webgrid-lighttpd-1204.eqiad.wmflabs
tools-webgrid-lighttpd-1205.eqiad.wmflabs
tools-webgrid-lighttpd-1206.eqiad.wmflabs
tools-webgrid-lighttpd-1207.eqiad.wmflabs
tools-webgrid-lighttpd-1208.eqiad.wmflabs
tools-webgrid-lighttpd-1209.eqiad.wmflabs
tools-webgrid-lighttpd-1210.eqiad.wmflabs

Mentioned in SAL (#wikimedia-labs) [2016-12-05T16:53:28Z] <bd808> Terminated deprecated instances: "tools-exec-1201", "tools-exec-1202", "tools-exec-1203", "tools-exec-1205", "tools-exec-1206", "tools-exec-1207", "tools-exec-1208", "tools-exec-1209", "tools-exec-1210", "tools-exec-1211" (T151980)

shinken seems to not understand that these hosts were removed on purpose:

[16:53]  <shinken-wm>	PROBLEM - Host tools-exec-1207 is DOWN: CRITICAL - Host Unreachable (10.68.17.113)
[16:54]  <shinken-wm>	PROBLEM - Host tools-exec-1205 is DOWN: CRITICAL - Host Unreachable (10.68.17.91)
[16:55]  <shinken-wm>	PROBLEM - Host tools-exec-1202 is DOWN: CRITICAL - Host Unreachable (10.68.16.57)
[16:55]  <shinken-wm>	PROBLEM - Host tools-exec-1201 is DOWN: CRITICAL - Host Unreachable (10.68.17.49)
[16:55]  <shinken-wm>	PROBLEM - Host tools-exec-1209 is DOWN: CRITICAL - Host Unreachable (10.68.17.129)
[16:55]  <shinken-wm>	PROBLEM - Host tools-exec-1203 is DOWN: CRITICAL - Host Unreachable (10.68.16.133)
[16:55]  <shinken-wm>	PROBLEM - Host tools-exec-1211 is DOWN: CRITICAL - Host Unreachable (10.68.17.64)
[16:55]  <shinken-wm>	PROBLEM - Host tools-exec-1210 is DOWN: CRITICAL - Host Unreachable (10.68.17.147)
[16:57]  <shinken-wm>	PROBLEM - Host tools-exec-1206 is DOWN: CRITICAL - Host Unreachable (10.68.17.105)

(The Shinken configuration gets regenerated on every Puppet run. So up to 30 minutes of false alarms are to be expected.)

Change 324623 merged by Yuvipanda:
toollabs: remove host aliases for tools-exec-12[01-11]

https://gerrit.wikimedia.org/r/324623

bd808 removed a project: Patch-For-Review.

Shinken seems to be happy now and the last patch has been merged.