Home
South Pole Logbook

Search below for 'logbook_sop' for help on usage.

Sections

Search

Archives

November 2009
Sun Mon Tue Wed Thu Fri Sat
         

RSS Feed

Powered by Blosxom


May 07, 2009

about power outage


Author: Erik
Mood: Excellent !

So as promised I did a little investigation on what happened yesterday. Unfortunately this leaded to a switch-off of the whole rack 9 again, so run #113667 crashed and #113668 is a failed attempt to restart. In the meantime I rebooted the hubs on rack 9, pUp'ed the DOMs, recovered firmware fuses and restarted a run, #113669. Sorry about that.
BUT here is what I found out (my look turns towards IT) :
The facility engineer ensured me this morning that the power delivery did not glitch of failed yesterday evening (1:09pm UTC). What Mr. Powerplant saw was the current drop produced by a whole rack that shuts down, and slowly ramps up again till all the hubs are back online. Quite neat that their monitoring system is able to see this. So apparently the problem is something local to the ICL.
Here is what I saw at the ICL when I arrived this morning. UPS of rack 9 is off (UPS battery circuit breaker tripped). No LEDs on the front bezel. This means there is direct power to the output but as soon as there is a power glitch, we'll loose rack 9 again.
I tried to switch it back ON and that is when the whole rack 9 (hubs) powered off, 30 minutes ago. Fortunately the GPS unit is not in this rack...
This weird state of the UPS is not described in the manual. But while it was briefly ON, 2 LEDs were blinking, the general alarm (red) and the utility LED (green) and no power on the output (hubs were off). This is described in the manual as "REPO condition". It is not written how to solve it. I tried pushing 10 seconds on each front button (like the on/off etc.), but nothing happened.
I manually tripped the breaker again to restore the previously faulty (but working) state and to restore data taking as quick as possible.
What does IT think off this UPS unit ? is it completely unusable ? if it is, I think replacing it with the unit in Rack 4 once AMANDA is switched off, as we agreed before, would solve the problem. This is in a few days anyway.
Another matter, is that yesterday's problem switched off sps-itfreeze. I don't know how that is possible but it happened. This is the only computer in rack 13, so I switched it on again.


Erik.

Erik Verhagen | 07 May 2009 19:19 GMT | Ice Cube/Power | | permalink