Search below for 'logbook_sop' for help on usage.
| November 2009 | ||||||
|---|---|---|---|---|---|---|
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 | |||||
So as promised I did a little investigation on what happened yesterday.
Unfortunately this leaded to a switch-off of the whole rack 9 again, so
run #113667 crashed and #113668 is a failed attempt to restart. In the
meantime I rebooted the hubs on rack 9, pUp'ed the DOMs, recovered
firmware fuses and restarted a run, #113669.
Sorry about that.
BUT here is what I found out (my look turns towards IT) :
The facility engineer ensured me this morning that the power delivery
did not glitch of failed yesterday evening (1:09pm UTC). What Mr.
Powerplant saw was the current drop produced by a whole rack that shuts
down, and slowly ramps up again till all the hubs are back online. Quite
neat that their monitoring system is able to see this.
So apparently the problem is something local to the ICL.
Here is what I saw at the ICL when I arrived this morning. UPS of rack 9
is off (UPS battery circuit breaker tripped). No LEDs on the front
bezel. This means there is direct power to the output but as soon as
there is a power glitch, we'll loose rack 9 again.
I tried to switch it back ON and that is when the whole rack 9 (hubs)
powered off, 30 minutes ago. Fortunately the GPS unit is not in this
rack...
This weird state of the UPS is not described in the manual. But while it
was briefly ON, 2 LEDs were blinking, the general alarm (red) and the
utility LED (green) and no power on the output (hubs were off). This is
described in the manual as "REPO condition". It is not written how to
solve it. I tried pushing 10 seconds on each front button (like the
on/off etc.), but nothing happened.
I manually tripped the breaker again to restore the previously faulty
(but working) state and to restore data taking as quick as possible.
What does IT think off this UPS unit ? is it completely unusable ? if it
is, I think replacing it with the unit in Rack 4 once AMANDA is switched
off, as we agreed before, would solve the problem. This is in a few days
anyway.
Another matter, is that yesterday's problem switched off sps-itfreeze. I
don't know how that is possible but it happened. This is the only
computer in rack 13, so I switched it on again.
Erik Verhagen
|
07 May 2009 19:19 GMT
|
Ice Cube/Power
|
|