Got offline around 1:30ish am last 'night' resetting autofs daemons, and didn't really boot up today until about 9:30am. The various alarums & excursions culminated around 5:30pm after the IT Manager left (who hadn't been back to his house and spouse in Stockton in about a week).
Turns out the electricians had used *20* amp breakers, not 30 amps, to take out to the PDU feeds on the racks. Our 'smart PDU' units were reading 22 – 25. Contrary to what the aforementioned IT Manager said, that is *not* the 'address of the unit', that is the amp draw. No good deed goes unpunished (thank you, Avon!), so my reward for not backing this guy against the wall on the machine room buildout was to spend an hour and a half just now working with the buildout temps on getting temporary power in and rearranged.
Fortunately the buildout folks (ManGo) are *highly* competent and cool, and brought the power thing to our attention during their walkthrough, and had a plan all organized on what to move and how to move it. Also fortunately we could move random hosts from the LSF compute cluster (badmin hclose hostname, halt it, replug it, bring it back up) without inconveniencing the engineers. OK, I had to bstop/bresume a couple of long jobs, but little other impact.
Then we got to poke at the specs for the supermicro file servers, and make sure that the redundant power supplies were *really* redundant, eg, independant cordage. That determined, we were able to move all 4 of the main fileservers to new PDUs by the simple expedient of one cord at a time. YES!!!! Ditto the couple of SunFire 280's we had to switch around.
So now instead of a little forest of red lights, we have a little forest of amber lights but it is progress. PDU's are at 13 – 14 instead of high teens, low 20's. And why *does* a PDU get to draw 23 – 25 amps before it blows a 20 amp breaker, eh? Probably because the UPS is involved and somehow buffering the draw, is my guess. But if 25 is overloading the PDU, and thus redlighting it, why will getting 30's in the breaker box constitute a fix?
It isn't 'my' problem per se, because I was not involved with the buildout. Hell, I wasn't involved with the move until about 3 or 4 weeks ago, when I shoved my way in because of all the stuff I could see Not Happening (move coordination meetings, planning, etc). But I have some sympathy for the IT-M who is going to get called on the carpet in a really royal way. Dude, you can't just read the back of the box and say that's the power draw, especially when you start shoving racks of disks into the cases. Well, experience is what you get when you fuck up, and this move is full of experience for this guy.
Leave a Reply