Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-l] web server fiasco



On 09/04/2010 12:15 PM, Michael Edmiston wrote:
we have something like six or seven servers for the
network, and several different Raid-5 systems. We have generally stocked
two replacement drives, but once we lost three in the same power failure. I
think two were on one server and the third was on a different server. Since
we only had two replacements in stock, we had to have the third drive
shipped overnight in order to get the system back up. That is an easy way
to make it take over 24 hours to get something back up... not having
sufficient replacement parts on hand. How many spare disk drives and spare
server computers do you think an organization should have on hand?

Well, since you ask:

A hosting service that has several hundred systems (as was
the case last Thursday) should have at least one entire
spare machine ready to go. The procedure is:

1) If a production machine is in trouble, if you know exactly
what the problem is, and it is easy to fix, go ahead and
fix it. Example: One element of a RAID array starts sending
SMART warnings. Hot-swap that element.

2) In all other cases, immediately start restoring files from
backup to the spare machine. This should be a non-laborious
hands-off process ... although it will take some time.

3) While that is going on, do whatever debugging you think is
worthwhile.

4) If the restore finishes before the problem is fully debugged,
switch over to the backup machine. Finish debugging, fixing,
and testing after the troubled machine is offline.

It can take 4 or 5 hours after power is restored just to figure out
what all is working and what is not working. Then, if we indeed lost two
Raid-5 disks, we have to restore from the backup, and that can indeed take a
long time (another 4 or 5 hours). ... Then, once the system is up, the
manager does some reliability testing before making things available
to the public again.

The procedure outlined above guarantees that the downtime is
equal to the /lesser/ of the debug time or the restore time.
The new machine can be presumed reliable.

You do not want the downtime to be the /sum/ of the debug time
plus the restore time ... not to mention the build-a-machine
time and/or the reliability testing time.

If, after a reasonably short amount of time, the machine that
came off the line cannot be recertified as 100% good to go,
junk it and buy a new one. You really don't want your spare
to be unreliable.

======

For an organization with "only" several machines, as opposed to
several hundred, this procedure is not quite so obviously the
way to go, but given a history of long, painful outages, it
should be seriously considered.

Hardware is cheap. Downtime is expensive.

Google doesn't even attempt to debug hardware. Their machines
are considered modules unto themselves, and if anything goes
wrong, they just yank the whole machine out of the rack and
shove in a new one.


On 09/04/2010 01:23 PM, Stefan Jeglinski wrote:

We were locked out of the
old colo building for *10 days* and could not get a single phone call
returned. Finally, the old colo owner, who knew the secrets of the
building, helped us *break in,* whereupon we "stole" our box out in
the dead of night, enabling us to bring it back up on a spare IP I
have at work. Still no word from the new (obviously fired) colo, like
for example "hey you can't just break in and take stuff!"

I contend my story competes,

Yikes!