Fileserver Outage

DreamHost are having some serious problems with one of its fileservers this weekend. The problems started early Saturday morning and are still not completely solved now 48 hours later.

DreamHost has been pretty good at updating their DreamHost Status blog this time (Network fileserver troubles, File Server Issues, File Server Finale and More Files Issues!), but I just got the latest news directly from DreamHost support:

Hello,

As you may know, we recently had a network filing system crash causing severe downtime. When a filer like this crashes, it basically has to go through each of the it’s disks to verify that they are ok, and can accept, read or change data. This takes an extremely long time, due to the sheer size of the machine (two terabytes!).

Once we got the machine up, fixed and serving files, everything seemed like it was ok, so we went back to making sure all content, data and websites were working normally.

Right about then, it crashed again! This time, however, it came back up correctly, so it didn’t take as long as it had previously.

That was 6am PST this morning. Since then, we are currently constructing a new filer machine (we had to cannibalize two just to get this current one back up and running) to offload everyone. In the meantime, it has crashed again, however, it seems to come back up in the correct state. If possible, you may want to get any sensitive, or important data off of your account, just to be safe. We are working on getting everyone off the faulty machine, however, as you can imagine, it will take sometime.

We are terribly sorry about the problems related to this disaster and hope to have everything stabilized and working ASAP. Please understand that we are doing everything humanly possible (including working 24 hours shifts, and sleeping in our data center!) to get every site back up and running. If your domain is down, or showing an error, we are working on reconfiguring all of the effected services, and should have them fixed soon.

Since the discussion forum also seems to be hit by the outage, it’s hard to tell how many customers are affected by this, but judging from the support queue (which has had an average of 400 customers in queue and a maximum of 600) the incident might not be as severe as it sounds. Hopefully DreamHost’s staff will soon have everything back under control.

28 Responses to “Fileserver Outage”

  1. James Silvester says:

    Thanks for the comment.

  2. R. Francis Smith says:

    Well, for what it’s worth, I’m one of the affected. I would describe the last 48 hours as downtime with sporadic bursts of uptime. It’s good to see this note, as the last official blog post on the subject was around 24 hours ago.

    I continue hoping things are settling down, but seconds before I read this, the services on my virtual host fell over again, so who knows…

    -R

  3. Daniel Drucker says:

    I sent the following to support:

    Dreamhost writes:
    [[ We’re doing everything we can to avoid data loss but backing up your email and site data is strongly encouraged. .... If possible, you may want to get any sensitive, or important data off of your account, just to be safe.]]

    I’m not on the affected system, but this statement worries me tremendously. I know that it’s always a good idea to do one’s own backups, and I do do my own backups – all my files are backed up not only to my own disks, but to Amazon S3 as well.

    That said – in the event you were to irretrievably lose a filer, please tell me it wouldn’t be MY responsibility to reload my data! You folks do actually do backups, right? Offsite, that sort of thing?

    You not so irresponsible as to not do backups, are you?

    – Daniel Drucker (manager of disaster recovery for Bristol-Myers
    Squibb Mail Engineering from 1999-2000)

  4. Unofficial DreamHost Blog says:

    Daniel – How do you backup to Amazon S3? I think it sounds very interesting… I’m just writing a short post about how to backup your data (to your own machine) and would like to add a section about backing up to a remote location.

  5. Chuck says:

    I was one of those affected too. Regardless of their technical skill in responding to the problem, their customer relations skills severely lacked. As of now, Monday morning, they still haven’t updated the status blog since before 6:00 am Sunday. Its nice that they’re sending you updates, but it would be more helpful to the rest of us if they’d post them to their own site.

  6. Unofficial DreamHost Blog says:

    Chuck – I fully agree that they should update their status blog even more often. Please notice that the support email wasn’t send as an update to this blog, but as a response to a support request I submitted.

  7. Linked from: T. Longren
  8. Scott says:

    My Dreamhost sites have been down since Friday afternoon, which makes this day 4. Aside from the one boilerplate email response posted above (and which I also received and posted on my blog), there has been no word from them in 30 hours. THIRTY HOURS. The few complaints that have been posted on their boards have been met with posts from Dreamhost employees to “put it in a repair ticket”. They obviously have no clue how to handle customers. When people’s sites are down for this long, updates at LEAST every 2 hours are mandatory, especially with no phone support or live chat support. They will lose accounts strictly because of their failure to communicate during this outage.

  9. Scott says:

    One more note – I should let you know that “latest response” they sent you is a boilerplate reply they sent me on Sunday afternoon. I posted a copy at that time on my blogspot (as in off-dreamhost) blog, dropdebt.blogspot.com. So apparently then nobody has heard anything new from them in a day.

  10. matttail says:

    I have heard this morning they are failing all of the data over to new hardware. Basically they took parts from two different filers to try and stabilze this one. With new disks in it, the filers has to go through and verify all of the data, in some cases writing stuff back onto the new disks. Once that is finished DH admins check for data integrity and the like. Now that process is one and they’re moving all of the effected data over to a brand new box. Things should be fully functional by tonight at the latest.

    One thing you need to keep in mind here is that there’s something between 2 and 3 terabytes of data on this filer. That’s a lot of stuff to shuffle around, and it takes time to copy that kind of stuff onto a new machine. Even with gigabit lan cards, you’re still waiting on drive access time.

    (oh, and the forums are ’staffed’ by customers. Unfortunaty there’s only so much access we as customers have and even the best of us must sometimes rely on support to do their job. (-: )

    –Matttail

  11. Daniel Drucker says:

    I use Jungle Disk to mount S3 as a drive on my linux box at home. I then just rsync my Dreamhost account to that drive, after dumping my databases to sql.gz files.

    Jungle Disk works on OS X and Windows too, is free, and the filesystem format is open source – no vendor lock in. I can’t recommend it enough.

  12. Waldorf says:

    Nearly 4 days mine has been down too, completely down, email, ftp, http, everything. I can not access to make these recommended backups. Why does it seem some services work for others but I get nothing?
    This is beyond tolerance, I could dig out a dusty old 286, stick 1000 accounts on it, hook it up and serve websites better than Dreamhost does!
    Worst mistake I have made in a long time, hosting with Dreamhost. I was drawn in by the cheap prices, and as the saying goes, you get what you pay for, which is NOTHING with Dreamhost.
    The status page was not updated for 29 hours, support eventually mail out canned responses and Dreamhost forums don’t work.
    This episode will surly go down in the webhosting history book as the greatest all time farce, it will be legendary. FOUR DAYS DOWNTIME, and still counting!

  13. Unofficial DreamHost Blog says:

    Daniel – So the files are in fact downloaded to your computer and then uploaded to S3? Or is the files directly synched to your S3 account (which I guess would give you a lot faster download speed)? It looks very promising.

  14. Khangtoh says:

    Guys.. I absolutely have to agree with the guys that their support is lagging. I mean kudos to the tech supports that work thru the weekday to get this thing resolve, but how about customer service and relations? It seems they customer relations does not exist at Dreamhost! Just because you are one of the thousands of accounts on their hundreds of server does not mean we are nothing. At least I would hope they would give the affected accounts some sort of coupon that we can use towards something like maybe getting another domain, or free IP for a year etc… But no… not a single word from the people at Customer Service.

    By the way, check out the Customer Service Guy at dreamhost at my blog … conjure from my spare time waiting for my server to get back up… ;)

  15. Jango says:

    what worries me is that i can’t get to the discussion forums at dreamhost, nor the wiki. scary stuff..

  16. R. Francis Smith says:

    The http://dreamhoststatus.com/ site says that there’s a separate problem with a firewall blocking access to the panel and the wiki and so on. Never rains but it pours, eh?

    -R

  17. FSFarm says:

    I’m quite tired of this. My business depends on my website and email access. Can anyone recommend another host?

  18. Jango says:

    i can’t recommend a good host, but i can tell you which ones to avoid (from personal experience). i had terrible experiences with powweb, hostnexus, and rootmode. i’ve yet to find a reliable host.. i guess we got what we paid for.

  19. FSFarm says:

    Just did a google search for web hosts and got this ranking. Has anyone heard of any of these at the top of the list?

    http://www.hostcritique.com/

    Thanks.

  20. vkimball says:

    Looks like they’re slowly getting things fixed.

    It appears that whatever system they use to store their customer / authentication data was unavailable to the rest of their network.

    Seems like the infrastructure hasn’t yet gotten ahead of all the growth that they’ve been experiencing.

  21. theMezz says:

    Hey poop happens sometimes
    They are fixing it fast anyway from what I see/read

  22. R. Francis Smith says:

    Not sure 72 hours of being largely unavailable is fast, as such, but okay.

  23. FSFarm says:

    Still no email. Anyone else???

  24. R. Francis Smith says:

    I do want to say on the flipside that at siteground, when my site was down, they’d done it on purpose (and not for the first time), blamed it on me (always), then accused me of using my hosting to pirate music (I used it to host my two podcasts.) That last bit was the last of my using them. :)

    And before them, I had endless mystery download failures when using godaddy. Whoever said we’re getting what we pay for is pretty much spot on, I’m afraid. Generally speaking dreamhost provides a better service than most of the other cheapies. Obviously, this wasn’t one of those times.

    -R

  25. FSFarm says:

    Working now!!!!!! Thanks DH team.

  26. Daniel Drucker says:

    Re Jungle Disk / S3:

    There’s currently no way to talk to a Jungle Disk directly on Dreamhost, though there’s no technical reason why you couldn’t write a script to do it – the source is available at the bottom of http://www.jungledisk.com/download.shtml and anyone who wants to write, say, a perl script to talk directly to S3 from Dreamhost is welcome to do it, e.g. by reading files and then writing the appropriate keys via Net::Amazon::S3. (You definitely couldn’t actually mount it as a filesystem on DreamHost – we don’t have FUSE, and I think that’s a good thing – thousands of users mounting filesystems on a shared web host wouldn’t be a great idea…)

    The way Jungle Disk works is via caching – when I rsync to my machine, it’s writing to a cache as fast as it can receive over the rsync connection. The Jungle Disk software is then pulling files out of that cache and writing them out to S3 as fast as IT can. However, it’s *not* removing files from the cache – it keeps them in the cache to be able to be read again if needed.

    The neat thing about JD is the way it lets you have an infinitely sized disk. One of my Jungle Disk volumes is a media library. I have a 150GB physical disk that used to contain a whole bunch of music, home videos, other large things. Before that, I had a 50GB disk, and before that… etc. When I ran out of space recently, instead of getting a bigger disk, I copied all my data to a Jungle Disk volume, and erased the disk. Then, I mounted the volume, and set the *cache* to reside on that 150GB disk. Now, my media library can grow as big as I want. The most recently accessed 150GB of my media library will always be instantly accessible, as it will be a “cache hit”. In the event of a “cache miss”, I have to wait for the file to arrive from Amazon (I get about 2MB/sec). It’s almost as good as having an infinitely sized disk.

  27. Brady J. Frey says:

    Still up and down… and it seems to happen to us monthly, sigh…

  28. vkimball says:

    They’re now saying that the new fileserver crashed. Huh?