So, you're hosted in a single zone, and if you're in US-East presumably went down last night. Stop it. AWS has seven regions across the world, three of which are in the US. Each region is split in up to five availability zones. You need to use more than just multiple zones if you want to stay up during one of these outages. Cross region is the answer.
The following is a rough guide to making your website available regardless of 'catastrophic' events.
Simple static sites and assets that don't change often have an easy option: just use CloudFront. You can do this by setting up a new CloudFront distribution. There are two common ways to use CloudFront for 'download' content - the first is S3 backed. This is great if your static content is already on S3.
Assuming so, use the AWS Console, click CloudFront, click create distribution, select download, and select your bucket from the dropdown list. If you've not set your http caching headers then you should set the Object Caching to Custom and enter a TTL - aka how long to cache things for.
The second and less oft-used way is to use CloudFront as a forward-proxy. Custom origins have been supported for some time now. It's pretty much the same procedure are above, but you point it at your webserver.
The process for this is more complex as you'll need to alter your webserver config, DNS and setup the distribution.
Your website will now be significantly faster.
For most people reading this, those of you serving dynamic content, you'll need to do something more complex. Migrating to a multi-region setup isn't the simplest so I'll outline it - the major parts are:
You want to move to Route53.
Route53 is AWS's DNS offering. It's awesome. It uses anycast. It has a swanky API. There are libraries for most commonly used langauges and more on github. The most important feature for us is going to be LBR, or Latency Based Routing. This allows you to have DNS records which change depending on the latency in a sepcific AWS region. Think of it as a CDN for DNS: it serves the lowest latency record for each user.
Side note: if you're not already, you should be using DNS for internal communication. Machines fail, using DNS means you can easy replace them without having to redeploy configuration. Route53 rocks at this due to its API. This is a separate discussion though.
So, you want to be able to cope with a fixed (lol, jk - I mean you've guessed your capacity) amount of traffic, but without spending too much monies. You're engineering for minimal downtime but also want to minimise cost. I.e. you don't want to run two or more identical copies of your environment (If you do, you are either over provisioned 98% of the time by 2x or you've engineered something that will get overloaded for the 2%, which makes it all rather pointless).
The solution is Auto Scaling. Auto Scaling lets you scale the number of machines up or down based on metrics of your choosing. These can be application-specific or the standard metrics from Cloud Watch.
tl:dr; When a region fails, traffic will mount up on the other one(s) and you'll need to scale up. Auto Scaling will do this.
However, it relies on you doing #3.
Automation is key for Auto Scaling, otherwise new machines start unconfigured. The easiest and fastest way to make your app work nicely with Auto Scaling is to use AMIs. However, this has one major issue: deployment. Releasing new code means rebuilding an AMI. If your release cycle is quick and your automation bad, this might suck more time than it's worth.
Using something like Puppet, Chef or CFEngine to manage configuration of your newly started machines is good - the main drawback over baking AMIs is speed - It can take up to 40 minutes to go from boot to usable. This might be too late if your application needs to deal with 100% extra traffic all of a sudden.
Automated deployment is essential if you're not baking AMIs.
I'm probably biased, but my current database of choice is MongoDB, so I'll assume it's yours too. If you're using MySQL or PostgreSQL I'll do a separate article in the future.
If you're using a single instance of Mongo, shame on you - you should migrate to at least two full machines + arbiter in a replica set. Replica sets allow for your app to keep running almost seemlessly to your application should your primary machine fail. They've also got pretty neat ways of going read only if nodes can't see the majority of the set.
Minimum sane setup:
Primary in one Region, Secondary in another and an Arbiter in a third region. This will allow for your application to keep writing if either of the primary or secondary regions' fail.
A better setup:
Replace the Arbiter with a full replica set member. This will allow for your application to keep writing if any one region fails. If you have a sharded setup, the key is to make sure your shards are distributed as well as config servers. Should you be super-paranoid, I'd suggest using a mix of EBS backed and Instance Storage, this will help when EBS goes south, however it can be more of a pain to setup.
One downside to all this is that you will have to pay for regional data transfer (but luckily not twice and sort out your security (AWS doesn't currently support cross region Secuirty Groups).
In conclusion, you probably used a single zone because it's easy (hey - so do we for now!). There will come a point where the pain of getting shouted at by your boss, client or customers outweighs learning how to get your app setup properly yourself.
You can get your app on AWS to stay up even with a region wide failure. Why don't you?