On a world where image is all it is a breath of fresh air to see how companies like Facebook, Twitter and Foursquare are handling media response to their “down time”. As Oscar Wilde would say: “Experience is simply the name we give our mistakes.” and on an ever expanding internet world many face the same issues as the big guys on the block when scaling up and scaling out.
As seen on Mashable’s Post Mortem section no matter how strong our error handling frameworks are there is always room for improvement as seen on Facebook’s article on their “worst outage ever”. We are used to seeing the “FailWhale” but what most people don’t see is how Twitter is usually quick to post on their blog the what, and how the problems were solved. As many sites are moving to other dbms’ like MongoDB we learn a bit from FourSquare’s “re-indexing” problems.
I believe every developer team should have a Post Mortem Wiki or blog were new resources can learn from previous mistakes and a sense of collective knowledge can be shared by the whole team.
Have you implemented this idea in your daily life? Is this part of your Development team practices? Let me know.