Github recently suffered from an outage while upgrading a switch. During the update the network has been down for a few seconds but that caused file servers to suddenly appear partitioned.

I see that as a practical application of the CAP theorem: their distributed storage architecture decided to drop partition tolerance in favor of availability and consistency. So once the datacenter got partitioned, software turned bad. Still, choosing partition tolerance was to me a wise choice as availability is desirable and dropping consistency could lead to something very hard to reconcile once the network goes up. As they explain in the post it was a known risk, but hard to mitigate:

The fact that the cluster communication between fileserver nodes relies on any network infrastructure has been a known problem for some time

What I actually retain from this story is that even within a single datacenter, partitions occur. And it is very interesting to read Github’s postmortem posts.