![]() The key pieces we need to think about as an application developer is our application's deployment, configuration, data, dependencies, and how it's exposed (load balancing, DNS, etc). Logical failures largely depend on our own application architecture, and tend to be much more complicated to reason about. So with availability zones and regions in hand as the physical failure domains to reason about, we're left to think about the logical failure domain. In practice it's not uncommon for multiple availability zones in the same region to fail, but it's incredibly rare for multiple regions to fail. Therefore cloud providers group sets of availability zones into a higher level failure domain called a region. However, demonstrating the cost-vs-availability trade-off, we know that availability zones are not always truly isolated from each other (cloud provider outages demonstrate this all too frequently). In modern cloud environments, the set of physical failure domains we need to worry about has been grouped into an easy handle to reason about, the availability zone. In practice there is always some interdependence, and minimizing it is always a trade off of cost against availability. The overall reliability of the resulting system depends on how independent we can make the replicas. The fundamental way we build reliable systems is to group the sets of failure domains the system straddles into a silo, then replicate that silo as multiple independent instances. Data failures: the database for your application has bad data, had a bad update (misconfiguration, botched binary update, etc), replication failed or lagged, backup failed (or wasn't persisted, or was persisted for too little time to be useful). ![]()
0 Comments
Leave a Reply. |