The practical consequence during an AWS provisioning outage is significant: Neon does not need to call EC2 APIs under failure pressure to replace dead compute nodes. It can pull a replacement from a pre-warmed pool of already-running instances and attach it to the existing storage state. The cloud provider's control-plane impairment becomes an operational inconvenience rather than a data-availability emergency.
Neon's regional deployments are not monolithic. Each region is composed of one or more identically shaped cells, where a cell bundles its own Kubernetes control plane, compute pool, and storage resources . This compartmentalization means that a failure in one cell—whether caused by a cloud provider outage, a software bug, or resource exhaustion—does not propagate to other cells in the same region.
During the May 2026 AWS outage in us-east-1, the cloud provider's failure specifically affected its ability to provision new instances and allocate IP addresses . For a single-cell architecture, that would have been a region-wide incident. In Neon's cell-based design, only the cells that exhausted their pre-provisioned compute buffers were affected. Other cells, carrying sufficient buffers of already-allocated instances, continued operating without interruption
.
This outcome reflects a deliberate design choice: cells are sized so that no single cell's resource limits can become a regional bottleneck. Earlier architectural lessons reinforced this thinking. Before moving to cell-based isolation, Neon operated a single Kubernetes cluster per region, and testing showed service degradation beyond 10,000 concurrent databases due to EKS etcd memory limits, network configuration constraints, and Kubernetes API rate limiting . Cell-based architecture removes those single-cluster ceilings entirely by splitting the load across independent, non-interacting cells.
Neon's relationship with the underlying cloud provider is intentionally arms-length. Instead of calling EC2 APIs on demand whenever a database needs to start, Neon pre-allocates pools of large—often bare-metal—instances and maintains buffer capacity to absorb provisioning outages . This buffer is not a small warm pool for priority tenants; it is a structural component of how the system schedules compute.
On top of these pre-provisioned instances, Neon runs its own vertically autoscaling virtualization layer that packs multiple Postgres instances onto a single physical host. This bypasses two cloud provider dependencies simultaneously: the VM provisioning API (instances are already running) and the block-storage attachment path (Neon's compute nodes do not use cloud block volumes) .
Data durability follows the same pattern. All database content resides in Neon's own zone-resilient storage service, backed by object stores like Amazon S3 or Azure Blob Storage, rather than on cloud provider block devices . Object storage APIs have different failure modes than VM provisioning APIs, and in practice, object store durability during regional control-plane outages has proven significantly more resilient. When a pageserver or safekeeper node fails, no durable state is lost—another node can reconstruct the necessary pages from WAL and object storage
.
In many managed database services, multi-AZ storage replication is a paid feature requiring explicit configuration. In Neon, every database—regardless of pricing tier—is backed by distributed, zone-redundant object storage with NVMe SSD caches spread across multiple availability zones . This removes physical replication across zones as a separate concern, because the storage layer itself is inherently replicated.
The WAL replication design provides concrete durability guarantees: writes are synchronously replicated to safekeepers with a quorum requirement (six-way replication with a four-of-six write quorum is one published configuration), meaning an entire availability zone plus one additional replica can fail without data loss . This is not theoretical resilience; it is a property of the write path that must be satisfied before transactions are acknowledged to the client.
For compute availability specifically, the shared storage model provides an advantage that traditional primary-replica architectures cannot match: because all compute instances share the same durable storage history, a replacement compute does not need to catch up through physical replication. It attaches to the existing history and begins serving queries within seconds to a few minutes, depending on workload and the size of the cached working set .
Published availability SLIs for Neon's lakebase architecture fall in the range of approximately 99.93% to 99.96% . These numbers reflect a design where compute failures are recovered by replacing stateless nodes rather than failing over to idle hot standbys, and where storage durability is achieved through object-store-backed replication rather than synchronous disk mirroring.
Neon's own incident record provides a useful calibration of these targets. A May 2025 incident in us-east-1 caused 5.5 hours of unavailability for database start and creation operations across two separate events, though active databases remained unaffected . The root cause—exhausted IP addresses in Kubernetes subnets triggered by control plane overload and AWS CNI misconfiguration—exposed a scaling limit that cell-based architecture was subsequently designed to prevent
. Earlier, in August 2024, a pageserver outage in us-east-1 affected approximately 0.4% of customer projects for up to two hours after an EC2 instance failure; because pageservers act as a local disk cache backed by S3, losing a pageserver meant temporary unavailability rather than permanent data loss
.
These incidents underscore that stateless compute and shared storage reduce the severity of failures but do not eliminate them entirely. The architecture's resilience properties—no data loss from compute failures, automatic recovery through reattachment, cell-bounded blast radius—hold up under real failure conditions, but the system is not immune to software defects, resource exhaustion, or cloud provider dependencies that have not yet been fully decoupled (such as IP address allocation).
Neon's engineering blog states that the system is tested against real-world failure scenarios including cloud provider provisioning outages and whole availability-zone disconnection simulations . These tests exercise the pre-provisioned instance buffers and cell isolation boundaries that are supposed to limit blast radius. The general form of chaos engineering that Neon describes mirrors established practice: define a steady-state hypothesis about how the system should behave under failure, inject a controlled fault (such as disconnecting an entire AZ or exhausting compute buffers), observe whether the hypothesis holds, and iterate on the architecture when it does not
.
While Neon has not published a detailed chaos engineering methodology or specific experiment results beyond the architectural blog overview, the available evidence shows that the testing directly targets the system's distinguishing resilience claims. The tests that Neon describes—simulating provisioning outages and AZ failures—are precisely the scenarios where stateless compute and cell isolation should provide the greatest advantage over traditional managed database architectures. The May 2026 AWS outage effectively served as an unplanned validation of those same mechanisms, and the contained blast radius outcome is consistent with what pre-provisioning and cell isolation are designed to produce.
Neon's architecture offers a specific resilience trade-off: it accepts that compute is ephemeral and replaces it rapidly rather than keeping it running at all costs, while investing heavily in storage durability and failure-domain isolation. For workloads where occasional sub-minute query interruption is acceptable and the primary concern is data safety, this model eliminates the cost and complexity of maintaining hot standbys. For workloads requiring continuous query availability with zero interruption, additional multi-compute configurations are available but come with higher cost.
The architecture also forces an honest accounting of cloud dependency. No database service is truly independent of its underlying cloud provider, but the degree of coupling varies enormously. Neon's decision to pre-provision capacity, use its own virtualization layer, and store data in object storage rather than block volumes reduces the surface area of cloud provider APIs that must be available for the database tier to function. That narrower dependency surface paid off during the May 2026 AWS outage, when cells with adequate pre-provisioned buffers continued operating through a failure that would have been region-wide for a more tightly coupled architecture.
For teams building on serverless infrastructure, Neon's approach demonstrates that blast-radius containment is not an afterthought—it is a product of architectural decisions made at the storage-compute boundary and the failure-domain structure long before an outage occurs.
Comments
0 comments