Tools: Stop Paying a Streaming Bus to Carry Bytes That Live for Ninety Seconds - Full Analysis

Tools: Stop Paying a Streaming Bus to Carry Bytes That Live for Ninety Seconds - Full Analysis

How a shared filesystem became the cheapest, fastest outbox I've ever built — and why FSx for OpenZFS is the version of that idea that finally scales

What "transitional data" actually means

The Kinesis math at scale, written out plainly

MSK isn't the answer either

The mental shift: payload on a filesystem, pointer on the bus

Why FSx for OpenZFS, specifically

Intelligent-Tiering, and why it matches transitional-data access patterns perfectly

The architecture in two passes

Producer pass

Consumer pass

The safe-write protocol, in detail

Idempotent retry without re-reading the bytes

Byte-based backpressure, not request-based

Naming, layout, and the partition directory tree

The Java library question, and why "kernel NFS + plain NIO" wins

Failure semantics, in painful detail

Cost worked example at 100 TB/day

Observability and what to watch

When this pattern is the wrong answer

Closing I was staring at an AWS bill last quarter where a single Kinesis Data Streams line item was costing more than the entire S3 footprint sitting behind it. The events on that stream had a useful lifetime of about ninety seconds. They were written by one service, read by another, processed, and dropped. We were paying full streaming-bus price for bytes that barely outlived a TCP timeout. That bill is what got me thinking about transitional data as a category that deserves its own architecture, and about why every "use the right tool" instinct I had — Kinesis, Kafka, MSK — was the wrong tool for this particular shape of work. The right tool, it turns out, is a filesystem. Specifically, AWS FSx for OpenZFS, used as an outbox between producers and consumers, with only a tiny pointer message traveling through whatever messaging bus you already have. This article is the case for that pattern. It's also the design, the failure modes, the code, the cost math, and the honest list of when not to do it. I'll walk you through the architecture from first principles, show you the safe-write protocol that makes it correct under crashes and concurrent retries, compare the cost against Kinesis, MSK and EFS at a realistic petabyte-class workload, and explain why the recent addition of FSx Intelligent-Tiering changes the cost story in a way that makes the pattern attractive even for teams that don't ingest petabytes. If you've ever felt the queasy sensation of paying twice for the same bytes — once to land on a stream, again to land in storage — this is for you. Most data falls into one of two cleanly shaped buckets. Durable data is the stuff you keep — user records, orders, financial events, audit trails. It needs to live for years; you pay storage costs for those years and you get value over those years. Streaming data is data you process in motion — clickstreams, telemetry, alerts — where the value is in real-time consumption. Transitional data is the awkward middle child. It's data that: The classic examples are event-pipeline workloads: a producer ingests external events, enriches them, then hands them to one or more downstream consumers for further processing. The "stream" in the middle is conceptually a pipe, not a database. The events have ordering constraints within partitions and at-least-once delivery requirements, but nobody is querying the stream itself — it's purely a carrier. The instinct, drilled into us by ten years of cloud-native architecture talks, is to put transitional data on a streaming bus. Kinesis. MSK. Pub/Sub. Event Hubs. It's the default. It's what the conference slides recommend. It's what every reference architecture diagram shows. And it works fine — right up until your volume gets serious. At that point, the streaming bus stops being a transport and starts being the largest line item on your bill. Let me walk through the actual numbers. A Kinesis Data Streams shard in provisioned mode gives you, per the AWS service limits documentation, 1 MB/s of write throughput OR 1,000 records per second, whichever you hit first. Read throughput is 2 MB/s per shard (5 transactions per second). Whichever cap you smash into first is the cap you actually have. Suppose you're ingesting at a steady 700 MB/s, which is what a healthy event pipeline at ~100 TB/day looks like. You need 700 shards just to hold the write rate, before any consumer fan-out, before any headroom for spikes, before any consideration of hot keys. Hot keys make the picture worse. The shard you land on is determined by hash(partitionKey) % shardCount. If 5% of your traffic comes from one customer or one stream of telemetry, that 5% all goes to one shard. One shard, 1 MB/s. The other 699 shards sit there underutilized while your hot shard throttles. You can solve this with key spreading, but key spreading breaks per-key ordering, which is often the entire reason you picked a partitioned stream in the first place. So you scale up — to 1,000 shards, to 1,500 shards — to give the hot key room. Shards are billed per shard-hour. At list price the per-shard cost is small, but at 700–1,500 shards it adds up to tens of thousands of dollars per month before you've ingested a single byte. Add PUT payload unit charges (each 25 KB chunk counts as a unit, so 700 MB/s of small events is around 28,000 units per second), enhanced fan-out per consumer, extended retention, and you're well into the high five figures per month for a transport that exists solely to move bytes that won't be relevant in five minutes. On-demand mode looks nicer until you read the small print. It's billed per GB ingested, per GB retrieved, plus a baseline stream-hour fee. At 100 TB/day, the per-GB charges alone clear $30K/month before you count consumer reads. You can argue with the exact dollar figures — they vary by region, by reserved-capacity discounts, by your actual fan-out fanout pattern — but the shape is the same regardless. The cost grows linearly with the bytes you push through, and the bytes you push through are exactly the same bytes you have to store somewhere durable anyway. You are paying twice. The instinct after "Kinesis is expensive" is "let's use Kafka instead." Kafka is open-source, it's mature, the ecosystem is enormous. AWS gives us MSK so we don't have to operate the brokers ourselves. Surely that's cheaper. It's not, for this shape of workload. Let me show the math. MSK pricing, as of 2026, looks roughly like this for provisioned clusters: broker instances at around $0.20/hour for the small kafka.m7g.large class and proportionally more for larger ones, EBS storage at $0.10 per GB-month for the broker volumes, plus inter-AZ data transfer for replication. MSK Express adds another $0.01/GB ingested. MSK Serverless charges $0.75 per cluster-hour, $0.0015 per partition-hour, $0.10/GB in and $0.05/GB out. For a 100 TB/day workload you'll want at least 6–9 brokers of a non-trivial instance class to handle the throughput with replication-factor-3 durability. That's $3K–$6K/month in broker hours alone. Then you need EBS sized to hold a few hours of buffer per broker — say 2 TB per broker times 9 brokers times $0.10/GB-month, another $1,800/month. Then cross-AZ replication of every byte ingested: at 100 TB/day across three AZs you're shifting roughly 200 TB/day cross-AZ for replication, and AWS charges around $0.01/GB for that — call it $60K/month if you're unlucky on routing. Add the operational burden: MSK frees you from broker installs but not from partition rebalancing decisions, broker right-sizing, version upgrades, ZooKeeper-to-KRaft migration, ACL management, monitoring, and the eternal "why is my consumer lag spiking" investigations. That's not a billed line item but it's a real cost. Rough total for the same 100 TB/day shape: $70K–$90K/month, give or take. Comparable to Kinesis, more operational headache, no architectural advantage for transitional data because — and this is the key point — you are still paying a transport service to carry every single byte through brokers and across AZs, even though those bytes are going to be discarded within minutes. The rule of thumb in the industry has settled at MSK being roughly 3–5× more expensive than Kinesis once you count operational overhead. For specifically transitional data, where the value of the byte-time on the wire is approximately zero, either choice is the wrong economic shape. There has to be a different model. Here's the model that actually fits. It's the outbox pattern, but the outbox is a shared filesystem instead of an in-process table. The producer doesn't push bytes through a transport. Instead it writes a file to a shared filesystem — a real file, on a real path, with normal POSIX semantics. Then it publishes a tiny pointer message through whatever bus you have lying around — Kinesis, SNS, SQS, even a small Kafka topic. The message is a hundred bytes: file path, batch id, size, checksum, maybe a couple of routing keys. The consumer reads the pointer message, mounts the same filesystem, opens the file at the given path, streams the bytes, processes them, and acknowledges the message. The bytes never traverse the transport. The transport only carries metadata about where to find the bytes. That diagram is the entire pattern at the architectural level. Three boxes. The interesting part is everything you don't see in it: backpressure, atomic writes, idempotent retry, partitioning, batching, compression, file-system tuning. We'll get to all of that. But first, why this change of frame is so cost-effective: The bus carries small messages, so the bus cost collapses. Pointer messages are roughly two hundred bytes each. A bus that was strained to carry 700 MB/s of payload is now carrying maybe 200 KB/s of pointers. Shard counts can drop by an order of magnitude or more. One real architecture I've seen reduced a Kinesis shard count from 700 to 32 — a 95% reduction in transport spend, with no change to the actual byte volume being moved. The filesystem is billed for what it stores, not for what passes through it. You pay for the bytes that exist, not the bytes that have existed and have already been consumed and deleted. If your retention is one hour, you pay for one hour's worth of bytes resident at any given moment. The filesystem gives you primitives the bus doesn't. Snapshots. Random access. Concurrent multi-reader semantics. Standard POSIX tools to inspect, validate, and debug. Your data-engineering team can ls your transitional data. They cannot ls a Kinesis stream. The compute path is the same speed or faster. A modern NFS mount on the same VPC moves data faster than any HTTP-based streaming bus. Once you understand the math, the streaming-bus model is the slow option, not the fast one. So why didn't everyone do this ten years ago? Because the filesystem options didn't scale. EFS exists, but at the throughput required for petabyte-class workloads EFS becomes the most expensive option in the room. We'll get to that comparison. The reason this pattern is suddenly viable is that FSx for OpenZFS, particularly with Intelligent-Tiering, gives you the performance and the multi-AZ durability of a real production filesystem at a cost that beats every transport on the market. There are at least six AWS storage services you could plausibly use here. Let's eliminate them one by one. S3 direct from producers. Tempting because S3 is cheap, well-understood, and infinite. But: producers write small files (a few hundred KB to a few MB per micro-batch), and S3 has a small-files problem. Per-PUT overhead, eventual-consistency caveats on listings (mostly fixed but historically a footgun), and a multipart-upload model that's clumsy for the size of files you actually produce. Consumers also have to do an LIST or rely on S3 event notifications, which adds latency and complexity. S3 is the destination tier, not the working tier. EFS Standard. Native NFS, multi-AZ, easy to mount everywhere. At $0.30/GB-month for storage and elastic throughput costing $0.03/GB read and $0.06/GB written, a 100 TB/day workload runs about $300K/month in throughput charges alone in elastic mode. EFS Provisioned trades that for a fixed throughput fee — at the throughput levels we need, around $90K/month. Still expensive, and EFS latency is consistently in the single-digit-millisecond range rather than the sub-millisecond range you'd want for a hot working tier. FSx for Lustre. Genuinely fast — sub-millisecond latency, hundreds of GB/s aggregate throughput in larger deployments, the workhorse of HPC. But it's single-AZ only, and the failure model for transitional data really wants multi-AZ. You can stitch together cross-AZ Lustre with replication, but the cost balloons and you lose simplicity. Lustre also requires a kernel module on clients, which is operationally awkward in mixed container environments. FSx for Windows / FSx for ONTAP. Both work, both support multi-AZ, both add complexity (SMB licensing, ONTAP feature surface). Neither is wrong, neither is the obvious right. FSx for OpenZFS. Multi-AZ with synchronous replication to a standby in another AZ. NFS protocol (v3 / v4.0 / v4.1 / v4.2) — clients are the standard Linux kernel NFS client, no special drivers. SSD-backed. Sub-millisecond latency. Native LZ4 compression at the file-system level. POSIX semantics, including the all-important atomic-rename guarantee we need for safe writes. Snapshots, encryption at rest, KMS integration. And — this is the new part — an Intelligent-Tiering storage class that prices in at roughly 85% less than the SSD class. The recent Intelligent-Tiering announcement is what tipped this from "viable" to "obvious." Let's look at what it actually gives you. FSx Intelligent-Tiering is a separate storage class for FSx for OpenZFS that introduces three tiers within a single namespace. Per the AWS announcement: Multiply the discounts and you get a storage cost in the Archive tier that's roughly 80% lower than the Frequent Access baseline. The marketing line is "up to 85% lower than the existing SSD storage class," which checks out arithmetically. A file system using Intelligent-Tiering supports up to 400K IOPS and 20 GB/s of throughput, with a minimum provisioned throughput floor of 160 MBps. You can optionally provision an SSD read cache on top of the tiered storage to keep hot files at sub-millisecond latency even after they've technically migrated to a colder tier. Here's why this matches transitional data so perfectly: the access pattern for an outbox is "hot for minutes, cold forever after." A batch file is written, read once or twice by consumers within the first few minutes, and then never touched again unless someone is doing forensic analysis. That's exactly the pattern Intelligent-Tiering is optimized for. The hot files stay on the fast tier and serve consumers at sub-ms latency; the post-consumption files quietly slide down to the Archive Instant Access tier, where they cost almost nothing but are still readable on demand if you need to audit. You don't have to design this. You don't have to set up lifecycle rules. The file system does it. From the application's perspective, every file is at the same path in the same namespace. The pricing optimization happens underneath. Let's walk through the producer side and the consumer side in turn. This is the actual shape of a working implementation, lightly cleaned up from a real codebase. A producer's job is to take a batch of events, compress them, write the result durably to FSx, and publish a notification message to the bus. The full sequence is: Step 5 onwards is the safe-write protocol, and it's the heart of correctness. We'll spend the next section on it. Steps 1–4 are application logic and bookkeeping; they're conceptually simple but the byte-based backpressure deserves its own discussion. Consumers do the inverse: There's no LIST, no scan, no polling for new files. Consumers are driven by the bus. The bus tells them what to read; the filesystem holds what they read. This separation matters operationally. A slow consumer doesn't back up the producer's writes — the producer's writes are already complete on the filesystem. A bus outage doesn't stall the producer beyond the publish step; the file is already durable. A consumer crash doesn't lose data; the next consumer pulls the same notification (because we hadn't acked) and reads the same file. The single most important piece of code in this entire pattern is how a producer commits a batch to the filesystem. Get this wrong and you have torn writes, partial reads, racing retries deleting each other's work. Get it right and you have an outbox that survives crashes without complicated coordination. The protocol has six steps. Each one is doing real work. Step 1 — create the batch directory. Files.createDirectories(batchDirectory). POSIX mkdir -p. This is idempotent and cheap. The batch directory is computed deterministically from the partition id and a UUID-based batch id, with time-based folder buckets above it to avoid one directory eventually holding millions of entries. Step 2 — open a unique temp file. Critically, the temp filename includes a per-attempt unique token. The template I use is something like data.bin.tmp.<uuid>.<attempt>. The reason for the uniqueness is that retries must not collide with each other or with a previous crashed attempt. If attempt 1 crashed mid-write and left a data.bin.tmp.abc123.1 orphan, attempt 2 must open data.bin.tmp.def456.2 — a different name — so neither attempt steps on the other. The StandardOpenOption.CREATE_NEW flag enforces this: the open fails if the file already exists, which is exactly what we want. Step 3 — vectored write. This is where the implementation rewards careful engineering. A naïve writer would loop over composite buffer components and issue one write() per component. A composite payload of a hundred small buffers becomes a hundred syscalls. Instead we use FileChannel.write(ByteBuffer[]), which is the JDK's bridge to the kernel's writev syscall. The kernel takes a whole array of buffer descriptors and does the gather in one call. Many small components, one syscall. There's a chunking detail: most kernels cap the gather size somewhere around IOV_MAX (1024 on Linux). The code respects that with GATHER_WRITE_BUFFER_LIMIT = 1024. A payload with more components is gathered in 1024-buffer slices. The CRC32C is computed during the same pass — one walk over the bytes, two outputs. Computing the checksum as a separate post-pass would double the byte traffic for nothing. Step 4 — fsync. Optional. fileChannel.force(forceMetadata) issues fdatasync (or fsync if metadata is included) to push the bytes to persistent storage before returning. On a synchronously-replicated multi-AZ FSx, this also blocks until the standby AZ has acknowledged the write. It's expensive (in latency, not bytes) but it gives you durability before the rename. Most workloads can skip it if they tolerate the small window between rename and standby-AZ sync; high-durability workloads should turn it on. Step 5 — atomic rename. Files.move(tempFile, finalFile, StandardCopyOption.ATOMIC_MOVE). POSIX guarantees that a rename(2) is atomic on the same filesystem: an observer either sees the old name or the new name, never both, never neither, never a partial state. Translated to our case: a consumer either sees data.bin (and can read the full payload) or doesn't see it at all (and will retry later). It never sees a partial data.bin. This guarantee is what makes the whole pattern correct. There is no other coordination, no manifest file, no two-phase commit. The single atomic rename is the publish moment. There's a subtle case: ATOMIC_MOVE providers on some implementations will replace an existing target without erroring. Before issuing the move we therefore probe for an existing data.bin. If it exists, the producer treats this as an idempotent retry — more on idempotency in a moment. Step 6 — publish pointer. Only after the rename succeeds do we publish the notification message. The order matters: if we published first and then the rename failed, consumers would chase a phantom file. The pointer message is intentionally tiny: Two hundred bytes, give or take. The bus carries the pointer. The filesystem holds the payload. The transport bill goes from "scale with bytes" to "scale with message count" — and message count for a typical batching workload is in the hundreds-per-second range, well within any bus's comfort zone. Retries are unavoidable. Producers crash, networks blip, NFS metadata operations stall. The framework has to handle all of these without producing duplicate files or losing the in-flight write. The idempotency rule I use is intentionally simple: if the final data.bin already exists at the expected path, treat it as a successful prior write — without re-reading the bytes. The size-only check is deliberate. Re-reading a multi-megabyte compressed payload over NFS just to verify a checksum would double the IO cost of every retry, and the atomic-rename protocol already ensures that if the final file exists, it contains a complete, validated payload. The size check is a cheap sanity guardrail against bizarre cases where another process wrote a different file with the same name. If the size matches, we return success with an idempotentExistingWrite=true flag. The caller sees a normal success result and continues. The notification publish step is then idempotent on its own end — most buses dedup on a message id you can derive deterministically from the batch id. This is at-least-once delivery, not exactly-once. The consumer side has to dedup by batchId. That's the standard contract for event pipelines and it's fine; building exactly-once on top of at-least-once with idempotent consumers is a solved problem. Naive backpressure limits the number of concurrent operations. "No more than 16 writes at a time." That works when every write is roughly the same size. It breaks the moment one write is 10 KB and the next is 100 MB. The 16-operation limit lets a single bad batch consume gigabytes of in-flight memory while the limiter thinks it's still doing the right thing. Instead, the limiter I use bounds bytes in flight, not operation count: Both must be satisfied before a write is admitted. When neither is satisfied, the request is queued; when capacity becomes available (a running write completes), the queue is drained in FIFO order. The whole thing is non-blocking — callers get a CompletableFuture<Permit> and can compose it into their own async chain. The detail that matters in production: oversized requests (those larger than maxInFlightBytes on their own) need to be handled. The implementation clamps reservedBytes to the limit, so an oversized write can run, but it runs alone. The caller is responsible for not handing in 10 GB requests; the framework's protection is against the runaway accumulation of "merely large" requests, not against a single pathologically-large one. Why is this fancy enough to need its own class? Because the difference between request-based and byte-based limiting is the difference between "the limiter does what I think it's doing" and "memory exploded at 3 AM because one batch was unusually large." Subtle bugs there are expensive to debug. The filesystem layout matters more than it might seem. Here's the convention I use: A few rationales worth calling out: Partition as the top-level grouping. Within a single partition, writes are serialized by the producer's own logic. Across partitions, writes are independent. Putting partition first in the path lets consumers scan or process one partition without walking the whole tree, and lets retention policies operate at partition granularity. Hive-style time-bucket folders. year=Y/month=M/day=D/hour=H/ is the convention Hive and Spark and a hundred analytics tools recognize. If you ever want to plug a query engine over the FSx tree (DuckDB, Presto, anything), the partitioning is already in a shape it understands. More importantly, time-bucketing prevents any single directory from accumulating millions of entries — a real performance issue for NFS metadata operations. Per-batch directory. Each batch gets its own folder, not just its own file. This gives you room to add sidecar files later (index files, per-batch metadata, manifest JSON) without breaking the existing readers. The batch id is a UUID, so collisions are not a concern. data.bin as the final filename. Boring, predictable, descriptive. The temp file uses the same final name with a .tmp.<token>.<attempt> suffix so the atomic rename target is unambiguous. There's also a critical security check: the partition id is sanitized and validated to stay under the configured FSx root. A producer cannot pass a partition id like ../../../../etc/passwd and write outside the intended tree. The sanitizer rejects path-traversal characters and the framework asserts the resolved path is a descendant of the root before doing anything with it. When you start looking for a Java library to talk to FSx OpenZFS, you find a mess. There's no official AWS SDK for FSx data-plane operations — the AWS SDK only handles the management plane (create/destroy file systems). For the actual file I/O you're on your own. I went down the rabbit hole: aws-sdk-java-v2 + S3AsyncClient + S3 Access Points for FSx. Works if you expose your FSx file system through an S3 access point. Mature, async, multipart, retries, integrates with the AWS SDK ecosystem. The downside is the S3-style access has tens-of-milliseconds latency rather than the sub-millisecond NFS-mount latency. For batch jobs that's fine; for hot transitional data you give up most of the perf advantage. dCache/nfs4j. A pure-Java NFSv3 and NFSv4 implementation. The most serious Java NFS library available. It's actively maintained, has a JMH benchmarks module, runs on Java 17. If you absolutely need to write your own NFS client (or extend one), this is where you'd start. But it's not a turnkey FSx client — you'd be building protocol code, and AWS's own performance documentation is opinionated about mount-time options that are easier to apply with the kernel client. EMCECS/nfs-client-java. NFSv3 only, dependent on a years-old version of Netty. Workable for legacy use, not a foundation for a petabyte-scale system in 2026. SMBJ, jcifs. SMB protocol clients. Wrong protocol family — FSx for OpenZFS doesn't speak SMB. Hadoop / Spark NFS connectors. Useful for ideas about request pipelining, not a foundation. The conclusion I came to is the boring one: mount the FSx file system with the kernel NFS client, and use the standard Java NIO FileChannel / AsynchronousFileChannel against the mount point. AWS's tuning guidance is much more opinionated about client-side mount options than about language libraries. Specifically: Once you do those things, the kernel does the heavy lifting. Java NIO becomes the thinnest possible wrapper over the syscalls. You're competing with the kernel for performance, and the kernel wins. So the implementation pattern is: That's it. No custom protocol code. No special drivers. No language-specific clients. Just the things the kernel is already tuned to do well. The honest list of what happens when things go wrong. Producer crashes after writing the temp file but before the atomic rename. The temp file is an orphan. No consumer ever sees it (consumers only look for data.bin). The producer's next attempt uses a new unique temp filename, so it doesn't collide. The orphan is cleaned up by a TTL job or a periodic sweep — the framework explicitly does not handle this, because the cleanup cadence is a deployment decision. Producer crashes after the rename but before publishing the notification. This is the dangerous one. The file is on FSx, but no consumer knows it exists. On the next retry, the producer attempts to write again with the same batch id, sees the existing data.bin, recognizes it as an idempotent retry, and proceeds to publish the notification. The consumer dedups by batch id so it doesn't matter that the producer might publish twice. The net result is at-least-once delivery, with at-most-once processing preserved by the consumer's dedup. FSx write succeeds, notification publish fails. Same picture as above. The file stays on FSx. The framework returns the publish failure to the caller, who is expected to retry. Cleanup is deferred to the retention policy. The pattern is explicitly non-transactional: FSx and the bus are composed, not atomically coupled. Notification publish succeeds, FSx file is later corrupted or deleted. This shouldn't happen — FSx is durable storage — but if it did, the consumer would get a read error and fail the message. With a dead-letter queue it would surface as an actionable alert. The CRC32C in the pointer message lets the consumer detect a corrupted file before deserializing. Consumer crashes mid-read. Bus message isn't acked. After visibility timeout, another consumer picks it up and reads the same file. Same file, same bytes, same processing. Dedup by batch id at the consumer side keeps the processing semantics correct. FSx becomes unavailable mid-write. Writes fail with retryable exceptions. The framework retries with exponential backoff and jitter (the standard schedule: base delay, doubled on each failure, with a random jitter component bounded by the current ceiling). After the max attempts, the failure is propagated to the caller, who is expected to translate it into a rate-limit response or a circuit-break upstream. Critically, the framework does not sleep on the I/O executor thread. Retries are scheduled on a dedicated scheduler so the I/O threads stay free to handle other writes. FSx is healthy but slow (a metadata operation stalls). The outer write timeout protects callers. The framework wraps the write future in a withTimeout that completes the caller-visible future with a TimeoutException after a configured deadline, while the actual write is allowed to complete in the background. The framework holds the pooled direct buffer reference until the real write finishes, not until the timeout fires, so we never release memory that NIO is still using. The size of this subtle bookkeeping difference, in production, is the difference between "occasional timeouts" and "occasional segfaults." That last point is worth dwelling on. Caller-visible timeouts must not free resources that the underlying I/O still owns. The pattern I use: The payload.close() only runs when the inner writeFuture completes for real, regardless of whether the outer timeout-wrapped future has already returned. The buffer reference outlives the caller-visible future, by design. This is the kind of detail that doesn't show up in any architecture diagram but determines whether your system stays up under load. Let's do the actual math. The workload: 100 TB/day of compressed transitional data, 24-hour retention, multi-AZ durability required. List prices in us-east-1. Two clarifications. First, these are illustrative ballpark figures for the same shape of workload, not benchmarked actuals. Your numbers will differ. Second, the FSx Intelligent-Tiering number assumes a typical transitional-data access pattern — files are read within the first few minutes and then never touched again, so they migrate to the Archive tier quickly. If your access pattern is heavier (consumers re-reading historical data frequently), the savings shrink because more data stays warm. The headline numbers are real, though. Moving from a streaming-bus model to a filesystem-pointer model knocks the cost down by roughly an order of magnitude at this scale. Adding Intelligent-Tiering knocks it down by another factor of five-ish. For a 100 TB/day workload you're looking at moving from ~$70K/month to ~$5–27K/month. That's not a marginal optimization. That's the kind of saving that funds the project on its own. The framework I work with emits a small set of metrics, all tagged by partition id, that have proven repeatedly useful in production: Watching inflight.bytes against maxInFlightBytes tells you immediately whether you're sized correctly. Watching backpressure.pending tells you whether producers are being throttled by the limiter (a good sign that your downstream is saturated) versus by FSx itself (which would show up as slow write.latency). On the FSx side, watch the published CloudWatch metrics: DataReadBytes, DataWriteBytes, ClientConnections, NetworkThroughputUtilization, FileServerDiskIopsUtilization, FileServerCacheHitRatio. The cache hit ratio in particular is the early-warning signal for "I should provision an SSD read cache" if your access pattern starts re-touching aged files. The honest list, because the worst thing you can do with an architecture article is sell the pattern as universal. Your traffic is small. If you're moving gigabytes per day, not terabytes per day, the cost gap closes and the operational overhead of a shared filesystem outweighs the savings. Use the streaming bus. It's fine. Your producers and consumers are in different VPCs / regions. Cross-VPC NFS works but it's awkward. Cross-region defeats the whole point — at that point you're back to needing a real transport. If your topology is multi-region active-active, this pattern fits poorly. You need true streaming, with downstream subscribers wanting to react to each event individually. This pattern is fundamentally batched. The minimum unit of work is a batch file, not an event. If your downstream wants per-event push semantics within single-digit-millisecond latency, you want a real streaming bus. Your consumers are serverless functions that can't mount NFS. Lambda can't mount FSx for OpenZFS (it can mount EFS, but not FSx OpenZFS). If your consumer side is Lambda, your options are (a) use EFS instead, (b) front FSx with S3 access points and have Lambda read via S3 (sacrificing latency), or (c) use a small ECS task as an intermediary. None of those are great. Pick a different pattern. You need the bus to provide ordering, replay, or stream-processing semantics. A pointer-on-bus model gives you ordering only if the bus already provides it (Kinesis with strict shard partitioning, Kafka with partition keys). It does not give you stream replay or windowed aggregations or any of the operations that real stream processors do. If your processing needs those, you need a real stream. Your team's operational maturity doesn't include filesystem operations. Shared filesystems have failure modes. Stale NFS handles, mount drift, permission issues, capacity-planning surprises. If your team has historically only operated stateless services, they will be surprised by the first NFS-related incident. That's a fixable gap, but it's a gap. For the broad middle — large volumes, in-VPC consumers, batched processing, cost-sensitive ingest — this pattern is the right answer. The decisive factor is usually the cost math. Once you see your own version of the chart above, the choice tends to make itself. There's a recurring blind spot in cloud architecture that I've watched cost teams six- and seven-figure sums over the past few years: treating every data-in-motion problem as a streaming-bus problem. The streaming bus is a wonderful tool when its semantics — per-event delivery, low-latency push, multi-subscriber fan-out — actually match your workload. It is a remarkably expensive tool when your workload is "land bytes here, pick them up over there, then forget them." Transitional data is that second shape. It deserves a different tool. A shared filesystem with sub-millisecond latency, multi-AZ durability, native compression, atomic rename semantics, and now an intelligent-tiering storage class that automatically migrates cooling data to cheap storage — that's the tool. FSx for OpenZFS is the specific implementation that happens to package all of those properties together at AWS today, but the pattern works against any filesystem with the same properties (in another decade, on another cloud, it'll be something else). The architecture is small. Producers do an atomic write to FSx and publish a tiny pointer message. Consumers read the pointer and stream the bytes from a mount. The bus carries metadata, not payload. The cost shifts from "scale with byte volume" to "scale with stored volume," and the storage class itself takes care of the cooling story. The code that implements this well — backpressure by bytes not requests, safe write protocol with temp+rename, idempotent retry without re-reads, careful timeout/buffer-lifetime bookkeeping — is small enough to fit in one engineer's head and robust enough to run for years untouched. The hard part was never the protocol. The hard part was unlearning the reflex that says "data in motion = put it on a stream." If you take one thing from this, take this: put a calendar reminder for next week to look at your own AWS bill, find the line item with the highest cost-per-byte-of-useful-life, and ask yourself whether the bytes are actually getting their money's worth from the transport they're on. The answer might surprise you. The fix is usually simpler than the architecture diagram makes it look. Steal the pattern. It works. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

1. Build payload → ZSTD-compressed protobuf in a direct ByteBuffer 2. Resolve partition folder → sanitize id, validate against FSx root 3. Create batch directory → /fsx/<part>/year=Y/month=M/day=D/hour=H/batch-<uuid>/ 4. Reserve in-flight bytes → byte-based backpressure (more on this below) 5. Write to a unique temp file → data.bin.tmp.<uuid>.1 6. fsync the temp file → optional, controlled by config 7. Atomic rename to data.bin → Files.move(tmp, final, ATOMIC_MOVE) 8. Publish pointer message → { filePath, batchId, sizeBytes, crc32c } 9. Release in-flight bytes → permit auto-closed 1. Build payload → ZSTD-compressed protobuf in a direct ByteBuffer 2. Resolve partition folder → sanitize id, validate against FSx root 3. Create batch directory → /fsx/<part>/year=Y/month=M/day=D/hour=H/batch-<uuid>/ 4. Reserve in-flight bytes → byte-based backpressure (more on this below) 5. Write to a unique temp file → data.bin.tmp.<uuid>.1 6. fsync the temp file → optional, controlled by config 7. Atomic rename to data.bin → Files.move(tmp, final, ATOMIC_MOVE) 8. Publish pointer message → { filePath, batchId, sizeBytes, crc32c } 9. Release in-flight bytes → permit auto-closed 1. Build payload → ZSTD-compressed protobuf in a direct ByteBuffer 2. Resolve partition folder → sanitize id, validate against FSx root 3. Create batch directory → /fsx/<part>/year=Y/month=M/day=D/hour=H/batch-<uuid>/ 4. Reserve in-flight bytes → byte-based backpressure (more on this below) 5. Write to a unique temp file → data.bin.tmp.<uuid>.1 6. fsync the temp file → optional, controlled by config 7. Atomic rename to data.bin → Files.move(tmp, final, ATOMIC_MOVE) 8. Publish pointer message → { filePath, batchId, sizeBytes, crc32c } 9. Release in-flight bytes → permit auto-closed 1. Pull notification message from bus 2. Validate pointer (path within allowed root, checksum optional) 3. Open the file via the local NFS mount 4. Stream the bytes — gather them, decompress, deserialize 5. Process the events 6. Acknowledge / delete the bus message 1. Pull notification message from bus 2. Validate pointer (path within allowed root, checksum optional) 3. Open the file via the local NFS mount 4. Stream the bytes — gather them, decompress, deserialize 5. Process the events 6. Acknowledge / delete the bus message 1. Pull notification message from bus 2. Validate pointer (path within allowed root, checksum optional) 3. Open the file via the local NFS mount 4. Stream the bytes — gather them, decompress, deserialize 5. Process the events 6. Acknowledge / delete the bus message private PayloadWriteOutcome writeGatheredBuffers(...) { final CRC32C checksumCrc32c = checksumEnabled ? new CRC32C() : null; long totalBytesWritten = 0L; int nextBufferIndex = nextReadableBufferIndex(payloadBuffers, 0); while (nextBufferIndex < payloadBuffers.size()) { ByteBuffer[] gatherSources = nextGatherSources(payloadBuffers, nextBufferIndex); updateChecksum(checksumCrc32c, gatherSources); while (hasRemaining(gatherSources)) { long bytesWritten = fileChannel.write(gatherSources); // writev under the hood totalBytesWritten += bytesWritten; } nextBufferIndex = nextReadableBufferIndex(payloadBuffers, nextBufferIndex + gatherSources.length); } return new PayloadWriteOutcome(totalBytesWritten, checksumCrc32c == null ? null : checksumCrc32c.getValue()); } private PayloadWriteOutcome writeGatheredBuffers(...) { final CRC32C checksumCrc32c = checksumEnabled ? new CRC32C() : null; long totalBytesWritten = 0L; int nextBufferIndex = nextReadableBufferIndex(payloadBuffers, 0); while (nextBufferIndex < payloadBuffers.size()) { ByteBuffer[] gatherSources = nextGatherSources(payloadBuffers, nextBufferIndex); updateChecksum(checksumCrc32c, gatherSources); while (hasRemaining(gatherSources)) { long bytesWritten = fileChannel.write(gatherSources); // writev under the hood totalBytesWritten += bytesWritten; } nextBufferIndex = nextReadableBufferIndex(payloadBuffers, nextBufferIndex + gatherSources.length); } return new PayloadWriteOutcome(totalBytesWritten, checksumCrc32c == null ? null : checksumCrc32c.getValue()); } private PayloadWriteOutcome writeGatheredBuffers(...) { final CRC32C checksumCrc32c = checksumEnabled ? new CRC32C() : null; long totalBytesWritten = 0L; int nextBufferIndex = nextReadableBufferIndex(payloadBuffers, 0); while (nextBufferIndex < payloadBuffers.size()) { ByteBuffer[] gatherSources = nextGatherSources(payloadBuffers, nextBufferIndex); updateChecksum(checksumCrc32c, gatherSources); while (hasRemaining(gatherSources)) { long bytesWritten = fileChannel.write(gatherSources); // writev under the hood totalBytesWritten += bytesWritten; } nextBufferIndex = nextReadableBufferIndex(payloadBuffers, nextBufferIndex + gatherSources.length); } return new PayloadWriteOutcome(totalBytesWritten, checksumCrc32c == null ? null : checksumCrc32c.getValue()); } { "filePath": "/fsx/<partition>/year=2026/month=05/day=24/hour=14/batch-abc123/data.bin", "batchId": "abc123", "partitionId": "<partition>", "sizeBytes": 524288, "crc32c": "f3c19a2b" } { "filePath": "/fsx/<partition>/year=2026/month=05/day=24/hour=14/batch-abc123/data.bin", "batchId": "abc123", "partitionId": "<partition>", "sizeBytes": 524288, "crc32c": "f3c19a2b" } { "filePath": "/fsx/<partition>/year=2026/month=05/day=24/hour=14/batch-abc123/data.bin", "batchId": "abc123", "partitionId": "<partition>", "sizeBytes": 524288, "crc32c": "f3c19a2b" } if (this.fileSystemOperations.exists(finalFilePath)) { return existingResultAsync(request, finalFilePath, payload, expectedBytes, attempt); } // in toExistingResult: long existingBytes = this.fileSystemOperations.size(finalFilePath); if (existingBytes != expectedBytes) { throw new FsxWriteException("Existing FSx file does not match expected payload size: " + finalFilePath); } return toWriteResult(request, finalFilePath, existingBytes, expectedChecksumCrc32c, attempt, true); if (this.fileSystemOperations.exists(finalFilePath)) { return existingResultAsync(request, finalFilePath, payload, expectedBytes, attempt); } // in toExistingResult: long existingBytes = this.fileSystemOperations.size(finalFilePath); if (existingBytes != expectedBytes) { throw new FsxWriteException("Existing FSx file does not match expected payload size: " + finalFilePath); } return toWriteResult(request, finalFilePath, existingBytes, expectedChecksumCrc32c, attempt, true); if (this.fileSystemOperations.exists(finalFilePath)) { return existingResultAsync(request, finalFilePath, payload, expectedBytes, attempt); } // in toExistingResult: long existingBytes = this.fileSystemOperations.size(finalFilePath); if (existingBytes != expectedBytes) { throw new FsxWriteException("Existing FSx file does not match expected payload size: " + finalFilePath); } return toWriteResult(request, finalFilePath, existingBytes, expectedChecksumCrc32c, attempt, true); final class FsxInFlightByteLimiter { private final long maxInFlightBytes; private final int maxInFlightOperations; private long inFlightBytes; private int inFlightOperations; private final Queue<PendingAcquire> pendingAcquires = new ArrayDeque<>(); CompletableFuture<Permit> acquire(long requestedBytes, Duration timeout, ScheduledExecutorService scheduler) { PendingAcquire pending = new PendingAcquire(toReservedBytes(requestedBytes)); synchronized (monitor) { if (pendingAcquires.isEmpty() && canAcquire(pending)) { acquireNow(pending); completeNow(pending); } else { pendingAcquires.add(pending); } } scheduler.schedule(() -> timeout(pending), timeout.toMillis(), TimeUnit.MILLISECONDS); return pending.future(); } private boolean canAcquire(PendingAcquire pending) { return inFlightOperations < maxInFlightOperations && inFlightBytes + pending.reservedBytes() <= maxInFlightBytes; } } final class FsxInFlightByteLimiter { private final long maxInFlightBytes; private final int maxInFlightOperations; private long inFlightBytes; private int inFlightOperations; private final Queue<PendingAcquire> pendingAcquires = new ArrayDeque<>(); CompletableFuture<Permit> acquire(long requestedBytes, Duration timeout, ScheduledExecutorService scheduler) { PendingAcquire pending = new PendingAcquire(toReservedBytes(requestedBytes)); synchronized (monitor) { if (pendingAcquires.isEmpty() && canAcquire(pending)) { acquireNow(pending); completeNow(pending); } else { pendingAcquires.add(pending); } } scheduler.schedule(() -> timeout(pending), timeout.toMillis(), TimeUnit.MILLISECONDS); return pending.future(); } private boolean canAcquire(PendingAcquire pending) { return inFlightOperations < maxInFlightOperations && inFlightBytes + pending.reservedBytes() <= maxInFlightBytes; } } final class FsxInFlightByteLimiter { private final long maxInFlightBytes; private final int maxInFlightOperations; private long inFlightBytes; private int inFlightOperations; private final Queue<PendingAcquire> pendingAcquires = new ArrayDeque<>(); CompletableFuture<Permit> acquire(long requestedBytes, Duration timeout, ScheduledExecutorService scheduler) { PendingAcquire pending = new PendingAcquire(toReservedBytes(requestedBytes)); synchronized (monitor) { if (pendingAcquires.isEmpty() && canAcquire(pending)) { acquireNow(pending); completeNow(pending); } else { pendingAcquires.add(pending); } } scheduler.schedule(() -> timeout(pending), timeout.toMillis(), TimeUnit.MILLISECONDS); return pending.future(); } private boolean canAcquire(PendingAcquire pending) { return inFlightOperations < maxInFlightOperations && inFlightBytes + pending.reservedBytes() <= maxInFlightBytes; } } /fsx-root/ <partitionId>/ year=2026/ month=05/ day=24/ hour=14/ batch-<uuid>/ data.bin [optional metadata files] /fsx-root/ <partitionId>/ year=2026/ month=05/ day=24/ hour=14/ batch-<uuid>/ data.bin [optional metadata files] /fsx-root/ <partitionId>/ year=2026/ month=05/ day=24/ hour=14/ batch-<uuid>/ data.bin [optional metadata files] producer/consumer container │ ▼ /fsx-root/ (Linux NFS mount, nconnect=16, NFSv4.1) │ ▼ AsynchronousFileChannel, FileChannel.write(ByteBuffer[]) producer/consumer container │ ▼ /fsx-root/ (Linux NFS mount, nconnect=16, NFSv4.1) │ ▼ AsynchronousFileChannel, FileChannel.write(ByteBuffer[]) producer/consumer container │ ▼ /fsx-root/ (Linux NFS mount, nconnect=16, NFSv4.1) │ ▼ AsynchronousFileChannel, FileChannel.write(ByteBuffer[]) CompletableFuture<FsxFileWriteResult> writeFuture = inFlightByteLimiter.acquire(...) .thenCompose(permit -> writeWithPermit(request, payload, payloadBytes, permit)); writeFuture.whenComplete((result, throwable) -> payload.close()); return withTimeout(writeFuture, configuration.getWriteTimeout()) .whenComplete((result, throwable) -> recordWriteMetrics(...)); CompletableFuture<FsxFileWriteResult> writeFuture = inFlightByteLimiter.acquire(...) .thenCompose(permit -> writeWithPermit(request, payload, payloadBytes, permit)); writeFuture.whenComplete((result, throwable) -> payload.close()); return withTimeout(writeFuture, configuration.getWriteTimeout()) .whenComplete((result, throwable) -> recordWriteMetrics(...)); CompletableFuture<FsxFileWriteResult> writeFuture = inFlightByteLimiter.acquire(...) .thenCompose(permit -> writeWithPermit(request, payload, payloadBytes, permit)); writeFuture.whenComplete((result, throwable) -> payload.close()); return withTimeout(writeFuture, configuration.getWriteTimeout()) .whenComplete((result, throwable) -> recordWriteMetrics(...)); - Is produced in continuous high volume. - Is consumed shortly after production — usually within seconds or minutes, almost never more than an hour or two. - After consumption, is either archived for compliance/audit or deleted entirely. - Has no value sitting on a transport between producer and consumer beyond getting from one side to the other. - The bus carries small messages, so the bus cost collapses. Pointer messages are roughly two hundred bytes each. A bus that was strained to carry 700 MB/s of payload is now carrying maybe 200 KB/s of pointers. Shard counts can drop by an order of magnitude or more. One real architecture I've seen reduced a Kinesis shard count from 700 to 32 — a 95% reduction in transport spend, with no change to the actual byte volume being moved. - The filesystem is billed for what it stores, not for what passes through it. You pay for the bytes that exist, not the bytes that have existed and have already been consumed and deleted. If your retention is one hour, you pay for one hour's worth of bytes resident at any given moment. - The filesystem gives you primitives the bus doesn't. Snapshots. Random access. Concurrent multi-reader semantics. Standard POSIX tools to inspect, validate, and debug. Your data-engineering team can ls your transitional data. They cannot ls a Kinesis stream. - The compute path is the same speed or faster. A modern NFS mount on the same VPC moves data faster than any HTTP-based streaming bus. Once you understand the math, the streaming-bus model is the slow option, not the fast one. - Frequent Access — data touched within the last 30 days. Baseline tier, sub-millisecond reads from cache, full performance. - Infrequent Access — data not touched for 30 to 90 days. Roughly 44% cheaper than Frequent Access. - Archive Instant Access — data not touched for 90+ days. Roughly 65% cheaper than Infrequent Access. Still online, no restore needed; first-byte latency in the tens of milliseconds. - maxInFlightOperations — caps the number of concurrent writes (still useful to prevent thundering herds on the file I/O thread pool). - maxInFlightBytes — caps the aggregate payload size of concurrent writes (the real protection against memory blow-up). - aws-sdk-java-v2 + S3AsyncClient + S3 Access Points for FSx. Works if you expose your FSx file system through an S3 access point. Mature, async, multipart, retries, integrates with the AWS SDK ecosystem. The downside is the S3-style access has tens-of-milliseconds latency rather than the sub-millisecond NFS-mount latency. For batch jobs that's fine; for hot transitional data you give up most of the perf advantage. - dCache/nfs4j. A pure-Java NFSv3 and NFSv4 implementation. The most serious Java NFS library available. It's actively maintained, has a JMH benchmarks module, runs on Java 17. If you absolutely need to write your own NFS client (or extend one), this is where you'd start. But it's not a turnkey FSx client — you'd be building protocol code, and AWS's own performance documentation is opinionated about mount-time options that are easier to apply with the kernel client. - EMCECS/nfs-client-java. NFSv3 only, dependent on a years-old version of Netty. Workable for legacy use, not a foundation for a petabyte-scale system in 2026. - SMBJ, jcifs. SMB protocol clients. Wrong protocol family — FSx for OpenZFS doesn't speak SMB. - Hadoop / Spark NFS connectors. Useful for ideas about request pipelining, not a foundation. - Use nconnect=16 (or up to your kernel's supported maximum) to parallelize NFS over 16 TCP connections. - Set rsize=1048576 and wsize=1048576 for 1 MiB read/write chunks. - Use NFSv4.1 for the locking and the cleaner failover semantics. - Place producers and consumers in the same AZ as the file system primary for sub-ms latency and to avoid cross-AZ data transfer charges. - fsx.write.success / fsx.write.failed — counters of completed writes - fsx.write.latency — write latency in milliseconds - fsx.write.bytes — total bytes written - fsx.write.retry — counter of retry attempts - fsx.write.inflight.bytes — current bytes reserved by running writes - fsx.write.inflight.operations — current running write count - fsx.write.backpressure.pending — current queue depth on the limiter