Three Identifier Constructions for a Human Research Pipeline: Hash, HMAC, NanoID
“If you think cryptography will solve your problem, then you don’t understand cryptography and you don’t understand your problem.” — Roger Needham
A few colleagues and I recently revisited a familiar debate. In most data systems, identifier generation is treated as undifferentiated infrastructure—one library, one default, applied everywhere. The implicit assumption is that a sufficiently random identifier is a sufficient answer to whatever question the system happens to ask of it; that the differences between identifier classes—surrogate keys, content fingerprints, anonymized handles, span identifiers, request correlations—are differences of usage rather than differences of construction. That assumption is convenient. It survives precisely until the system encounters constraints that aren’t engineering constraints.
For the past several years, I’ve designed and currently operate a clinical research data pipeline serving multiple concurrent Institutional Review Board (IRB)-approved studies against Electronic Health Record (EHR)-sourced data at production scale. Implementation specifics are abstracted from this post; the architectural pattern is not. In human-subjects research, identifier construction stops being a performance choice and becomes a regulatory and ethical primitive.
Why? Because the constraints are different from those that arise in commercial systems. Re-identification resistance is enforceable under the Health Insurance Portability and Accountability Act (HIPAA) and the Common Rule. Per-study isolation is an IRB requirement. Provenance MUST survive transformation but MUST NOT enable cross-study linkage. Workflows that depend on the operator’s inability to reverse a transformation can’t be retrofitted onto a single global identifier scheme; they have to be built into the construction itself. None of these properties emerge from a generic randomness primitive applied uniformly to every identifier the pipeline produces.
So let me state the pattern plainly. A real human-research pipeline needs at least three distinct identifier constructions. Each answers a different question. Each lives at a different boundary. Treating identifier generation as a single problem with a single answer breaks one architectural pillar in service of another. The failure mode isn’t that any single construction is wrong, but that no single construction is right for every boundary the pipeline must respect.
Three Pillars, Three Constructions
A human-subjects research pipeline must answer three questions of every row that flows through it. The questions are independent; every row is interrogated by all of them.
Fidelity. Is this row the same as before, or has it changed? A pipeline that can’t answer this can’t deduplicate, can’t detect drift, and can’t reason about whether a transformation preserved meaning. Fidelity is the substrate on which idempotency, change detection, and lineage are built. Without it, every other property the pipeline claims to provide is conjectural.
Governance. Can this row be re-linked to a person, or to the same person across studies? In a research context, the answer must be no, and the answer must be enforceable rather than promised. Policy is necessary but not sufficient; if the construction permits re-linkage, then policy is the only thing preventing it, and policy can be circumvented, misconfigured, or socially engineered. Governance asks whether the construction itself denies the operations that policy forbids.
Provenance. Which pipeline run produced this row, from which source, on which day, by which transformation? Provenance is the operational substrate of accountability. It allows any finding to be traced back to its derivation, any anomaly to be localized to a stage and a run, any audit to reconstruct what happened without depending on the cooperation of the system that produced the result.
Each question is answered by a different cryptographic construction at a different boundary:
- Content-addressed SHA-256 hashing answers fidelity at the ingest boundary.
- Per-tenant keyed HMAC hashing answers governance at the publication boundary.
- CSPRNG-backed identifier generation answers provenance at the operational layer.
The constructions are not interchangeable. Using any one of them outside its intended boundary either fails the question its boundary is supposed to answer or undermines a pillar that some other boundary was responsible for. Let me walk each in turn, identify the constraint it answers, and explain why the substitution of a generic identifier scheme is an architectural mistake rather than a minor inefficiency.
Fidelity at the Ingest Boundary
At the ingest boundary, the pipeline receives raw data from a source system that doesn’t coordinate with the pipeline about identity. Files arrive on a schedule, are written to a lake in raw form, and become the substrate from which everything downstream is derived. The question fidelity asks of every row at this boundary is whether it’s the same row that arrived yesterday, last week, or in any prior incremental load. If the pipeline can’t answer that question cheaply and unambiguously, then:
- Deduplication becomes guesswork.
- Idempotency becomes a hope.
- The lake’s role as the recovery anchor for the entire pipeline is compromised.
The construction that answers this question is a SHA-256 hash computed over the row’s sorted business columns at the moment of lake ingestion. Simply put, “business columns” are all columns except for metadata columns that were added by the pipeline itself. The hash is written into a metadata column that is then carried unchanged through every downstream stage. It MUST NOT be recomputed at the warehouse boundary, must not be recomputed after foreign-key re-keying at segmentation, and must not be recomputed under any condition the pipeline operator may otherwise consider reasonable. The construction is content-addressed and immutable.
The non-randomness of this identifier is the point. Two rows with identical business-column content produce identical hashes. Two rows with any meaningful difference produce different hashes with overwhelming probability. The hash is unique by virtue of its content, not by virtue of its randomness, and that property is what makes change detection tractable across overlapping daily exports without requiring the source system to participate in a coordination protocol it was never designed to support. A randomized identifier would be wrong here in the strongest sense: it would foreclose exactly the property the boundary needs.
The same property that enables change detection also serves as a duplicate pre-filter. Rows that share identical business-column values are duplicates by definition, and the hash exposes that fact at the moment of ingestion rather than deferring it to a downstream reconciliation step. Duplicate detection becomes a comparison of fixed-width hashes rather than a comparison of arbitrarily wide row contents, and the cost of identifying duplicates collapses to the cost of comparing two values. The work the pipeline would otherwise have to do further downstream is done once, at the boundary, by the construction itself.
The hash also serves as the first link in the lineage chain that fidelity, governance, and provenance all eventually depend upon. Any row anywhere in the pipeline can be traced back to its source by following the hash. Any divergence between a stage’s input and its output can be diagnosed by comparing hashes. Any claim that two rows in different marts derive from the same source row is verifiable without consulting either source or mart, because the hash is the same in both places. The hash is the substrate; everything else is built on top of it.
A subtlety worth naming is that the hash construction itself sits on a hot path. At ingest scale, tens of millions of rows pass through the pipeline on every full load, and the hash is computed for every one of them. The disciplines that apply to identifier generation under sustained concurrency apply equally here: allocation behavior MUST be controlled, primitive selection MUST avoid implicit coordination, and the cost of the construction MUST NOT become the dominant cost of ingestion. The arguments I developed in my earlier post on NanoID hot-path optimization generalize directly to hash construction in pipelines of this kind. The construction is different; the discipline is the same.
Governance at the Publication Boundary
The boundary between the warehouse and the data marts is where the pipeline transitions from internal computation to external publication. Inside the warehouse, identity is managed by the pipeline operator, and the SHA-256 hash from the ingest boundary serves every internal purpose well. Outside the warehouse, identity becomes a research artifact, and a fundamentally different requirement asserts itself: the identifier that reaches the mart MUST NOT enable re-identification of the underlying subject, and it MUST NOT enable cross-mart linkage of the same subject across separate research projects.
Both requirements are properties of the construction, not properties of the data. A SHA-256 hash carried untouched into a mart would satisfy neither. Why? The hash is deterministic; identical inputs produce identical outputs; an attacker with sufficient knowledge of the source schema could attempt to reconstruct the input space and brute-force the mapping. More immediately, the same warehouse row would carry the same hash into every mart, and any researcher with access to two marts could link rows trivially by comparing hashes. The construction permits exactly the operations that governance forbids.
Stay with me—the answer here is a small but architecturally decisive change. The construction that answers governance at this boundary is HMAC-SHA256, keyed with a per-mart secret stored in a Hardware Security Module (HSM)-backed secrets manager and never exposed to the pipeline operator. At the segmentation stage that produces each mart, the warehouse-side SHA-256 hash is replaced by HMAC-SHA256(warehouse-hash, per-mart-secret). The result is a new identifier that is:
- Deterministic with respect to the same inputs.
- Cryptographically opaque without the secret.
- Distinct across marts, because the secret is distinct across marts.
This construction has a property that’s easy to overlook and architecturally decisive: the HMAC operates on the SHA-256 hash, not on the raw business columns. An attacker who somehow obtained a per-mart secret would not face the relatively constrained problem of inverting the HMAC over a known schema of patient identifiers; they would face the problem of inverting it over the full output space of SHA-256, which is 2^256 possible inputs. A dictionary attack against patient identifiers, plausible in principle against a single-layer keyed hash, is eliminated by construction. The two-layer design—content hash first, keyed hash of the hash second—is what makes the cryptographic claim survive contact with a realistic threat model.
The architectural consequence is that the publication boundary becomes a place where identity is not merely transformed but severed. The original SHA-256 hash never reaches the mart. The mart receives the HMAC, and only the HMAC, as its identifier for that row. The pipeline operator, who has access to the warehouse and to the segmentation code, can’t reverse the transformation, because the operator doesn’t have the per-mart secret. The secret lives in an HSM-backed store that the operator can use but can’t extract. The same warehouse row, written into three different marts, carries three different identifiers, and no party—including the pipeline operator—can establish that the three identifiers refer to the same subject without simultaneous access to all three secrets and the cooperation of the secrets manager.
This is what makes the honest-broker workflow cryptographically tenable rather than merely operationally promised. In human-subjects research, there are workflows where the operator’s inability to reverse the transformation is structurally required: the team that operates the pipeline MUST be able to publish the data, but MUST NOT be able to re-identify the subjects whose data it published. Under a single-layer construction, the operator can always be compelled, suborned, or compromised into reversing the mapping. Under the two-layer construction described here, the operator simply doesn’t have the capability to reverse, and therefore can’t be compelled to exercise it. The construction enforces what policy could only request.
The bridge to my earlier work on AES-CTR-DRBG is direct. In that post I argued that deterministic cryptographic constructions belong in environments where reproducibility, auditability, and explicit state evolution are required. The HMAC-at-publication pattern is the same disposition applied to identity rather than to byte streams. The construction is deterministic with respect to its inputs; the state that determines its output evolves only when the per-mart secret is rotated; the auditor can reason about what the construction produces without having to reverse-engineer the library that implements it. Configuration is a contract. Boundaries are explicit. The cryptographic posture doesn’t change mid-flight.
A randomized identifier would be wrong here in a different way than at the ingest boundary, but no less wrong. A randomized identifier is non-deterministic; the same row processed twice would produce different identifiers; the mart’s referential integrity would collapse on the first re-run. What the boundary needs is deterministic-but-cryptographically-opaque, and the only construction that delivers both properties is a keyed hash.
Provenance at the Operational Layer
The constructions described so far handle the headline identity story—the identity of the rows themselves. They’re computed once per row and carried thereafter. They’re also, by a substantial margin, not the identifiers the pipeline generates most often.
A pipeline of any operational seriousness produces identifiers continuously for things that aren’t rows. Pipeline runs need identifiers. Stage executions need identifiers. Batch tasks under a thread pool need identifiers. Observability spans, if the pipeline exports traces, need identifiers. Synthetic sentinel rows injected at segmentation need identifiers distinguishable from real source identifiers. Project allocations need identifiers that survive across systems. Honest-broker requests need identifiers that the requesting system, the secrets manager, and the provisioning workflow can all reference unambiguously. Quality diagnostic runs need identifiers for the audit trail that links them to the pipeline run that triggered them.
These identifiers have a constraint regime distinct from anything the row-level constructions answer:
- They are allocated at high concurrency, often from many worker threads or pods simultaneously, with no central allocator available.
- They MUST be collision-resistant by virtue of their construction, because no coordinator exists to detect and resolve a collision after the fact.
- They MUST be URL-safe and human-tractable, because they will appear in logs, dashboards, audit trails, and the bodies of cross-system requests.
- They MUST be allocation-cheap, because the pipeline generates them at orders of magnitude higher volume than it generates SHA-256 row hashes or per-mart HMACs.
- They MUST NOT leak temporal patterns or sequence structure that an observer could exploit to infer pipeline behavior or correlate identifiers from independent runs.
In raw call volume, this class of identifier dominates the SHA-256 and HMAC constructions by orders of magnitude. The hash is computed once per source row at ingest. The HMAC is computed once per row per mart at segmentation. The operational identifiers are generated continuously: on every notebook, on every span, on every parallel task, on every cross-system request. If the construction here is wrong, the cost is paid everywhere the pipeline does any work at all.
This is the design space my NanoID library was built for, and the disposition that produced it is the disposition that earlier governed its randomness backends. The construction at the call site is identical regardless of the workload—a single function call returning a compact, URL-safe identifier—but the randomness source behind it is configurable, and the choice of source is made per identifier class rather than per call site. The architectural payoff is that the cryptographic posture of the identifier is determined at construction time and survives every downstream invocation without requiring the call site to be aware of it.
For the highest-volume operational identifiers—span identifiers, batch identifiers, stage execution identifiers, the thousands of small handles that a pipeline produces in the course of doing its work—the appropriate backend is the ChaCha20-based PRNG I developed in an earlier post. It’s designed for predictable behavior under sustained concurrency, with allocation discipline on the hot path and no shared coordination between worker threads. It doesn’t target any particular compliance regime, because none applies at this layer; what applies is the constraint that identifier generation MUST remain invisible in profiling output even at the volumes the pipeline reaches.
For the identifiers that participate in audit, IRB, or compliance trails—project allocations, honest-broker request identifiers, anything that ends up in a Key Vault audit log or a regulatory submission—the appropriate backend is the AES-CTR-DRBG construction I developed in the post before that. It targets FIPS 140-2-aligned cryptographic randomness with deterministic state evolution, allowing the identifiers it produces to be traceable, auditable, and defensible under scrutiny in ways that a general-purpose PRNG is not. The construction is the same NanoID library, configured with a different randomness source via the same WithRandReader option, called identically at the call site.
The two backends are not competing; they answer two different constraints within the same pipeline, on the same library surface, decided once at construction time. The same disposition that governs SHA-256 and HMAC at their boundaries—the construction is chosen for the boundary, not for the codebase—governs NanoID at the operational layer.
One Library, Three Backends
The library design that makes this tractable isn’t a convenience feature. It’s an architectural commitment that allows per-boundary construction choice to coexist with a single call-site surface. NanoID exposes a WithRandReader option that accepts any io.Reader as its randomness source. The choice of source is made once, at generator construction time, and survives every subsequent identifier the generator produces. The call site doesn’t change; the cryptographic posture does.
This mirrors the architectural move SHA-256 and HMAC are making at their respective boundaries. Both belong to the SHA-2 family. Both compose with the same underlying hash primitive. The difference between them is whether a per-tenant key is involved, and whether the construction can therefore enforce the cryptographic break that publication requires. The primitive is shared; the construction is chosen for what the boundary MUST guarantee. Same disposition, same shape, different layer.
The negation is instructive. A pipeline that uses uuid_generate_v4() everywhere—or NanoID with a single hardcoded randomness source applied to every identifier it produces—collapses distinctions that the architecture is doing real work to preserve. Surrogate keys for fact tables receive the same construction as project identifiers feeding IRB submissions, which receive the same construction as observability spans. Each is correct for some workload and wrong for at least one other. The pipeline that adopts a single global construction hasn’t simplified its identifier story; it has merely declined to make decisions that its boundaries will eventually force it to revisit.
The pipeline that adopts the layered model described here makes those decisions once, at the boundary where each one belongs, and then stops thinking about them.
- The hash is content-addressed at ingest.
- The HMAC is per-tenant keyed at publication.
- The NanoID is randomly allocated at the operational layer, with the randomness backend chosen for the identifier class.
Each construction is correct for its boundary. Each construction is composable with the others, because the boundaries are explicit and the constructions don’t interfere. The architecture isn’t more complex than the alternative; it’s more honest about the complexity that was already present.
Wrapping it all up…
The earlier articles in this sequence each followed a single constraint to its conclusion. My 2024 article on NanoID followed the hot-path discipline that emerges when identifier generation moves from incidental utility to infrastructure. My 2025 article on AES-CTR-DRBG followed the requirements of compliance, auditability, and explicit state evolution. My article on the ChaCha20-based PRNG followed the constraints of portability, concurrency, and operational simplicity. In each case, a single set of requirements produced a single construction.
This article inverts that move. Given a domain—human-subjects research—that imposes multiple constraint regimes simultaneously, the architectural response isn’t a single construction but a layered one. Fidelity, governance, and provenance aren’t competing concerns to be balanced against one another. They’re independent questions answered by independent constructions at independent boundaries, and the pipeline that respects the independence of the questions can satisfy all three without trading any of them off against the others.
Make no mistake; the constructions themselves are unremarkable. SHA-256 has been a standard primitive for two decades. HMAC has been a standard primitive for longer. CSPRNG-backed identifier generation is a solved problem with well-understood implementations. The architectural work isn’t in the choice of primitives. It’s in the recognition that each primitive is correct only for the boundary that actually needs the property it provides, and in the discipline of refusing to substitute one for another where the substitution would be easier but wrong.
Good architectures emerge when constraints are taken seriously and followed to their conclusions, even when those conclusions lead to different constructions under different constraints. The earlier articles argued this for randomness. The pipeline architecture described here argues it for identity. The answer is the same in both cases: make the boundaries explicit, choose the construction the boundary needs, and let the disposition do the rest.
Good architectures don’t just happen, they are designed.
My next post will take a step back and explore the data analytics and insights pipeline from an architectural perspective.
Implementation: https://github.com/sixafter/nanoid