Infrastructure controls help to protect where data lives.
But they don't control what happens to it after that.
Many Azure data teams have a well-built data platform. They have separate subscriptions for dev and prod, private endpoints on every storage account, use Key Vault to manage their secrets, and scope RBAC correctly. It would pass a security review without too much trouble.
And then you peek behind the curtain and look at the data itself.
Raw customer records sitting in a container that twelve people have access to, because someone needed to investigate a pipeline issue six months ago and the access was never revoked. Sensitive fields get carried through four transformation layers because it was easier to keep them than make an explicit decision to drop them. A dataset from a project that ended two years ago still sitting in production storage because nobody got around to deleting it - and now nobody is quite sure what still depends on it.
The platform is secure. The data is not governed.
Figure: Risk points shown across the different stages of the data lifecycle.
This is the gap that infrastructure management doesn't close. They protect the boundary. They don't govern what happens to data once it's inside that boundary, this is the data lifecycle: how it moves, who touches it, where copies accumulate, or when it gets removed.
Why the gap exists
The way most teams approach security maps naturally onto how Azure is structured. You think in terms of resources: storage accounts, compute services, networking, identity. You apply controls at those layers, and in doing so you build a solid foundation.
The problem is that those controls don't follow the data once it starts moving.
A storage account can be completely locked down, but the data inside it can still be duplicated into another location with broader access. A Synapse workspace can be isolated within a managed virtual network, while the datasets it produces get exported into tools that operate entirely outside that boundary. Access can be granted correctly at the resource level and still result in unintended exposure once users start interacting with the data directly.
Over time the platform stays stable, but the data footprint expands. Copies accumulate. Access patterns drift. The original intent behind the design becomes harder to enforce; not because of any single bad decision, but because of a sequence of small, reasonable ones.
I think this also highlights a broader issue with how tech teams are structured. It's not enough for DevOps engineers to own infrastructure while data engineers work with the data. You need a truly cross-disciplinary team to run successful data operations.
The data lifecycle is the actual risk surface
Most teams think about data security in terms of DevOps and cloud security. That's necessary, but not sufficient. The real risk accumulates across the data lifecycle, from the moment data is ingested to the moment it's deleted. On many Azure data platforms, that lifecycle is largely ungoverned.
The pattern is usually the same.
- Data is ingested without classification, so nobody is certain what's sensitive and what isn't, this means every downstream access decision is a guess.
- It lands in a raw layer that was supposed to be restricted but gradually became accessible to anyone who asked, because each individual request seemed reasonable at the time.
- Transformation pipelines carry fields through multiple layers without a clear requirement, because removing data requires an explicit decision and keeping it doesn't.
- Analysts get read access to serving layer datasets, which means the data can be exported, downloaded, and moved outside the platform in whatever form they choose.
- Nothing gets deleted, because storage is cheap, deletion feels risky, and the cleanup never quite makes it onto the sprint.
Each of these decisions is understandable in isolation. Together, they define the actual security posture of the platform.
What a more intentional sequence looks like
None of what follows is exotic. These are standard practices. What's missing on most platforms isn't the tooling, it's the culture. Here's what I recommend if you want to enforce the data lifecycle within your Azure data platform:
Classify at ingestion.
If you don't know what's sensitive when data enters the platform, every downstream decision is guesswork. Microsoft Purview can automate a meaningful portion of this through built-in sensitive information types and custom classifiers. Don't have access to (or the bandwidth to configure) Purview? Even a manual classification pass on known source systems is better than nothing. The goal is to establish sensitivity context before the data starts moving, not after it has already spread across three layers.
Treat the raw layer like production.
The raw layer contains the full fidelity of your source systems; unfiltered, unmasked, often including fields that will never be needed downstream but remain present because they came with the original dataset. Despite this, raw layer access tends to expand over time through a series of requests that each seem minor. The fix is straightforward: restrict access by default, use Entra ID groups rather than individual assignments so access is easier to audit and revoke, and review who has access on a regular cadence rather than waiting for an incident to prompt it. I like to do this audit quarterly.
Minimize data during processing.
Every transformation is a decision point about what should continue downstream. If a field isn't needed, it shouldn't travel. In practice, this means explicitly dropping columns rather than selecting everything, giving intermediate datasets defined retention periods rather than letting them persist indefinitely, and being deliberate about where temporary outputs land.
Scope consumption access tightly.
Read access is often treated as low risk. In practice, it can still allow data to leave the platform through exports, downloads, screenshots, or downstream tooling unless additional controls are in place. Row-level security, column masking, and carefully scoped serving layer datasets all help reduce that risk, but only when they're implemented with a clear understanding of what each user actually needs access to. Granting someone access to an entire dataset when they only require three columns is still over-permissioning, and it's far more common than most teams realize.
One pattern worth enforcing here: treat Power BI semantic models and SQL endpoints as proper access boundaries, not transparent pass-throughs to the underlying data. What gets modelled in the semantic layer is what gets exposed, not everything in the lakehouse behind it.
Define retention and enforce it with lifecycle policies.
Data that isn't actively used is still a liability. The more of it you have, the larger the blast radius of any security incident, and the harder it is to govern, because older datasets frequently fall outside current monitoring and classification practices. Azure Blob Storage lifecycle management and similar policies in ADLS Gen2 can handle this automatically once you've made the underlying decisions about how long different data categories should be kept. The hard part is making those decisions explicitly, rather than defaulting to keeping everything indefinitely.
Design for deletion from the start.
Deletion is hard when you don't know where all copies of a dataset live. It's a capability that many Azure data platforms lack, not because it's technically complex, but because it requires lineage visibility that was never built in. Purview lineage, or even well-maintained pipeline documentation, is what makes reliable deletion possible. GDPR right-to-erasure requests, data subject access requests, and contractual data removal obligations all require this capability. This is especially true for me as someone who works in Defense and Airport operations.
If you can't delete data on demand, you don't fully control it, and at some point, that's not just an operational problem, it's a regulatory one.
Governance is what sustains this
Getting the sequence right once isn't enough. The platform will evolve, the team will change, and without active governance the controls degrade. Here's some tips that make this sustainable in practice:
- Use Azure Policy to enforce baseline standards like no public access on storage accounts, required diagnostic settings, and tagging standards that make ownership visible. Policies applied at the Management Group are inherited by all child subscriptions automatically, which means the guardrails don't depend on individual engineers remembering the rules.
- Enable diagnostic logging for every data resource that matters. Create a record of storage access logs, Key Vault audit logs, Synapse pipeline runs, and Databricks cluster audit events. If you're not logging, you can't detect, which means you can't respond. This is configuration work, not complex infrastructure, but it has to be intentionally put into place from the start.
- Make regular access reviews a part of your routine. Anything ad hoc cannot be relied on, so schedule time to answer questions like: Who has access to the raw layer? Which service principals still exist from projects that finished? Which identities have data plane access in production that should be time-limited through Privileged Identity Management instead?
None of this is technically demanding. What it requires is treating data governance as part of the platform, not as an afterthought to be handled once the pipelines are working.
Final Thoughts
A well-configured Azure data platform is table stakes. What determines whether that platform stays secure as it scales is whether the data inside it is governed with the same intentionality as the infrastructure around it.
Most teams invest heavily in getting the platform right. Fewer apply the same effort to managing data as it moves through it. That's where the problems accumulate and where the next round of incidents tends to come from.
If your team is working through this, either building data governance in from the start or retrofitting it onto a platform that grew faster than the controls around it, this is the work I do with clients. If that's relevant, let's talk.