Every infrastructure team has one. The person who knows why that array was configured the way it was, which workloads are sensitive to which changes, what the last migration broke and how it got fixed at two in the morning. The institutional memory isn't in a runbook, it's in their head.
For years that's been a manageable risk. You document what you can, you cross-train where you have time, and you hope the gaps never get tested. Mostly they don't.
But the ground has shifted, and it's worth being honest about why.
The expertise is getting scarcer, not more abundant
Storage administration has never been a crowded specialism, and the pipeline isn't refilling. The engineers who understand enterprise storage at a deep level are getting older, getting promoted, or getting hired away - and the people coming up behind them are, reasonably, more interested in cloud, platform, and AI roles than in the unglamorous work of keeping a SAN healthy.
So the dependency on individuals isn't easing over time, it's tightening. The person who knows your environment best is also the person who's hardest to replace, and increasingly the person most likely to be approached by someone offering more money to go and know their environment instead.
That's not a hypothetical risk you're managing down, it's a structural one that's quietly getting worse.
What actually walks out the door
When that person leaves, the spec sheet doesn't change. Same capacity, same hardware, same performance. What you lose is harder to see on an asset register.
You lose the judgement about which changes are safe to make and which need a maintenance window. You lose the pattern recognition that turns a vague alert into "oh, that's the same thing that happened in March." You lose the speed - the difference between resolving an issue in twenty minutes because someone recognised it, and resolving it in six hours because three people had to work it out from scratch.
And in a mission-critical environment, that gap isn't an inconvenience. It's the difference between a non-event and an incident with the business asking why.
Documentation doesn't close the gap
The instinct is to document your way out of the risk, and documentation is worth doing. But anyone who's inherited a complex environment knows the limits of it. Runbooks capture the procedures someone thought to write down. They don't capture the reasoning. They tell you what to do in the situations someone anticipated - not how to think about the situation nobody did.
The knowledge that matters most is precisely the knowledge that's hardest to externalise. Which is why "we have good documentation" rarely survives contact with the first real problem after the expert has gone.
The question worth asking now
For a long time, the only answers available were the same ones that never quite worked: document more, cross-train more, pay more to retain the person, accept the risk. All of them are about distributing one person's knowledge across more people - and all of them run into the same wall, which is that there aren't more people, and the ones you have are stretched.
What's changed is that there's now a different kind of answer. If the platform itself can carry some of the operational knowledge - provisioning safely, optimising continuously, diagnosing problems before they escalate - then the environment no longer depends quite so heavily on whether the right human happens to be available and still employed.
That doesn't replace good people. It changes what your good people are exposed to. The question stops being what happens when the person who knows this best leaves? and starts being how much of what they know does the platform need to depend on in the first place?
That's a more comfortable question to be able to answer.
