Spotlight on SIG Storage (kubernetes.io)

by rss-bot · 4 days ago · 0 comments

In our ongoing SIG Spotlight series, we shine a light on the groups that keep the Kubernetes project
moving forward. This time, we catch up with <a href="https://github.com/kubernetes/community/tree/master/sig-storage">SIG
Storage</a>, the group responsible
for persistent data, volume management, and the interfaces that connect Kubernetes workloads to the
storage systems beneath them.
We spoke with <a href="https://github.com/xing-yang">Xing Yang</a>, Co-Chair of SIG Storage and Software
Engineer at VMware by Broadcom, about the SIG's history, the features shipping in recent Kubernetes
releases, and where storage in Kubernetes is headed as AI workloads become the norm.
<h2 id="introductions">Introductions<a class="td-heading-self-link" href="#introductions" aria-label="Heading self-link"></a></h2>Could you introduce yourself and share your role(s) within SIG Storage?
My name is <a href="https://github.com/xing-yang">Xing Yang</a>, a software engineer at VMware by Broadcom. I'm a co-chair in SIG Storage,
alongside another co-chair <a href="https://github.com/saad-ali">Saad Ali</a> from Google. There are also two Tech Leads in SIG Storage:
<a href="https://github.com/msau42">Michelle Au</a> from Google and <a href="https://github.com/jsafrane">Jan Šafránek</a> from Red Hat.
What first drew you to storage in Kubernetes, and how did you start contributing?
I have always been working in the storage domain, so SIG Storage was a natural place for me to get
started when I began to learn Kubernetes. I started attending <a href="https://github.com/kubernetes/community/blob/main/sig-storage/README.md#meetings">SIG Storage meetings</a>, trying to figure
out what I could do to help. This was before the first <a href="https://github.com/container-storage-interface/spec/blob/master/spec.md">Container Storage Interface</a> (CSI) release —
lots of things were still evolving. It was a very exciting time.
What subprojects or areas do you actively maintain or review today?
I'm a maintainer in Kubernetes CSI. There are multiple CSI sidecars — such as <code>csi-provisioner</code>,
<code>csi-attacher</code>, <code>csi-resizer</code>, and <code>csi-snapshotter</code> — that we need to release following every
Kubernetes release. I'm also a co-chair for a <a href="https://github.com/kubernetes/community/blob/main/wg-data-protection/README.md">Data Protection Working Group</a> co-sponsored by SIG
Storage and <a href="https://github.com/kubernetes/community/tree/main/sig-apps">SIG Apps</a>. Several features have come out of that WG aimed at filling gaps in data
protection support within Kubernetes. One is <a href="https://kubernetes.io/docs/concepts/storage/volume-group-snapshots/">Volume Group
Snapshot</a>, which provides
crash-consistent group snapshots for multiple volumes used by an application. <a href="https://github.com/kubernetes/enhancements/issues/3314">Changed Block
Tracking</a> (CBT) is another critical feature
from the DP WG designed to support efficient backups.
<h2 id="about-sig-storage">About SIG Storage<a class="td-heading-self-link" href="#about-sig-storage" aria-label="Heading self-link"></a></h2>For folks who are new: what is SIG Storage, in your own words? What problems in Kubernetes are
you trying to solve?
SIG Storage is a <a href="https://github.com/kubernetes/community/blob/main/governance.md">Special Interest Group</a> focused on how to provide storage to containers running in
your Kubernetes cluster. We define standard interfaces so that a storage vendor can write a driver
and have its underlying storage system consumed by containers in Kubernetes.
Why does Kubernetes need a dedicated storage SIG? What makes storage hard in a distributed
system?
When Kubernetes was first introduced, it was meant for stateless workloads only. Container
applications were regarded as ephemeral and therefore did not need to persist data. However, that
changed drastically. Stateful workloads started running in Kubernetes, and we needed a dedicated
SIG to tackle the associated storage challenges. PersistentVolumeClaims, PersistentVolumes, and
StorageClasses were all introduced to provision data volumes for applications running in Kubernetes.
How did SIG Storage originally form, and how has its mission changed over time?
SIG Storage was formed to address the challenges of handling persistent data within Kubernetes.
Initially, PersistentVolumes were implemented as in-tree plugins, and the SIG managed those plugins
while developing core storage primitives like PersistentVolumes and PersistentVolumeClaims.
Container Storage Interface (CSI) was introduced later and played a crucial role in simplifying
storage integration, enabling third-party storage providers to develop and maintain their own
out-of-tree plugins without modifying Kubernetes core code.
With basic integration addressed by CSI, the SIG's mission expanded to include advanced storage
features that leverage the new interface. The SIG has also expanded its scope to support object
storage through the <a href="https://github.com/kubernetes-sigs/container-object-storage-interface">Container Object Storage Interface</a> (COSI).
<h2 id="current-work-and-roadmap">Current work and roadmap<a class="td-heading-self-link" href="#current-work-and-roadmap" aria-label="Heading self-link"></a></h2>What are the top features SIG Storage is actively working on right now?
The Data Protection WG has been working on a couple of exciting features:
<ul>
<li>
VolumeGroupSnapshot is a Kubernetes feature enabling a crash-consistent, point-in-time
snapshot of multiple PersistentVolumes simultaneously. This ensures data integrity for
applications — like databases — that rely on multiple volumes by capturing all volumes in the
group atomically, at the exact same point in time. It just moved to GA in Kubernetes v1.36.
</li>
<li>
CSI Changed Block Tracking (CBT) enables efficient, incremental backups. By allowing storage
systems to report only the blocks that have changed since the last snapshot, it significantly
reduces the amount of data that needs to be transferred. It just moved to Beta in Kubernetes v1.36.
</li>
</ul>
Another feature worth highlighting is Container Object Storage Interface (COSI). COSI provides
a standard interface for provisioning and consuming object storage buckets in Kubernetes —
standardizing object storage for containerized applications much like CSI did for block and file
storage. COSI is now transitioning to <code>v1alpha2</code>, with plans for promotion to Beta in a future
release.
What recent work from SIG Storage do you consider a "win" for users?
The graduation of <a href="https://kubernetes.io/docs/concepts/storage/volume-attributes-classes/">VolumeAttributesClass</a>
to GA in Kubernetes v1.34 is a major win for users managing stateful workloads. Previously,
changing volume attributes like IOPS or throughput required out-of-band actions or disruptive
operations. Now, users can dynamically tune storage properties such as IOPS or throughput directly
through the Kubernetes API — scaling up for peak loads or down to optimize costs — without external
processes or downtime.
VolumeAttributesClass enables dynamic modification of storage characteristics without recreating
the volume. This completes the picture by allowing users to tune both capacity and other storage
properties dynamically, just as they can now tune both CPU and memory for compute.
Looking ahead one or two releases, what's on the roadmap that people should watch for?
I'd like to draw attention to the <a href="https://github.com/kubernetes/enhancements/issues/1432">Volume Health</a> feature. This feature is designed to offer
critical visibility into the operational status and integrity of persistent volumes. By enabling
storage drivers and the Kubernetes control plane to report issues, it allows for proactive
monitoring and identification of volume-related problems.
Currently, volume health information is reported via non-persistent events. We are actively
investigating enhancements to this feature with the goal of supporting automated remediation
capabilities in the future.
Are there areas where you'd really like more discussion or help from the community?
We always need help from the community to fix bugs, add tests, and help with reviews.
We'd also like to get feedback on the Alpha feature <a href="https://github.com/kubernetes/enhancements/issues/4762">Mutable PV
Affinity</a>, which was introduced in
Kubernetes v1.35. Use cases include migrating volumes from zonal to regional storage or migrating
from one disk type to another.
Another topic is volume replication. It was raised at <a href="https://www.cncf.io/reports/kubecon-cloudnativecon-north-america-2025/">KubeCon Atlanta</a> and has been discussed
in the Data Protection WG. Community members interested in this topic are encouraged to join the DP
WG meetings.
What are the biggest challenges users face today when running stateful workloads on Kubernetes?
While Kubernetes has moved stateful workloads — like databases and AI pipelines — into the
mainstream, managing "state" in a system designed for ephemerality remains difficult:
<ul>
<li>
Data Gravity and Storage Locality: Pods move in seconds, but data has gravity. If a node
fails, a pod using local storage is stuck. Operators must decide whether the failure is transient
or permanent — a high-stakes call. This is why we are enhancing the Volume Health feature to
provide the visibility needed to automate recovery choices.
</li>
<li>
Day 2 Complexity: Setting up a database is easy; maintaining its health over time is the real
challenge. Standard Kubernetes objects like StatefulSets offer a baseline, but they lack the
operational logic needed for tasks such as schema upgrades, engine patching, or cluster-wide
Kubernetes upgrades.
</li>
<li>
Data Mobility: Moving persistent data remains a significant hurdle — whether migrating between
storage tiers, shifting workloads across availability zones, or moving to a different cluster.
This challenge includes ongoing synchronization and replication for high availability and disaster
recovery across a distributed system.
</li>
</ul>
<h2 id="storage-and-ai">Storage and AI<a class="td-heading-self-link" href="#storage-and-ai" aria-label="Heading self-link"></a></h2>How do you see storage evolving in Kubernetes over the next few years, especially as AI/ML
workloads grow?
I see several trends shaping storage in Kubernetes as it evolves from a container orchestrator into
the "Operating System" for AI:
<ul>
<li>
More Intelligent Data Management: We'll see a shift toward smarter CSI drivers and data
management tools offering advanced features like automatic tiering, snapshots, migration, and
replication — optimized specifically for high-performance AI/ML workflows and large data
platforms.
</li>
<li>
Object Storage as a First-Class Citizen: AI datasets now frequently reach exabyte scale,
making object storage the preferred choice for AI workloads. COSI is standardizing bucket
management just as CSI did for disks, allowing data scientists to use a BucketClaim to
provision S3-compatible storage natively and unifying object, file, and block storage into a
single workflow.
</li>
<li>
Performance and Low Latency: For AI/ML, storage needs to keep up with GPU processing speeds.
This will accelerate adoption of high-performance parallel file systems and NVMe-over-Fabrics
(NVMe-oF) technologies managed natively via Kubernetes. The line between traditional block/file
and memory-speed storage will continue to blur.
</li>
<li>
Data-Aware Scheduling: Instead of just considering CPU and RAM, the Kubernetes scheduler will
increasingly prioritize placing Pods based on data locality — calculating the cost of moving data
versus moving compute to keep massive data platforms performant.
</li>
</ul>
<hr>
SIG Storage continues to tackle some of the hardest problems in Kubernetes: keeping stateful
applications running reliably, making storage operations transparent and composable, and now
scaling up to meet the demands of AI-era workloads. Whether you're a user managing databases in
production or a developer curious about storage internals, there's a place for you in SIG Storage.
If you'd like to get involved, check out the <a href="https://www.kubernetes.dev/community/community-groups/sigs/storage/">SIG Storage community
page</a> and join the <a href="https://github.com/kubernetes/community/tree/master/sig-storage#meetings">bi-weekly
meetings</a>. You can also
find the SIG on Slack at
<a href="https://kubernetes.slack.com/messages/sig-storage">#sig-storage</a>.
<ul>
<li><a href="https://groups.google.com/a/kubernetes.io/g/sig-storage">SIG Storage Mailing List</a></li>
<li><a href="https://kubernetes.slack.com/messages/sig-storage">SIG Storage on Slack</a></li>
<li><a href="https://github.com/kubernetes/community/blob/master/wg-data-protection/README.md">Data Protection WG</a></li>
</ul>

AI News

Spotlight on SIG Storage (kubernetes.io)

Comments