Building K8s workflows from a prompt

I've been working on Tiny Systems for a while now. It's an open source visual workflow engine that runs on Kubernetes. You install it as CRDs and operators, wire up components in a graph (HTTP servers, K8s watchers, Slack, cron, JS eval, that kind of thing), and they deploy as real workloads in your cluster.

A few months ago we added an MCP server, and I've been surprised by how well it works.

The MCP server

MCP (Model Context Protocol) is a way for AI assistants to use external tools. Our server exposes about 30 of them: creating projects and flows, adding nodes, wiring edges with data mappings, reading port schemas, cloning solutions from the marketplace, checking execution traces. The AI isn't generating YAML for you to apply later. When it adds a node, that node exists in your cluster right then. When it wires an edge, data flows through it.

Testing it for real

I wanted to see if "describe what you want and get a working flow" was actually true or just a nice idea. So I opened Claude and typed something like: "watch pods across all namespaces for CrashLoopBackOff and OOMKilled, grab the last 20 log lines from the failing container, and send me a Slack message."

It built the whole thing in about a minute. Six nodes:

A Signal node at the front, holding all the config in one place: Slack token, channel ID, which namespaces to watch, which to ignore, cooldown period. A Pod Watcher connected to it, filtering for problems only, skipping kube-system and kube-public, with a 5-minute per-pod cooldown so you don't get 400 messages about the same crash. A Router that splits on the hasProblem flag, sending problem pods down to a Pod Logs node (fetches the last 20 lines) and then to Slack Send (posts the formatted alert). Healthy pods drop into a Debug sink and disappear.

Every edge validated on the first try. The Slack credentials flow from the Signal through the entire chain without being hardcoded anywhere except that one node. I was honestly expecting to spend 20 minutes fixing edge configs, but it got the context passthrough pattern right.

Things we've built and actually run

We have a solutions marketplace. All of these run in our own clusters or were built for real use cases:

Cluster cost saver scales down labeled deployments at night on a cron, brings them back in the morning. Every scale event posts to Slack. We run it on our GKE dev cluster. Knocked about 40% off the bill, which was a pleasant surprise because I was expecting maybe 20%.

Image policy webhook is a MutatingAdmissionWebhook that rewrites container image references to point at your local registry. If you've done this with Kyverno, same idea, but you build it visually instead of writing policy YAML. It generates its own TLS cert, registers the webhook itself, and runs with failurePolicy: Ignore. If the flow crashes, your cluster keeps working. We wrote about this in detail. We built the image mirror for the same reason, after Docker Hub rate limits broke our staging deploys on a Friday afternoon. That one scans existing deployments, copies images to a local registry, and patches the specs. For new deployments we use the webhook. There's a full writeup on that too.

Pod failure alerts watch for CrashLoopBackOff, OOMKilled, and scheduling failures. When something goes wrong, it grabs the last 30 log lines from the container and sends a Slack message with the pod name, namespace, node, reason, and the log snippet. The 5-minute cooldown per pod is there because without it, a single crash-looping pod will destroy your Slack channel. The log keyword monitor does something similar but for specific phrases in logs rather than pod status changes. You configure the strings you care about (error patterns, stack trace signatures, whatever) and it tails the logs and alerts on match.

TLS cert expiry watcher reads cert-manager Certificate CRDs every morning, works out how many days until each one expires, and alerts below your threshold. We set ours to 14 days after nearly letting a wildcard cert lapse. That was not a fun morning.

GitHub deploy bot receives a webhook from GitHub Actions after an image build, checks the branch, updates the Deployment image, responds to the caller, and posts to Slack. There's a working example repo with the full GHA workflow if you want to try it.

Service status page is a self-hosted uptime dashboard. Pings your endpoints every 30 seconds, stores state in K8s node metadata (survives restarts without a database), renders an HTML page server-side. No JS on the frontend. Firestore to ConfigMap sync watches a Firestore collection for feature flag changes and writes them to a K8s ConfigMap in real time. We use it to toggle features without redeploying.

What this isn't

I should be clear: this doesn't replace Terraform or Helm or ArgoCD. Those manage infrastructure state. Tiny Systems is for operational workflows, the stuff that happens at runtime. "When a pod crashes, do X." "Every night at 11pm, scale these deployments to zero." "When someone pushes to main, update the deployment."

The closest analogy is writing a custom controller in Go. If you've ever written a Go program that watches a K8s resource and reacts to changes, that's the use case. You just don't have to write the Go program.

The MCP server works with Claude Code, Claude Desktop, and any MCP-compatible client. You can also use the visual editor directly.

Open source modules: github.com/tiny-systems/module
Docs: docs.tinysystems.io