Andrew Southall

I’ve been running Kubernetes for years both professionally and personally. My on-premises K8s cluster has been going over half a decade. To manage these clusters I’ve used Terraform, Ansible, Kustomize, Helm, Flux, plain manifests and even my own tooling in jinny. Out of all of them, Helm is the only tool which I can’t recommend. I haven’t encountered a scenario where Helm has resulted in a net benefit, it has always been better to pick something else. Let’s explore!

What is Helm ¶

Helm is described as a package manager for Kubernetes. It bills itself as an easy way to install software onto Kubernetes clusters in the same way that package managers make it easy to install software onto servers or desktop computers. How this specifically breaks down is that Helm will manage:

Passing the resource definitions to the Kubernetes API
Upgrades and configuration of those resource definitions
Uninstalling any unrequired or unwanted resources
Management of dependencies

From my point of view, package managers aren’t needed in containerised environments like Kubernetes. Containers should be self contained with explicit dependencies isolated inside each versioned image and any implicit “dependencies” - such as database services - should be managed as separate entities. Package Managers like APT and yum have to manage dependencies due to operating in non-contained environments where resources are shared across the machine. Aside from sharing the kernel and data we don’t need to share and manage resources like that when they can be wholly captured inside container images. However, let’s put that to one side and imagine that we want to have an environment where a Kubernetes cluster is managed more like a desktop.

Helm explicitly works only with Kubernetes, it doesn’t interface with anything else. It can’t be used with your cloud infrastructure or for provisioning virtual machines, bare metal or any other sort of hosting.

Additionally, the only interaction external systems, including Helm, have with Kubernetes is via the Kubernetes API. If you want anything at all to happen in Kubernetes then you’re sending a request to the API. The Kubernetes Control Plane will interpret your request and operate as best it can to implement your request. All external Kubernetes tools act this way, they all pass JSON to the Kubernetes API containing requests for what you want. Therefore, all Helm does is manage resource definitions passed to and from Kubernetes and it only needs to do two things:

Take our desired software, apply our desired configuration and determine the output JSON that should be passed to Kubernetes
Manage any differences, updates or drift in the desired configuration and what is in the cluster so that updates, uninstalls and modifications are reflected in the cluster - like a package manager

Noting that the Kubernetes Control Plane handles the actualised orchestration in the cluster, there’s not much work for external tools to do. You can handle a Kubernetes cluster entirely with pure text manifests and a method to push manifest alterations to the Kubernetes API on change. If that sounds like a git repo and some CI/CD then you understand what the Flux GitOps tool is. Yes, it can be as simple as that.

With this small zone to operate in let’s check out how Helm falls behind it’s contempories.

Caveats ¶

Firstly, Helm is being continuously updated. It might be good some day, maybe even today. Even if that’s the case, the alternative options haven’t had the same problems that Helm has had, so I don’t believe it’s worth investing in Helm when other tools have been more dependable.

Secondly, I don’t trust Helm. Helm has had major issues that I find unacceptable in released software. These issues range from security to usability, with many issues going on for years. Even if all those issues were fixed, those issues were severe enough for me to write off Helm permanently.

Finally, you can do as you like. This article is a place I can point to as needed. I regularly have discussions about Helm, so it’s good to have a collated list.

If Helm isn’t great, why is Helm the prominent tool? ¶

I’m speculating, but I think it’s as simple as:

Helm was present early in Kubernetes’ lifetime
Most DevOps engineers don’t have a history of programming seriously if at all. They don’t have the background of templating languages, options or experience to develop their own templating approach using pre-existing tools
Helm is built in Go - the new trendy language - and was built for Kubernetes specifically, that made it intially attractive

Therefore the first major Kubernetes manifests used Helm charts and once implemented, things never change.

Furthermore, I think Helm use is reinforced by:

Large scale Helm Charts being to complex to replace ¶

Take the decommissioned Helm chart for mysqldump which is available here. Inside are 6 templates with 253 lines of code between them. That’s a light chart. In most cases a mysqldump one off job can be deployed with one file:

apiVersion: batch/v1
kind: Job
metadata:
  name: mysqldump
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: populator
        image: "mysql"
        imagePullPolicy: Always
        command:
        - bash
        - '-c'
        - "echo \"Fetching dump\" && mysqldump -u mysqluser -psupersecretpassword -h mysqlhost -d database > /whatever/volume.dump"

Mount a volume or add in creds to back up to S3 and you’re good. Cronjob it to make it regular if you like.

Universally available Helm charts need to handle a magnitude more use cases than the singular use case each installation is looking for. As a result the charts end up bloated and difficult to comprehend. Even Ops Engineers with the skills don’t get the time to break down a chart into just what they need. The time efficient result is to simply use the available Helm chart to install whatever is in there. What exactly it’s installing no one knows or cares, what’s important is to just shove it out the door.

Everyone else uses it ¶

Hiring is a nightmare so you take what you get, which are people that default to Helm. People default to Helm because it doesn’t help individuals’ CVs to not use Helm. When Helm is used in an organisation, it doesn’t get replaced, which means everyone old and new needs to know and use it. “We must be cohesive as an organisation” and “Helm is our preferred tool”. So on, so forth.

Redoing things are shunned so hard, good developers suggest it's the worst thing you could ever do. Whether you agree or not, the first choice you make is unlikely to be undone.

Helm’s issues are future loaded ¶

By the time the issues occur time has been sunk into making a chart repository, a couple of the guys on the Ops team are really committed to their templating, there’s no other options and it’ll take too much time to migrate. Now the teams are committed. Now it’s harder to be undone, practically and psychologicaly.

So what’s wrong with it ¶

Helm is split into two parts, the templating of Kubernetes YAML files (charts) and the deployment of those files to the cluster. Helm’s templating is just Go’s standard library template with some bonus functionality. It takes templates, takes some variables and vomits out completed text files that can be passed to the Kubernetes API to do things. Helm’s templating is not amazing, but it’s not absolutely awful. Thus using Helm for handling chart templates and getting usable manifests to deploy separately is OK. It’s not pleasant but it’s workable and generally how I use it in environments that are “committed” to Helm.

Helm’s management of deployments is where the real problem lies as getting this wrong is what results in serious business impacting issues. A mangled template can be thrown away, a mangled deployment can take down companies.

Serious Issues ¶

Helm Tolerates Nothing ¶

Occasionally, you need to make changes outside of your deployment tool or pipeline. I’m talking “we deployed out a bad config value to prod that absolutely must change immediately”. I’ve worked in regulated finance and environments more dramatic than that. I’ve handled production environments both scrubbed up for prepared deployments and diving in raw for immediate emergencies. Bad situations happen and yes, sometimes changes need to be made “right now”.

If you do that, Helm throws a tantrum and invalidates the entire deployment. This is because Helm 'expects that you're not modifying manifests outside of the Helm system'. That also includes when Helm itself is the problem and you have to go outside of Helm. Helm will not tolerate you. This is 'a defensive decision' and was built into Helm from the beginning.

This wasn’t reconsidered until Helm v3. That’s 3+ years post full release where touching any Kubernetes resource outside of Helm potentially destroyed the entire deployment. For an operations tool, that’s wild. Helm doesn’t help orchestrate as much as it takes your environment hostage.

Whilst it’s advertised that Helm now tolerates changes outside of its system, not tolerating changes was built into Helm from the beginning - this issue runs deep. You’ll find this ghost crop up across a few threads where Helm fails to detect or accommodate changes outside specific deployments. Also inside deployments as well. Helm was not built like Terraform or Ansible, both of which were built to be idempotent no matter the source of the changes.

Helm Expects You to Nuke Everything ¶

In a related theme, Helm is built and has a culture around purging unsuccessful deployments should they exhibit problems. Don’t attempt a fix in place, nuke it. This is so prominent that in the Helm v3 release the --purge option became the default option. Here's the issue. A delete without a --purge simply deletes the resources but keeps the previous deployment history. Adding the --purge deletes this history as well. This matters because Helm often complains about failed prior deployments even if there’s no resources in the cluster. If you just want a plain install you probably don’t care. If you want to reinstall a broken deployment, you’ll need to --purge.

This tells me that most Helm users are completely reinstalling more often than they’re uninstalling, so much so that it’s the default to nuke it all and try again. I mean, it does work.

You’ll find this suggestion across all of the Helm issues. Nuking everything is built in.

As has been said a number of times, nuking everything is not acceptable for production. It would be best if Helm actually tried to fix problems in place.

Helm Chooses When It Wants to Work ¶

Helm broke?

Yep

Yeah, that happens a lot. No idea why?

Nope

Yeah, that happens a lot too

It’s common to find issues in GitHub asking why Helm is finding releases that don’t exist or not finding releases that do exist. The most commented issue is asking why Helm thinks nothing is released. Why? Don’t know. Suggestion is to nuke it and start again.

Sometimes it just works the second time. Why? Don’t know.

You will come across this if you use Helm. You’ll come across Helm doing weird things if you develop Helm. To prove the point, I searched slack at a major technology company I was consulting at. The results were essentially the above dialogue culminating in helm delete --purge.

Here’s a couple of fun discussions:

Problem: no available release name found

Key quote:

Hi folks i just don’t have any clue what is going wrong.

Solution: Reinstall tiller manually. This solves timeouts.

Problem: The template command fails to dump generated files if it cannot parse as valid yaml file

Key quote:

The helm template command in helm 3.0.0 tries to parse the results as yaml files. So if there is a problem with the generated yaml file, I cannot see the file itself to check the problem. With helm 2, the template command dumped the generated files without trying to parse it as yaml file, so I could check the problem. Now it is not possible.

Bonus Quote:

Trying to locate the cause of invalid YAML with Helm 3 is a nightmare.

Super Bonus Quote:

| fixed in #7556, which will be in Helm 3.2.0. Use the –debug flag to display the invalid YAML.

Doesn’t work.

Solution: Use Helm v3.2. Maybe.

Critical Release Management is Stored in the Cluster ¶

Helm likes to store its own state. It passes this state to the same API that can tell Helm exactly what the actual state is. An API designed specifically for that task. Yeah, I know.

Either way, if you delete the secret or configmap with Helm’s state, Helm gets weird. What sort of weird depends on your current Helm release and the chart. It’s unlikely on any version that you’ll be making a successful deployment if any of these critical, unreplicated Kubernetes resources were hit by a stray delete request.

My personal witnessed experience was the deletion of the ConfigMap storing the Helm release of a production database. When the next release of the pipeline rolled out helm upgrade --install production-database and that release failed for whatever stupid reason, Helm deleted all production database resources that already existed in the cluster. You see, Helm’s version of the state wasn’t recorded in a ConfigMap with the release name, so clearly this release was a brand new release and all the resources that were already in the cluster needed deletion as an appropropriate rollback.

Then again, maybe this is a feature? Maybe you have bad templates but Helm still deploys out a release with this bad template and then complains about deploying the fixed template and thus you need to manually delete the Helm state in the cluster. That’s like a fix, right?

Helm v3 Adoption is no Terraform Import ¶

Originally, Helm would not allow for adoption of resources. If you’re running a bank database and wanted to manage your Kubernetes hosted database with this new brilliant ‘Helm’ tool that everyone is using, then you need to reinstall your bank. No really.

Helm v3 is a bit more tolerant but when working with resources that have been changed outside of Helm or resources that are then “adopted” by Helm. However, Helm’s adoption has been flaky at best - adopted resources may not get updated when properties change, adoption may or may not work, use at your own risk.

In my case, the configuration for selectors on some “adopted” Kubernetes Services weren’t being updated. Service Selectors are important as they point your Kubernetes Service resources to applications. I needed to flip services from an old set of databases to a new set of databases. These were entirely new Kubernetes Services that needed to exist for a data migration and then be adopted by Helm to handle future traffic management. Then we can have that sweet organisational cohesiveness as every resource is included in the day to day release. Well, Helm v3 adopted the Services but didn’t update the selectors as required, so traffic kept going to the old database. I manually updated them and Helm didn’t seem to notice. That’s like, another fix right?

The discussion on resource adoption is 6 years old.

Shout out to this absolute lad who forked Helm and added the functionality.

Helm Has Limits ¶

As Helm stores its state in Kubernetes Secrets and alternatively ConfigMaps, Helm Charts have a hard limit where they break. This is due to the etcd resource size being limited to a few Mb depending on your etcd version. Seen as Helm stores:

All resource definitions
The original values template
The provided values for that release
Some fun extras

Depending on the setup you can hit 3x multiples of data storage for each value in your inputs. When you do hit the limit for that storage then you can’t deploy as Kubernetes rejects updates to the ConfigMap/Secret as the data is too large to store. Awesome.

This isn’t theoretical, I’ve seen this happen a couple times. In both cases the managed resources were complex and hosted in production leading to those resources being stuck without a refactor. The actual ConfigMaps and Secrets that make up the solution are fine, it’s Helm that’s failing as it’s trying to place all the data into a single limited resource.

You can use a database to store Helm state but it’s in Beta and it’s Postgres only. As a comparison, Terraform also stores it’s own state but can store it across:

Mounted filesystems
Remote backends such as the Terraform Cloud Service
Consul
Tencent Cloud Storage
S3 and S3 compatible storage (every cloud provider under the sun, Minio, etc)
Alibaba
Postgres
Any compatible HTTP service
And as a Kubernetes Secret, but don’t do this

Terraform’s state is one of the things I dislike most about Terraform, but it’s lightyears ahead of Helm’s equivalent.

Silly Helm Things ¶

Some of the following have been “fixed” or aren’t a big deal, however, it still feels wrong to have these issues in large scale released software. A good chunk of these issues should never have existed outside of early development.

Helm Templates What it Wants and then Complains About It ¶

As originally introduced, Helm takes inputs and templates, mashes them together and checks the output. Obviously, said output is different from the inputs hence if there’s a problem it’s necessary to know at what point the problem occured and what the proto-output looked like at that point. Up until Helm v3.2 you couldn’t see the output prior to parsing, meaning that you have to replicate Helm’s templating steps outside of Helm to gather where the problem might be. It took until April 2020 for v3.2 to release where Helm could occasionally tell you what it was complaining about before it implodes. Until that point you had invalid YAML in line 200 of your 60 line YAML template.

I code in Go, implementing this is trivial. Here's the commit that implemented it with a dozen lines. I don’t understand how this was left so long.

When I first experimented with Helm off the clock, this single issue was sufficient in disqualifying Helm from my personal use. That was early days, yet the idea of being complained at about invalid code on lines that only existed in the application’s head was a no go.

Helm doesn’t do Exit Codes ¶

Helm either fails with a 1 or succeeds with a 0. Modules might pass through exit codes but who knows. Adding in and maintaining os.Exit(2) is immeasureably complex.

The Helm Team Love Essays ¶

To get changes into Helm you need to come up with a Helm Improvement Proprosal, based off of Python’s equivalent. Unlike Helm, Python is a fully fledged programming language with numerous implementations, concerns and reach across all of computing. Helm is… Helm. If you want to improve Helm beyond a bug fix, happy reading!

Helm Timeouts ¶

On applying Helm you need to keep an eye on the cluster to find any issues that may arise. If a pod is crashlooping or missing some configuration then Helm will sit there until it hits it’s timeout and then delete everything. It doesn’t scan for logs, output, or gather an real evidence of what has happened during the deployment. Instead you just get:

Error: UPGRADE FAILED: timed out waiting for the condition

What does this mean? Fuck you, Helm has deleted everything so now you won’t know. If you’re running Helm inside a pipeline this is even more obnoxious. Set your --timeout argument to 1800 seconds and wait around watching the cluster to discover what the actual problem is.

Thou Shalt Helm ¶

Helm has a particular file structure required. If you don’t provide this file structure then Helm won’t tolerate whatever it is that you’re doing. One of these files is the values.yml file that must be at the root of the Helm chart. If you don’t provide this file then Helm has a tantrum. What if your chart has no values as it is either hardcoded or has no variance? Fuck you, give me a values file. I’ve seen a values file like the below rolled all the way out to prod just to keep Helm happy:

$ cat values.yaml

this: true

Tiller ¶

Oh boy, here’s something that’s probably as bad as the move to Python3. The Python migration was of course much more involved, but the Helm v2 is going to hang around just as long.

Helm v2 included a server side application that sat in the cluster and handled Helm commands. This application was called Tiller. Helm v3 removed Tiller which is a vast improvement, yet all those old applications deployed under Helm v2 don’t seem to be going anywhere and migrating them into Helm v3 is… well, see above. So even now, I see about as many Helm v2 managed applications bouncing about as Helm v3. There is the Helm 2to3 migration tool but I’ve never seen anyone use it. After dealing with Helm for years, I fully understand. Maybe the migration tool works wonderfully, but it’s absolutely not worth the risk and consequently I’ve never seen anyone try. Expect to see Tiller still running in the 2030s.

Tiller Security ¶

This has been patched, but still interesting as a bookmark for Helm security. Tiller would totally just give you anything you asked for if you manage to connect to it. No RBAC, no authorisation, nothing. If you had Tiller in your cluster you had a giant rat wearing a wire.

Helm v2 is End of Life ¶

This is fair, but given that Helm v2 is still everywhere keep in mind that it’s been abandoned. Python2 had security updates for a long time after Python3 came along. Helm ditched v2 a year after v3's release.

OK, what should I use instead? ¶

Anything else really. Remember that Kubernetes manages all resources, you just provide the definitions of what you want and the Control Plane manages the rest. I’ve handled production systems with plain manifests just fine. In my experience, replacing Helm with plain manifests has been an improvement. All of the security, deployment and usage issues disappeared. There’s no charts, stuck releases, disappearing resources, unchanged resources or other problems. Plain manifests never had Tiller and haven’t been decommisioned.

But how do I handle massive manifests? ¶

Use another tool such as Terraform, Ansible, Kustomize, etc. You might need the complexity, you might not.

What about Helm Charts I don’t have time to migrate? ¶

I’ve been there and dealt with this a lot. Grab the output and keep Helm the hell away from deployments by templating the chart into the needed output and feeding that into your preferred solution:

helm template -f values.yml . > output.yaml

Voila, you’ve got the whole output that Helm was only going to pass to the Kubernetes API anyway. Now you can check it, manage it and deploy it as you like. I guarantee that if you run kubectl apply -f output.yaml you won’t get a release not found error.

Is there ever a time where Helm is the right tool? ¶

Sure, I can think of two.

Vendor supplied charts that only offer support for releases via their chart ¶

I’ve seen the statement but never come across this in reality. Ultimately the vendor is paid by the client so this hasn’t been enforced. I’ve also seen the vendor’s technical teams be fully aware of Helm’s issues and don’t really like this clause so they ignore it or recommend that you don’t use the chart or get the output as per the helm template command above.

When you need to get as far as possible without knowing what you’re doing ¶

This is the reason I see in the wild. Folks out of their depth or run off their feet, shoving out as much stuff as possible to meet the next funding goal or speed-running straight into the Age of Strife. No one cares, ram it out the door, get that box ticked.

Does it matter? I don’t know. It’s hard to put a price on the future problems that you may or may not face one day. Banks are still running on COBOL and Microsoft Exchange Server 2000 can still be found delivering M&A documents. Maybe it’s fine. Maybe it’s not. But if you can strive for better than I’d suggest doing so.

Helm Why there's no good reason to use it