On Package Manager Architectures, Part I

I am sort of a package manager junky. I even wrote my own package manager, capable of managing dependencies across programming language boundaries, called Degasolv. During the course of my travels, I've noticed some very significant things about package management architecture.

There are really three different kinds of package managers. Each fundamental type of package manager makes trade-offs. Each type is fairly successful in its own ecosystem. It is instructive to point out the differences.

Here I will treat the two most significant. The third type deserves its own blog post (assuming I get to it). For the curious, it includes ivy and maven, and I will discuss this one later.

The Main Differences

The first type is that of the system package manager. The archetype of this category is YUM, and more recently, DNF. In this model, the following facts are true:

  1. Packages install files into a shared file system, and are shared resources. Only one package of each package name may exist in an installation at any given time. Packages may NOT share ownership of a particular file (RPM is actually pretty strict about this), and generally do not share resources. 
  2. Because of this, there must be agreement on package semantics. Since the "libc" package is a shared resource, all packages must agree on what "libc" means. This means that all packages must agree on an acceptable version of libc.
  3. All packages are owned by the system, NOT by other packages. Java apps do not specify how to configure Java, for example. The systems administrator is in charge of configuring the Java package, not other packages that rely on Java.
  4. No domain knowledge exists at the package level of the "type" of thing being installed; all DNF knows about are files. Packages installing a C shared library and packages installing a tomcat service are in some sense created equal.
  5. At at least a theoretical level, circular dependencies are possible. It is possible to resolve the dependency of "a -> b -> a", by installing both "a" and "b". 
Contrast this with a  manager like NPM. NPM is the archetypal package manager of the second type of package managers, a class which I will call the "hierarchical package managers". Others include GX and Helm. It is very popular, and very different:
  1. All packages are installed relative to the package requiring them. Therefore, packages are not shared resources. Package installations are only used by the package that depends on them. There are no shared resources anywhere (in theory); if A relies on B, it is the only package relying on B. If, for example, package A relies on both B and C, and package B relies on package C, then package B will get its own copy of C installed relative to B and package A will get its own copy of C installed relative to A. A and B do not share C at all.
  2. Because of this, no agreement on package semantics is necessary. A and B in the example above do not need to agree on what "C" means; A can use 2.0 of package C and B can use 1.0 of package C and no one cares.
  3. In this sense, packages are "owned" by parent packages. This can be seen in this category of package manager with the Helm pacakge manager. Values in the sub-packages' values.yaml (the package's configuration file) may be overridden by values in a parent package.  Ownership, the sort of "care and feeding" that the depended on package needs to work, is provided by the depending package.
  4. The package manager takes advantage heavily of domain knowledge. It makes assumptions about what is being installed and only makes sense in the context of some programming language or system, such as Javascript in NPM's case or Kubernetes in Helm's case.
  5. Because packages are installed relative to the packages that rely on them, circular dependencies are not possible

On Shared Resources

The most interesting point in the above discussion, in my opinion, is that of whether or not a package is a shared resource. Consider this bug I filed with the Helm chart package manager. In it, I show that although the package manager Helm is of the second type, and therefore assumes that packages (in Helm nomenclature, "charts") are not shared resources, this is not always the case. In the default case, packages create services using names that assume that there exists only one package of a certain name in any given release. 

In short, in order for a package manager to be of the second type, and for that to work, packages must truly and in every way must not be shared resources, ever. There are subtle corner cases in every language or system where this is simply not true. 

Consider the example of two packages with a "mutual friend" package. In this scenario, Package Alice and Package Ben have a mutual friend, Package Cat. Alice talks to Cat and asks Cat to tell her the right amount of money needed to buy a horse. Alice then asks Ben to buy a horse for her, giving him needed money. Ben then asks Cat if the money given to buy the horse is enough.

The above example is contrived and tells the story of the classic diamond: Package A, which relies on Package B and Package C. Package B also relies on Package C.

Now consider if these packages were NPM packages. Package B's Package C is installed relative to B, and could be at an entirely different version than Package A's Package C. This version difference is perhaps breaking. So, what may end up happening is that Package Alice asks Cat how much money to use for buying a horse, and Cat's methods of figuring out how much money that is may have completely changed between Cat 2.0 and Cat 3.0 . So Alice's Cat 2.0 says she needs $1000, and she gives this to Ben. Ben asks Cat 3.0 if that's enough, but Cat 3.0 says "no". Now Ben gives an error and no one knows why.

This example shows that even though Ben and Alice do not share the same Cat package, they are sharing her as a resource. They rely on her expertise and believe that whatever Cat says is right. Therefore NPM's fundamental assumption that Cat is not a shared resource is violated. The developers of Alice don't talk to Ben's developers, but they're both in the same community and both have been told to use package Cat for all things horse. It's simply the de facto solution in the community.

Though contrived, this is a real problem. NPM tried to fix this by introducing "peerDependencies" for those packages where this really matters, and I think this isn't a bad plan. It allows packages to use the hierarchical method of managing dependencies, assuming packages aren't shared resources, for the general case, while allowing them to specify hard requirements when problems arise.

On Ownership

A second fundamental difference between these is ownership. I'm now considering the differences between RPM, a package from the systems family, and Helm charts, packages from the hierarchical family. I do this because charts are unique (as far as I can tell) because you can configure them using their values.yaml file from a parent chart. This means that in some sense depending packages in helm can configure their depended on packages.

Contrast this with RPM. A Java app like Jenkins installed via RPM would never presume to configure Java for you. Things like Java heap size are entirely up to the administrator. A central authority emerges in this model, where the sysadmin or <insert favorite infrastructure-as-code tool here> tool is responsible for configuring the package and making it work.

Which is Better?

I think this boils down to a discussion around where the authority should be. Should packages be made to share resources owned by a central authority (the sysadmin), or should they be given their own sandbox, allowed to use what's in their box with no regard to the needs of other packages?

This has a direct corollary question to human communities and how communication takes place, as it must. Should developer teams coordinate, often centrally? Or, should we try to isolate developer teams so that they do not need to coordinate or communicate?

This question sounds harsh, but do and must ask and answer this question all the time. It is everywhere. This is particularly true in DevOps, which is my profession. Do we make two teams' products share a virtual machine when we deploy their code, or do they get separate VMs? We usually choose the latter isolating them from each other. Do we make two teams' products agree on what version of PostgreSQL to use, though they will use separate databases? Yes; our support contract only supports certain versions of this backing, depended-on service. Both teams depend on the database admins, and the database admins's service won't work with theirs unless certain constraints are met.

So: as a package manager designer, which one do we choose? Do we enforce some coordination between packages? If we do, all packages will agree on semantics (e.g.,  package "B" will mean the same thing to all depending packages, because it will be the same version). All packages will need some central authority to configure them. If we don't, packages could be allowed to configure sub-packages, depending in turn on configuration given to them. Do we trust package authors to make the right configuration choices? They may be trustworthy, but will have no knowledge of other packages within the same installation. Will this cause problems? 

Popular Posts