Initially, only the root cgroup exists, to which all processes belong. You create an empty child cgroup by adding a subdirectory:
Each cgroup has an interface file called
cgroup.procs that lists the PIDs of all processes belonging to the cgroup, one per line. A process can be moved to a cgroup by writing its PID into the cgroup's cgroup.procs file:
echo 24982 > /sys/fs/cgroup/cg1/cgroup.procs
Only one process can be migrated on a single write call. If a process is composed of multiple threads, writing the PID of any thread migrates all threads of the process.
Note: A process can be in only one cgroup at a time.
Designing a cgroup hierarchy
Controllers can be enabled in any cgroup from the root to the leaves. Each controller distributes its system resource along the hierarchy, according to its configuration and the configuration of the hierarchy’s subtrees.
You specify controllers for each cgroup using two interface files that appear in every cgroup—including the root and all its children:
||Lists the controllers available in a cgroup. In the root cgroup, it lists all the controllers available on the system. In child cgroups it lists the controllers specified in its parent's
||Lists the controllers that are active (enabled) in the cgroup’s subtrees. The controllers listed here are the ones available to descendant cgroups; they're listed in the
You activate or deactivate controllers by writing their names to
cgroup.subtree_control, each preceded by either a plus sign (+) to enable it, or a minus sign (-) to disable it, as in this example:
echo '+cpu -memory' > /sys/fs/cgroup/cg1/cgroup.subtree_control
In the hierarchy below, each
cgroup.subtree_control file determines the set of controllers available to its child cgroups, i.e., the controllers that appear in the
cgroup.controllers file of its children.
In this example, the root cgroup distributes resources to two partitions:
system.slice where system processes run, and
workload.slice, where production workload apps typically run. The set of resource controllers available to child cgroups with services like
smc_proxy.service, is further restricted by the
cgroup.subtree_control files of their respective parents.
The fbtax2 cgroup hierarchy
The cgroup hierarchy used for the fbtax2 project is similar to the one above, but introduces some additional structures and best practices.It divides the hierarchy into three top-level cgroups, each with its own purpose. In addition to the
system.slicecgroup where system binaries run, it also includes:
hostcritical.sliceThis cgroup protects processes required to keep the host running. It contains critical host management functions like
oomd, an alternative to the system OOM (out-of-memory) killer, that we'll look at in a later section of this case study.
Protecting the main workload from resource conflicts with system binaries was a primary goal of the project, and that's the main purpose of
workload.slice. In this case, the main workload gets its own child cgroup
workload-container.slice to protect
workload-support.slice, provides protection to some of the system binaries needed to keep the main workload running. For example, if binaries like
workload-support.servicefail, the main workload can also fail. The exact set of system binaries a workload needs, and the amount of resources allocated to them, will differ depending on the context. But protecting required binaries in a cgroup like
workload-support.slicehelps ensure they're available to keep the main workload running.
In the next section of this case study, we'll look at how fbtax2 uses PSI memory pressure metrics.
Additional notes about cgroup hierarchies
All controller behaviors are hierarchical: if you enable a controller on a cgroup, it affects all processes belonging to the cgroups in its sub-hierarchy.
Similarly, most of the statistics you can query in cgroups, such as current memory or CPU usage, show the sum for the entire subtree.
Restrictions set closer to the root in the hierarchy can't be overridden from further away. When you enable a controller on a nested cgroup, it always restricts the resource distribution further.
Note: The sum of the resources consumed by child cgroups can't exceed the total amount of resources available to the parent.
Avoiding resource conflicts between parent and child cgroups
To eliminate situations where child cgroups compete for resources against the internal processes of their parent, cgroups can control distribution of memory and IO to their children only when they contain no processes of their own. In other words, cgroups can either contain tasks or control resources for child cgroups, but not both.
Note: An exception to this rule is the root cgroup, which can both contain processes and control resource distribution.
Thus, if a parent cgroup controls resource distribution to its child(ren) (i.e., by having a non-empty
cgroup.subtree_control file) any processes in the parent cgroup must be moved to their own child cgroups.
For instance, a cgroup like
/cg1/cg2 can contain processes, but if
/cg1 also contains any processes of its own, those processes should be moved to their own leaf node, e.g.,
For certain legacy configurations, the CPU controller can also contain processes and control resource distribution. See the sections on thread mode on the cgroup manpage for details.
Other useful cgroup interface files
These cgroup interface files come in handy when designing and testing a cgroup hierarchy:
||Contains key/value pairs that identify states or events for the cgroup. Events or state changes generate
||Specifies the limit on the nesting depth of descendant cgroups.
||Specifies the limit of the number of descendant cgroups that a cgroup may have. Writing the string