icon-cookie
The website uses cookies to optimize your user experience. Using this website grants us the permission to collect certain information essential to the provision of our services to you, but you may change the cookie settings within your browser any time you wish. Learn more
I agree
Summary | 7 Annotations
system has the flexibility to limit IO to low priority workloads
2020/06/23 08:22
IOPS
2020/06/23 08:22
accounting of all IOs per-cgroup
2020/06/23 08:32
O
2020/06/23 08:33
This is where you specify IO limits
2020/06/23 08:34
You protect workloads with io.latency by specifying a latency target (e.g., 20ms). If the protected workload experiences average completion latency longer than its latency target value, the controller throttles any peers that have a more relaxed latency target than the protected workload. The delta between the prioritized cgroup's target and the targets of other cgroups is used to determine how hard the other cgroups are throttled: If a cgroup with io.latency set to 20ms is prioritized, cgroups with latency targets <= 20ms will never be throttled, while a cgroup with 50ms will get throttled harder than a cgroup with a 30ms target. Interface The interface for io.latency is in a format similar to the other controllers: MAJOR:MINOR target=<target time in microseconds> When io.latency is enabled, you'll see additional stats in io.stat: depth=<integer>—The current queue depth for the group. avg_lat=<time in microseconds>—The running average IO latency for this group. This provides a general idea of the overall latency you can expect for this workload on the specified disk. Note: All cgroup knobs can be configured through systemd. See the systemd.resource-control documentation for details. Using io.latency The limits are applied only at the peer level in the hierarchy. This means that in the diagram below, only groups A, B, and C will influence each other, and groups D and F will influence each other. Group G will influence nobody. Thus, a common way to configure this is to set io.latency in groups A, B, and C. Configuration strategies Generally you don't want to set a value lower than the latency your device supports. Experiment to find the value that works best for your workload: Start at higher than the expected latency for your device, and watch the avg_lat value in io.stat for your workload group to get an idea of the latency during normal operation. Use this value as a basis for your real setting: Try setting it, for example, around 20% higher than the value in io.stat. Experimentation is key here since avg_lat is a running average and subject to statistical anomalies. Setting too tight of a control (i.e., too low of a latency target) provides greater protection to a workload, but it can come at the expense of overall system IO overhead if other workloads get throttled prematurely. Another important factor is that hard disk IO latency can fluctuate greatly: If the latency target is too low, other workloads can get throttled due to normal latency fluctuations, again leading to sub-optimal IO control. Thus, in most cases then, you'll want to set the latency target higher than expected latency to avoid unnecessary throttling—the only question is by how much. Two general approaches have proven most effective: Setting io.latency higher (20-25%) than the usual expected latency. TThis provides a tighter protection guarantee for the workload. However, the tighter control can sometimes mean the system pays more in terms of IO overhead, which leads to lower system-wide IO utilization. A setting like this can be effective for systems with SSDs. Setting io.latency to several times higher than the usual expected latency, especially for hard disks. A hard disk's usual uncontended completion latencies are between 7 and 20ms, but when contention occurs, the completion latency balloons quickly, easily reaching 10 times normal. Because the latency is so volatile, workloads running on hard disks are usually not sensitive to small swings in completion latency; things break down only in extreme conditions when latency jumps several times higher (which isn't difficult to trigger). Effective protection can be achieved in cases like this by setting a relaxed target on the protected group (e.g., 50 or 75ms), and a higher setting for lower priority groups (e.g., an additional 25ms over the higher priority group). This way, the workload can have reasonable protection without significantly compromising hard disk utilization by triggering throttling when it's not necessary. How throttling works io.latency is work conserving: as long as everybody's meeting their latency target, the controller doesn't do anything. Once a group starts missing its target it begins throttling any peer group that has a higher target than itself. This throttling takes two forms: Queue depth throttling—This is the number of outstanding IO's a group is allowed to have. The controller will clamp down relatively quickly, starting at no limit and going all the way down to 1 IO at a time. Artificial delay induction—There are certain types of IO that can't be throttled without possibly affecting higher priority groups adversely. This includes swapping and metadata IO. These types of IO are allowed to occur normally, but they are "charged" to the originating group. Once the victimized group starts meeting its latency target again, it will start unthrottling any peer groups that were throttled previously. If the victimized group simply stops doing IO the global counter will unthrottle appropriately. fbtax2 IO controller configuration As discussed previously, the goal of the fbtax2 cgroup hierarchy was to protect workload.slice. In addition to the memory controller settings, the team found that IO protections were also necessary to make it all work. When memory pressure increases, it often translates into IO pressure. Memory pressure leads to page evictions: the higher the memory pressure, the more page evictions and re-faults, and therefore more IOs. It isn’t hard to generate memory pressure high enough to saturate a disk with IOs, especially the rotating hard disks that were used on the machines in the fbtax2 project. To correct for this, the team used a strategy similar to strategy 2 described above: they prioritized workload.slice by setting its io.latency to higher than expected, to 50ms. This provides more protection for workload.slice than for system.slice, whose io.latency is set to 75ms. When workload.slice has been delayed by lack of IO past its 50ms threshold, it gets IO priority: the kernel limits IO from system.slice and reallocates it to workload.slice so the main workload can keep running. hostcritical.slice was given a similar level of protection as workload.slice since any problems there can also impact the main workload. In this case it used memory.min to guarantee it will have enough to keep running. Though they knew system.slice needed lower IO priority, the team determined the 75ms number through trial and error, modifying it repeatedly until they achieved the right balance between protecting the main workload and ensuring the stability of system.slice. In the final installment of this case study, we'll summarize the strategies used in the fbtax2 project, and look at some of the utilization gains that resulted in Facebook's server farms. ← Memory Controller Strategies and ToolsCPU Controller →cgroup2 IO controller enhancementsInterface filesProtecting workloads with io.latencyInterfaceUsing io.latencyConfiguration strategiesHow throttling works
2020/06/23 08:35
io.latency
2020/06/23 08:35