The website uses cookies. By using this site, you agree to our use of cookies as described in the Privacy Policy.
I Agree
richard yuwen
768 articles
My Web Markups - richard yuwen
47 annotations
23 annotations
  • 公司的基因。正是因为他们自己长期使用容器,长期实践DevOps
  • OCP为核心的泛PaaS
  • 400多种容器镜像
  • BuildConfig
  • 学习技术要学有生命力、生态好的,至少在整明白之前,市场需求还很旺盛
  • 众多开源项目如何在企业中组合使用,迅速构建知识体系,形成战斗力
  • 部署、运维、或者查找文档,都比Kubernetes要方便很多
  • OCP是众多开源项目的集大成者
  • OpenShift作为企业级容器平台,它是一盏打开从PaaS到DevOps和微服务的大门
  • 先攻破开发,后打通运维
  • 面向于开发
  • DevOps早都有方法论
  • Kubernetes因OCP被企业级客户所接受
  • OCP因Kubernetes而重生
  • OCP全球案例超过1000个
  • DevOps的PaaS平台的产品
  • 客户的应用运行的更快、开发的更快
  • 客户敏态业务的价值是什么呢
  • 敏态、稳态双模IT
  • Docker使容器具备了较好的可操作性、可移植性,Kubernetes使容器具备企业级使用的条件。而企业级的容器平台,又成为了PaaS、DevOps、微服务新一代的基础架构
  • 红帽随即宣布与谷歌合作,推广Kubernetes。这也奠定了OpenShift在业内今日的影响力
  • 容器的镜像构建、打包等技术
  • 可操作性和实用性
  • 客户自己了解了容器后,寻找能够提供企业级容器解决方案的厂商
  • IT厂商推动虚拟化技术在客户处的落地
  • 容器方案与客户的运维和开发部门都可以谈
  • 容器的起点就是面向应用
  • 容器是直接用宿主机的操作系统来欺骗容器的应用
  • 虚拟机用假的CPU、内存和网络欺骗操作系统
  • 红帽份额为44%
  • 业容器平台(ECP)
31 annotations
  • 一个通用的 chart 对应不同的 values.yaml 做到了复用
  • Application Configuration 是运维人员
  • 设计应用的运维能力定义的过程中,我们重点关注的是运维能力怎么发现和管理的问题
  • 运维能力的描述
  • 应用组件的描述
  • 定义必须不是在一个 YAML 里描述所有东西
  • 区分使用角色
  • 一次定义、随处运行
  • 一套应用定义,必须可以不加修改跑到不同运行环境当中
  • 原始的 K8s API 按照现实中的协作逻辑进行合理的拆分和分类,然后分别暴露给研发和运维去使用
  • 基础设施工程师想要基于 K8s 来服务更上层应用开发和运维人员时
  • 运维能力的描述,比如自动扩缩容、流量切换、灰度、监控等,涉及到一系列的规则
  • 应用本身的镜像、启动参数、依赖的云资源等等全部描述起来
  • 应用定义”这个东西,在整个云原生社区里其实是缺失的
  • Helm 和 Application CRD 只是单纯地将 K8s 的 API 组合在一起,无法描述我们对云上面资源的依赖
  • 访问会依赖 SLB
  • 数据库会依赖 RDS
  • Kubernetes 我们也试过使用 Helm 以及 Application CRD 来定义应用
  • Docker 解决了单机应用交付,它就通过 Docker 镜像把单机应用定义的很好
  • 第三层的 OpenKruise 项目
  • 高效、简洁的应用管理与交付体系
  • Helm 就是位于整个应用管理体系的最上面
  • Kubernetes 本身并不提供完整的应用管理体系
  • 应用交付分层模型
  • Docker 化了之后就能‘一次打包、随处运行’
  • 应用的交付场景
  • 一个环境里的不同运维能力,实际上有可能是冲突的
  • 应用开发和应用运维人员的诉求又该怎么传递给基础设施呢
  • PaaS 平台只允许开发填 Deployment 的极个别字段
29 annotations
  • io.latency
  • You protect workloads with io.latency by specifying a latency target (e.g., 20ms). If the protected workload experiences average completion latency longer than its latency target value, the controller throttles any peers that have a more relaxed latency target than the protected workload. The delta between the prioritized cgroup's target and the targets of other cgroups is used to determine how hard the other cgroups are throttled: If a cgroup with io.latency set to 20ms is prioritized, cgroups with latency targets <= 20ms will never be throttled, while a cgroup with 50ms will get throttled harder than a cgroup with a 30ms target. Interface The interface for io.latency is in a format similar to the other controllers: MAJOR:MINOR target=<target time in microseconds> When io.latency is enabled, you'll see additional stats in io.stat: depth=<integer>—The current queue depth for the group. avg_lat=<time in microseconds>—The running average IO latency for this group. This provides a general idea of the overall latency you can expect for this workload on the specified disk. Note: All cgroup knobs can be configured through systemd. See the systemd.resource-control documentation for details. Using io.latency The limits are applied only at the peer level in the hierarchy. This means that in the diagram below, only groups A, B, and C will influence each other, and groups D and F will influence each other. Group G will influence nobody. Thus, a common way to configure this is to set io.latency in groups A, B, and C. Configuration strategies Generally you don't want to set a value lower than the latency your device supports. Experiment to find the value that works best for your workload: Start at higher than the expected latency for your device, and watch the avg_lat value in io.stat for your workload group to get an idea of the latency during normal operation. Use this value as a basis for your real setting: Try setting it, for example, around 20% higher than the value in io.stat. Experimentation is key here since avg_lat is a running average and subject to statistical anomalies. Setting too tight of a control (i.e., too low of a latency target) provides greater protection to a workload, but it can come at the expense of overall system IO overhead if other workloads get throttled prematurely. Another important factor is that hard disk IO latency can fluctuate greatly: If the latency target is too low, other workloads can get throttled due to normal latency fluctuations, again leading to sub-optimal IO control. Thus, in most cases then, you'll want to set the latency target higher than expected latency to avoid unnecessary throttling—the only question is by how much. Two general approaches have proven most effective: Setting io.latency higher (20-25%) than the usual expected latency. TThis provides a tighter protection guarantee for the workload. However, the tighter control can sometimes mean the system pays more in terms of IO overhead, which leads to lower system-wide IO utilization. A setting like this can be effective for systems with SSDs. Setting io.latency to several times higher than the usual expected latency, especially for hard disks. A hard disk's usual uncontended completion latencies are between 7 and 20ms, but when contention occurs, the completion latency balloons quickly, easily reaching 10 times normal. Because the latency is so volatile, workloads running on hard disks are usually not sensitive to small swings in completion latency; things break down only in extreme conditions when latency jumps several times higher (which isn't difficult to trigger). Effective protection can be achieved in cases like this by setting a relaxed target on the protected group (e.g., 50 or 75ms), and a higher setting for lower priority groups (e.g., an additional 25ms over the higher priority group). This way, the workload can have reasonable protection without significantly compromising hard disk utilization by triggering throttling when it's not necessary. How throttling works io.latency is work conserving: as long as everybody's meeting their latency target, the controller doesn't do anything. Once a group starts missing its target it begins throttling any peer group that has a higher target than itself. This throttling takes two forms: Queue depth throttling—This is the number of outstanding IO's a group is allowed to have. The controller will clamp down relatively quickly, starting at no limit and going all the way down to 1 IO at a time. Artificial delay induction—There are certain types of IO that can't be throttled without possibly affecting higher priority groups adversely. This includes swapping and metadata IO. These types of IO are allowed to occur normally, but they are "charged" to the originating group. Once the victimized group starts meeting its latency target again, it will start unthrottling any peer groups that were throttled previously. If the victimized group simply stops doing IO the global counter will unthrottle appropriately. fbtax2 IO controller configuration As discussed previously, the goal of the fbtax2 cgroup hierarchy was to protect workload.slice. In addition to the memory controller settings, the team found that IO protections were also necessary to make it all work. When memory pressure increases, it often translates into IO pressure. Memory pressure leads to page evictions: the higher the memory pressure, the more page evictions and re-faults, and therefore more IOs. It isn’t hard to generate memory pressure high enough to saturate a disk with IOs, especially the rotating hard disks that were used on the machines in the fbtax2 project. To correct for this, the team used a strategy similar to strategy 2 described above: they prioritized workload.slice by setting its io.latency to higher than expected, to 50ms. This provides more protection for workload.slice than for system.slice, whose io.latency is set to 75ms. When workload.slice has been delayed by lack of IO past its 50ms threshold, it gets IO priority: the kernel limits IO from system.slice and reallocates it to workload.slice so the main workload can keep running. hostcritical.slice was given a similar level of protection as workload.slice since any problems there can also impact the main workload. In this case it used memory.min to guarantee it will have enough to keep running. Though they knew system.slice needed lower IO priority, the team determined the 75ms number through trial and error, modifying it repeatedly until they achieved the right balance between protecting the main workload and ensuring the stability of system.slice. In the final installment of this case study, we'll summarize the strategies used in the fbtax2 project, and look at some of the utilization gains that resulted in Facebook's server farms. ← Memory Controller Strategies and ToolsCPU Controller →cgroup2 IO controller enhancementsInterface filesProtecting workloads with io.latencyInterfaceUsing io.latencyConfiguration strategiesHow throttling works
  • This is where you specify IO limits
  • O
  • accounting of all IOs per-cgroup
  • IOPS
  • system has the flexibility to limit IO to low priority workloads
7 annotations
  • CRI + containerd ShimV2 revolution
  • Container Runtime management engine
  • Sigma/Kubernetes
  • lower-layer Container Runtime
  • CRI + containerd shimv2
  • CRI is the first calling interface in Kubernetes to be divided into plug-ins
  • remove and decouple the complex features that are originally invasive to the main code from the core library one by one by dividing them into different interfaces and plug-ins
  • how to connect containerd to the kata container
  • implementation of Shimv2 API
  • kata-Containerd-Shimv2
  • container-shim-v2 in Sandbox
  • a containerd shim
  • specify a shim for each Pod
  • containerd shim for each container
  • make KataContainers follow containerd
  • standard interface between the CRI shim and the containerd runtime
  • Containerd ShimV2
  • CRI-O
  • reuse the existing CRI shims
  • What can a CRI shim do? It can translate CRI requests into Runtime APIs
  • CRI shim
  • Dockershim
  • maintenance
  • we do not want a project like Docker to have to know what a Pod is and expose the API of a Pod
  • Containerd-centric API
  • Container Runtime Interface
  • multi-tenant
  • security
  • Kernel version run by your container is completely different from that run by the Host machine
  • each Pod now has an Independent kernel
  • the more layers you build here, the worse your container performance is
  • SECCOMP
  • secure Container Runtime
  • we are concerned about security
  • each Pod like the KataContainer is a lightweight virtual machine with a complete Linux kernel
  • a compressed package of your program + data + all dependencies + all directory files
  • the Container Image
  • the Container Runtime
  • runC that helps you set up these namespaces and cgroups, and helps you chroot, building a container required by an application
  • binding operation
  • NodeName field of the Pod object
  • Pods are created, instead of containers
  • the designs of Kubernetes CRI and Containerd ShimV2
  • KataContainers
  • RuntimeClass
  • ShimV2
  • container runtime
  • CRI
  • design and implementation of key technical features
49 annotations
  • Overcommitting on memory—promising more memory for processes than the total system memory—is a key technique for increasing memory utilization
  • demand exceeds the total memory available
  • outweigh the overhead of occasional OOM events
  • oad shedding is a technique to avoid overloading and crashing a system by temporarily rejecting new requests. The idea is that all loads will be better served if the system rejects a few and continues to run, instead of accepting all requests and crashing due to lack of resources. In a recent test, a team at Facebook that runs asynchronous jobs, called Async, used memory pressure as part of a load shedding strategy to reduce the frequency of OOMs. The Async tier runs many short-lived jobs in parallel. Because there was previously no way of knowing how close the system was to invoking the OOM handler, Async hosts experienced excessive OOM kills. Using memory pressure as a proactive indicator of general memory health, Async servers can now estimate, before executing each job, whether the system is likely to have enough memory to run the job to completion. When memory pressure exceeds the specified threshold, the system ignores further requests until conditions stabilize. The chart shows how async responds to changes in memory pressure: when memory.full (in orange) spikes, async sheds jobs back to the async dispatcher, shown by the blue async_execution_decision line. The results were signifcant: Load shedding based on memory pressure decreased memory overflows in the Async tier and increased throughput by 25%. This enabled the Async team to replace larger servers with servers using less memory, while keeping OOMs under control. oomd - memory pressure-based OOM oomd is a new userspace tool similar to the kernel OOM handler, but that uses memory pressure to provide greater control over when processes start getting killed, and which processes are selected. The kernel OOM handler’s main job is to protect the kernel; it’s not concerned with ensuring workload progress or health. Consequently, it’s less than ideal in terms of when and how it operates: It starts killing processes only after failing at multiple attempts to allocate memory, i.e., after a problem is already underway. It selects processes to kill using primitive heuristics, typically killing whichever one frees the most memory. It can fail to start at all when the system is thrashing: memory utilization remains within normal limits, but workloads don't make progress, and the OOM killer never gets invoked to clean up the mess. Lacking knowledge of a process's context or purpose, the OOM killer can even kill vital system processes: When this happens, the system is lost, and the only solution is to reboot, losing whatever was running, and taking tens of minutes to restore the host. Using memory pressure to monitor for memory shortages, oomd can deal more proactively and gracefully with increasing pressure by pausing some tasks to ride out the bump, or by performing a graceful app shutdown with a scheduled restart. In recent tests, oomd was an out-of-the-box improvement over the kernel OOM killer and is now deployed in production on a number of Facebook tiers. Case study: oomd at Facebook See how oomd was deployed in production at Facebook in this case study looking at Facebook's build system, one of the largest services running at Facebook. oomd in the fbtax2 project As discussed previously, the fbtax2 project team prioritized protection of the main workload by using memory.low to soft-guarantee memory to workload.slice, the main workload's cgroup. In this work-conserving model, processes in system.slice could use the memory when the main workload didn't need it. There was a problem though: when a memory-intensive process in system.slice can no longer take memory due to the memory.low protection on workload.slice, the memory contention turns into IO pressure from page faults, which can compromise overall system performance. Because of limits set in system.slice's IO controller (which we'll look at in the next section of this case study) the increased IO pressure causes system.slice to be throttled. The kernel recognizes the slowdown is caused by lack of memory, and memory.pressure rises accordingly. oomd monitors the pressure, and once it exceeds the configured threshold, kills one of the processes—most likely the memory hog in system.slice—and resolves the situation before the excess memory pressure crashes the system. This behavior ← Memory ControllerIO Controller →Memory overcommitPressure-based load sheddingoomd - memory pressure-based OOMCase study: oomd at Facebook
  • Load shedding
  • rejects a few and continues to run
  • oomd
  • The kernel OOM handler’s main job is to protect the kernel
  • out-of-the-box improvement over the kernel OOM killer
  • a memory-intensive process
10 annotations
  • 抓手、生态、闭环、拉齐、梳理、迭代、owner意识
  • 能说、会写、善做是对职场人的三大要求
  • 撕逼甩锅邀功抢活这些闹心的事儿基本也不会缺席
  • PPT、沟通、表达、时间管理、设计、文档等方面的能力
  • 报警配置和监控梳理
  • 良好的规划能力和清晰的演进蓝图
  • 做系统建设要有全局视野
  • 有的人能把一个小盘子越做越大
  • 想到了leader没想到的地方
  • 直接去找对应的人聊,让别人讲一遍自己基本就全懂了,这效率比看文档看代码快多了
  • 向上沟通反馈
  • owner意识
  • 主动承担任务,主动沟通交流,主动推动项目进展,主动协调资源,主动向上反馈,主动创造影响力
  • 主动承担,及时交流反馈
  • 主动跳出自己的舒适区,感到挣扎与压力的时候,往往是黎明前的黑暗,那才是成长最快的时候
  • 强迫自己跳出自己的安逸区
  • 积极学习,保持技术能力、知识储备与工作年限成正比,这到了35岁哪还有什么焦虑呢
  • 架构先行于业务
  • 技术同学该如何培养产品思维,引导产品走向
  • 系统建设?系统核心能力,系统边界,系统瓶颈,服务分层拆分,服务治理
  • 代码层,可以做的事情更多了,资源池化、对象复用、无锁化设计、大key拆分、延迟处理、编码压缩、gc调优还有各种语言相关的高性能实践
  • 在架构层,可以做缓存、预处理、读写分离、异步、并行等等
  • 术到道的过程
  • 知识还是零星的几点,不成体系,不仅容易遗忘,而且造成自己视野比较窄,看问题比较局限
24 annotations
  • Allocations can't be over-committed
  • Non-root cgroups can distribute domain resources to their children only when they don't have any processes of their own
  • Only one process can be migrated on a single write(2) call
  • use cases where multiple cgroups write to a single inode simultaneously are not supported well
  • cgroup writeback is implemented on ext2, ext4, btrfs, f2fs, and xfs
  • per-cgroup dirty memory states
  • dirty memory ratio
  • how much the workload is being impacted due to lack of memory
  • memory.pressure
  • memory.stat
  • memory.events
  • Memory usage hard limit
  • Memory usage throttle limit
  • Best-effort memory protection
  • Protections can be hard guarantees or best effort soft boundaries
  • Memory is stateful and implements both limit and protection models
  • cgroup is a mechanism to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner
  • "min" and "max"
  • weight"
  • Limits can be over-committed
  • [0, max] and defaults to "max"
  • [1, 10000] with the default at 100
  • absolute resource guarantee
  • weight based resource distribution
  • The root cgroup should be exempt from resource control and thus shouldn't have resource control interface files
  • Consider cgroup namespaces as delegation boundaries
  • namespace root
  • all non-root "cgroup.subtree_control" files can only contain controllers which are enabled in the parent's "cgroup.subtree_control" file.
  • not subject to the no internal process constraint
  • threaded domain or thread root
  • The io controller, in conjunction with the memory controller, implements control of page cache writeback IOs
  • CPU
  • Memory
  • IO
  • PID
  • Cpuset
  • Device
  • RDMA
  • HugeTLB
  • Misc
  • A read-only flat-keyed file
  • allows to limit the HugeTLB usage per control group
  • controller limit during page fault
  • anon_thp
44 annotations