Notes on BPF & eBPF
Today it was Papers We Love, my favorite meetup! Today Suchakra Sharma (@tuxology on twitter/github)
gave a GREAT talk about the original BPF paper and recent work in Linux on
eBPF. It really made me want to go write eBPF programs!
I wanted to write some notes on the talk here because I thought it was super
To start, here are the
slides and a
pdf. The pdf is good
because there are links at the end and in the PDF you can click the links.
Before BPF, if you wanted to do packet filtering you had to copy all the
packets into userspace and then filter them there (with “tap”).
this had 2 problems:
if you filter in userspace, it means you have to copy all the packets into userspace, copying data is expensive
the filtering algorithms people were using were inefficient
The solution to problem #1 seems sort of obvious, move the filtering logic into
the kernel somehow. Okay. (though the details of how that’s done isn’t obvious,
we’ll talk about that in a second)
But why were the filtering algorithms inefficient! Well!!
If you run tcpdump host foo it actually runs a relatively complicated query,
which you could represent with this tree:
Evaluating this tree is kind of expensive. so the first insight is that you can
actually represent this tree in a simpler way, like this:
Then if you have ether.type = IP and ip.src = foo you automatically know
that the packet matches host foo, you don’t need to check anything else. So
this data structure (they call it a “control flow graph” or “CFG”) is a way
better representation of the program you actually want to execute to check
matches than the tree we started with.
How BPF works in the kernel
The main important here is that packets are just arrays of bytes. BPF programs
run on these arrays of bytes. They’re not allowed to have loops but they can
have smart stuff to figure out the length of the IP header (IPv6 & IPv4 are
different lengths!) and then find the TCP port based on that length
x = ip_header_length
port = *(packet_start + x + port_offset)
(it looks different from that but it’s basically the same). There’s a nice
description of the virtual machine in the paper/slides so I won’t explain it.
When you run tcpdump host foo this is what happens, as far as I understand
convert host foo into an efficient DAG of the rules
convert that DAG into a BPF program (in BPF bytecode) for the BPF virtual machine
Send the BPF bytecode to the Linux kernel, which verifies it
when packets come in, Linux runs the native code to decide if that packet should be filtered or not. It’l often run only 100-200 CPU instructions for each packet that needs to be processed, which is super fast!
the present: eBPF
But BPF has been around for a long time! Now we live in the EXCITING FUTURE
which is eBPF. I’d heard about eBPF a bunch before but I felt like this helped
me put the pieces together a little better. (i wrote this XDP & eBPF post back in April when I was at netdev)
some facts about eBPF:
eBPF programs have their own bytecode language, and are compiled from that
bytecode language into native code in the kernel, just like BPF programs
eBPF programs run in the kernel
eBPF programs can’t access arbitrary kernel memory. Instead the kernel
provides functions to get at some restricted subset of things.
they can communicate with userspace programs through BPF maps
there’s a bpf syscall as of Linux 3.18
kprobes & eBPF
You can pick a function (any function!) in the Linux kernel and execute a
program that you write every time that function happens. This seems really
amazing and magical.
For example! There’s this BPF program called disksnoop which tracks when you start/finish writing a block to disk.
Here’s a snippet from the code:
This basically declares a BPF hash (which the program uses to keep track of
when the request starts / finishes), a
function called trace_start which is going to be compiled into BPF bytecode,
and attaches trace_start to the blk_start_request kernel function.
This is all using the bcc framework which lets you write Python-ish programs
that generate BPF code. You can find it (it has tons of example programs) at