Traffic filtering is a common technique used to protect networks and hosts from malicious activity. The filtering can be done in the incoming and outgoing flows, better known as ingress and egress filtering. There are many examples of use cases for ingress filtering: mitigating DDoS attacks, avoiding SPAM, blocking access to services for specific geographic regions and so on.
One common application of egress filtering is to prevent applications and users from reaching remote hosts. These kinds of filters are usually based on a large deny-list of IP address ranges, like Reputation Block Lists . The technology used to perform this task has to be able to handle a high number of restricted hosts, up to millions in some cases.
Our friends at SAP asked us to perform a benchmark of the common Linux technologies available for this task. This blog post presents the methodology and results from benchmarking some of the Linux filtering technologies: eBPF, IP sets and iptables.
Update 2020-09-14: we have received some great feedback and suggestions about other options to try, including eBPF JIT optimization (that we left inadvertently disabled) and inline eBPF map lookup. We will take a look at this and update this post as soon as possible.
Update 2020-12-17: acting on the valuable feedback received, the latest test results by the benchmark are done with the eBPF JIT compiler enabled. The results showed a marked improvement by enabling eBPF JIT compiler. Latest versions of Flatcar Container Linux Alpha channel now come with eBPF JIT compiler enabled by default.
We had the following goals going into this study:
- Provide a reproducible benchmark framework that anyone else can download and use.
- Evaluate the different mechanisms in the Linux networking stack to filter out a large amount of IP addresses and assess their scalability when the amount of IP ranges to block reaches 1 million.
We aim to understand the cost of performing egress filtering based on the destination IP in a large set of IP ranges.
The scenario we consider is composed of a client and a server computer that communicate through an IP network. The egress filtering is performed on the client machine and there is no filtering performed on the server side.
An egress filter has to perform a matching operation to understand if a packet can continue its way or has to be dropped. This operation could be costly if the number of blocked IPs is high. This kind of filter impacts the throughput of IP connections, it consumes extra CPU and increases the latency. We decided to take into consideration these metrics under the following conditions:
- CPU usage
Linux Filtering Mechanisms
The Linux kernel provides different mechanisms to filter network packets. In this benchmark we considered the most known ones. iptables is the historical and probably more known filtering utility in Linux. IP sets support high speed matching for sets of specific types, like IP ranges for instance. eBPF is very flexible and customizable. The following section provides more details of how we used those mechanisms in our benchmark.
eBPF is a virtual machine built in the Linux kernel. In the networking scope, it allows the user to load programs that are executed each time a packet arrives or is sent out of a networking interface. Those eBPF programs can modify, redirect or drop the packets.
There are different types of networking eBPF programs, CGROUP_SKB for per-cgroup filtering, SCHED_CLS and SCHED_ACT for filtering at the traffic control level, and XDP for filtering at the network interface level.
In our scenario, we want to apply the same filtering regardless of the cgroup of the application generating the traffic. The traffic coming from non-local applications or forwarded traffic should be filtered as well. This makes filtering at the socket level unfit for our scenario. XDP programs can only be attached to the ingress1 path, removing it from consideration in our tests also. By process of elimination, we only consider eBPF programs attached to the traffic control layer.
Given that we want to keep the tests as simple as possible we decided to implement our own filtering program in eBPF that will be attached to the traffic control layer with a clsact qdisc. The clsact qdisc was introduced in Linux 4.5, it’s a pseudo qdisc that allows to attach eBPF programs in ingress and egress using the direct-action mode.
We tested the standard iptables implementation by adding DROP rules to the OUTPUT chain in the filter table. The rules we used have the following format
-A OUTPUT -d 192.168.0.0/16 -o eth0 -m comment --comment "benchmark" -j DROP
While implementing this filter we found that appending each rule independently was too slow and we had to use
iptables-restore utility to do it.
iptables uses a linear search algorithm for rule matching.
-A OUTPUT -o eth0 -m set --match-set myset dst -m comment --comment benchmark -j DROP.
The IP address to check is hashed and the result of the hash is used as the index in a hash table. Each bucket of the hash table contains an array of networks for that hash. When an IP address is checked against an IP set of type “hash:net”, it is not known what network size will match. The IP set implementation hashes the IP address for all possible network sizes. For instance, if the IP to match is 220.127.116.11, it looks for 18.104.22.168/32, for 22.214.171.124/31 and so on until all the 32 possible network sizes have been tested.
iperf3 to measure the performance of TCP, UDP and the CPU usage. The standard
ping utility was
used to measure the latency. We tested with 10, 100, 1k, 10k, 100k and 1M rules and each test was run 5 times
to get statistical robustness.
We used two bare metal servers - running on Packet - as client and server machines. The specifications of such servers are:
c2.medium.x86 1x AMD EPYC 7401P 24-Core Processor @ 2.0GHz 2x 120GB SSD 2x 480GB SSD 64GB RAM 2x 10Gbps
The exact versions of the tools we used are:
- Flatcar Container Linux by Kinvolk alpha ( 2705.0.0 )
- Linux kernel 5.9.11
- iperf 3.6 (in a Docker container with the host network)
- iptables v1.6.2
- ipset v7.6, protocol version: 7
https://github.com/kinvolk/egress-filtering-benchmark has all the tools and instructions to reproduce these tests.
Test #1 - TCP Throughput
The throughput graph shows that the test is reaching line rate, 10Gbps because we are using a single connection hence only one of the interfaces on the bonding is used in most of the cases. Only iptables with more than 10k rules is not able to handle that performance. From this test we can conclude that iptables doesn’t scale well after 100k, but unfortunately we cannot conclude anything from other tests as network performance is the bottleneck.
In order to stress the CPU more, we performed another test using UDP.
Test #2 - UDP Throughput
The goal of this test is to saturate the CPU. We used UDP packets with a size of 1470 bytes and the
parameter to try to reach line rate. The graph shows that none of the cases reaches the 10Gbps line rate, it’s
good for this test because it means the network is not the bottleneck anymore.
We can see that iptables does not scale well beyond 1k rules and that IP sets give a slightly higher throughput than eBPF on the clsact qdisc for high number of rules.
Test #3 - CPU usage
iperf with a target bandwidth of 1Gbps (
-b 1G) to avoid saturating the CPU and be able to compare
the CPU usage with the different filters. We take the CPU usage reported by
iperf that includes all the
applications running on the host. From the above graph we can see that the CPU usage with iptables increases
when using more than 1k rules. The CPU usage with eBPF and IP sets filters stays just about equal and almost
constant despite the increasing number of rules.
Test #4 - Latency
We used the standard Linux ping utility to measure the latency with the
-i 0.001 and
-c 1000 parameters,
sending 1000 pings at a 1 millisecond interval. The above graph shows that the latency is not affected by the
number of rules in the eBPF and IP sets filters, it’s ~27µs in those cases. On the other hand, it increases
linearly with the number of rules for the iptables case.
Test #5 - Setup Time
This test aims to measure the time it takes to install the filtering rules on the system. It doesn’t take into consideration fixed times like loading the eBPF program, creating the IP sets but instead focuses on measuring the time it takes to update the rules themselves and how it changes with the number of rules.
The above graph shows that the time required to set up the rules in all the filters increases linearly with the number of rules.
With 100.000 IPs, the setup time is still less than one second. For 1 million IPs, the setup time is less than 10 seconds. This shows that in practice, all 3 methods have an acceptable setup time.
The setup time is very dependent on the benchmark implementation rather than being inherent to the filtering method. Each filtering method could be optimised in the benchmark code if it were deemed necessary.
For the bpf filter: the benchmark performs one bpf() system call for every IP. From Linux 5.6, it is possible to perform several updates in a eBPF map with a single bpf() call .
For the iptables and IP sets filters, the benchmark executes external commands and feeds each IP range individually over a pipe. We could avoid that by communicating the list to the kernel using the Netlink protocol directly. There are some attempts to implement IP Sets support in Golang with the vishvananda/netlink library.
For this reason, the reader should be cautious before comparing the setup time of one filter to another. This should not be interpreted as one method being better than another but rather to show expected setup performance and that setup time is unlikely to be a problem for any filtering method.
Test #6 - Throughput with too Many Rules
The throughput in the tests one and two is almost the same regardless of the number of rules for the eBPF and IP Sets filters. We wanted to know if this behaviour changed with a higher number rules, i.e., up to 100 millions.
We found that it’s not possible to run a test with 100M entries in the eBPF case as this would require more than 4GB of memory for the LPM map and it’s not allowed by the Linux kernel2. We decided to test with 10M and 50M rules.
From the above chart we can see that the difference in throughput between IP sets and eBPF is almost negligible even with 10M and 50M rules.
Profiling the CPU usage
If you want to understand more about each implementation and read the relevant source code, a good way to get started is to profile the CPU usage to find out which functions take most of the CPU time.
In the UDP tests, the CPU is saturated and is not able to reach the 10Gbps line rate. We can use the BCC profile tool to get the list of functions that are executed in the tests for each filtering method.
We have to run the tool with the
-K parameter to get the traces from the kernel.
sudo /usr/share/bcc/tools/profile -K
For the iptables case, it prints the following trace:
ipt_do_table ipt_do_table nf_hook_slow __ip_local_out ip_local_out __ip_queue_xmit __tcp_transmit_skb
The order of calls is from the bottom to the top, we can see how the nf_hook_slow function calls ipt_do_table in an iterative manner. It is a simple “for” loop that iterates over each rule, explaining the performance decreasing linearly with the amount of rules.
We got the following trace for the eBPF case:
longest_prefix_match.isra.0 longest_prefix_match.isra.0 trie_lookup_elem cleanup_module cleanup_module tcf_classify __dev_queue_xmit ip_finish_output2 ip_output
In this case the first important call is to tcf_classify and the lookup in the LPM is performed in trie_lookup_elem and longest_prefix_match . It is walking the trie . This explains the better performance, since walking over the trie takes maximum 32 steps with IPv4 addresses.
Without eBPF JIT enabled we would have expected to see
__bpf_prog_run32 just before the
trie_lookup_elem. These calls interpret the eBPF program instructions and convert into eBPF bytecode. The
non-presence of these calls shows that eBPF JIT compiler is in effect.
The traces for IP sets are the following:
hash_net4_test hash_net4_test hash_net4_kadt ip_set_test set_match_v4 ipt_do_table iptable_filter_hook nf_hook_slow __ip_local_out ip_local_out ip_send_sk udp_send_skb.isra.0 udp_sendmsg
For the IP sets case we can see how the first part of the stack is the same as iptables, nf_hook_slow -> iptable_filter_hook-> ipt_do_table. Then set_match_v4 and after it specific IP set logic is executed: ip_set_test , hash_net4_kadt and hash_net4_test , that lookups all possible network sizes for a given IP.
The throughput, CPU usage and latency tests show that IP sets and eBPF filters scale well even with 1 million IPs. On the other hand, iptables doesn’t scale that well and even with a low number of rules like 1k or 10k shows a considerable difference to the other filters. This is expected as iptables performs a linear search on the rule list.
IP sets fared marginally better than eBPF with an LPM map on a clsact qdisc, it has a slightly higher
throughput in the UDP test and uses less CPU. We suspect this is because the random CIDR’s generated in the
benchmark test were with the subnet mask
/16, this resulted in faster lookups on the hash table.
The lookup time grows linearly with the number of different prefix values added to the set. Hence we cannot
compare the performance between IP sets and eBPF in the general case at this point.
After enabling the eBPF JIT compiler, eBPF managed to bridge the performance gap versus IP sets, being more noticable in nearly equal CPU usage and throughput when the number of rules reached upto 50M.
eBPF cannot handle more than ~89M entries due to restrictions in the Linux kernel.
We suggest the decision of what tool to use shouldn’t be strictly based on the results of this benchmark but should take into consideration other aspects like whether the technology is already used on the hosts, whether there are other needed features (now or in the future) that could be provided by some of the technologies, the way traffic statistics can be exposed to a monitoring tool, etc.
Keep in mind that this benchmark only tests a specific use case: filtering a large deny-list of IP address ranges on egress. A more realistic cluster would have other filters, including on ingress.
Each element on the LPM map requires 48 bytes in our case (40 from
sizeof(struct lpm_trie_node)+ 4 from
attr->value_sizeand 4 from
trie->data_size). It means the maximum number of possible elements in the map is (2^32/48) ~89M. ↩︎