article image

Traffic filtering is a common technique used to protect networks and hosts from malicious activity. The filtering can be done in the incoming and outgoing flows, better known as ingress and egress filtering. There are many examples of use cases for ingress filtering: mitigating DDoS attacks, avoiding SPAM, blocking access to services for specific geographic regions and so on.

One common application of egress filtering is to prevent applications and users from reaching remote hosts. These kinds of filters are usually based on a large deny-list of IP address ranges, like Reputation Block Lists . The technology used to perform this task has to be able to handle a high number of restricted hosts, up to millions in some cases.

Our friends at SAP asked us to perform a benchmark of the common Linux technologies available for this task. This blog post presents the methodology and results from benchmarking some of the Linux filtering technologies: eBPF, IP sets and iptables.

Update 2020-09-14: we have received some great feedback and suggestions about other options to try, including BPF JIT optimization (that we left inadvertently disabled) and inline BPF map lookup. We will take a look at this and update this post as soon as possible - expected at the end of the month.

Goals

We had the following goals going into this study:

  • Provide a reproducible benchmark framework that anyone else can download and use.
  • Evaluate the different mechanisms in the Linux networking stack to filter out a large amount of IP addresses and assess their scalability when the amount of IP ranges to block reaches 1 million.

Scenario

We aim to understand the cost of performing egress filtering based on the destination IP in a large set of IP ranges.

The scenario we consider is composed of a client and a server computer that communicate through an IP network. The egress filtering is performed on the client machine and there is no filtering performed on the server side.

Metrics

An egress filter has to perform a matching operation to understand if a packet can continue its way or has to be dropped. This operation could be costly if the number of blocked IPs is high. This kind of filter impacts the throughput of IP connections, it consumes extra CPU and increases the latency. We decided to take into consideration these metrics under the following conditions.

  • Throughput
  • CPU usage
  • Latency

Linux Filtering Mechanisms

The Linux kernel provides different mechanisms to filter network packets. In this benchmark we considered the most known ones. iptables is the historical and probably more known filtering utility in Linux. IP sets support high speed matching for sets of specific types, like IP ranges for instance. eBPF is very flexible and customizable. The following section provides more details of how we used those mechanisms in our benchmark.

eBPF

eBPF is a virtual machine built in the Linux kernel. In the networking scope, it allows the user to load programs that are executed each time a packet arrives or is sent out of a networking interface. Those eBPF programs can modify, redirect or drop the packets.

There are different types of networking eBPF programs, CGROUP_SKB for per-cgroup filtering, SCHED_CLS and SCHED_ACT for filtering at the traffic control level, and XDP for filtering at the network interface level.

In our scenario, we want to apply the same filtering regardless of the cgroup of the application generating the traffic. The traffic coming from non-local applications or forwarded traffic should be filtered as well. This makes filtering at the socket level unfit for our scenario. XDP programs can only be attached to the ingress1 path, removing it from consideration in our tests also. By process of elimination, we only consider eBPF programs attached to the traffic control layer.

Given that we want to keep the tests as simple as possible we decided to implement our own filtering program in eBPF that will be attached to the traffic control layer with a clsact qdisc. The clsact qdisc was introduced in Linux 4.5, it’s a pseudo qdisc that allows to attach eBPF programs in ingress and egress using the direct-action mode.

Our filter implementation uses an LPM map to save the list of IP addresses to block. eBPF implements the LPM algorithm with a digital tree .

iptables

We tested the standard iptables implementation by adding DROP rules to the OUTPUT chain in the filter table. The rules we used have the following format

-A OUTPUT -d 192.168.0.0/16 -o eth0 -m comment --comment "benchmark" -j DROP

While implementing this filter we found that appending each rule independently was too slow and we had to use the iptables-restore utility to do it.

iptables uses a linear search algorithm for rule matching.

IP Sets

We performed the testing also with the IP sets framework. We used an IP set of type hash:net and linked it to iptables by using the following rule:

-A OUTPUT -o eth0 -m set --match-set myset dst -m comment --comment benchmark -j DROP.

The IP address to check is hashed and the result of the hash is used as the index in a hash table. Each bucket of the hash table contains an array of networks for that hash. When an IP address is checked against an IP set of type “hash:net”, it is not known what network size will match. The IP set implementation hashes the IP address for all possible network sizes. For instance, if the IP to match is 1.2.3.4, it looks for 1.2.3.4/32, for 1.2.3.4/31 and so on until all the 32 possible network sizes have been tested.

Benchmark Set-up

We used iperf3 to measure the performance of TCP, UDP and the CPU usage. The standard ping utility was used to measure the latency. We tested with 10, 100, 1k, 10k, 100k and 1M rules and each test was run 5 times to get statistical robustness.

We used two bare metal servers - running on Packet - as client and server machines. The specifications of such servers are:

c2.medium.x86
1x AMD EPYC 7401P 24-Core Processor @ 2.0GHz
2x 120GB SSD
2x 480GB SSD
64GB RAM
2x 10Gbps

The exact versions of the tools we used are:

  • Flatcar Container Linux by Kinvolk alpha ( 2605.0.0 )
  • Linux kernel 5.4.59
  • iperf 3.0.7 (in a Docker container with the host network)
  • iptables v1.6.2
  • ipset v6.20.1, protocol version: 6

Reproducibility

https://github.com/kinvolk/egress-filtering-benchmark has all the tools and instructions to reproduce these tests.

Results

Test #1 - TCP Throughput

The throughput graph shows that the test is reaching line rate, 10Gbps because we are using a single connection hence only one of the interfaces on the bonding is used in most of the cases. Only iptables with more than 10k rules is not able to handle that performance. From this test we can conclude that iptables doesn’t scale well after 100k, but unfortunately we cannot conclude anything from other tests as network performance is the bottleneck.

In order to stress the CPU more, we performed another test using UDP.

Test #2 - UDP Throughput

The goal of this test is to saturate the CPU. We used UDP packets and the -b 10G parameter to try to reach line rate. The graph shows that none of the cases reaches the 10Gbps line rate, it’s good for this test because it means the network is not the bottleneck anymore.

We can see that iptables does not scale well beyond 1k rules and that IP sets are a bit faster than eBPF on the clsact qdisc with a high number of rules.

Test #3 - CPU usage

We used iperf with a target bandwidth of 1Gbps (-b 1G) to avoid saturating the CPU and be able to compare the CPU usage with the different filters. We take the CPU usage reported by iperf that includes all the applications running on the host. From the above graph we can see that the CPU usage with iptables increases when using more than 1k rules. The CPU usage with eBPF and IP sets filters stays almost constant showing a bit higher usage in the eBPF case.

Test #4 - Latency

We used the standard Linux ping utility to measure the latency with the -i 0.001 and -c 1000 parameters, sending 1000 pings at a 1 millisecond interval. The above graph shows that the latency is not affected by the number of rules in the eBPF and IP sets filters, it’s ~27µs in those cases. On the other hand, it increases linearly with the number of rules for the iptables case.

Test #5 - Setup Time

This test aims to measure the time it takes to install the filtering rules on the system. It doesn’t take into consideration fixed times like loading the eBPF program, creating the IP sets but instead focuses on measuring the time it takes to update the rules themselves and how it changes with the number of rules.

The above graph shows that the time required to set up the rules in all the filters increases linearly with the number of rules.

With 100.000 IPs, the setup time is still less than one second. For 1 million IPs, the setup time is less than 10 seconds. This shows that in practice, all 3 methods have an acceptable setup time.

The setup time is very dependent on the benchmark implementation rather than being inherent to the filtering method. Each filtering method could be optimised in the benchmark code if it were deemed necessary.

  • For the bpf filter: the benchmark performs one bpf() system call for every IP. From Linux 5.6, it is possible to perform several updates in a eBPF map with a single bpf() call .
  • For the iptables and IP sets filters, the benchmark executes external commands and feeds each IP range individually over a pipe. We could avoid that by communicating the list to the kernel using the Netlink protocol directly. There are some attempts to implement IP Sets support in Golang with the vishvananda/netlink library.

For this reason, the reader should be cautious before comparing the setup time of one filter to another. This should not be interpreted as one method being better than another but rather to show expected setup performance and that setup time is unlikely to be a problem for any filtering method.

Test #6 - Throughput with too Many Rules

The throughput in the tests one and two is almost the same regardless of the number of rules for the eBPF and IP Sets filters. We wanted to know if this behaviour changed with a higher number rules, 10 and 100 millions.

We found that it’s not possible to run a test with 100M entries in the eBPF case as this would require more than 4GB of memory for the LPM map and it’s not allowed b y the Linux kernel2. We decided to test with 10M and 50M rules.

From the above chart we can see that there is not a big difference in the throughput and the same tendency keeps even with 10M and 50M rules.

Profiling the CPU usage

If you want to understand more about each implementation and read the relevant source code, a good way to get started is to profile the CPU usage to find out which functions take most of the CPU time.

In the UDP tests, the CPU is saturated and is not able to reach the 10Gbps line rate. We can use the BCC profile tool to get the list of functions that are executed in the tests for each filtering method.

We have to run the tool with the -K parameter to get the traces from the kernel.

sudo /usr/share/bcc/tools/profile -K

Iptables

For the iptables case, it prints the following trace:

ipt_do_table
ipt_do_table
nf_hook_slow
__ip_local_out
ip_local_out
__ip_queue_xmit
__tcp_transmit_skb

The order of calls is from the bottom to the top, we can see how the nf_hook_slow function calls ipt_do_table in an iterative manner. It is a simple “for” loop that iterates over each rule, explaining the performance decreasing linearly with the amount of rules.

eBPF

We got the following trace for the eBPF case:

longest_prefix_match.isra.0
longest_prefix_match.isra.0
trie_lookup_elem
___bpf_prog_run
__bpf_prog_run32
cleanup_module
tcf_classify
__dev_queue_xmit
ip_finish_output2
ip_do_fragment
ip_output

In this case the first important call is to tcf_classify, then the bpf program is run in ___bpf_prog_run and the lookup in the LPM is performed in trie_lookup_elem and longest_prefix_match . It is walking the trie . This explains the better performance, since walking over the trie takes maximum 32 steps with IPv4 addresses.

IP Sets

The traces for IP sets are the following:

hash_net4_test
hash_net4_test
hash_net4_kadt
ip_set_test
set_match_v4
ipt_do_table
iptable_filter_hook
nf_hook_slow
__ip_local_out
ip_local_out
ip_send_sk
udp_send_skb.isra.0
udp_sendmsg

For the IP sets case we can see how the first part of the stack is the same as iptables, nf_hook_slow -> iptable_filter_hook-> ipt_do_table. Then set_match_v4 and after it specific IP set logic is executed: ip_set_test , hash_net4_kadt and hash_net4_test , that lookups all possible network sizes for a given IP.

Conclusion

The throughput, CPU usage and latency tests show that IP sets and eBPF filters scale well even with 1 million IPs. On the other hand, iptables doesn’t scale that well and even with a low number of rules like 1k or 10k shows a considerable difference to the other filters. This is expected as iptables performs a linear search on the rule list.

IP sets outperform eBPF with an LPM map on a clsact qdisc a bit, it has a slightly higher throughput in the UDP test and uses less CPU.

Update 2020-09-14: we only ran the benchmark with the eBPF interpreter instead of generating efficient JIT code, so we cannot compare the performance between IP sets and eBPF in the general case at this point.

eBPF cannot handle more than ~89M entries due to restrictions in the Linux kernel.

We suggest the decision of what tool to use shouldn’t be strictly based on the results of this benchmark but should take into consideration other aspects like whether the technology is already used on the hosts, whether there are other needed features (now or in the future) that could be provided by some of the technologies, the way traffic statistics can be exposed to a monitoring tool, etc.

Keep in mind that this benchmark only tests a specific use case: filtering a large deny-list of IP address ranges on egress. A more realistic cluster would have other filters, including on ingress.

Notes


  1. There are some proposals to add support for the egress path but they aren’t merged at the time of writing. ↩︎

  2. Each element on the LPM map requires 48 bytes in our case (40 from sizeof(struct lpm_trie_node) + 4 from attr->value_size and 4 from trie->data_size). It means the maximum number of possible elements in the map is (2^32/48) ~89M. ↩︎

Related Articles