Howto Build a (Cheaper) 100 Gbit Continuous Packet Recorder using Commodity Hardware

Posted · Add Comment

Those who follow this blog probably read a few posts where we described how to build a 100 Gbit continuous packet recorder using n2disk and PF_RING, providing specs for recommended hardware and sample configurations (if you missed them, read part 1, part 2 and part 3). In those posts we recommended the use of FPGA-based adapters (e.g. Napatech) with support for PCAP chunk mode (e.g. ability for the NIC to collapse packets onside the adapter in pcap format without the need to read packet-by-packet as with most network adapters), in addition to other nice features like hardware time-stamping and link aggregation. The ability to capture traffic using chunk mode improves the bus utilisation and reduces the number of CPU cycles (and PCIe bus transitions) required for copying and processing the data. This allows n2disk, our continuous traffic recorder, to process up to 50 Gbps per instance/stream, allowing us to handle full 100 Gbps load balancing traffic to 2 instances/streams (in the same box).

As you probably know, we have pioneered packet capture on commodity hardware for many years, since the introduction of PF_RING, trying to do our best to get the most out of ASIC adapters (e.g. Intel) and close the gap with specialised and costly capture adapters. For this reason, in the past months we struggled to give our community a cheaper alternative for building a 100 Gbps recorder, using ASIC adapters in place of FPGAs.

For this project we selected a NVIDIA (former Mellanox) ASIC adapter from the ConnectX family, which is quite fast, provides nice features (including hardware timestamping and flexible packet filtering), at a price range similar to the Intel E810. This adapter, just like any other ASIC adapter on the market, does not support chunk mode, but it rather uses a transaction per packet, delivering lower performance per stream with respect to FPGA adapters. This means that in order to reach 100 Gbps performance we need to scale with many more (RSS) streams.

For this reason we have decided to change the internal n2disk architecture to be able to handle multiple interfaces (and RSS streams) in a single process, by means of multiple capture threads. In order to keep the internal architecture simple, the configuration of multiple capture threads also requires multiple dump directories (and timelines), one per RSS stream. The extraction tool (npcapextract) can be seamlessly used to extract and aggregate traffic from all the timelines at extraction time, using the hardware timestamp with nanosecond resolution for merging packets in the proper order.

The hardware required for building such system consists of a 16+ cores 3Ghz CPU (e.g. in our tests we have used Xeon Gold 6526Y), with an optimal memory configuration (i.e. all memory channels filled) as already discussed in the previous posts.

Sample command for configuring 8 RSS queues on a ConnectX interface (the linux interface name should be provided):

ethtool -L ens0f0 combined 8

Sample n2disk configuration file for capturing from 8 RSS queues (mlx_5@[0-7]):

--interface=mlx:mlx_5@[0-7]
--dump-directory=/storage1/n2disk/pcap
--dump-directory=/storage2/n2disk/pcap
--dump-directory=/storage3/n2disk/pcap
--dump-directory=/storage4/n2disk/pcap
--dump-directory=/storage5/n2disk/pcap
--dump-directory=/storage6/n2disk/pcap
--dump-directory=/storage7/n2disk/pcap
--dump-directory=/storage8/n2disk/pcap
--timeline-dir=/storage1/n2disk/timeline
--timeline-dir=/storage2/n2disk/timeline
--timeline-dir=/storage3/n2disk/timeline
--timeline-dir=/storage4/n2disk/timeline
--timeline-dir=/storage5/n2disk/timeline
--timeline-dir=/storage6/n2disk/timeline
--timeline-dir=/storage7/n2disk/timeline
--timeline-dir=/storage8/n2disk/timeline
--disk-limit=80%
--max-file-len=1024
--buffer-len=8192
--index
--writer-cpu-affinity=0
--writer-cpu-affinity=0
--writer-cpu-affinity=0
--writer-cpu-affinity=0
--writer-cpu-affinity=0
--writer-cpu-affinity=0
--writer-cpu-affinity=0
--writer-cpu-affinity=0
--reader-cpu-affinity=1
--reader-cpu-affinity=2
--reader-cpu-affinity=3
--reader-cpu-affinity=4
--reader-cpu-affinity=5
--reader-cpu-affinity=6
--reader-cpu-affinity=7
--reader-cpu-affinity=8
--indexer-cpu-affinity=9,10
--indexer-cpu-affinity=11,12
--indexer-cpu-affinity=13,14
--indexer-cpu-affinity=15,0
--indexer-cpu-affinity=25,26
--indexer-cpu-affinity=27,28
--indexer-cpu-affinity=29,30
--indexer-cpu-affinity=31,16

From our tests, this configuration was able to capture, index and dump traffic to disk with no packet loss up to more than 100 Mpps (average packet size of 100 bytes) at full 100 Gbps. Worth to mention that this still delivers lower performance with respect to an FPGA with chunk mode, which is capable of 148.8 Mpps (theoretical max throughput at 100 Gbps with 60-byte packets), but for many people it’s probably enough to cope with real traffic (where the average packet size is usually much higher than 100-bytes). Also, note the configuration looks a bit more complicated with respect to the one used with FPGA adapters, as this requires more threads, and setting the affinity for all of them is required to make sure n2disk fully takes advantage of all CPU cores and get the best performance. In essence with an FPGA adapter you have to put into account the cost of the network adapter but you can save money on the system (as you can use a cheaper CPU) and you have the guarantee that no packet is lost a FPGA adapters are more efficient and have GB of on-board of memory. This said you now have a cheaper option for your high-speed packet-to-disk activities.

Enjoy!