# Using (Suricata over) PF\_RING for NIC-Independent Acceleration

Luca Deri <deri@ntop.org> Alfredo Cardigliano <cardigliano@ntop.org>





#### Outlook

- About ntop.
- Introduction to PF\_RING.
- Integrating PF\_RING with Suricata.
- Using PF\_RING in real-life scenarios.





#### About ntop

- ntop develops open source network traffic monitoring applications.
- ntop (circa 1998) is the first app we released and it is a web-based network monitoring application.
- Today our products range from traffic monitoring, high-speed packet processing, deep-packet inspection, and IDS/IPS acceleration (Suricata, Snort, Bro).





#### ntop's Approach to Traffic Processing

- Ability to capture, process and (optionally) transmit traffic at <u>line rate</u>, any packet size.
- Leverage on modern multi-core/NUMA architectures in order to promote <u>scalability</u>.
- Use <u>commodity hardware</u> for producing affordable, long-living (no vendor lock), scalable (use new hardware by the time it is becoming available) monitoring solutions.
- Use <u>open source</u> to spread the software, and let the community test it on unchartered places.









# PF\_RING History

- PF\_RING is a home-grown open source packet processing framework for Linux.
- When it was born (2003), PF\_RING was designed to accelerate packet capture on commodity hardware without using FPGA-based network adapters.
- Later (2005/9) PF\_RING introduced DNA (Direct NIC Access) featuring kernel bypass for line rate RX/TX packet processing.
- In PF\_RING ZC (2013) we have focused on efficient packet processing on multi-vendor NICs, as line rate was already granted by PF\_RING.



# PF\_RING Flavours [1/2]



 Standard, in-kernel packet capture (I-copy mode, no kernel bypass): packets are received and processed inside the Linux kernel, then are dispatched using memory-map to user-space. This operating mode is good for IG interfaces.





# PF\_RING Flavours [2/2]



 PF\_RING ZC (zero-copy, kernel bypass): once the device is open, packets are ready directly by user-space applications without passing through the kernel. This is the preferred solution for traffic over 1 Gbit.



# PF\_RING API



- A single API, regardless of:
   Interface vendor and speed.
   ZC and non-ZC.
- If you code your application on top of the PF\_RING API, your development investment will be preserved as you won't need to change a single line of code when moving from one adapter to another (e.g. prototype on a cheap on-board ethernet and then deploy 40 Gbit interface: just change the device name).



# PF\_RING ZC [1/2]



- The idea behind ZC is to create a playground for processing information (and in particular network packets) in zero-copy.
- ZC comes with <u>0-copy</u> user-space drivers (for I and 10G Intel NICs) that allow packets to be read in 0-copy.
- <u>1-copy</u> packets (e.g. received on non-Intel NICs or WiFi/Bluetooth devices) can be injected in ZC and from that time onwards, be used in 0-copy.





# PF\_RING ZC [2/2]



- Support of legacy pcap-based applications.
- ZC has simple yet powerful components (no complex patterns, queue/consumer/balancer).
- KVM support: ability to setup Intra-VM clustering.



- Native PF\_RING ZC support in many open-source applications such as Snort, Suricata, Bro, Wireshark.
- Ability to operate on top of sysdig.org for dispatching system events to PF\_RING applications.





# PF\_RING ZC on KVM

- With ZC, packets are captured in 0-copy from network adapters and deployed in 0-copy to VMs.
- ZC packets are deployed on the VM using virtual adapters dynamically attached to the VM through PCI hot-plug.
- When an application running inside the VM wants to open a ZC queue, via this virtual adapter, the application is attached in 0-copy with the packet producer.



# PF\_RING ZC and OpenStack [1/2]

- Most companies operating in OpenStack, focus on Open vSwitch acceleration.
- In ntop we have privileged instead the ability to:
   Deliver packets in 0-copy at line rate to VMs.
  - Create arbitrary packet processing topologies in VMs, processes and threads.
- The idea is to create a framework that is able to run applications on a VM at the same speed.





# PF\_RING ZC and OpenStack [2/2]



(Host) \$ ./zpipeline\_ipc -i zc:eth2,0 -o zc:eth3,1 -n 2 -c 99 -r 1 -t 2 -Q /tmp/qmp0 (VM) \$ ./zbounce\_ipc -c 99 -i 0 -o 1 -g 3



# Multi-Vendor Support in PF\_RING [1/2]

- The PF\_RING user-space library is logically divided into modules, to support multivendor NICs.
- PF\_RING-based applications transparently select the correct module name, by means of the interface name to use. Example:
  - opfcount -i eth I [Vanilla Linux adapter]
  - opfcount -i zc:eth1 [Intel ZC drivers]
  - opfcount -i anic: [ [AccoladeTechnology]





# Multi-Vendor Support in PF\_RING [2/2]

 Currently PF\_RING natively supports the following vendors (1/10/40/100 Gbit)







16

## What Vendor Shall I Choose?

- ntop is an <u>independent</u> software company that is <u>not endorsing</u> any vendor.
- Users have to decide based on their requirements, budget, and preferences.
- Most modern adapters can now support 10G linerate RX/TX and hardware packet time-stamping, so you should choose based on other features such as in-hardware port aggregation, packet filtering/ balancing, memory buffers on the NIC (to adsorb traffic spikes), pattern matching...



# ASIC vs FPGA-based NICs [1/3]

- Most commodity NICs use an ASIC chip to implement networking: no programmability, simple RX/TX operations, cheap.
- As commodity NICs are used for basic networking they work per-packet so you have PCIe transactions and interrupts as packet rate increases.
- Packet memory is allocated per-packet in the host and its pointer passed to the NIC where it will place incoming RX packets.





# ASIC vs FPGA-based NICs [2/3]

- In FPGA-based NICs, packets are moved to the host in blocks (I MB or more) instead of perpacket and thus the pressure on the PCIe bus and memory subsystem is greatly reduced.
- Packet memory is managed by the FPGA that passes packets in zero-copy to user-space applications.





# ASIC vs FPGA-based NICs [3/3]

- (+) Moving packets in blocks reduces the load on the system and thus frees CPU cycles for other tasks.
- (-) FPGA memory management limitations: Packets must be processed in order, and if not possible be copied (say goodbye to zero-copy).
- Bottom line: FPGA-based adapters are great if you use them the way the manufacturer has designed them, otherwise you will jeopardise the advantages (both in complexity and memory copies) you have paid when purchasing the NIC.





# PF\_RING Advantages for NIC Vendors

- If you provide PF\_RING support for your NIC, you will immediately be able to run ntop, Suricata, Snort, Bro, Argus, Wireshark... users take advantage of your NIC without you individually supporting all these applications.
- You can provide a simple way to benchmark your NIC and show its advantages with respect to competitors using a single API.





# PF\_RING Advantages for Developers

Develop on top of PF\_RING so:

- Your time investment will be <u>preserved</u> overtime as ntop will make sure that future vendor changes will be supported <u>transparently</u> by PF\_RING.
- No need to learn yet another API, and sign NDAs.
- PF\_RING will take care of all differences between NICs such as hardware timestamp format etc.
- PF\_RING does not add any overhead (beside API wrapping) with respect to the native vendor API, but it offers you many advantages in terms of traffic processing facilities.



#### Accelerating Suricata with PF\_RING





# PF\_RING Support In Suricata

Maintained by William Metcalf and Eric Leblond

```
/**
* \file
*
* \author William Metcalf <william.metcalf@gmail.com>
* \author Eric Leblond <eric@regit.org>
*
*
* PF_RING packet acquisition support
*
```

 ntop is contributing for improving packet acquisition based on latest PF\_RING technologies





# Latest Contributions to Suricata [1/2]

- •PF\_RING IPS/TAP Support #1587
- new PktAcqBreakLoop callback in TmModule #1696
- •workers runmode: allow multiple input devices #1701
- •pfring pkt acq: keep running on 'pfring\_set\_cluster' failure when cluster is not required #1713
- •pfring pkt acq: use zero-copy recv in workers runmode #1706
- pfring pkt acq: removed reentrant flag #1707
- pfring pkt acq: capture loop optimisation #1708

Enabling support for non-Intel/ZC kernel-bypass

Performance improvements





# Latest Contributions to Suricata [2/2]

Enabling support for non-Intel kernel-bypass

 Ability to run on top of FPGA-based kernelbypass cards:











# Performance Improvements [1/2]

#### Packet acquisition performance tests (no processing)







# Performance Improvements [2/2]

Capture speed (no processing) with PF\_RING ZC (Intel 82599) using a single thread on E3-1230 v3 at 10 Gbit line-rate:

- Before patches: 9.52 Mpps
- After patches: 14.88 Mpps (line-rate)
- >55% performance boost (after latest improvements)
- More CPU cycles for real processing !





# PF\_RING/Suricata Performance [1/4]

Packet acquisition performance - single thread - Intel Xeon E3-1230 v3 - Intel 82599 single-queue



# PF\_RING/Suricata Performance [2/4]

CPU Load at 1 Mpps - single thread - Intel Xeon E3-1230 v3 - Intel 82599 single-queue



# PF\_RING/Suricata Performance [3/4]

Packet acquisition performance - Single thread/core Intel Xeon E3-1230 v3 - kernel-bypass technologies









#### PF\_RING in Real Life Scenarios





#### Is Packet Capture Acceleration Still a Good Argument? [1/2]

- For years the industry and community <u>focused</u> mainly on packet capture/filtering.
- Applications were relative <u>simple and self-contained</u>: network security, traffic monitoring, high-frequency trading, packet-to-disk....
- With the advent of big-data systems (and not only that) and reduction of data center space, people like to <u>collapse multi-apps</u> on a single box.
- As previously stated, packet processing at 10 Gbit is now possible with commodity hardware.





#### Is Packet Capture Acceleration Still a Good Argument? [2/2]

- <u>100 Gbit</u> (and partially 40 Gbit) are raising the bar once more, and FPGA-based NICs are currently the <u>winners</u> in this scenario.
- In summary: "packet capture acceleration" is no longer enough as we often need to combine it with multi-app traffic distribution, balancing, and pipelining in order to collapse on one box at high speed, functionalities that were previously implemented onto multiple boxes.











# PF\_RING: Pipelining + Traffic Cleanup







# PF\_RING: Flows + DPI + Security [1/2]





# PF\_RING: Flows + DPI + Security [2/2]



cento -i zc:eth6 -9 127.0.0.1:1234 -t 5 -o 2 (500k flows, expiring ever 5 seconds, export in NFv9) taskset -c 2,3 suricata --pfring-int=zc:10@0 --pfring-int=zc:10@1 -c /etc/suricata/suricata.yaml --runmode=workers



39

#### Final Remarks





# The Big Picture

- PF\_RING implements a packet processing framework featuring:
  - <u>Single API</u> regardless of the network adapter being used.
  - Support for commodity and FPGA-based adapters.
  - Native support of many opensource applications as well native Suricata integration (IDS and IPS modes).
  - Developers can focus on a <u>single API</u> and let their users choose the best NIC for their project.
  - Stable and <u>maintained</u> product (since 12 years).
  - Download it at https://github.com/ntop/PF\_RING/



