Why TNAPI (Threaded NAPI) ?


PF_RING is a Linux kernel patch that allows packet capture to be improved:

  • bypassing some kernel network layers
  • replacing system calls such as read() with DMA (Direct Memory Access)

One great PF_RING feature is that it is driver-independent as it sits into the kernel at the bottom of the networking stack immediately above of NAPI. This means that any driver can be used with PF_RING.

Unfortunately the drawback is that Linux networking drivers have not changed much in the past years and have not taken much advantage of new features such as kernel threads and multicore. In particular in the past couple of years, Intel has developed some technologies known as I/OAT (I/O Acceleration) that dramatically enhance performance when used. Furthermore many modern ethernet cards support MSI-X that allow to partition the incoming RX queue into several RX-queues, one per core (i.e. each RX-queue is mapped to a processor core). The traffic is balanced per-flow (i.e. a flow will be sent always to the same core and not in round-robin mode) across the queues in hardware by the ethernet cars thanks to the RSS technology. This means that the more CPU cores are availables the more RX queues are available.

What is the advantage of all this? With vanilla Linux the advantage is very limited as you can see in the picture.

In fact:

  • The driver is still polling the various RX queues in sequence and not concurrently as it should be

  • The operating system collapses all the RX queues into a single tnapiX interface. This was a good idea for interfaces of limited speed, but with fast interfaces as those running at 10 Gbit, it a nightmare.

    The reason is pretty simple. In order to go fast an application has to spawn several threads. As all threads need to fetch packets from the single tnapiX they will compete hence a semafore (mutex) needs to be used. The result is that the performance is degraded due to context switches or active wait (spin lock).

 

What is the TNAPI Advantage?


TNAPI attempts to solve the following problems:

  • Distribute the traffic across cores (i.e. the more core the more scalable is your networking application) for improving scalability.
  • Poll packets simultaneously from each RX queue (contraty to sequential NAPI polling) for fetching packets as fast as possible hence improve performance.
  • Through PF_RING, expose the RX queues to the userland so that the application can spawn one thread per queue hence avoid using semaphores at all.

TNAPI achieves all this by starting one thread per RX queue. Received packets are then pushed to PF_RING (if available) or through the standard Linux stack. However in order to fully exploit this technology it is necessary to use PF_RING as it provides a straight packet path from kernel to userland. Furthermore it allows to create a virtual ethernet card per RX queue.

How fast is PF_RING+TNAPI vs PF_RING?


Now that it’s clear to you what is TNAPI and that you need PF_RING to exploit it, it’s time to talk about numbers. The answer is very simple: PF_RING+TNAPI is at least 2X faster with respect to plain PF_RING (that’s much faster than native Linux NAPI). If you have more than two CPU cores then it’s even faster. With an old Core2Duo 1.86 GHz you can capture 2.8 Mpps using TNAPI and PF_RING 5.1.

Hardware Packet Filtering using TNAPI


Modern 10 Gbit network adapters such as Intel X520 support hardware packet filtering. Please see this document for learning how to use hardware filters.

How to use TNAPI


The hardware prerequisites are

  • Multicore CPUs: the more cores the better.
  • Ethernet card able to support Intel I/OAT, MSI-X, DCA. Currently the following ethernet controllers are supported:
    • 1 Gbit: Intel 82575/76/80 (Linux driver igb 3.1.x)
    • 10 Gbit: Intel 82598/82599 (Linux driver ixgbe 3.3.9)

    Other, older controllers (such as those supported by the e1000 driver) are not supported.

Make sure that you use PF_RING 5.1.x or better.

  1. Compile the TNAPI driver
    cd <driver directory>/src
    make
  2. Install the driver
    cd <driver directory>/src
    su (become root)
    modprobe dca (prerequisite)
    insmod ./<driver name>.ko IntMode=3 (IntMode=3 enables MSI-X)
  3. Check that everything is OK
    • Check the /var/log/messages fileigb: tnapi0: igb_probe: Intel(R) Gigabit Ethernet Network Connection

      igb: tnapi0: igb_probe: (PCIe:2.5Gb/s:Width x4) 00:1b:21:30:81:d0

      igb: tnapi0: igb_probe: Using MSI-X interrupts. 2rx queue(s), 1 tx queue(s)

    • Use ‘ethtool -S tnapiX’ (tnapiX is the TNAPI-powered card) to check the traffic on the RX queues

It’s now time to start your multiqueue PF_RING application. Suppose you use tnapiX with Y RX queues: you can either capture from tnapiX (aggregated traffic from all RX queues) or from the single queues tnapiX@0 … tnapiX@Y-1. This means that if you capture from the tnapiX device you capture from all queues (PF_RING merges traffic from all incoming queues). Instead for maximum performance you can create a multithreaded application which captures from the single queues. Examples (X=1, Y=2):

  • Capture from all queue and let PF_RING merge packets: pfcount -i tnapi0
  • Capture from a single RX queue: pfcount -i tnapi0@0

Multithreaded TNAPI vs NAPI-based TNAPI


Inside the driver in the xxx_main.c file (e.g. igb_main.c) you can comment out the line “#define ENABLE_TNAPI” so that the multithreaded packet polling is disabled but PF_RING multiqueue is still enabled. You might wonder why you might need to do this. In most cases multithreading is a good idea as it improves the packet capture performance; however if your PC is not particularly fast, the kernel packet polling might put extra load on the PC. This is a typical case where you might consider using multiqueue but not multithreading.

 

FAQ


  • Q. Do I need just a TNAPI-compatible NIC or also a specific motherboard/chipset?
    A. The Intel 5000 chipset is required for maximum acceleration as it can fully exploit I/OAT. Nevertheless any chipset, even if the speed-bump will be more limited, can be used.
  • Q. Do I get the driver source code?
    A. Absolutely. The code is released under GPL and you get the full source code.
  • Q. Do I need to pay a fee for each host on which I use the TNAPI driver?
    A. Not at all. The little fee we ask is for carrying on research and it’s not a license fee whatsoever.
  • Q. What application can take advantage of TNAPI?
    A. Any application that has been compiled with PF_RING and PF_RING-enabled libcap (part of PF_RING). This means that you can use for instance tcpdump, wireshark, snort etc.
  • Q. Do I need to change the source doe in order to use TNAPI?
    A.No. TNAPI is fully transparent to the application. Just use tnapiX@queue_id if you want to capture from a specific RX queue.
  • Q. What hardware PC do I need for wire-speed 1G packet capture at any packet size?
    A. A single quad-core Xeon is more than enough.
  • Q. My computer has DCA disabled and the BIOS do not allow me to enable it. What shall I do?
    A. Have a look at thislink.
  • Q. Is TNAPI useful for general-purpose networking?
    A. No. TNAPI is NOT designed for general purpose networking but ONLY for passive packet capture.

 

Get It


You can get your TNAPI driver from web shop for a little fee. This work is the result of the last couple of years of self-funded research. We therefore ask you a little help to keep the project running. Nevertheless if you’re a no-profit organization, professor or university researcher, please contact us and if you quality we’ll send it to you for free.

igb ixgbe
get it
Capture Rate 1 Gigabit/Sec >5 Gigabit/Sec
Supported Cards Intel 82575/76/80-based Intel 82598/82599-based
Operating System Linux
Traffic Reception included
Traffic Injection Not yet supported
Hw packet filtering Intel 82599-based/Silicom Director only