Not All Servers Are Alike (With PF_RING ZC/DNA) – Part 3

Posted · Add Comment

We have already discussed on the first and second part of this post some common issues that might be encountered while doing high-performance packet processing. Most of the problems are related to multi-CPU servers (NUMA) and memory configuration. We have spent a lot of time creating the nBox web-GUI that is not just a graphical interface, but it is a way to automatically configure ntop applications as well report common configuration issues. For those who want to live without it, we have some additional lessons learnt to share.

Lesson 1: Make sure all your memory channels are populated


Every CPU has a number of memory channels that influence the memory bandwidth available to applications. Although most people look only at the total memory (or at most at memory type), it is very important how you populate the memory slots. Using

cat /proc/cpuinfo | grep "model name" | head -1
model name: Intel(R) Xeon(R) CPU E5-2667 0 @ 2.90GHz

You can figure out the CPU you own (Xeon E5-2667 in the above case), and then going to http://ark.intel.com and searching for the CPU model, you can learn the number of channels the CPU has. In our case it has 4 memory channels, so we need to make sure that 4 channels per CPU (in case you have a NUMA system with more than one CPU) have been populated.

Using:

# dmidecode |grep "Speed:\|Locator:"|grep -v Bank

you can see what slots are populated. In the example below the first slot (on the first CPU slot 1) is NOT populated, whereas the second  (on the second CPU slot 9) is populated.

Locator: PROC  1 DIMM  1 
Speed: Unknown
Configured Clock Speed: Unknown
Locator: PROC  2 DIMM  9 
Speed: 1600 MHz
Configured Clock Speed: 1600 MHz

Note that depending on the motherboard there is an exact sequence you need to populate, but in general the first slots need to be populated. So in our case as we have 4 channels, we must make sure that each CPU has the first 4 memory slots filled in. Again it does not matter just the total amount of memory, but how you populated the slots. So 1 bank of 16 GB is not the same as 4 banks each of 4G, as in the first case only one channel would be used whereas in the latter all 4 channels will be.

Of course the type of memory (dual rank etc.) and clock (the faster the better) is also important. There are tools such as stream you can use to measure the real available bandwidth on your system.

 

Lesson 2: Plug your NICs on the correct NUMA node


As you know, on NUMA systems you must avoid crossing the costly QPI bus so that your apps must be bound to a precise NUMA node. The same applies to PCIe slots. On NUMA systems each node has some PCIe slots directly connected. So if your application runs on such node, you must make sure that such app is accessing packets received from a NIC connected to such node. Supposing you want to know the node to which eth1 is attached you need to do

# cat /sys/bus/pci/devices/`ethtool -i eth1 | grep bus-info | cut -d ' ' -f 2`/numa_node
1

or you can use the command

# hwloc-info -v

that is a bit more verbose but that provides more detailed information.

This means that applications opening the device zc:eth1 must run on NUMA node 1 as otherwise they will have to cross QPI for each and every packet. Note that there is no software command you can use to change NIC affinity, as this is a physical/electrical property. If you want to change node you need to change PCIe slot.

Please note that for dual-port (or even denser) NICs, both ports are on the same NUMA node, so if you want to have one NUMA node taking care of one NIC, you must buy two single-port NICs and install them on a different node.

 

Lesson 3: Allocate the memory and bind your application on the correct NUMA node


Once you have selected the NUMA node to use and made sure your NIC is properly installed, it is now time to allocate the memory properly. First of all you need to load the driver telling to what cores the memory should be bound. In our DNA and ZC drivers you will find a new option named numa_cpu_affinity, that allows you to specify for each NIC port to what core (not NUMA node, but core) the port will be bound to. In essence the core identifies what NUMA node will then use the NIC. Example with

insmod ./ixgbe.ko numa_cpu_affinity=0,0,1,1

the first two ports will be used by core 0, the second two by core 1. Note that it is not strictly necessary to specify the exact coreId but it must be clear that the coreId is used to figure out the NUMA mode the core is running on, and thus the memory affinity.

Once the driver memory has been properly allocated, we need to start the application on the exact node. In order to do that, you must first bind your app to the correct node and then allocate memory and spawn threads. In fact if you start your app, open your PF_RING ring and then bind it to the correct node, you have not done a great job. This because the application might have allocated the memory on the wrong node and then you have set the affinity too late. Please pay attention to this fact that is not a minor detail at all.

If you want to find out what cores are bound to that NUMA node do:

# numactl --hardware 
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 32733 MB
node 0 free: 722 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 32767 MB
node 1 free: 1375 MB

Lesson 4: RAID controller and NICs must be connected to the same NUMA node


If you have decided to use applications like n2disk to dump traffic to disk, there is an extra thing to check. You must make sure that:

  1. the RAID controller you are using to dump packets to disk
  2. the NIC from which you are capturing traffic
  3. n2disk

are all bound/connected to the same NUMA node. A common mistake is to forget about the RAID controller that can degrade the overall performance although packet capture  is running at full speed.

 

Conclusion


NUMA might not be your best friend if you expect the OS to do everything on your behalf. There are some things you must do yourself (e.g. PCI slot selection) and others that can be done on software. Make sure you have everything properly configured, before starting to do performance measurements on your application.