Not All Servers Are Alike (With DNA) – Part 2

Posted · Add Comment

Some time ago, we discussed on the first part of this post, why not all servers spot the same performance with DNA. The conclusion was that beside the CPU, you need a great memory bandwidth in order to move packets from/to the NIC. So in essence CPU+memory bandwidth are necessary for granting line-rate performance.

In this post we want to add some lessons learnt while playing with DNA on modern servers.

Lesson 1: Not all PCIe slots are alike


With the advent of PCIe gen3, computer manufacturers started to mix (old) PCIe gen2 with gen3 slots onto the same machine. In order to have a smaller set of components, quite often, manufacturers use the same physical PCIe slot connector for gen2/gen3. The same applies to slot speed, where x4 slots are shorter than x8 slots. So in essence, when you need to plug your 10G adapter into your brand new server, do not look at the slot form factor to figure out if this is the correct slot, but do read on the motherboard (or on the companion manual) the slot speed prior to plug your NIC. This will save you headaches.

Lesson 2: 2 x single-port 10G NIC != 1 x dual-port 10G NIC


After you have selected a PCIe slot on which to plug your 10G card, you need to figure out what are your plans. Considered the little difference in cost between a single and a dual 10G card, many people prefer to buy the dual-port (if not the quad or six ports) and believe that all those cards are alike. Not quite. In fact you need to understand that beside the NIC form factor, the hardware manufacturer had to interconnect all those ports in some way. Usually this happens via a PCIe bridge that interconnects the various ports similar to what a USB hub does for your PC devices. Whenever you pass through a bridge, the bandwidth available is reduced so this might be a potential bottleneck. So make sure that your 10G NIC uses a PCIe gen3 bridge for your many-ports 10G card, otherwise expect packet drops due to the physical form factor of your NIC.

With modern Sandy Bridge systems, PCIe slots are connected directly to the CPU. If you use a NUMA system (e.g. you have a system with two physical CPUs), connecting a dual port 10G NIC to a slot, means that you are binding all the two slots to such CPU. If your application accessing the NIC is running on the other CPU (i.e. not the same CPU to which your dual-port is connected) then your application will end up accessing the card via the first CPU and the QPI bus (that interconnects the CPUs). In essence this is a bad idea as your performance will be reduced due to the long journey and memory coherency (a packet memory allocated on the first CPU is accessed by the second CPU). If you plan to use this architecture, you better use instead two single-port 10G NICs, and plug one card on the first CPU, and the second card on the second CPU. This will grant you line-rate any packet size.

Lesson 3: Energy efficiency might not be your best friend


Modern CPUs such as Intel E5 have a variable clock speed, that changes according to the amount of work the CPU has to carry on, so that the CPU can save energy when possible. This means that the clock of your CPU is not stable, but it changes based on the CPU load. Tools like i7z allow you to monitor the CPU clock realtime. In the system BIOS you can set how you plan to use your CPU (energy efficient, performance, balance) and in your Linux system you can set how software plans to use the CPU. Depending on BIOS and kernel configuration, you can obtain very bad or very good results. For example on our uniprocessor sandy bridge system using E5-2620 this is what happens.

  1. CPU Power Management Configuration disabled on BIOS:

    root@nbox:/home/deri/PF_RING/drivers/DNA/ixgbe-3.10.16-DNA/src# /home/deri/PF_RING/userland/examples/pfsend -i dna0 -g 3
    Sending packets on dna0
    Using PF_RING v.5.5.2
    Using zero-copy TX
    TX rate: [current 12'850'010.49 pps/8.64 Gbps][average 12'850'010.49 pps/8.64 Gbps][total 12'850'139.00 pkts]
    TX rate: [current 12'852'422.34 pps/8.64 Gbps][average 12'851'216.44 pps/8.64 Gbps][total 25'703'114.00 pkts]
    
  2. CPU Power Management Configuration set on BIOS to Performance

    root@nbox:/home/deri/PF_RING/drivers/DNA/ixgbe-3.10.16-DNA/src# /home/deri/PF_RING/userland/examples/pfsend -i dna0 -g 3 
    Sending packets on dna0
    Using PF_RING v.5.5.2
    Using zero-copy TX
    TX rate: [current 14'708'669.91 pps/9.88 Gbps][average 14'708'669.91 pps/9.88 Gbps][total 14'708'817.00 pkts]
    TX rate: [current 14'727'932.58 pps/9.90 Gbps][average 14'718'301.54 pps/9.89 Gbps][total 29'437'810.00 pkts]
    
  3. Same as 2 but we add -a (active polling) to pfsend

    root@nbox:/home/deri/PF_RING/drivers/DNA/ixgbe-3.10.16-DNA/src# /home/deri/PF_RING/userland/examples/pfsend -i dna0 -g 3 -a
    Sending packets on dna0
    Using PF_RING v.5.5.2
    Using zero-copy TX
    TX rate: [current 14'880'666.69 pps/10.00 Gbps][average 14'868'674.49 pps/9.99 Gbps][total 29'737'914.00 pkts]
    TX rate: [current 14'880'562.38 pps/10.00 Gbps][average 14'872'637.12 pps/9.99 Gbps][total 44'618'774.00 pkts]
    

As you can see the performance is very different. The reason is the CPU speed that changes according to the power management configuration. In the 2. case i7z reports

Socket [0] - [physical cores=6, logical cores=12, max online cores ever=6]
 TURBO ENABLED on 6 Cores, Hyper Threading ON
 Max Frequency without considering Turbo 2098.95 MHz (99.95 x [21])
 Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 Cores is 25x/25x/24x/24x/23x/23x
 Real Current Frequency 1226.46 MHz [99.95 x 12.27] (Max of below)
 Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % C7 % Temp
 Core 1 [0]: 1199.34 (12.00x) 1.22 0.969 0 0 98.3 29
 Core 2 [1]: 1199.59 (12.00x) 1 0.0213 0 0 100 33
 Core 3 [2]: 1198.16 (11.99x) 1 0.0263 0 0 100 28
 Core 4 [3]: 1199.38 (12.00x) 71 57.4 0 0 0 39
 Core 5 [4]: 1180.98 (11.82x) 1 0.0362 0 0 99.9 32
 Core 6 [5]: 1226.46 (12.27x) 1 0.0361 0 0 99.9 30

whereas in 3. it reports

Socket [0] - [physical cores=6, logical cores=12, max online cores ever=6]
 TURBO ENABLED on 6 Cores, Hyper Threading ON
 Max Frequency without considering Turbo 2098.95 MHz (99.95 x [21]) 
 Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 Cores is 25x/25x/24x/24x/23x/23x
 Real Current Frequency 2498.64 MHz [99.95 x 25.00] (Max of below) 
 Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % C7 % Temp
 Core 1 [0]: 2485.39 (24.87x) 1 0.301 0 0 99 31
 Core 2 [1]: 2390.24 (23.91x) 1 0.041 0 0 99.9 34
 Core 3 [2]: 2318.51 (23.20x) 0 0.0343 0 0 100 30
 Core 4 [3]: 2498.64 (25.00x) 100 0 0 0 0 42
 Core 5 [4]: 2445.43 (24.47x) 1 0.0328 0 0 99.9 32
 Core 6 [5]: 2416.29 (24.18x) 1 0.0358 0 0 99.9 31

In essence during test 2. the core 3 (where pfsend was running) was running at 1.19 GHz, whereas on test 3. it was running at 2.49 GHz. This explains the difference in performance. If you are monitoring network traffic, you better pay attention to these details, as otherwise you will be disappointed by the performance you achieve. Please note that you can set the CPU speed and scheduler from software using tools such as cpufreq-set.

Conclusion


In addition to what we discussed in the  first part, make sure that you understand the topology of your computer and the power configuration of your system, otherwise you might obtain some unexpected results from speedy (and costly) modern computer system.