How Great Hashing Can (More Than) Double Application Performance

Posted · Add Comment

Most ntop applications (ntopng, nProbe, Cento) and libraries (FT) are based on the concept of flow processing, that merely means keeping track of all network communications. In order to implement this, network packets are decoded and, based on a “key” (usually a 5-tuple consisting of protocol and src/dst IP and port), clustered into flows (other keys such as VLAN can be added if necessary). This usually requires a lookup in an hash table, by using an hash function to translate the key into an index for an array with collision lists.
A good hash function should be collision resistant (it should be hard to find two inputs keys that hash to the same index), this to avoid huge collision lists which affect the performance of a lookup. However building a strong hash often translates in poor performance, in fact most of the well-known hash functions that are publicly available are computationally expensive.

A few days ago, we have been struggling optimising one of our applications, to make it processing traffic at 100 Gbps. This application had to perform flow classification and a bunch of other activities within the same thread. During this analysis, we ended up trying to accelerate hash lookups, and we faced with the hash function, which was taking most of the CPU. The application was using the Fowler-Noll-Vo (FNV) hash function  which is one of the fastest out of those available in literature, and with a fairly good distribution. This led to a discussion with our friend Logan oos Even, which is helping us a lot with applied cryptography in n2n and demonstrated a strong experience in the field. At the end of the discussion, he offered to help building a fast hash function from scratch for our application.

The requirements for the hash were clear:

  1. 12 byte input (this is what we need for a bidirectional flow key)
  2. 32 bit output (enough to index any hash table size)
  3. well distributed, for use with an hash table
  4. **fast**, the hash function itself has to be able to compute at least 600K hashes per second (we got this number with some math with required performance and cpu cycles available) on a E3 3 GHz CPU
  5. optionally consider SSE 4.2 or AVX 2 which are supported by most CPUs today

This sounded easy to Logan and he started considering to use round functions of some ARX ciphers, or otherwise take advantage of hardware accelerated AES-NI (that should be available on cpus supporting AVX2).

Only having a laptop computer (Intel i7 2860QM) available, Logan found that his favorite Pearson hashing  was too slow to meet the requirements. Having a quick look at FNV, which is what we initially selected for the application, Logan found that it walked byte-by-byte through the input, just like Pearson hashing. Thus is what the first idea was: what if we take the input in chunks of 32 or 64 bits instead? No problem as we had well structured 12 byte input which made three same-sized 32-bit input pieces.

As second idea, Logan was thinking to  the xorshift pseudo random number generator, with a round function offering a somewhat scrambling bijection. Unfortunately, in this approach zero get mapped to zero – something he wanted to avoid for the beauty of the output. Also, input states with only few bits set are transformed to output with only a few (more) bits set. To counter this, he decremented the input (zero becomes “all set”, and at least all trailing zero-bits turned to ‘1′), accumulated the result, and added the other input words between the scrambling rounds. The addition operation interrupted the sequence of otherwise linear shifts and exclusive-ors and thus kept out xor-shift-patterns from distribution to a certain degree.

So, the first draft was ready to be tested:

// 'in' points to some 12 byte memory data to be hashed
// uses round function of xorshift32 prng
uint32_t simple_hash (uint8_t *in) {
  uint32_t ret = 0;

  ret += *(uint32_t*)(in+0);
  ret--;

  ret ^= ret << 13; // 1st round function of xorshift32
  ret ^= ret >> 17;
  ret ^= ret << 5;

  ret += *(uint32_t*)(in+4);
  ret--;

  ret ^= ret << 13; // 2nd round function of xorshift32
  ret ^= ret >> 17;
  ret ^= ret << 5;

  ret ^= ret << 13; // another round for more scrambling
  ret ^= ret >> 17;
  ret ^= ret << 5;

  ret += *(uint32_t*)(in+8);

  return ret;
}

First of all we had to check the distribution, this has been done by applying empirical statistics. That was counting how often different groups of output appeared for certain input patterns, e.g. all inputs with a suffix `0x00`¹⁰ (varying the first two bytes only) that led to an output with a leading `F` nibble. It should be around 4096. This has been repeated for a lot of combinations, positions and lengths to see enough evidence of a somewhat rectangular distribution. However, it was important to have a look at vastly listed output of subsequent input data: one could find “regions” of repeating patterns or always zero digits for certain input ranges. The latter was the reason for placing an additional scrambling round.

Talking about distribution, Logan cared for bijectivity of all building-blocks because he did not want to miss a thing, on neither side – neither domain nor co-domain. Any two items mapped to the same value would negatively impact output distribution.

On Logan’s laptop, at a clock speed of 3.6GHz but with an old Intel architecture, this turned out to deliver slightly above 105 Mhps (Million hashes per second)  if compiled with `-O3` compiler optimizations, 40 Mhps without. Then we tried on a Xeon machine, more similar to a production machine, a Xeon E3-1230 v5 3.4GHz. On this system the function was capable of computing 260 Mhps!

As a first step of revision, Logan decided to replace the “2nd round function” of xorshift32 with another bijection that also relied on simple primitives such as rotation and bitwise operation: xor-rotate, assuming that the compiler would have translated the ROR macro (see the code below) into the assembly rotate instruction, accelerating the computation:

  #define ROR32(x,r) (((x)>>(r))|((x)<<(32-(r))))
  …
  ret ^= ROR32(ret, 17) ^ ROR32(ret, 13);

And yes, this was delivering 304M flows per second!

By the way, replacing the “2nd round function” and not “another round function” was a decision based on output observation.

Now, how to double performance to reach 600 Mhps as per the requirement? Getting rid of instructions! Hence, Logan reduced it all to one shift-xor round between whose steps he interleaved the additions of the three words, and used a final scramble to make output look good:

#define ROR32(x,r) (((x)>>(r))|((x)<<(32-(r))))

// 'in' points to some 12 byte memory data to be hashed;
// interleaves additions of data into one round function of xorshift32 prng
// and performs a bijective final xor-rotate
uint32_t simple_hash (uint8_t *in) {
  uint32_t ret = *(uint32_t*)in;
  ret--;

  ret ^= ret << 13; // 1st step of round function of xorshift32

  ret += *(uint32_t*)(in+4);
  ret--;

  ret ^= ret >> 17; // 2nd step of round function of xorshift32

  ret += *(uint32_t*)(in+8);

  ret ^= ret <<  5; // 3rd step of round function of xorshift32

  ret ^= ROR32(ret, 17) ^ ROR32(ret, 11); // final scramble xor-rotate

  return ret;
}

This did not fully double performance on Logan’s i7 2860QM but on his second computer which is powered by an Intel i7 7500U. Assuming this same benefit from a more modern CPU architecture, we tried on the Xeon, pretty sure that it would also have shown the doubling: 365M flows per second. That was definitely strange! We were not yet there.

Trying SSE to load 16 bytes at once (adding some padding to the input data) was a disaster. Even with the simplest SSE instruction applied to the data was way slower than any of the scalar versions above. Same result for one or two rounds of hardware accelerated AES applied to the data in SSE register. It seemed that performance was eaten up by the load and store operations. Reading longer data from memory seemed to result in extra cost. That was in line with Logan’s former experience and the reason why he preferably sticked to 32 bit – 64-bit operations were much slower on 2860QM even though considered a 64-bit CPU.

At this point Logan was concerned that he even tried an idea that he originally had ruled out: linear congruential generators. He feared the cost of multiplication and reading the constant factor from memory. Sticking with 32-bit and choosing some arbitrary, probably not correctly chosen constants, he was surprised to see that this resulted in doubled speed on i7 2860QM.

#define ROR32(x,r) (((x)>>(r))|((x)<<(32-(r))))

// 'in' points to some 12 byte memory data to be hashed;
// applies different linear congruential generators
// and performs a bijective final xor-rotate
uint32_t simple_hash (uint8_t *in) {
  uint32_t a, b, c;

  a = *(uint32_t*)(in + 0);
  b = *(uint32_t*)(in + 4);
  c = *(uint32_t*)(in + 8);

  a = a * 0x53731741 + 0x61949103;
  b = b * 0x7038151b + 0x29875275;
  c = c * 0xc2758245 + 0x28759251;

  a = (a + b + c);

  // final scramble
  a ^= ROR32(a, 17) ^ ROR32(a, 11);

  return a;
}

416 Mhps.

Seemingly, a step into the right direction. Based on our experience with Xeon CPUs, we suggested Logan to try using 64-bit arithmetic to save one multiplication. Logan, that had experience with his i7, was reluctant but gave it a shot, merging `a` and `b` into one 64-bit generator, keeping `c` at the presumably faster 32-bit. Indeed, this version turned out slower on i7. But, this was surprisingly delivering 530 Mhps! So, Logan decided to make it two 64-bit generators. These linear congruential generators would have bijective property if their parameters were chosen wisely.

The linear congruential generators shared the property that any suffix of a certain length sequentially repeats, e.g. the last nibble of a sequential output would always repeat after 16 steps. This would probably also hold true for the sum of two generator outputs. For that reason, a final scramble of a different method was required. Here, the xor-rotate came in handy again. This code has been added to the nDPI toolkit so every application using it can take advantage of it without having to duplicate code.

#define ROR64(x,r) (((x)>>(r))|((x)<<(64-(r))))

/*                                                                                                                                                                                                                      
  'in_16_bytes_long` points to some 16 byte memory data to be hashed;                                                                                                                                                   
  two independent 64-bit linear congruential generators are applied                                                                                                                                                     
  results are mixed, scrambled and cast to 32-bit                                                                                                                                                                       
*/
u_int32_t ndpi_quick_16_byte_hash(u_int8_t *in_16_bytes_long) {
  u_int64_t a = *(u_int64_t*)(in_16_bytes_long + 0);
  u_int64_t c = *(u_int64_t*)(in_16_bytes_long + 8);

  // multipliers are taken from sprng.org, addends are prime                                                                                                                                                            
  a = a * 0x2c6fe96ee78b6955 + 0x9af64480a3486659;
  c = c * 0x369dea0f31a53f85 + 0xd0c6225445b76b5b;

  // mix results                                                                                                                                                                                                        
  a += c;

  // final scramble                                                                                                                                                                                                     
  a ^= ROR64(a, 13) ^ ROR64(a, 7);

  // down-casting, also taking advantage of upper half                                                                                                                                                                  
  a ^= a >> 32;

  return((u_int32_t)a);
}

598 Mhps! We made it!

At the end of this journey, we learned a lot. Especially the proverbial expression “your mileage may vary” proved true along the way. We are pretty sure that there might be even faster ways to go. If you feel like sending your code to us, we can get back to you with a score (Mhps). :-)

Special thanks to Logan oos Even who helped ntop once again with passion and dedication.