Sometimes traffic monitoring requires data deduplication as due to topology or hardware constraints there are some network traffic activities that are monitored by multiple devices, and others that are monitored only by a single device. This means that unless some corrections are configured, traffic measurements are wrong and thus useless. Fortunately, we have implemented some features that allows you to avoid this problem by discarding duplicated traffic before this hits the collector. This is because the collector is overwhelmed by the various activities it has to carry on, so it is better do avoid duplicates at the source (i.e. at the nProbe side) rather than at the collector side where deduplication rules can be complicated when multiple issues are mixed in the same network.
As there are multiple scenarios and solutions, below some use cases are listed to explain this in detail:
A. Packet Deduplication
This section applies when nProbe sniffs traffic and converts it into flows. For flow collection please move to section B. Remember that PF_RING comes out of the box with utilities for aggregating, distributing, dropping packets: see for instance zbalance for more information about this topic.
A1. Overlapping Packets/Networks
In some networks, when merging packets coming from various sources there is some duplication as some (e.g a specific subnetwork only) packets are observed by multiple probes. If this problem is not addressed, there is some partial data duplication for those networks/hosts that are observed multiple times. In order to solve this problem, the simplest solution is to use packet filtering to discard packets that can be duplicated. For instance suppose that nprobe A sees traffic of network 172.16.0.0/16 in addition to other traffic, and that nprobe B sees also traffic of network 172.16.0.0/16 in addition to other traffic that nprobe A does not see (i.e. the only overlap is on network 172.16.0.0/16). In this case on either nprobe A or B (not on both!) you can add -f “not net 172.16.0.0/16”
A2. Consecutive Duplicated Packets
In some cases there packets are observed twice, meaning that hardware (e.g. a packet broker, or a mirror) emits the same packet twice. This is the simplest form of duplication as you see some packets twice, and others once. nProbe can deduplicate this traffic discarding consecutive copies of the same packet by using the following option –enable-ipv4-deduplication
B. Flow Deduplication
This section applies to nProbe when used in flow collector mode (i.e. when a router creates flows and they are collected by nProbe).
B1. Overlapping Flows
This is the same as A1 but for flows instead of packets. In case multiple networks are observed by multiple probes, you need to set filter on all collectors but one (this will be the one that will emit flows for the duplicated network) to discard duplicated flows. This option –collection-filter <filter> allows you to specify a flow collection filter. The filter can be a network or an an AS; in case you have multiple filters you can separate them with a comma. Example –collection-filter “192.168.0.0/24″ means that flows where one of the peers (doesn’t matter if source or destination) belongs to 192.168.0.0/24, then such flow is discarded. Instead –collection-filter “!192.168.0.0/24” means that flows where none of the peers belong to 192.168.0.0/24 are discarded. You can also filter flows based on the autonomous system (remember to load geoip dat files). Example –collection-filter “!as12345″
B2. Partially Duplicated Flows
Due to high-availability and routing, some flows can be observed more than once depending on traffic conditions. So flows can be constantly duplicated, or only when some conditions happen (e.g. the main path is down and a backup path is observed). As there is no rule of thumb to discard duplicated flows being the duplication completely dynamic and unpredictable, the best option is to use this option –flow-deduplication <interval (sec)>. In essence you create a sliding time window of X seconds where if the same flow is observed multiple times, only the first flow is emitted and following copies are discarded in the collection period. Acceptable values for the time interval are 15 or 30 seconds to make sure that flows are deduplicated but the deduplication cache is not too large.
Final Remarks
As you have read there is no single solution to this problem as there are many use cases. nProbe offers a plethora of solutions that should allow you to cover all the possible use cases.
We hope this article walked though all the possible options nProbe offers. If you have questions, feedback or anything get in touch with us and let us know.
Enjoy!