IPs and AS numbers have been randomized and are fictitious for privacy reasons.
More and more Masters degrees in technology are becoming geared towards full-time professionals — making them more coursework-oriented, terminal degrees. The professional track typically comprises 10–15 courses. Having been a researcher, however — both in the academia and currently in the industry, I wanted something more challenging and persisting: the end goal for which yields the development of new skills, knowledge and pragmatic experience rather than teeny bits of “hands-on” work here and there in a course, for a letter grade.
Therefore, I requested the Georgia Tech Administration to let me pursue the Project route for my remaining free electives. Thankfully, my request was approved!
The Project
The project, supervised by Dr. Maria Konte, consisted of analyzing a very large number of real-time flows (“NetFlow”). The flows were first to be captured — we are talking about upwards of 150 Million NetFlows over a ten-minute window alone, while minimizing, if not completely eliminating any “drops”. The collected traffic was then to be analyzed for its volumetric characteristics — such as the flow size distribution, entropy, estimated cardinality, heavy hitter counts, etc.
The goal was for us to have aggregate data per destination AS (Autonomous System) for all the source IPs connecting to this AS, and output these stats into a nice JSON file.
These network volumetric characteristics can have several applications: they can be used for research purposes, for generating graphs and analyzing trends, or be further used in Machine Learning (ML) cybersecurity projects, such as DoS detection.
Flow vs. Packet
By the way, while a packet refers to a “chunk” of data being transmitted across the network, a flow is rather metadata about this packet — without the actual “body” or the “content” of the packet.
Therefore, think of a flow as all of the below fields of a TCP packet except Data and some others.
In summary, this meant the project consisted of at least the following tasks:
- Capturing millions of real-time network flows, pouring in from multiple sources per second and saving them into text files. For example, a file with the lines of format:
srcIP|srcPort|dstIP|dstPort|Protocol|srcAS|dstAS|timestamp...
- Choosing an efficient, fast data structure to hold this information somewhere — for further processing.
- Processing these flows to extract ARE, cardinality, entropy, etc.
- Saving the extracted characteristics as JSON files.
- Repeating the above steps for another 10-minute iteration.
I admit, this is one of the nerdiest things I have ever accomplished.
The Challenge
One of the main challenges we experienced were combating memory limitations and speed. Handling hundreds of millions of packets received over a 10 minute window may actually take over a good 10–15 minutes — if your code sucks. We are also talking about C++ here so one needs to be very careful. Simple things like bad memory handling, I/O operations, and a poor data structure choice will yield punitive results as the size of the dataset increases.
The challenge was to get this processing time down as much as possible. With some of the methods or “hacks” mentioned below, I was able to get the time for everything down to ~4.5 minutes for the first iteration of 10-minute data and ~2 minutes for subsequent iterations. It may not sound ideal but with the kind of data we are dealing with, it was good enough.
The Pipeline
Our pipeline consisted of the following steps. Listed below are the solutions and optimizations we used for each of these options.
1. Flow Collection
For our flow collector, we used nProbe. Luckily, the folks at nTop were kind enough to approve our request for acquiring an academic license. nProbe quietly listens on one of the specified ports on your machine (“the collector”) for incoming packets. For every packet, it generates the flow metadata containing vital information, as shown below:
nProbe comes charged with a variety of command-line options which let you enable packet sampling, specify the output format, alter the frequency of dumping files (e.g. every 2 minutes, 5 minutes, etc.) and minimize flow drops. This challenge was therefore solved, with some experimentation.
2. Data Structure
Ah, another challenging part!
For every flow line extracted from the nProbe output shown above, we needed to hold this information into a suitable C++ data structure which would group them by a certain key — the destination AS. We needed a list of flows for every destination AS, something like:
AS Number: {list of flows belonging to this AS}
dstAS_1234: {"1.1.1.1|2.2.2.2|....", "2.7.8.9|5.6.7.8|23|...", ...}
dstAS_5678: {"2.2.2.2|1.1.1.1|....", "3.7.5.8|66.24.55.31|23|...", ...}
Now, I know the first thing that comes to your mind is a hashmap or a dictionary. Rumor has it that hashmaps take O(logN) time for accessing elements, where N is the number of the elements, if not constant time (order O(1)) — but you don’t know what that constant time is until you actually try. It could be a very small value, or it could be such an ugly value that the notion of “constant time” becomes worthless in practice. We were optimistic and used unordered_map
at first but the practical results were very disappointing.
As a few thousand flows approached the counts of millions, every insert
operation was drastically slowing the pipeline down exponentially. That is because, unlike the time it takes to access the elements, the worst case scenario for insert
operations is O(N*size()+N) if not O(N). So, the first few thousand flow insert
ions would only take ~10 seconds, the next few would take 2 minutes, and the subsequent ones made the entire program excruciatingly slow to the point the whole process took over 40 minutes to terminate.
So, we started experimenting with different kinds of hashmaps: boost::unordered_map
, Facebook’s Folly Atomic Hashmap, etc.
Was the performance maybe slightly better with these? Maybe so.
Was the performance acceptable? Far from it.
No matter how efficient the data structure, every insert
operation would resize the map — making it bigger and bigger, and its performance worse and worse. This was killing our program!
It seemed to be one of the downsides of using dynamic memory allocation in a loop. On the other hand, static memory allotted in advance for a fixed size data structure seemed to improve the performance significantly for obvious reasons. This could be something to do with the fact that stack-based memory allocation is much faster than heap-based.
The Tradeoff
Could we do this with C-style arrays
or, the more flexible vectors
instead of relying on maps? Hmm…
We could not have, had the key of our data been a really long “string”— we were then only stuck with a dictionary or a map. But it wasn’t. An AS number is merely a 16-bit or a 32-bit integer.
Even though the AS range is pretty vast, over 99.8% of our dataset dealt with only 16-bit ASes. Therefore, a data structure like the following was adequate:
OurMap[index] = {list of tuples belonging to the AS index}.
We basically used the “index” of the array (or vector) as a data field itself: the AS number.
For our purposes, we used 2D vectors with indices for the 1st dimension going up from 0 to 65535 — representing 16-bit ASes. For the second dimension, the chosen type was a vector
with no pre-allocated size. So, in practice our “map” was something like:
vector<ourFlow> ourMap[65536]; //indices 0 to 65,535
Or, better yet, in vector form:
vector<vector<ourFlow>> ourMap;
//I'll explain what 'ourFlow' is below.
And, to “reserve” at least 65,536 spaces statically for the first vector, we used the reserve
function of the std::vector
class:
ourMap.reserve(65536);
This combination of static allocation with dynamic data structures was the perfect fix for our purposes and now made the entire program significantly faster. For (I/O) reading 10 minutes of flows (~150 million or more) from ~1700 text files totaling 1.5 GB in size, converting text flows into binary, and populating this data structure, it took around 4.5 minutes.
Another hack: we actually wanted to add the epoch timestamp to every single flow being read based on when it was collected. Instead of transforming this array into a 3D array which, would have destroyed the performance, we made the second dimension of this 2D vector into a different type: a C-style struct
as opposed to a string. This structure would now contain our flow information encoded as binary data (i.e bytes represented as char
) and the timestamp as an int
. So the struct would look something like:
struct ourFlow {
char fiveTuple[13];
int epoch_timestamp;
}
The fiveTuple
field contains the essential flow data in binary: source IP (32 bits), source Port (16 bits), destination IP (32 bits), destination Port (16 bits), protocol (8 bits). The total adds up to 104 bits = 13 bytes.
We could have made fiveTuple
a bigger sized char
array and in fact injected the epoch timestamp in binary too, at the end. But, because the fiveTuple
field was to be consumed by Elastic Sketch, as explained below, we decided to keep the timestamps separate from the binary data.
3. Processing with Elastic Sketch
Now we needed a framework capable of processing 150 million flows super-fast to give us its volumetric characteristics.
Having little to no luck with the outdated OpenSketch framework and our failed attempts to contact its creators, we decided to give Elastic Sketch a shot — and we couldn’t be more glad that we did. The researchers; creators of this framework are brilliant and quite responsive too — whenever we reached out for their support in setting up the code, or pointed out any bugs in their code, they were quick to address.
For those interested in the theory, below is the original paper and there exists a Github repository as well.
NOTE: If you do choose to use Elastic Sketch, make sure that your system supports the AVX/AVX2 instruction set. An often overlooked prerequisite, but perhaps the most important one. Your scripts will fail with errors which will take you an eternity to debug without any luck — simply because the error is actually occurring at the assembly instruction level. AVX/2 instructions are simply not supported on some (older) systems and Elastic Sketch tasks — mainly elastic.cpp
will fail fatally in such a scenario.
Elastic Sketch, using the char fiveTuple[13];
field of the structure — which contains the most essential information about the flow, was able to give us the following characteristics for every destination AS — that is how we have chosen to group the flows. These terms are explained in the paper above, if needed.
- ARE
- Flow Size Distribution
- Entropy
- Estimated Cardinality
- Heavy Hitters — the offending source IPs hitting the (destination) AS the most.
4. The Output
We simply expressed the output returned by the Elastic Sketch as JSON files.
As can be seen below, for every key (destination AS number), we now have aggregate data with the heavy hitters being the source IPs hitting the destination AS the most (above a certain threshold). Note, the source IPs are IPv4 expressed in integer format. So, for example, 967125800 would be 57.165.43.40. To test this theory simply enter 967125800/
into your Chrome browser’s address bar and hit Return.
The total time taken by Elastic Sketch for generating the aggregates was around 2 minutes.
Each of these JSON files are only 2.8 MB in size but contain stats for 10 minutes of flows. That’s the beauty of Elastic Sketch — speeding through 1.5 gigs worth of flows to return aggregates.
5. Sliding Window
Because we used a “sliding window” approach in real-time, our focus was to run Elastic Sketch on every 10-minutes of flows with a sliding window of 1-minute. So, say if we captured our first 150 million flows from time 13:25
to 13:35
, the next iteration of Elastic Sketch will run on the flows captured between 13:26
and 13:36
. This would mean, for subsequent iterations we are only:
- removing the flows with timestamp of
13:25
from the existing data structure, - adding the flows for the minute
13:36
to the existing data structure from nProbe text files, and - re-running Elastic Sketch (task
elastic.cpp
) on this updated structure.
We need not repopulate the entire data structure from the beginning which would have taken 4–5 minutes.
Hopefully, this explains as to why we were sneaking in the epoch timestamps into our original data structure for every single flow. The epoch timestamps also help keep track of the directory structure as well because nProbe, depending on how it’s configured, saves flow files into timestamped directories: e.g. /output/2018/11/29/15/48-121.flows
(Year/Month/Day/Hour/Minute-fileNumber.flows).
That’s it!
Special Thanks!
I would like to express my sincerest gratitude towards:
Georgia Tech
Dr. Maria Konte who supervised this project and has been nothing short of a supportive and reliable advisor. She challenged me intellectually at every level throughout this project.
Elastic Sketch
Check out the Elastic Sketch paper and source code on Github.
Amazing open-source framework for quickly gathering stats for large amounts of network data. Props to researchers, Jie Jiang of
Institute of Network Computing and Information Systems at Peking University, China and Dr. Steve Uhlig of Queen Mary University of London for their prompt responses when asked for help!
nTop
For granting us an academic license to use their sophisticated product, nProbe: https://www.ntop.org/products/netflow/nprobe/
Bye-bye Georgia Tech! Like all great things, our journey must come to end…
I’ll miss you, but hey… I got out!