The main reason behind the rising popularity of data science is the incredible amount of digital data that gets stored and processed daily. Usually, this abundant data is referred to as "big data" and it's no surprise that data science and big data are often paired in the same discussion and used almost synonymously. While the two are related, the existence of big data prompted the need for a more scientific approach– data science– to the consumption and analysis of this incredible wealth of data.
In order for cybersecurity professionals to see the greatest possibilities offered by big data and data science it would be ideal to goBack to the Futureto see how data insights will unfold.Lacking the time-travel expertise of that movie's Doc Brown, today’s data scientists must imagine the possibilities of how big-data analysis will inform and educate our world.
As I discussed in thefirst blog of this series, the application of data science techniques to cybersecurity relies on the prompt availability of massive amounts of data on which models can be built and tested to extract interesting insights.
To give you an idea of how much data needs to be processed, a medium–size network with 20,000 devices (laptops, smartphones and servers) will transmit more than 50 TB of data in a 24–hour period. That means that over 5 Gbits must be analyzed every second to detect cyberattacks, potential threats and malware attributed to malicious hackers! We can now understand Doc Brown’s amazementwhen he shouted “1.21 gigawatts!”inBack to the Future.
While dealing with such volumes of data in real time poses difficult challenges, we should also remember that analyzing large volumes of data is necessary to create data–science models that can detect cyberattacks while both minimizing false positives (false alarms) and false negatives (failing to detect real threats).
When discussing big data, the three big "V's" are often mentioned: Volume, Variety and Velocity. Let's see what these really mean in a cybersecurity context.
If a data scientist is relying on machine learning to build a model, large data samples are necessary to understand and extract new features, and properly estimate the performance of the model before deploying it in production environments. Also, when a given model is based on simple rules or heuristic findings, it is of paramount importance to test it out on large data samples to assess performance and the possible rate of false positives. When the data sample is "large" enough and, as I will discuss in the second point, has enough "variability", the data scientist can try to identify different ways of categorizing the data and unexpected properties of the data may become evident.
For cybersecurity data science models, "Variability" really matters more than "Variety." Variability refers to the range of values that a given feature could take in a data set.
The importance of having data with enough variability in building cybersecurity models cannot be stressed enough, and it's often underestimated. Network deployments in organizations –businesses, government agencies and private institutions –vary greatly. Commercial network applications are used differently across organizations and custom applications are developed for specific purposes. If the data sample on which a given model is tested lacks variability, the risk of an incorrect assessment of the model’s performance is high. If a given machine learning model has been built properly (e.g., without "overtraining", which happens when the model picks up very specific properties of the data on which it has been trained), it should be able to generalize to "unseen"data. However, if the original data set lacks in variability, the chance of improper modeling (for example, misclassification of a given data sample) is higher.
Volume, Variety, and Velocity (as well as Variability) are all essential characteristics of big data that have high relevance for applying data science to cybersecurity. More recent discussions on big data have also started to emphasize the concept of the "Value"of data.
In the next post in this series I will start to discuss how machine learning can be applied to cybersecurity and the value of your network’s data.
Watch this videoto learn how the Vectra Networks X-series platform provides something different.Instead of focusing on signatures, payloads, sandboxing, or reputations, theVectra X-seriesbreach detection platform looks for malicious behaviors on the network in real time using data science algorithms and machine learning. We track these behaviors regardless of device, operating system or application, and correlate multiple behaviors over time that could be missed by other solutions that monitor discrete events.