The main reason behind the rising popularity of data science is the incredible amount of digital data that gets stored and processed daily. Usually, this abundant data is referred to as "big data" and it's no surprise that data science and big data are often paired in the same discussion and used almost synonymously. While the two are related, the existence of big data prompted the need for a more scientific approach– data science– to the consumption and analysis of this incredible wealth of data.
In order for cybersecurity professionals to see the greatest possibilities offered by big data and data science it would be ideal to goBack to the Futureto see how data insights will unfold.Lacking the time-travel expertise of that movie's Doc Brown, today’s data scientists must imagine the possibilities of how big-data analysis will inform and educate our world.
As I discussed in the first blog of this series, the application of data science techniques to cybersecurity relies on the prompt availability of massive amounts of data on which models can be built and tested to extract interesting insights.
To give you an idea of how much data needs to be processed, a medium–size network with 20,000 devices (laptops, smartphones and servers) will transmit more than 50 TB of data in a 24–hour period. That means that over 5 Gbits must be analyzed every second to detect cyberattacks, potential threats and malware attributed to malicious hackers! We can now understand Doc Brown’s amazement when he shouted “1.21 gigawatts!” in Back to the Future.
While dealing with such volumes of data in real time poses difficult challenges, we should also remember that analyzing large volumes of data is necessary to create data–science models that can detect cyberattacks while both minimizing false positives (false alarms) and false negatives (failing to detect real threats).
When discussing big data, the three big "V's" are often mentioned: Volume, Variety and Velocity. Let's see what these really mean in a cybersecurity context.
Volume, Variety, and Velocity (as well as Variability) are all essential characteristics of big data that have high relevance for applying data science to cybersecurity. More recent discussions on big data have also started to emphasize the concept of the "Value"of data.
In the next post in this series I will start to discuss how machine learning can be applied to cybersecurity and the value of your network’s data.
Watch this video to learn how the Vectra Networks X-series platform provides something different.Instead of focusing on signatures, payloads, sandboxing, or reputations, the Vectra X-series breach detection platform looks for malicious behaviors on the network in real time using data science algorithms and machine learning. We track these behaviors regardless of device, operating system or application, and correlate multiple behaviors over time that could be missed by other solutions that monitor discrete events.
David Pegna is the director of data science at Vectra AI with over ten years of experience in data analysis and mining, machine learning and predictive modeling. At Vectra he is responsible for the development of analytical models for malware detections and real time insights into advanced persistent attacks. Before joining Vectra, he was a data scientist consultant for Apple. He received a bachelors, masters, and Ph.D degrees in nuclear and subnuclear physics from University of Pavia as well as an international certificate of doctorate studies in particle physics from University of California, Berkeley.