Blog - article

Politics and the bungling of big data

By:
David Pegna
November 17, 2016

We live in the age where big data and data science are used to predict everything from what I might want to buy on Amazon to the outcome of an election.

The results of the Brexit referendum caught many by surprise because pollsters suggested that a “stay” vote would prevail. And we all know how that turned out.

History repeated itself on Nov. 8 when U.S. president-elect Donald Trump won his bid for the White House. Most polls and pundits predicted there would be a Democratic victory, and few questioned their validity.

The Wall Street Journal article, Election Day Forecasts Deal Blow to Data Science, made three very important points about big data and data science:

  • Dark data, data that is unknown, can result in misleading predictions.
  • Asking simplistic questions yields a limited data set that produces ineffective conclusions.
  • “Without comprehensive data, you tend to get non-comprehensive predictions.”

Keep the baby, drain the bath water

A powerful new application of data science uses data to detect and stop cyber attacks in real time. Think of it as stopping the next Target, Anthem and Sony Pictures data breach.

Data science has produced critical discoveries like the Higgs Boson particle, a scientific breakthrough to which I am proud to have contributed. Now, my team and I apply our data science minds to detecting hidden threats and cyber attacks on the businesses you trust.

So, from a data science perspective, what are the lessons learned from the big data blunders in election predictions? The lesson is all about using the right data for the problem at hand, and not about questioning if the data is right. The same applies for cybersecurity.

Using the wrong data

Cybersecurity that relies on logs as the data source suffers the same election-prediction fate as dark data.

Logs provide detailed information about user identity and computers. For example, a log can tell us that Kevin accessed a database at 10:03 p.m. or Emily visited a Russian website at 5:32 a.m.

The belief is logs are the fingerprints that reveal a cyber attacker’s presence. However, data breach victims never knew the attacker was there. Sophisticated attackers are experts at hiding in plain sight and never leaving any evidence they were there.

Asking simplistic questions

Cybersecurity that relies on flow data like NetFlow is similar to relying on a pollster that asked simplistic questions.

Attackers who perpetrate the most sophisticated cyber heists like the Carbanak banking theft use remote access Trojans (RATs) to remotely control their attack. Flow data reveals that an internal computer communicated with an external one, when it started and ended, and how much data was sent and received. But flow data can’t distinguish between Web browsing and a RAT.

Using the right data to make comprehensive decisions

If you want to find a cyber threat in your computer network, then the most truthful source of big data is your computer network traffic.

Data science enables you to make very rapid decisions based on incredibly big data sets. In fact, data science recently enabled a robot to set a new record for solving Rubik’s cube in less than a second.

Data science likewise enables cybersecurity to listen to all the computer traffic on a network to find cyber attackers in the act and stop them before they steal personal, health or financial information.

The key is using the right “big data” – in this case, network traffic – for data science to make the right decisions.

Let’s hope that pollster in the next election learn from the past and use the right data source to predict outcomes. In the meantime, you can learn more about how the right data and data science create security that thinks.

An aside: Protecting without prying

If you are worried about sacrificing the privacy of your email or Web browsing to protect your health, financial or personal information, then read how data science can actually protect without prying. You can also check out our white paper: The data science behind Vectra threat detections.

{{cta('c719441b-97f5-4987-b4cc-0d0ce5c3341c')}}

About the author

David Pegna

David Pegna is the director of data science at Vectra AI with over ten years of experience in data analysis and mining, machine learning and predictive modeling.

Most recent blog posts from the same author

Artificial intelligence

Politics and the bungling of big data

November 17, 2016
Read blog post
Cybersecurity

Cybersecurity and machine learning: The right features can lead to success

September 15, 2015
Read blog post
Cybersecurity

Cybersecurity, data science and machine learning: Is all data equal?

May 9, 2015
Read blog post