In a recent article, Enterprise Security Startups Are Booming- So Why Is Security Getting Worse?, Tom Foremski recaps a dinner event he attended in Silicon Valley last week organized by Eastwick Communications. The question posed in the article, "Why is security getting worse?" is easy. Three reasons.
- The attack surface of a typical enterprise is enormous now thanks to mobility, the consumerization of IT, social media, BYOD, cloud services, etc.
- Defense is down. $60B a year spent on global IT security- a huge chunk focused on signature-based prevention is easily bypassed. Some say 50% effective at best. Others say 5% at best. People? Do the math: $1B enterprise spends 5% of revenue on IT, 10% of that on security, 30% of that on security personnel. That’s about 8-10 heads give or take to guard a massive and growing attack surface.
- The enemy is smart, organized, automated, relentless and numerous. Reportedly 400,000 nation-state supported hackers in China alone.
Cybercrime numbers, although controversial (I'll leave that alone here), are coming in at 4x what we spend on defense. So the bad guys are shooting fish in a barrel.
So are we spending enough? Maybe, maybe not. Are we spending it wisely? Clearly not. Why do you see startups jumping in? The security status quo is ripe for disruption, and the advent of big data crunching, analysis and visualization in a way that arms security personnel to "hunt" for attackers is just now emerging. Will it get better, more cost effectively, and with better c-suite understanding? Yes. In many organizations it already is. We just have not yet seen the trickle down occur en masse yet. By proxy, we don't have nearly as many physical bank robberies today as we have had in years past – thanks to security technology, but we still have them.
This is the first in a series of posts about the new public Click Security GitHub project Data Hacking. The project utilizes an open architecture based on Python and the most recent advances in data analysis, statistics, and machine learning. We investigate challenging security issues through a set of exercises that use open data sources and popular python modules such as Pandas, Scikit Learn, and stats models. All materials are presented within a set of iPython notebooks that are shared publicly.
Exercise: Detect Algorithmically Generated Domain Names
The Data Hacking Github project page already has several posted exercises but we'll begin with an exercise to detect Algorithmically Generated Domain Names.
Python Modules Used:
- iPython: Architecture for interactive computing and presentation
- Pandas: Python Data Analysis Library
- Scikit Learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
- Matplotlib: Python 2D plotting library
- StatsModels: descriptive statistics, statistical tests, and plotting functions.
In this notebook we're going to use some great python modules to explore, understand and classify domains as being 'legit' or having a high probability of being generated by a DGA (Dynamic Generation Algorithm). The primary motivation is to explore the nexus of iPython, Pandas and scikit-learn with DGA classification as a vehicle for that exploration. The exercise intentionally shows common missteps, warts in the data, paths that didn't work out that well and results that could definitely be improved upon. In general capturing what worked and what didn't is not only more realistic but often much more informative. :)
The DGA Notebook contains all the code and details of the exercise but we'll summarize the work and approach here
- Alexa 100k top domains (we also show results for top 1 Million).
- A mixture of ~3500 domains that were known to come from DGA sources.
Summary of Approach
Data Ingestion, Cleanup and Understanding: We compute both length and entropy and add those to our Pandas data frame and demonstrate the nice integration of iPython/Pandas/Matplotlib.
- We demonstrate the use of scikit-learn's CountVectorizer to compute NGrams on both the Alexa domains and on the English dictionary, those new features helped to increased feature differentiation (plots shown below).
- We Utilize the Scikit Learn Machine Learning Library
o Random Forest: popular ensemble machine learning classifier
o We perform Numpy matrix operations to generate NGram count vectors.
o New features are added to our dataframe and feature matrix for scikit learn.
o Train/Classify: We demonstrate the classification results on our expanded feature vectors
For an exercise where the focus was to demonstrate the utilization of iPython, Pandas, Scikit Learn and Matplotlib, the results were reasonably good. Given a feature matrix of length, entropy, alexa_ngrams, and dict_ngrams our classifier had a predictive performance on our holdout set of the following:
We can see that 'false positives' (legit domains classified as DGA) is quite small at 0.62%. This is critical in a large-scale system where you don't want false alerts going off for legitimate domains.
Confusion Matrix Stats
legit/legit: 99.38% (6723/6765)
legit/dga: 0.62% (42/6765)
dga/legit: 14.61% (39/267)
dga/dga: 85.39% (228/267)
Well that summarizes the results in a nutshell but the DGA Notebook gives a thorough, in-depth treatment of the data, features, analysis and machine learning done for this exercise.
Please visit the new Click Security Data Hacking GitHub site for additional exercises, code, and iPython notebooks.
Brian Wylie (Lifeform at Click Security)
Two of the top questions we get from customers on a consistent basis are:
1. What is a security analytic?
2. How is security analytics different from a SIEM?
Jon Oltsik, one of the leading analysts covering this market space, does a great job addressing these questions in his most recent blog post, http://www.networkworld.com/community/blog/big-data-security-analytics-faq
Both purpose-built security analytics solutions and SIEMs can fire alerts on rules and triggers. But the number of conditions crunched by the analytic has everything to do with how good your detection is relative to determined external attackers or crafty insiders. And then, the questions become how many of these analytics can you run in real-time, and how easy is it for the security analyst to interact with findings in order to confidently speed time to response? That is where a true real-time security analytics solution
This is a guest post from one of our Click Labs Senior Engineers, Lucas McLane, CISSP.
At Click Security we have the unique ability to gain insight into a diverse set of customer environments located worldwide. Most corporate networks generate some amount of network traffic to internet domains outside their local network. Networks that don’t communicate outside their LAN aren’t very interesting. But what about networks that communicate with every site on the internet? That would be arguably the most interesting, but as it turns out highly irregular. Most networks communicate to “common” sites. What if you could globally track all aggregate traffic to specific internet domains, taking into account “common” visited sites as a baseline filter? What you would be left with is a handful of communications to “weird” or “uncommon” sites. Thanks to the Alexa 1 million, it’s not a completely intractable problem to solve! Alexa keeps track of the top 1 million most visited internet sites and provide the list for others to leverage.
At Click we leverage this list along with a custom database containing hundreds of millions of internet domains, similar to what you’d find in a standard internet Whois registry. However, unlike standard Whois, our proprietary centralized cloud service is able to track domain information queries not listed on the Alexa 1 million. The result is a single web server log describing uncommon domain traffic seen across a diverse set of corporate networks. Why is this interesting you might ask? Well, mostly because by definition what you’re left with is a very short list of weird (and eyebrow raising) internet traffic to sites such as:
Some of these sites have been listed on Trend Micro Malicious Top Ten!
From a security analyst’s perspective, weird is always interesting because sometimes (not always and not necessarily often) weird means incident! By leveraging a centralized service for tracking strange domain access, Click is enabled to globally correlate anomalous web traffic across multiple diverse network environments against current attack vectors! It’s not a silver bullet for security- none exists, but we have one more power tool in the modern analyst’s overflowing, yet always insufficient, box of tools.
We are excited to announce today that we have acquired VisibleRisk, an Austin-based information security analytics company! We will be combining our products and expertise to enrich and accelerate actionable security intelligence solutions for enterprise–class customers.
For more than a decade, organizations have relied upon signature-based malware detection products for defense, and security event management systems to investigate attacks that got through – but typically months after the fact. This is driving an industry shift towards advanced technologies able to intelligently mine big security data for anomalous and malicious activity – automatically and in real-time. We are focused on this shift, specifically enabling organizations to convert security big data into actionable intelligence.
VisibleRisk has significant experience providing enterprises with advanced security analysis – including flexible, advanced "hunting-based” analytics, custom threat intelligence, and incident response support – leading to faster and more accurate detection of incidents with a focus on reducing impact to the organization.
The acquisition of VisibleRisk complements Click Security’s security intelligence arm, Click Labs, and also provides breadth and depth to our data ingest and security analytics technology development. We have a lot of respect for this team’s talent and experience – and this will help us provide an even richer real time security analytics solution.
Together, VisibleRisk and Click Security can empower organizations to find and address a broader set of anomalies much faster and with greater situational awareness. Being able to team up with another company here in Austin that shares a similar vision of how to support the defender of the enterprise is exciting!
VisibleRisk founder and CEO, Rocky DeStefano, will join our leadership team as Vice President of Strategy and Technology, and all other VisibleRisk personnel will become full-time employees of Click Security.
Welcome to the team VisibleRisk!
Last week, Gartner held its 6th annual Security & Risk Management Summit
at the Gaylord National Resort and Convention Center just outside of Washington, DC. With an audience of nearly 2,000 Gartner clients, the Fortune 1000 and critical government agencies alike were well represented. Combine that with the fact that virtually every Gartner security analyst spoke on one topic or another, the conference made for a lively exchange on what is hot and what is not here in 2013.
Of course it helped to get the juices flowing by having a keynote speaker like Mike Mullen
, a retired United States Navy admiral who served as the 17th Chairman of the Joint Chiefs of Staff under Presidents George Bush and Barack Obama from 2007 to 2011. Mr. Mullen is a captivating and focused speaker – you’d expect nothing less from a Navy admiral. Looking around the room, I was clearly not the only one who who felt like I better sit up straight and take notes.
But beyond the keynote, three very strong and specific statements heard from informed analysts were:
- “80-90% of budgets are devoted to prevention, leaving very little for detection and analysis. This needs to change.” Of course we are all increasingly aware of the demise of signature-based malware detection. But to hear clear advice condoning a budget shift from traditional prevention towards rich data mining, data contextualization, and speed of understanding is a remarkable change from the same conference just a year ago.
- “The shift to cloud-based security services is now in high gear.” Let's put that into perspective. The entire security market forecast – all products and services – are forecasted to be ~$60B in 2013, with services ~65% of that figure, and cloud services ~1/3 of that, or ~$10B. By 2016, those figures are expected to grow to $78B, $48B, and $17B respectively. Now, we know that cloud security services have historically centered on email security and secure web gateways – well within the outsourcing “comfort zone”. What is interesting about these figures is the growing shift in cloud-based security services to incident response, threat analysis, and risk-assessment. A key stated driver was that “…advanced attacks are driving customers to the cloud.” There are really two forces at work here – depending on your company’s security team profile. If you have in-house personnel capable of deep investigation and analysis, then you want cloud security services to relieve you of the mundane security work, giving you more time to ‘go hunting for the bad guys’. Alternatively, if you don't have the security chops, you want the cloud to augment (if not fully provide) the hunting for you.
- “End user behavior base-lining is in.”Of course this is sensitive topic given the recent news articles associated with personal privacy. But, in the corporate world, organizations are showing a much greater demand for security technologies that can rapidly compare end user behavior to baselines – whether associated with thresholds, indicators of attack, or external content use. There is keen interest in being able to rapidly process large numbers of events over longer time frames. And finally, Security Operations Centers (SOCs) are admitting a heightened need for analyst support – specifically in the areas of better decisions and advice and outside intelligence/resource leverage.
Regardless of where each organization finds itself on the spectrum of evolution with respect to the above trends, one thing is becoming clear: a fundamental shift towards better, more leveraged techniques of detecting and halting hacktivists, criminals, and spies – in order to cost-effectively lower business risk – is well underway.
We all know the world of network security has changed. Script kiddies taking networks down for fun? Passé. Hackers coming in to shut you down? Sure, but they are far more interested in the “grab and go” of something valuable. Yes, we have nation-state attacks and hacktivism defacements. And, of course we need to be concerned about critical infrastructure siege and damage. But, the majority of the focus in this day and age has to be around protection of personal privacy information and intellectual property – whether the “secret sauce” recipe, your client list, or your key trading algorithm.
Hackers are smart. They know how your traditional defenses work. They know you hate a noisy IDS, so you turn the volume down. They can figure out the long tail of opportunity. They can exploit the long tail by means of low and slow tactics, masking legitimate user activity, sending their luggage into your shop via seemingly innocent files, connection types, protocols, etc. And there are so many combinations of these techniques that it becomes tough for your loosely bound traditional defenses, plus a few humans, to somehow crawl loads of metadata to see where they are and what they are doing – until about 6 months after you found out you were breached. Keep in mind, a single firewall producing 15,000 events per second is equal to about 1.3 billion events per day for you to digest.
What happens when you want to take web proxy, IDS, windows authentication data, blacklist IP information, geo-location info and more, munge it together, find the anomalous threads, understand them, analyze them, and determine if they are benevolent or malicious? And if it requires deeper investigation to reach a highly probable conclusion, will it be obvious to you where and how you should start that effort? Typically not. It takes a lot of time, skill and inclination.
This is where security analytics can help. What exactly is a security analytic? Let’s consider two definitions: one technically-oriented and one business-oriented.
A simple technical description of a security analytic is “a software element that encodes all or part of the logic used to perform an analysis on data”. Of course there are degrees of complexity. A single factor analytic could be: tell me if a given IP address is blacklisted or not. A more complex analytic could be: tell me if any financial department members ever access the payroll server from a location other than this IP subnet in our corporate network. And if so, I want to know the device they are using, their geo-location, what files they are accessing, at what time, etc. Oh, and I also want to know every other IP address they are connected to – on or off my network – and with which specific protocols. This kind of information is extremely useful – particularly if it is formed automatically and in real-time, enabling a machine or analyst to take corrective action before damage or exfiltration occurs.
Now at the business decision-maker level, the above definition sounds nice, but a different definition of an analytic is usually required. Business decision makers just want to be shown something they did not know, and it needs to be scary – as in “intruders are in the safe” or “intruders have entered the building” or “no intruders, but the basement door is wide open – and it is after hours”. Then they want to know the risk, the time and cost to get it resolved, and the recommended action plan – and fast. Further, they don't want this to require their limited, overworked security staff to have to figure it out manually. That is just a big distraction. Their security staff is already plenty busy with firewall adds, moves and changes just to keep the business running.
So, a security analytic enables a large amount of telemetry data to be pieced together automatically and contextually such that security staff can cover more ground with less aggravation.
Security analytics are the key ingredient to “automating the analyst” – which Click Security believes is Job #1 given the enormous shortage of security staff at a time when the attack surface is growing at an exponential rate and hackers are automating their ability to “grab and go” by the minute.
We are happy to announce that Click Security has been named the winner of the NetEvents 2013 Innovation Award in the Security Solution category!
These prestigious awards recognize the very best in the technology industry and reward the leading individuals and organizations for innovation and performance in the networking and telecommunications sector. The winners were chosen by an independent panel of highly respected judges including IT industry gurus and professionals from the leading technology press and industry analysts from around the globe.
According to Mark Fox, CEO of NetEvents, the judges found Click Security’s real-time security analytics’ ability to create a rich contextual understanding of network activity – and unearth anomalous activity early in the attack kill-chain – to be a highly innovative approach to an increasingly difficult problem.
To be recognized by a NetEvents Innovation Award is special given the participation at these events is of high caliber. We believe real-time security analytics will be a game-changer for how organizations protect themselves going forward, and we are delighted to be recognized as innovating in this space.
Read the full release here.
As the recent Forbes article, WordPress Under Attack: How to Avoid the Coming Botnet, states, “One in every six sites on the web runs on WordPress. That’s a lot of fodder to make a botnet out of! Don’t let yours be one of the trampled. Make this five-minute fix today.”
Of course, if we all managed login names, passwords, multifactor authentication, plug-ins, and software updates – as the writer recommends – then these types of hacks would be less likely. And then you wake up.
And before you say, “Well, we don’t use WordPress,” remember these types of attacks could just as easily be focused on any application.
So, how might Real-time Security Analytics help? Here are three analytics that would give you early indication of anomalous login activity:
Multiple Usernames per IP: This will tell you when you have an attacker who is cycling through a lot of different usernames, trying them all. The targets can be a single box or many.
Failed Logins: Tells you when an actor is generating a sequence of failed logins without a successful one. This is very useful if the attacker is trying the same username over and over (so the previous analytic will not fire), but using a different password each time. The targets can be a single box, or many.
Multiple IPs per Username: This will fire if many different IP addresses are all trying the same username. For example, if 90,000 bots all attempt to log into your WordPress server with the username "admin", this analytic will fire.
Run all three together and you have pretty strong, early warning visibility into anomalous login activity. Further, Click’s system state tables could be used to harvest a list of the bot IP addresses and the usernames they are trying against which targets. The administrator will additionally be able to see if the attacker had any successful logins above and beyond the failed ones.
Finally, the above statement, "The targets can be a single box or many," is quite important. Here’s why. To be able to notice when an attacker is generating a threshold number of failed logins across multiple targets (and highlight those for the analyst) requires pre-correlation, something that is traditionally a difficult problem in this space. That is, the traditional way to do this is to look at each server in isolation, one after the other. Let’s consider how an analyst might go about this:
1. Identify the WordPress servers
2. Consider the first WordPress server:
a. List all the clients
b. Find the clients generating a lot of failed logins against that server
c. Write down the IP addresses of these clients
3. Consider the next WordPress server:
a. List all the clients
b. Find the clients generating a lot of failed logins against that server
c. Compare the IP addresses of these clients against the existing list and look for commonalities
4. Repeat step 3 for each WordPress server
That’s a fair amount of time, energy, and therefore cost. Alternatively, Real-time Security Analytics does the pre-correlation work and surfaces the attacking actors for you automatically.
We are often asked...How is Real-time Security Analytics different from a Security Information and Event Manager (SIEM)?
In a word, it comes down to being designed for Proactive - rather than reactive - Security. I'm sure that very suggestion will spark controversy, so let's delve in a bit.
First, both proactive and reactive security tools are necessary. But they are very different.
SIEMs have been around for a decade plus, and while they got their start doing basic alarm correlation - enabling security practioners to consolidate multiple alerts for simplified monitoring and reporting - SIEMs were really put on the map in the early part of last decade when Governance, Risk and Compliance (GRC) was coming into prominence - as organizations were looking for every avenue to prove best practices to any external party who might have accused them otherwise.
SIEMs are good at what they do - simple alerting / alarm correlation across a fairly well-defined set of SECURITY product logs - where the logs have been parsed down to relatively straightforward parameter normalization. This is because they are built upon a technology foundation of store off-line and retrieve as needed for ad hoc post-processing. Consequently, when you ask a SIEM to correlate and automate context around a very large number of disparate data sources, spread over a significant window of time - sometimes days, weeks, months, even years - AND do this is real-time, the story is usually something along the lines of...."well, it can do it, but we would need to write a very complex script, retrieve a huge amount of data from disk, incur strenuous read/write penalties, have it crunch for a few hours and see what it returns. And let's hope we get it right the first time. Else, we must start back at square one if we need to augment the analytic in some manner." Further, that is just for a SINGLE 'analytic'. Try performing that with tens or hundreds of analytics. The system architecture quickly hits a wall and just becomes untenable. Worse, the analyst needs to be able to freely interact with the analysis - culling out records of insignificance, augmenting the data with additional contextual information like IP blacklist or geo-location data, or subjecting it to multiple visualizations - enabling the analyst to see the situation from dfferent vantage points. Now we are talking about a very different type of system objective.
I'll stop there and relate what we've heard from a few industry analysts (not all mind you, just a few...). The argument is often put back to us as..."SIEMs could be better at real-time processing if the vendors had gotten feedback from the customer community that that is indeed, what they wanted. So the fact that they have not gotten better at real-time analytics proves that customers are simply not desirous of real-time analytics, even if it could be done...". In many ways that rings of the same argument the IDS analysts made at the advent of IPS. And we know that simply was not born out over time. Based on what we hear from savvy security practitioners - across a number of verticals including financial services, government, higher education, healthcare, and critical infrastructure - the desire is actually the opposite! What we hear is that they crave real-time security analytics capabilities. They just aren't being told it is possible.
But let's park that point too - I'll come back to it later. Let's focus on the fundamental problem at hand. Here it is - laid out nice and neat by Verizon in their 2012 DBIR report - a great read by any account:
This graphic shows that 85% of the time it takes hackers minutes or less to move from initial attack to initial compromise. Yet, 85% of the time it takes organizations weeks to months, even years, to discover the compromise. This is the problem. We have the data before us, but we cannot piece it together, contexualize it, interrogate it, etc. until well...someone finds out the proverbial 'horse has left the barn'.
Click Security's Real-time Security Analytics is built upon a stream processing engine that is fundamentally different than relational database query archtectures, and yes, different from distributed map reduce technologies as well. We hold critical data in memory - for long periods of time. We hold hundreds of broad correlation / automated context mapping analytics in memory. As a result, we can respond to the very next firewall log or IPS event as soon as it arrives - and relative to a massive amount of data sitting directly in memory. No re-creation necessary. Super fast. Interactivity? That is our specialty. Because everything is sitting ready-to-use in a fully-stateful form, we can instantly pivot between complex visualizations including dynamic spreadsheet rows and columns, histograms, fanouts and parallel coordinate views. Further, if the analyst creates a new anomaly rule set as a result of an analysis - that analytic (we call it a Click Module) can be instantly converted to a 24x7 persistent analytic working on behalf of the analyst in real-time - to watch for those conditions and take a designated action if they reoccur. That is not an ad-hoc query. It is not a weekly scheduled cron job. It is 24x7 persisent operation!
So, now let's go back to the point of whether or not a SIEM can do this. On the theoretical surface, seemingly. In the real world, no. We just need to remember that a single simple alert checking for a limited set of parameters over a narrow time window, say 5 minutes, is not the same exercise as running hundreds of complex, high-correlation and contextualization analytics over a broad range of data sources and a protracted period of time AND returning a response in real-time. Those are no more the same than a Fiat and a Ferrari.
But, why should you care? Simply because to really scour big data for clues requires speed and accuracy, lest you chase a load of false positives. Speed comes from an engine designed for parallel stream processing of a large number of analytics. Accuracy comes from analytics that are correlating a significant number of actor parameters. The more anomalies you can tie together, the more likely it is that the oddity in question in malicious, as opposed to just a new behavior by a legitimate employee or contractor.
But, don't take my word for it. The reality is the customers who are buying Real-time Security Analytics already own a SIEM and are looking for something different to address the modern securty threat. So I'll come back to where I started. One is reactive, one is proactive. We aren't here to say that SIEMs don't have their place. They clearly do. But Real-time Security Analytics solves for a different problem set, and as a result, requires a different type of engine and approach to analytics at large. As such, it is far more than a 'feature addition' to a product designed for another purpose.
You may or may not really need a SIEM. But, revisiting the central problem as pointed out by the Verizon DBIR, it is doubtful you will want to be without proactive Real-time Security Analytics.