After a great deal of time spent on video calls to friends and family describing the rollercoaster year the SOC.OS team has just had, I often found myself answering questions such as;
Is a cybersecurity alert like a mobile notification? What do you do if you get an alert? What’s an AlienVault?
These are all valid questions with non-trivial answers – trying to explain the work that a Network/Security Analyst has to go through in order to triage a deluge of cybersecurity alerts is not easy. It dawned on me that this conversation is one that CISOs and Junior Analysts alike have to go through on a daily basis in an attempt to communicate the value of their work and remain sufficiently funded by their organisation.
I tended to answer these questions with a series of idioms that boil down the topic to the core problem. For example, one might say an analyst’s job is to: “Find the needle in the haystack”. Except that implies that only one cybersecurity alert is of importance. Really, an analyst is trying to: “Find the sharpest needle in an ever-expanding needlestack” i.e. they must sift through a huge number of alerts in order to find the most severe one, then the next most severe, etc. Even worse, what might be the most critical alert now might be the 10th most important after an hour.
Before a single line of SOC.OS code was written, we carried out extensive research into the alert triage problem space. The following is a consolidated list of the diverse, niche, and time-consuming set of steps that an analyst must go through in order to make an informed decision when triaging a cybersecurity alert. For you the reader, this will hopefully serve two purposes:
- Act as a set of crib notes to aid in cybersecurity budgeting discussions.
- Describe how SOC.OS automates and augments this process.
This will be interspersed with metaphors, similes and sayings that I found myself using to summarise the problems in each of these sections:
- Ingest & Collection
Ingest & Collection
Log volume and velocity from security systems is far too high for any human to even begin to comprehend it – log storage is often sized on the order of terabytes, and rates measured in gigabytes per second. Before any triaging can start, analysts need a set of systems that analyse security logs and turn them into alerts. Some common examples might be the IPS/IDS modules on firewalls, network data analysis tools, or cloud service monitoring.
The issue is that expanding organisations, increasing likelihood of attack and more security tools produce more alerts. More alerts means more investigation and triage, and for many the alert rate becomes a constant burden. At SOC.OS we have two terms for this; “Alert whack-a-mole” – constantly dealing with alert after alert on a case-by-case basis – and “Swivel-chairing” – frequently switching between screens for each tool in the triage process. This leaves less time to manage the source systems and configure them to produce fewer, higher quality alerts – so the problem spirals.
The easy solution is to throw more people at the problem; but funding is limited. The clever solution is to make the most of the alerts you have got to paint a more vivid picture of weaknesses or breaches in your cyber estate. Some might opt for a SIEM or SOAR – however these are expensive and have huge management overheads of their own – a “Double-edged sword”. After a brief honeymoon period with these heavyweight tools, many will find themselves back in the same downward spiral, sometimes worse than before.
SOC.OS aims to be the lightweight, low-maintenance, intelligent solution at the end of this toolchain – ingesting alerts from all of these systems through an on-premise syslog-forwarding agent or directly from cloud APIs. Automated swivel-chairing.
As mentioned previously, SIEM and SOAR tools like Splunk have a large management overhead – often in the form of mapping raw logs to the required format for that system. Some organisations choose to take this on themselves, others choose to use add-ons – but either way, the onus is on the customer to do their own data mapping.
Why is that the case? Well, everyone’s cyber data is subtly different. It’s hard for wide, overarching data platforms like Splunk to offer that kind of detailed expertise. Taking the “glass-half-full” view, the good news is that a subtle difference between alert data fundamentally indicates there is an obvious pattern between them too. The clever thing to do here is to use an automated approach to take advantage of that pattern, and only require human input when there is a subtle difference to overcome.
Why should 3 customers individually manage 3 slightly different mappings for the same tool, when SOC.OS can manage a single mapping and only ask customers when necessary what makes their data slightly different? Once mapped into SOC.OS, these alerts from different vendors can all be interpreted in a single common format for processing and presentation to the analyst.
An alert alone has very little meaning – it needs additional enrichment from external threat databases and internal business context for it to have any value. An external IP address means nothing until you know the country of origin or if it’s known to be malicious. An internal hostname means nothing until you know what that machine does and who it belongs to.
A large portion of analysts time goes towards carrying out this investigation – much of it comes from their own experience and detailed knowledge of the estate. They might use external services like AlienVault OTX and AbuseIPDB to determine who might be attacking them, or internal Active Directory databases or DHCP logs to determine what’s being attacked. It may also indicate a false positive, as what may seem like an anomalous login from a foreign country could just be your organisation’s second office location. You might say “Don’t judge a book by it’s cover” – context is key.
This is something SOC.OS knows it can only augment, not automate, as there is no substitute for an experienced analyst. We can, however, help them come to decisions faster by allowing them to tag entities with business context (e.g. “10.0.24.46 is our customer database server”) and automate the lookup of key entities to services like AbuseIPDB (“188.8.131.52 looks malicious”) or internal AD databases (“I don’t think our legal team should have admin rights on that domain controller“).
Understanding where this alert falls in terms of the MITRE ATT&CK® framework might also prove very useful – see our earlier blog post, “Defending your castle with MITRE ATT&CK®”.
Enrichment is great for point-in-time understanding of an alert – but the picture becomes even clearer when considering a timeline of related alerts. Your firewall picking up an unknown external IP address carrying out a port scan on your public-facing devices might be a common occurrence and no cause for alarm – but seeing that same IP in your email monitoring solution carrying out phishing attempts might indicate a targeted attack and warrant further investigation. This normally requires analysts to have a memory of which alerts they saw on which system and when – more “swivel-chairing”.
A more senior analyst can also begin to consider the negative space when investigating an alert. Imagine your systems detected port scans on 12 machines, then a day later it detected and blocked remote code execution attempts on 11 of those 12 machines. Why didn’t that 12th machine detect anything? Was it not attacked, or did it not detect a successful attack? Without considering alerts in context with one another, an analyst might never have picked this up. Unfortunately, this is time consuming, and many analysts simply don’t have enough hours in the day to carry out this kind of investigation.
At SOC.OS, we call our solution to this stateful correlation – alerts are grouped into correlated clusters that represent an incident spanning a period of time. Analysts investigate clusters – not individual alerts – allowing them to “kill two birds with one stone”. We more often find SOC.OS customers are able to “kill hundreds of birds with one stone” – a tad morbid, but representative of analysts triaging many alerts in a cluster at once.
On top of this, analysts need to make quick decisions that can have major ramifications for your business down the line – one misclassified alert could mean a successful breach of sensitive business assets. The issue is that marking an alert as a false positive and closing it is just so definitive and final – there’s no going back to change it, even if you have more information now indicating it was in fact malicious.
When an analyst marks a cluster as a false positive in SOC.OS, the cluster is archived, not closed. New information in the form of an alert can reactivate a cluster, opening it up for further investigation. Reactivation is no more work than the analyst would normally do – they would still need to investigate this alert either way – except in SOC.OS, they have the benefit of being able to use the cluster’s entire history for greater context.
Analysts have a list of alerts to work through from a number of different source systems. How do they know which alerts to work on first (the aforementioned “sharpest needle in the needlestack”)? Should they deal with the alert marked as severity “critical” on one system, or the alert marked as severity 9 on the other? Are 100 identical alerts of severity 1 as bad as 1 alert of severity 100? What’s “par for the course”?
Ultimately analysts end up answering those questions themselves, as there is no normalisation across systems. Through no fault of their own, when it comes to security auditing, an analyst is unlikely to be able to quantitively explain why they chose to investigate one alert over the other.
SOC.OS assigns a base score to every incoming alert, normalising it across source systems. Once that alert has been clustered, a number of additional factors are considered which alter the final cluster score accordingly. This includes business context enrichment – customers who tagged 10.0.24.46 as their customer database server can choose to increase the score of any cluster which contains this IP address.
Reporting is a necessary evil. Without being able to report on the amount of effort put into defending your cybersecurity estate, there would be no cyber budget. It also allows the team to understand where there might be a shortage or misconfiguration in security tooling. Unfortunately, it’s also very time consuming.
We’ve already described how MITRE ATT&CK® “acts as the ultimate security translation tool, facilitating effective communication and understanding between top level executives and technical security personnel, and everyone in between.” It’s a great starting point for businesses who want to standardise how they view and report on the effectiveness of their security tooling. The hard part is mapping the alerts to the framework – with 43% of enterprises struggling to map event-specific data to tactics and techniques. SOC.OS is built around the MITRE ATT&CK® framework, automating this mapping and allowing it to offer numerous dashboards and reporting capabilities with easily interpretable and useful information.
So you’re one of the lucky organisations that’s managed to recruit a new analyst in a labour market with a significant cyber skills gap. They log into the alerting systems for the first time and say “It’s all Greek to me”. You’re already working at 110% capacity. How do you find the time to upskill this new person, particularly with regards to sharing the years of organisation-specific knowledge you’ve been keeping in your head and haven’t had the time to write down? It’s a common problem and one of the reasons why hiring more people does not always solve the issue of alert fatigue for small teams.
SOC.OS is designed to be used by a small team – everything you do in the tool is shared between team members. Opening a reactivated cluster shows you the full history of what happened, including your team’s cluster comments and reasons for archiving. Tagging allows Senior Analysts to add business context so that Junior Analysts can learn as they go. Everything you put into SOC.OS as part of your normal alert triage process becomes a learning experience for someone else.
So that concludes my crib sheet which I intend to print out for all future events with friends and family – I’m sure to be the life of the party.
We’re all about community input so if you feel we’ve missed any important points please share these with us directly at email@example.com.
DOWNLOAD YOUR COPY
Download your copy of the four steps to take for a start-up approach to automated deployment