Like most big companies, Capital One relied on a security information and event management (SIEM) system to act as the centralized data hub for security-related information. When cyber criminals start hammering various perimeter defenses in the company’s network, the SIEM was supposed to collect all the various logs, organize them, and find correlations, via various heuristics and rules-based decision-making capabilities, that cyber security pros can use to protect the company.
Unfortunately, the SIEM dream of ultimate situational awareness has not played out nearly as well as its creators first intended, at Capital One and other companies. For a variety of reasons, many deployments of off-the-shelf SIEMs have proven themselves inadequate for dealing with the types of attacks, the volume of attacks, and the increasing complexity of attacks they’re tasked to defend against.
Capital One had all of these issues – and then some. According to Sagar Gaikwad, who heads up big data cybertech at Capital One, the company’s Web and mobile properties are subjected to 7 million HTTP requests per hour. All told, customer and employee requests generate 7TB of log data every single day.
Another problem with Capital One’s existing SIEM was the lack of a single plane of glass for security analysts, who would continually switch among more than 30 security solution to solve various forensics problems.
“Internally we used to call it ‘swivel chairing,’ where you shifted between different screens,” Gaikwad said during a keynote address at last month’s LBD + EHPC 2017 conference, where Gaikwad was named the Top Overall Contributing Attendee. “They had to go through logs. It wasn’t proactive at all. And the business rules that analysts put on the data were all static and were very hard to change on the fly.”
The company came to the conclusion that it needed to modernize its SIEM. However, the software market for SIEM solutions was lacking, Gaikwad said. “There were some tools which met 80 to 90 of the requirements, but there was no tool that met 100 of our requirements,” he said.
So the company decided to build its own.
Rolling ‘Purple Rain’
Once the Capital One group had the basic criteria set, it was time to hash out some of the characteristics it expected of the system, which at one point was referred to as “the system formerly known as,” hence the connection to the album Purple Rain from Prince, who died exactly a year ago today.
“We needed a platform that would ingest large amounts of data, meaning a buffering system that would queue the incoming logs,” Gaikwad said. “We needed a layer of enrichment that would further enrich the data with our business rules and enrich the data with the help of machine learning algorithms.”
Purple Rain would also need a persistence layer, as well as way to visualize the data. The new SIEM also needed an interface for Capital One analysts to perform search and exploratory analyses. “And we needed a dashboard where our analysts would look at the alerts that are coming in in real time,” Gaikwad said.
Gaikwad and his colleagues then sat down to map the specific requirements into core architectural components.
Everything would plug into the SIEM framework through Java, which would also power the core application server. Apache Kafka would serve as the log buffer, while Apache NiFi would collect everything into streams.
Apache Storm would provide the core intelligence to deal with incoming streams, while a project called Apache Metron, spearheaded by Hortonworks, would abstract some of Storm’s complexity and expose everything in Storm as a JSON document. H2O‘s machine learning framework was originally selected for machine learning (and is in the process of being replaced by Spark ML).
ElasticSearch would provide the indexing, while Elastic‘s Kibana would provide visualization of the data stored there. Purple Rain would rely on the ElastAlert rule engine developed by Yelp, while the company would build its own real-time visualization tool for analysts, called Alert UI.
HBase would serve as Purple Rain’s repository for storing and mixing various types of temporal and geospatial data. HDFS would have a very small role, as a configuration store, while Amazon Web Services S3 repository would provide the heavy lifting and storage of ORC files.
UBAs and DGAs
In addition to detecting traditional attacks, Purple Rain was designed to address several other new attack vectors, including those that fall under the rubrics of user behavior analysis (UBA) and domain generation algorithms (DGA).
Detecting malicious activity by internal users can be difficult, since they already have clearance into back-office applications. Law enforcement estimates upwards of 70 of hacks are perpetrated by internal users. “Working for a bank we’re very sensitive to what our associates are doing,” Gaikwad said. “We want to make sure we don’t have any cyber crooks.”
By setting machine learning algorithms to “learn” what regular employee behavior looks like, UBA solutions can more easily flag anomalous behavior, if and when it ever pops up. In Purple Rain’s case, the system is always on the hunt for big downloads – particularly those that occur outside of Capital One’s internal network.
The DGA threat represents a more challenging phenomenon for Capital One and other organizations that value the security of their systems and data. The way that Purple Rain handles the DGA threat will likely become a highly replicated piece of security engineering.
When a piece of malware successfully infects a PC these days, one of the first things it does is “phone home” to the cyber criminal’s mother ship: the command and control center (C&CC). Like everything on the Internet, the C&CC is represented by a URL.
In the olden days of the Internet, white hats would simply black list the evil mother ships, and that was that. But that was before the bad guys started using algorithms to generate URLs so fast that no human could ever keep up with them. The good guys responded by buying up domains, thereby keeping them out of the hands of the cyber crooks. But eventually, the DGAs started generating upwards of 25,000 domains per day.
“At that point it becomes practically and technically impossible for any company to reverse engineer that many domains, or buy out that domain, or even put it in a block list,” Gaikwad said. “That’s where you need the power of big data and machine learning, where you can protect this at the same speed as the DGA is updating domains.”
The core of Purple Rain’s DGA-fighting super power is a natural language processing (NLP) algorithm that essentially mimics a human’s ability to understand language. Most DGAs are fairly easy for humans to spot because they’re just a bunch of random letters. After training the NLP algorithm, Purple Rain could successfully identify 99.7 of DGA-generated domains at a scale of about 30,000 events per second. However, that was in the laboratory. In the real world, it would need to handle 70,000 events per second.
“And that’s where we saw a lot of the limitations on the solution we had in open source,” Gaikwad said.
Tweaking Open Source
To speed up the model, Gaikwad and his colleagues leaned heavily on their experience as data engineers. The first thing Gaikwad did was build a “model as a service” solution. It brought in an Apache Spark cluster and parked it behind an Amazon Elastic Node Balancer.
“Amazon Elastic Node Balancer provides us the flexibility to scale in and scale out our model based on our needs,” Gaikwad said. “So if we see a spike, we can easily add two servers behind an Elastic Node Balancer. Our streaming engine is completely unaware of what’s happening in the background.”
The company also shifted to Google’s GPRC protocol for internal communication. “It’s a leaner version of REST calls. It uses compression even during communication calls, so it’s much faster,” Gaikwad said. It made several other tweaks to get Purple Rain where it likes it.
While the software provided some operational challenges, Capital One worked through them. The nature of open source software itself proved to be a big challenge for the company, which had to adapt its culture to accept it. As Gaikwad and his team used the open source software and became more familiar with it, they found themselves contributing changes back into the various projects.
Capital One started building Purple Rain in June 2015, and has been in production with it since June of 2016. As June 2017 rolls around, Purple Rain will be a regular part of Capital One’s security infrastructure, and one that should be there for years to come.
The company is planning on releasing some components of the Purple Rain framework as open source. Users won’t be able to see the actual machine learning models the company uses (which could give the bad guys a leg up on the company). But any organization that wants to use Purple Rain to help protect their computing infrastructure will be free to do so.