Bot Detection — The Works Way

In this article, we focus on a high-impact cyber security use case: Bot detection in single en­ter­prise networks. We demonstrate step-by-step how PredictiveWorks. can be used to imple­ment a com­prehensible and reusable solution to identify C&C do­mains without writing a single line of code.

PredictiveWorks.

Built on top of Google’s CDAP and with 200+ data connectors & machine in­telli­gence operators. It supports the new “predict as you train” para­digm and elimi­nates the ex­pensive and time-consuming gap between training & deploy­ment phase of AI models.

Background

C&C channels leverage a variety of protocols, but HTTP is still very popular. In a HTTP botnet, the master uses a couple of C&C domains to publish com­mands and bots search for them to establish a communication channel.

Botnets ge­nerate hund­reds and thousands of do­mains every day but register few of them.

Domain generation means that botnets are one step ahead of security experts: Lear­ning the algorithm requires time-consuming reverse engineering of the bot binary. And even worse, the bot master can change the algorithm at any time by patching its bots.

Bots randomly query the generated domains and the master can re­gister any of them. A huge number of do­mains must be registered to expose the bots’ communication channel, which is hardly affordable.

Scenario

Semantic Framework

PredictiveWorks. always takes a business-centric perspective and organizes projects along a 4-level AI tax­onomy that tightly connects problems, drivers, desired outcomes to tasks down to technical AI application blueprints.

At any time, business users find answers why a certain data connector and operator has been selected, how they fit to a certain project task and what all of this has to do with the ori­gi­na­ting prob­lem.

The image illustrates how we expose the DNA of the selected botnet detection algorithm in terms of business case, tasks, blueprints and pre-built plugins. Comprehensible and reusable business knowledge.

Cases

They prepare grounds for reusing projects in case of similar problems, and e.g. accelerate the onboar­ding process of re­cently hired security experts and more.

Tasks

It is the CRoss- Indus­try- Stan­dard- Process for Data Mi­ning on the one hand and the Analy­tics Solu­tions Uni­fied Me­thods for Data M i­ning & Pre­dictive Analytics on the other hand.

We support five phases: Understand, Prepare, Build, Evaluate & Ope­rate.

Sol­ving business prob­lems usually starts top down and with tasks to understand the problem and derive ideas how to solve it. Also, these tasks can end up in AI applications.

Hypothesis

We leverage an existing bot detection. Therefore, data-driven problem understanding, solution idea and hypotheses al­­ready exists:

Algorithmically generated domains are linguistically different from those that are selected by humans. The latter ones are made to be easy to remember and com­paring them linguistically usually does not expose any similarity patterns. On the contrary, suspi­ci­ous domains are expected to expose themselves by sharing similar linguistic attributes.

Solution Idea

The idea is to build a linguistic model of suspicious domains from the DNS traffic of a single network and use simi­larity and segmentation to distinguish between legitimate and sus­picious domains. Segments with anomalous high counts of members are considered as al­gorithmically generated.

What tasks have to be fulfilled to implement this idea?

We must build a linguistic model to describe algorithmically generated domains & evaluate whether our hypothesis holds. Before starting model building, however, we want to reduce (prepare) the DNS traffic to become more fo­cused on potentially suspicious domains.

To this end, external knowledge about legitimate domains on the one hand and known be­havior patterns of botnets on the other hand is applied to the DNS data to reduce to that part that more likely contains bot traffic.

In this scenario, the idea how to prepare the DNS traffic already exists. Otherwise, we would be to perform ‘ understand ‘ tasks to e.g. discover bot behavior patterns.

To summarize, the business case to detect bots in a single network can be decomposed and organized in ‘ prepare’, ‘ build’, ‘ evaluate’ & ‘ operate ‘ business tasks.

Blueprints

Blueprints are categorized by two semantic dimensions:

  • Analytics. Descriptive, Diag­nostic, Declarative, Predictive & Prescriptive.
  • Purpose. Discover, In­gest, Learn & Predict.

The ‘ prepare’ business task is solved by ‘ des­criptive’ blueprints that e.g. leverage known bot behavior to ‘ discover ‘ domain candidates for a linguistic model. This is a response to the fol­lowing requirement:

“As a security analyst, I want to prepare a DNS traffic dump of a single network for linguistic ana­ly­tics by discovering suspicious domains that can be described by known bot (traffic) beha­vior.”

Plugins

More than 200 pre-built & reusable Lego-like building blocks (plugins) exist, that can be configured and plugged together with a point-and-click data workflow designer to specify logical plans (blueprints). And an application generator then turns every blue­print into an exe­cutable AI binary.

The image is a sketch of ‘The-Works-Way’ and what we discuss in this article is template that may be registered in the template hub.

Plugins define the fourth level of our AI taxonomy. They can be categorized as Source or Sink of a data workflow, as Action, Alert, Analytics, Condition, Lookup & Trans­form.

The remaining part of this article is used to describe bot detection in more detail, based on the above introduced template components.

Algorithm

  • Bots leverage a large amount of algorithmically generated domains to query DNS ser­vers to search for registered C&C domains. The overwhelming majority of these do­mains are NXDomains (Non-eXisting domain, not resolved by a DNS server).
  • Not every NXDomain is suspicious. A variety of known and proven reasons exist why this happens. This knowledge is used to remove legitimate NXDomains and reduce the amount of DNS traffic for linguistic analytics.
  • Infected hosts show characteristic quantitative & temporal behavior (when searching for C&C domains). This knowledge is also to reduce the amount of DNS traffic for linguistic analytics.
  • Algorithmically generated domains are linguistically different from human generated domains as they are significantly more similar to each other.

Case

The Template Dashboard is one option to access registered business cases, and, as you can imagine, PredictiveWorks. is not restricted to cyber defense.

Tasks

The business case is organized in tasks of the following categories:

  • Prepare. Discover NXDomains that are more likely suspicious than legitimate. Therefore, known legitimates are removed, and known bot behavior is used to prepare grounds for linguistic analytics.
  • Build. Build a linguistic domain model to describe algo­rithmi­cally generated domains instead of doing binary bot reverse-engineering.
  • Evaluate. Evaluate the quality of the linguistic model.
  • Operate. Use the linguistic model in a chain of evidence to detect C&C domains.

Blueprints

For this scenario, we assume that the DNS traffic dump is stored in a database ( CrateDB) al­ready. Other­wise, there would be another blueprint for data ingestion, that e.g. connects to a real-time network monitor such as Zeek and writes network traffic events to a CrateDB database instance.

PredictiveWorks provides connectors to many data sources, but Aerospike and CrateDB are appropriate candidates to store IoT scale data.

The ‘prepare’ business task leverages two different strategies to reduce the DNS traffic data, known legitimate NXDomains on the one hand and known bot behavior on the other hand.

We use ‘descriptive & discover’ blueprints to im­ple­ment each of these strategies.

KNOWN LEGITIMATE NXDOMAINS

The knowledge of legitimate NXDomains originates from external data sources:

  • An NXDomain is considered legitimate if its top-level domain (TLD) is not in the list of IANA TLDs.
  • An NXDomain is con­si­der­ed legitimate if it contains one or more of these stop words from known network configurations.
  • An NXDomain is considered legitimate if it is a typo of popular domains (in Alexa’s top domains and websites of world’s biggest 500 companies from Forbes).
  • An NXDomain is considered legitimate if it results from an overloaded DNS (provided by blacklist services for overloaded DNS).

These external data sources (Alexa, Forbes, IANA, Stop Words etc.) are made available to the blueprint by PredictiveWorks common purpose Lookup plugin. This plugin reads data from fi­les, supports various text compare me­thods (including Levenshtein distance for typo dis­cov­ery) to make a yes or no decision.

In addition to applying external knowledge in form of a repeated use of the Lookup plugin, we also reduce the DNS traffic by removing NXDomain with repeated TLDs that originate from mis­configuration (e.g. “www.example.com.example.com") rather than botnets.

This kind of pattern matching can be applied to the blueprint by either using the Rule or SQL plugin. In this scenario, we decided to use the SQL plugin. It is based on Spark SQL and provi­des various pre-built user-defined functions to enrich SQL statements.

Besides a repeated use of the Lookup plugin and the SQL plugin, enriched with pre-built user-defined functions, this blueprint makes use of the CrateDB connector to read the DNS traffic data and finally write the reduced DNS traffic back to the same database instance.

KNOWN BOT BEHAVIOR

On the contrary to bots, most legitimate hosts do no query many NXDomains. Therefore, in­fec­ted hosts are expected to expose itself as outliers with respect to the total number of do­mains. When a bot searches for registered C&C domains by querying its algorithmi­cally gene­rated domains, we expect a sudden increase of queries, and once it hits a re­gistered C&C do­main, a sudden decrease.

List of Plugin Types

For our botnet detection scenario, the time interval that specifies this increase-decrease be­ha­vior is important, as domains that fall into this interval are candidates for the linguistic bot model.

Implement ‘Build’ Task

The bot behavior is implemented by another ‘descriptive & discover’ blueprint. It starts with a CrateDB connector to read the (reduced) DNS traffic from the previous blueprint.

Then, the SQL plug­in is used to apply the known bot behavior to the DNS traffic with the help of two enrichments: One is a registered Tukey method for statistical outlier detection, and the other is a Change Point Detection method to determine the begin and end point (in time) of a peak.

Both registered (pre-built) methods are used to determine the anomalous hosts and their as­so­ciated time peak.

Then, CDAP’s built-in Joiner plugin joins these data with the previously reduced DNS traffic to determine the host and its queried domains within the peak interval.

Then, the CrateDB connector is used to write the data prepared for ling­uistic analytics back to this database instance.

  1. CrateDB connector
  2. Lookup plugin
  3. SQL plugin
  4. Joiner plugin

The selected detection algorithm determines gene­rated domains without a priori knowledge of the underlying (and reverse engineered) domain algorithm, but with a comprehensive picture of the lin­guistic features of the generated domains deployed as a linguistic model.

NXDomains and C&C domains are generated by the same underlying algorithm. The only difference is that the latter ones are registered. The expectation is, that the derived ling­uistic model detects C&C domains, when applied to real-time DNS traffic.

List of Plugin Types

The ‘build’ business task is solved by two ‘declarative & learn’ blueprints that both start with a CrateDB connector to read domains queried within the peak inter­val.

Implement ‘Evaluate’ Task

The first builds a model of the domains’ linguistic attributes and the second uses this attribute model to determine the similarity of different domains. Those linguistic attributes that are shared by significantly more domains than others are considered as linguistic sig­na­tures for bots and are stored as linguistic bot model.

In more detail:

The first blueprint is composed of the CrateDB connector, N-Gram and Word2Vec plugin to build the attribute model.

The second blueprint reuses the CrateDB Connector and N-Gram decomposer and applies the attribute model to represent each domain by an attribute vector. Then the K-Means plugin is used to build a cluster model from these vectors.

List of Plugin Types

This cluster model holds the linguistic signatures of the domains’ similarities. Both models (attribute & cluster) are stored for reuse under control of an internal version mana­ge­ment.

  1. CrateDB connector
  2. N-Gram plugin
  3. Word2Vec (train) plugin
  4. K-Means (train) plugin

The expectation of this approach is, that algorithmically generated domains are significantly more similar than those defined by humans. This means that we expect clusters that contain significantly more members than others.

To verify this hypothesis on the one hand and to detect these clusters on the other hand, it is sufficient to define a single ‘predictive & predict’ blueprint:

Implement ‘Operate’ Task

This blueprint reuses the Crate DB connector and N-Gram decomposer, and the attribute & clus­ter model, to pre­dict a certain cluster for each domain. Then, the SQL plugin associated with the Tukey method is used to detect anomalous clusters.

The presence of anomalies confirms the algorithm’s hypothesis.

  1. CrateDB connector
  2. N-Gram plugin
  3. Word2Vec (predict) plugin
  4. K-Means (predict) plugin
  5. SQL plugin

List of Plugin Types

Before the ‘operate’ task is implemented by a another blueprint, let’s summarize what has been achieved so far:

  1. The solution of the more or less sophisticated botnet detection problem can be organized in four business tasks each covering a well-defined phase of a cross-industry analytics process.
  2. The implementation is based on six blueprints in total: ‘prepare’ (2), ‘build’ (2), ‘evaluate’ (1) and ‘operate’ (1).

Business case, tasks, blueprints and plugins are tight together in a comprehensible, machine readable and reus­able business template that is offered on a template marketplace.

We decided to detect C&C domains in real-time. So, different from previous blueprints, the final ‘predictive & predict’ blueprint starts with the Zeek connector to subscribe to DNS traffic events.

The SQL plugin is used to restrict the traffic to resolved domains. Then most of the blueprint that solves the ‘evaluate’ task is reused and each domain is assigned to one of the computed similarity clusters.

Finally, the SQL plugin is used again to restrict to those events and resolved domains that refer to anomalous clusters. We use the Kafka connector to publish our findings.

  1. Zeek connector
  2. SQL plugin
  3. N-Gram plugin
  4. Word2Vec (predict) plugin
  5. K-Means (predict) plugin
  6. Kafka connector

Summary

Detecting bots in the DNS traffic of a single network is achieved by orchestrating few blue­prints based on a couple of pre-built data connectors and operators (plugins).

After orchestrating them, each blueprint is turned into an executable AI binary by run­ning an application generator with a click.

Relevant to identify C&C domains in production and in real-time is the single blueprint that implements the ‘operate’ task. Its Word2Vec and K-Means (predict) plugins have immediate access to each selected version of the trained attribute and cluster model. Using another ver-sion of these models in production is just a click away.

This is ‘The Works Way’ to implement the predict as you train paradigm.

Originally published at https://www.linkedin.com.

The world´s first AI-prediction template provisioning and sharing platform for advanced data analytics.