DNS Filtering Blog: Latest Trends and Updates | DNSFilter

Deep Dive: Are Top Domain Lists Useful for Machine Learning Projects in Protective DNS?

Written by David Elkind | Jul 7, 2023 5:49:08 PM

 

One component of computer security research assesses the effectiveness of machine learning models to create automation to detect or block malicious activity, and protective DNS is no exception. However, it is common practice for researchers to employ Top Domain Lists as the sole source for benign domains when training machine learning algorithms. Top Domain Lists are lists of domains compiled by technology companies that measure, in some specific sense, how “popular” a domain is. The most common example of a Top Domain List is the now-deprecated Amazon Alexa Top 1 Million. Now that Alexa has been deprecated, we hope that computer security researchers will shift their labeling strategies to using more representative methods; however, we fear that, instead, researchers will simply shift to alternative Top Domain Lists. This article is aimed at persuading computer security researchers to adopt a more representative methodology.

We conjecture that because Top Domains are, by definition, the most popular domains on the Internet, they tend to exhibit systematic differences compared to less-popular benign domains; furthermore, we suspect that these systemic differences will reduce the effectiveness of any machine learning model trained using solely Top Domain lists.

Specifically, there are two primary reasons that using Top Domains lists as the source of benign labels results in machine learning models that do not generalize: (1) to the extent that a machine learning model is based on patterns of user behavior in querying a domain, Top Domains are, by definition, identifiable by the large number of queries to those domains; (2) to the extent that a machine learning model relies on lexical characteristics of the domain string, Top Domains tend to be written in certain languages and have specific TLDs, whereas new benign domains arise from a broader list of languages and TLDs.

 

A standard part of undertaking a machine learning or artificial intelligence project is to train a model to recognize categories of objects. The most straightforward way to do this is to use supervised learning methods, which are characterized by their reliance on a corpus of labeled examples. As an example, a machine learning model to categorize the animals in an image would be trained on many images of animals, each annotated with appropriate labels (e.g., cat, dog, bird). 

In the context of protective DNS, a standard approach is to seek to infer malicious domains by some means. The data that the model uses to make its decision are called “features” and can take essentially any form (e.g. lexical statistics of the domain string, or patterns of behavior in query logs). Framed this way, the researcher labels domains as malicious and benign, and modeling proceeds as the estimation of a binary classifier, for which there are numerous well-known tools and algorithms available.

However, the apparent simplicity of the labeling process belies some important nuance. The choice of how to collect data and how to label the data is a crucial step. The labels are the only resource that the model has to understand how to interpret its features to distinguish between malicious and benign domains, so providing the model with high-quality labels is of paramount importance.

The process of labeling data can be time-consuming and expensive, especially when expert knowledge is required to make a sound judgment. As a result, it’s common for academic researchers to “outsource” this labor and use domains appearing on one or more “Top Domain” lists as the sole source of labels for benign domains is common and widespread in academic research.

On the other hand, when curating labels for malicious domains, researchers turn to the wealth of lists of newly-detected malicious domains. This is good practice, because the tactics, techniques, and procedures (TTPs) of adversaries are constantly evolving to evade computer security researchers’ detection strategies. We believe that there is a similar need for currency and adaptability on the part of computer security researchers with respect to benign domains. The appearance and behavior of newly-observed benign domains also changes over time, so the data sources that are used to construct machine learning models for domain classification should reflect that.

First, this article undertakes a literature review to demonstrate that the practice of using Top Domain lists in academic research on protective DNS is widespread.

Second, this article conjectures two core conceptual flaws of using Top Domains as the sole source of labels for a machine learning classifier for protective DNS.

Third, we suggest alternative sources and methods that researchers can use in place of Top Domain Lists.

 

Researchers using top domain lists as a source for labels of benign domains is a widespread practice among papers applying machine learning to security problems related to the domain name system. 

As a starting point, we conducted a series of searches on Google Scholar to characterize how widespread different top domain lists are in academic research pertaining to machine learning on DNS. We used the search string $PRODUCT top "domain name system" ("machine learning" OR "neural network") where $PRODUCT is replaced by the name of the respective resource. All searches were carried out on 06-01-2023. The search results are summarized in Table 1.

Table 1: Number of academic articles returned for its respective search string.

Source of Top Domain List

Number of Google Scholar Hits

01-01-2019 to 06-01-2022

Alexa

701

Common Crawl

445

Cisco Umbrella

63

Tranco

40

Majestic

39

 

For context, omitting $PRODUCT from the query string and instead searching for "domain name system" ("machine learning" OR "neural network") turns up 4,450 papers, so roughly 15% of these papers use Alexa data, and 10% use Common Crawl data. (But a single paper could match both the Alexa and Common Crawl search strings, so we do not know & cannot easily characterize the extent to which sets of results for two or more queries overlap.1)

Naturally, not all of the papers returned by this search will use Alexa or other top domain lists as their sole source of benign labels, but this method does provide a starting point for understanding the extent to which Top Domain lists are used in machine learning research for DNS.

Literature reviews appearing in articles about protective DNS find many examples of academic articles using Top Domain Lists used as the source of benign domain labels.  Vracken & Alizadeh conduct a comprehensive literature review and summarize other works aimed at DGA (Domain Generation Algorithm) detection.2 They identify roughly four-dozen recent papers about DGA. In only two cases is the source of benign labels not a top domain list; in these cases, the source is termed “private” or “passive DNS,” without elaboration. In nearly every other article, the source of benign data was Alexa. Cisco Umbrella, OpenDNS (the previous name for Cisco Umbrella), and Majestic Top Domain Lists appear in a small minority of the papers the authors surveyed.

The review of Li et al. summarizes the common sources of benign labels. They write “In most related studies, the top lists provided by the well-known Internet portal Alexa, are used to build a dataset of clean domain names. Alexa provides various top rankings which are classified under different criteria, such as per country and so on.”3

Rarely, researchers acknowledge the deficiency of using Alexa and similar Top Domain Lists as a source for benign labels. Sivaguru et al. write

“We note that it is common in earlier research on DGA classification to construct a training dataset with negative examples from whitelists such as Alexa. Alexa ranks websites based on their popularity in terms of number of page views and number of unique visitors. For example, according to Alexa, the three highest ranked domain names in terms of popularity on 07-08-2020 are google.com, youtube.com, and tmall.com. In our previous research we observed that DGA classifiers trained on a dataset with domains pulled from whitelists and blacklists tend to do well when evaluated on a similar test dataset, but don’t fare well at all when deployed on real traffic. We found that a whitelist such as Alexa, which consists of domains that are collected at the browser level, is not sufficiently representative of all non-malicious domain names that appear in real traffic, causing DGA classifiers trained on Alexa to yield many false positives. This is the main reason why in the current study we opt to use benign samples collected from real traffic with a predefined set of heuristics instead.”4

 1 Another complication to this analysis is that each service provider might have launched, relaunched, changed the name of the product, or changed the ability of members of the public to access the data at one or more times during the period between 2019-01-01 and today. Likewise, Amazon’s Alexa Top Domain service has been retired; Amazon published the last Alexa Top Domain list on 2023-02-01, and has announced that it will cease to host the data as of 2023-07-31. It will be challenging to account for the consequences of launching and renaming in this analysis without creating my own annotations for each of hundreds of papers, so we simply use the current name of the service in these queries.

2 Vranken, Harald, and Hassan Alizadeh. "Detection of DGA-generated domain names with TF-IDF." Electronics 11.3 (2022): 414.

3 Li, K., Yu, X., Wang, J. (2021). A Review: How to Detect Malicious Domains. In: Sun, X., Zhang, X., Xia, Z., Bertino, E. (eds) Advances in Artificial Intelligence and Security. ICAIS 2021. Communications in Computer and Information Science, vol 1424. Springer, Cham. https://doi.org/10.1007/978-3-030-78621-2_12

4 Sivaguru, Raaghavi, et al. "Inline detection of DGA domains using side information." IEEE Access 8 (2020): 141910-141922.

 

In protective DNS, there are two broad categories of research. 

  1. Classification of domains according to patterns of behavior among users who query the domains. The main hypothesis here is that the pattern of requests will be measurably different for benign and malicious domains. This can take the form of a temporal pattern of behavior (i.e. statistics that characterize queries over time will be different), or a pattern of association when the query logs are viewed as a graph (i.e that there is homophily in some graph representation of the query logs, such as host-domain or domain-resolved IP address).
  2. Classification of domains according to lexical structure of the domain string. The main hypothesis here is that procedurally-generated strings (such as those produced by a domain generation algorithm) tend to “look different” from other domain strings. In some cases, DGAs do not contain real words; in others, DGAs concatenate words or parts of words to create a short phrase.

The essence of the domain classification problem is deciding whether or not a domain represents a security risk for one’s organization or users. In many circumstances, this can be simplified to observing a domain for the first time and making a decision.5 New domains are, necessarily, characterized by their recent appearance in query logs. Likewise, statistics about how often they are queried, and how many users are querying them, and similar ways to measure the requests for the domain, will tend strongly towards small numbers of queries among a small number of users and infrequent requests. And it is increasingly true that newly-observed domains come from different languages, rather than the languages of domains in the Top Domains list. Likewise, newly-observed domains often have more recently-provisioned TLDs. (For instance, Google recently announced that it will start selling domains with .zip and .mov TLDs.6 Necessarily, there were zero domains with these TLDs prior to Google making these domains available.)

We conjecture that the use of Top Domain Lists as the sole source of benign labels will trivialize any resulting model because doing so creates a systematic inductive bias in the resulting machine learning classifier. Top Domain Lists are systematically different from newly-observed domains. Top Domain Lists are, by the definition of their compilation process, not a representative sample of the kinds of domains that the machine learning model will be classifying in the future, so using these domains is not a good practice, and will tend to create unhelpful machine learning models.

In the first category of domain classification research, characterizing behavior patterns, we would expect that Top Domain Lists will, by definition, exhibit different traffic patterns and associations in graphs because Top Domain Lists are the most popular domains on the Internet.

In the second category of domain classification research, the lexical traits of Top Domain Lists will tend to be distinct from newly-observed domains because Top Domain Lists will, by their nature, be strongly influenced by the data used to compile them. Alexa, Cisco Umbrella, and all other providers will be at the mercy of their methodologies in creating these lists. As a starting point, the lists tend to be dominated by the languages spoken by societies that supply data to the Top Domain List aggregator. Partially due to the history of how the Internet came about, and partially because of the data collection effect, there are few punycode domains appearing on Top Domain Lists. But regardless of the cause, using a Top Domain List will present a problem for model-building. Without a rich variety of domain strings, including punycode domains, models will not have enough information to make an accurate decision in a production setting.

Moreover, as the availability of “classic” TLDs such as .com and .org becomes more scarce, there will be a natural shift among newcomers for attractive domain names in new TLDs. Likewise, it’s common for adversaries to tend to use specific TLDs for reasons of cost or because the registrars maintaining those TLDs are lax about security and protect the adversaries’ personal information.7 Therefore, there is a risk of developing machine learning models which will systematically overestimate the riskiness of some TLDs while underestimating the riskiness of other TLDs, especially when the sources of labels are generated in such a way to create a systematic disparity in which TLDs appear in which label set.

In other words, the real-world task for a machine learning model is not to discern malicious domains from domains that appear on a Top Domains List; it’s to detect malicious domains in a sea of new, unusual domains. The distinction is subtle, but crucial, to producing a model that addresses a real-world problem.

5 Sophisticated threat actors may employ various strategies to evade detection aimed at newly-registered or newly-observed domains, such as strategic domain aging. These strategies emphasize the dynamic nature of the Internet. Because the content of a domain can change over time, security practitioners will likewise need to develop tools to update risk assessments for domains over time to address this scenario.

6 Lily Hay Newman. “The Real Risks in Google’s New .Zip and .Mov Domains.” Wired. 2023-05-01.

7 Brian Krebs writes in a blog post “The number of phishing websites tied to domain name registrar Freenom dropped precipitously in the months surrounding a recent lawsuit from social networking giant Meta, which alleged the free domain name provider has a long history of ignoring abuse complaints about phishing websites while monetizing traffic to those abusive domains.” Brian Krebs. Krebs on Security. “Phishing Domains Tanked After Meta Sued Freenom.” 2023-05-23.

 

There are remedies to the systematic bias that can arise from using Top Domain Lists as the sole source of benign labels. We outline several alternative approaches here, so that researchers might select the best one for their particular needs.

 

When a researcher has access to a large volume of DNS queries generated by real users, it is feasible to discover which domains are “newly-observed” (or, at least, queried infrequently enough that the difference is not enormously important). One strategy is to simply record, for each day in the study period, the domains that were queried on that day, but not queried for a large number of previous days.

The effect is to simulate “new” domains, and then label them, because these newly-observed domains are, by construction, new or at least infrequently queried among the user base. Computer security experts can manually review these newly-observed domains and determine if they are malicious or benign. 

Manual curation of large-scale benign datasets may not be practical due to the scarcity of experienced security professionals. However, they could be used to construct or evaluate machine learning models developed using tools that are specifically designed for problems where labels are scarce, or only present for a single class.

 

To facilitate curating a suitable set of domains, benign or malicious, researchers might consider using heuristics to exclude some domains from being labeled benign. As an example, Sivguru et al. outline about a dozen criteria for a domain to be considered for inclusion in their set of benign domains.8 However, the crucial task with using heuristics is to not commit any egregious statistical sins that will reproduce the bias that one is attempting to mitigate.9

In both cases, either 100% human curation, or curation with heuristic filters, it can be expensive or time-consuming to curate a set of benign labels. Careful attention should be paid to which domains are selected for labeling to avoid dedicating considerable effort that produces a biased dataset. After all, the goal of this effort is not to recreate the problems with using Top Domain Lists as the sole source of benign labels.

8 Sivaguru, Raaghavi, et al. "Inline detection of DGA domains using side information." IEEE Access 8 (2020): 141910-141922.

9 An extended discussion of how to construct a good sample would not fit in this blog post, but a good resource is Phillip I. Good, James William Hardin (2003). Common errors in statistics (and how to avoid them). Wiley.

 

Machine learning researchers have developed methods for training models when only partial label data is available. As a schematic, we can categorize these methods as semi-supervised classification, positive-unlabeled classification, and one-class classification, according to what kind of data and labels are available.

These methods tend to be more complex, and come with caveats that it’s not always possible to do better than simply training a model with what little labeled data is available. Careful evaluation and testing criteria are crucial components of successful model development.

Semi-supervised machine learning addresses the situation where the researcher has some small amount of labeled data for both classes, as well as a much larger amount of unlabeled data. The goal is to use the labeled data and unlabeled data together to train a classifier that is better than just using the labeled data alone.

The major caveat to semi-supervised learning is that it is not always true that augmenting a dataset with unlabeled data improves the accuracy of the resulting model. It is not well-understood when semi-supervised learning provides improvements over the simpler method of training using only the labeled data (even when the amount of labeled data is relatively small), so researchers should take care to whether they’re garnering any improvement from this approach.

Positive-unlabeled (PU) learning is an active area of machine learning research to develop methods that allow researchers to train binary classifiers when only one class has labeled examples. This is a scenario that naturally arises in computer security because researchers and vendors undertake a great deal of effort to discover and distribute lists of malicious domains so that they can be blocked, while similar efforts for benign websites are rare & less reliable. But DNS query logs provide numerous examples of unlabeled domains, each one of which may either be benign or malicious. The successful application of positive-unlabeled learning would allow researchers to train a model when only the malicious class is labeled. 

One-class classification is a machine learning technique that requires only examples belonging to a single class. This is similar to PU learning, except that there is no unlabeled data. A researcher who only has access to the domains in a threat feed, for instance, may wish to apply one-class classification to their research. 

The primary drawback to one-class classification is that it must, necessarily, make very strong assumptions about data-generating processes that produce benign and malicious domains. To the extent that the assumptions of any particular method are likely out-of-step with reality, one-class classification methods will tend to result in models that have lower accuracy than models trained with more informative label information.

 

Constructing a useful machine learning model requires careful attention to data collection methods and data quality. Poor data collection can lead to machine learning models that do not generalize to real-world situations. This post outlines the pitfalls of using a poor data curation strategy.

In part 2 of this deep dive, we will compare the efficacy of two models: One trained using a Top Domain List as its source of benign labels and a second trained using curation of novel domains.