Anycast Resolution Latency and Our Commitment to Transparency

by DNSFilter on Mar 23, 2023 4:21:00 PM

Early today at 11:40 a.m. UTC, we detected degraded performance across the DNS2 anycast network. Our team escalated the issue to our hosting provider immediately, and took action to implement a fix by 1:00 p.m. UTC. Performance was fully restored by 1:44 p.m. UTC, and our team continued to monitor the situation. You can review the updates on our status page here.

In the interest of transparency, I wanted to write this article to detail exactly what we experienced to our customers to provide additional information around the incident around this somewhat unique issue.

The complete incident details

At 11:49 a.m. UTC we detected degraded performance on part of our DNS2 anycast network. One of our hosting providers stopped sending our secondary prefixes, pushing the majority of DNS2 traffic to our DNS1 anycast network, which is the initial cause of this degradation.

During the shift from DNS2 to DNS1, much of that traffic shifted to nodes in Copenhagen, Prague, Marseille, and Stockholm. But those nodes could not handle the entire surge from DNS2, and traffic was again rerouted to Sydney and Miami. While this failover mechanism maintained DNS resolution for our customers, it also created latency primarily in central and eastern US time zones. DNS resolution speeds increased at their height to roughly 300ms (3/10 of a second), though the average response time in that window was 11ms.

Since we use our own service internally, we also experienced this incident firsthand. While you might not have noticed the impact if you were browsing a news site at this time, sites that use a lot more dynamic resources may have seemed slow based on the knock-on impacts of slower resolution.

Because we saw this incident occur in real-time, we immediately escalated the issue to our provider and collaborated to resolve the problem. Our hosting provider is also conducting further RCA (root cause analysis) to understand what led to the routing interruption of our secondary prefixes.

Our fully redundant architecture allowed DNS resolution to continue, despite increased latency of resolution time.

Changes we’re making

We are still investigating this incident with our hosting provider, as mentioned above. One thing we’re looking at doing a better job of is decreasing the MTTR (meantime to recovery) for these types of situations. We believe we will resolve these issues significantly faster even when the impact is low.

We are also reviewing internal processes and how we’ve structured our architecture to determine what changes we can make to reduce the impact surface area if an anycast node goes down.

When we built our anycast network, we purposefully created two parallel BGP networks so that if one network had any failures or latencies, the other network would pick up the slack. In one way, this incident was a testament to the success of that strategy; But in another way, this incident will allow us to build further improvements to account for the infinite landscape of problems that come with running a complex global anycast network.

I keep saying transparency

I often correlate the service we provide to oxygen. If we’re controlling the oxygen flow for other companies like ours, we need all of the gauges to report accurately and every tank has to be filled.

Providing our customers with a reliable, high performance service remains a core value of ours. We know that we are an integral part of your technology stack—one that you need to simply work. That’s why we take incidents like this very seriously.

But I also recognize the need to share information when things like this occur. I’m a software user, too. I get impacted by incidents, too. As a technical user, I want answers to why these things occur. That is what we strive to do here: Be honest and responsive when incidents of this type do occur.

We are committed to our customers beyond the product itself. Each of you has chosen to partner with DNSFilter as your DNS resolution and filtering provider, deploying security to your organization via DNS through us. Thank you for choosing us, and we will continue to work hard to ensure that oxygen levels are at full capacity. And if the readings are ever off, we will always let you know.

Visit DNSFilter’s status page for details on this incident.

‍

Topics: Product & Features

The Mind Games Behind Cyber Attacks

Hackers have long understood that the most sophisticated firewall is no match for a well-placed psychological trick. While many focus on the technical prowess of cybercriminals, the real magic often lies in their ability to manipulate human behavior. By exploiting our natural tendencies and cognitive biases, hackers can slip past even the most robust security systems. It's not just about cracking codes; it's about cracking the human psyche.

AI and Cybersecurity Risks: Why DNS Filtering is Critical for AI-Driven Workplaces

Artificial intelligence is transforming business operations, automating everything from customer service to data analysis. But with these advancements come new security challenges. AI-driven cyber threats are becoming more sophisticated, enabling attackers to automate phishing campaigns, generate malware, and exfiltrate sensitive data at scale. Without proper safeguards, AI tools can unintentionally leak corporate secrets or connect to malicious ...

A Smarter Way to Manage Roaming Clients: The New DNSFilter Experience

Managing endpoint security across an organization—whether as an MSP overseeing multiple customers or an admin overseeing a tech stack—should be simple, efficient, and effective. That’s why we’re excited to introduce a revamped Roaming Client management experience, designed to provide greater confidence and ease in managing your fleet of DNSFilter Roaming Clients.