Anycast Resolution Latency and Our Commitment to Transparency

Early today at 11:40 a.m. UTC, we detected degraded performance across the DNS2 anycast network. Our team escalated the issue to our hosting provider immediately, and took action to implement a fix by 1:00 p.m. UTC. Performance was fully restored by 1:44 p.m. UTC, and our team continued to monitor the situation. You can review the updates on our status page here.

In the interest of transparency, I wanted to write this article to detail exactly what we experienced to our customers to provide additional information around the incident around this somewhat unique issue.

The complete incident details

At 11:49 a.m. UTC we detected degraded performance on part of our DNS2 anycast network. One of our hosting providers stopped sending our secondary prefixes, pushing the majority of DNS2 traffic to our DNS1 anycast network, which is the initial cause of this degradation.

During the shift from DNS2 to DNS1, much of that traffic shifted to nodes in Copenhagen, Prague, Marseille, and Stockholm. But those nodes could not handle the entire surge from DNS2, and traffic was again rerouted to Sydney and Miami. While this failover mechanism maintained DNS resolution for our customers, it also created latency primarily in central and eastern US time zones. DNS resolution speeds increased at their height to roughly 300ms (3/10 of a second), though the average response time in that window was 11ms.

Since we use our own service internally, we also experienced this incident firsthand. While you might not have noticed the impact if you were browsing a news site at this time, sites that use a lot more dynamic resources may have seemed slow based on the knock-on impacts of slower resolution.

Because we saw this incident occur in real-time, we immediately escalated the issue to our provider and collaborated to resolve the problem. Our hosting provider is also conducting further RCA (root cause analysis) to understand what led to the routing interruption of our secondary prefixes.

Our fully redundant architecture allowed DNS resolution to continue, despite increased latency of resolution time.

Changes we’re making

We are still investigating this incident with our hosting provider, as mentioned above. One thing we’re looking at doing a better job of is decreasing the MTTR (meantime to recovery) for these types of situations. We believe we will resolve these issues significantly faster even when the impact is low.

We are also reviewing internal processes and how we’ve structured our architecture to determine what changes we can make to reduce the impact surface area if an anycast node goes down.

When we built our anycast network, we purposefully created two parallel BGP networks so that if one network had any failures or latencies, the other network would pick up the slack. In one way, this incident was a testament to the success of that strategy; But in another way, this incident will allow us to build further improvements to account for the infinite landscape of problems that come with running a complex global anycast network.

I keep saying transparency

I often correlate the service we provide to oxygen. If we’re controlling the oxygen flow for other companies like ours, we need all of the gauges to report accurately and every tank has to be filled.

Providing our customers with a reliable, high performance service remains a core value of ours. We know that we are an integral part of your technology stack—one that you need to simply work. That’s why we take incidents like this very seriously. 

But I also recognize the need to share information when things like this occur. I’m a software user, too. I get impacted by incidents, too. As a technical user, I want answers to why these things occur. That is what we strive to do here: Be honest and responsive when incidents of this type do occur.

We are committed to our customers beyond the product itself. Each of you has chosen to partner with DNSFilter as your DNS resolution and filtering provider, deploying security to your organization via DNS through us. Thank you for choosing us, and we will continue to work hard to ensure that oxygen levels are at full capacity. And if the readings are ever off, we will always let you know.

 

Visit DNSFilter’s status page for details on this incident.

Search
  • There are no suggestions because the search field is empty.
Latest posts
Ensuring CIPA Compliance: A Practical Guide (and checklist) for Educational Leaders Ensuring CIPA Compliance: A Practical Guide (and checklist) for Educational Leaders

The Children's Internet Protection Act (CIPA) is a critical law designed to ensure that students are protected from harmful online content. It requires schools and libraries to implement Internet safety measures, such as filtering and monitoring, to safeguard minors. Compliance with CIPA is essential for institutions seeking E-Rate program discounts for Internet access and internal connections.

The Power of Customer Experience The Power of Customer Experience

Customer experience is the secret sauce that sets successful Managed Service Providers (MSPs) apart from the rest. In a market teeming with competition, you need more than offering the best technology or the lowest prices. It's about how clients feel when they interact with your services. A stellar customer experience can transform a one-time client into a loyal advocate, while a poor one can send them running to your competitors. According to a ...

Enhancing Security for In-Store Wi-Fi: How to make convenience safe for all Enhancing Security for In-Store Wi-Fi: How to make convenience safe for all

As demand grows for constant connectivity to the digital world, offering free Wi-Fi has become as essential for restaurants and retail stores as providing quality products and exceptional service. Customers increasingly expect to stay connected wherever they go, and the availability of Wi-Fi in restaurants, shopping malls, and retail outlets significantly influences their choice of where to dine and shop. For businesses, providing in-store Wi-Fi ...

Explore More Content

Ready to brush up on something new? We've got even more for you to discover.