2014-06-13

[vc_row][vc_column width=”1/1″][vc_column_text]

Google Public DNS is a free, global Domain Name System (DNS) resolution service, that you can use as an alternative to your current DNS provider.

To try it out:

Configure your network settings to use the IP addresses 8.8.8.8 and 8.8.4.4 as your DNS servers or

Read our configuration instructions (IPv6 addresses supported too).

If you decide to try Google Public DNS, your client programs will perform all DNS lookups using Google Public DNS.[/vc_column_text][vc_column_text]The DNS protocol is an important part of the web’s infrastructure, serving as the Internet’s phone book: every time you visit a website, your computer performs a DNS lookup. Complex pages often require multiple DNS lookups before they start loading, so your computer may be performing hundreds of lookups a day.[/vc_column_text][vc_column_text]By using Google Public DNS you can:

Speed up your browsing experience.

Improve your security.

Get the results you expect with absolutely no redirection.

[/vc_column_text][vc_column_text]

Introduction: causes and mitigations of DNS latency

As web pages become more complex, referencing resources from numerous domains, DNS lookups can become a significant bottleneck in the browsing experience. Whenever a client needs to query a DNS resolver over the network, the latency introduced can be significant, depending on the proximity and number of nameservers the resolver has to query (more than 2 is rare, but it can happen). As an example, the following screen shot shows the timings reported by the Page Speed web performance measurement tool. Each bar represents a resource referenced from the page; the black segments indicate DNS lookups. In this page, 13 lookups are made in the first 11 seconds in which the page is loaded. Although several of the lookups are done in parallel, the screen shot shows that 5 serial lookup times are required, accounting for several seconds of the total 11 seconds page load time.



There are two components to DNS latency:

Latency between the client (user) and DNS resolving server. In most cases this is largely due to the usual round-trip time (RTT) constraints in networked systems: geographical distance between client and server machines; network congestion; packet loss and long retransmit delays (one second on average); overloaded servers, denial-of-service attacks and so on.

Latency between resolving servers and other nameservers. This source of latency is caused primarily by the following factors:

Cache misses. If a response cannot be served from a resolver’s cache, but requires recursively querying other nameservers, the added network latency is considerable, especially if the authoritative servers are geographically remote.

Underprovisioning. If DNS resolvers are overloaded, they must queue DNS resolution requests and responses, and may begin dropping and retransmitting packets.

Malicious traffic. Even if a DNS service is overprovisioned, DoS traffic can place undue load on the servers. Similarly, Kaminsky-style attacks can involve flooding resolvers with queries that are guaranteed to bypass the cache and require outgoing requests for resolution.

We believe that the cache miss factor is the most dominant cause of DNS latency, and discuss it further below.

Cache misses

Even if a resolver has abundant local resources, the fundamental delays associated with talking to remote nameservers are hard to avoid. In other words, assuming the resolver is provisioned well enough so that cache hits take zero time on the server-side, cache misses remain very expensive in terms of latency. To handle a miss, a resolver has to talk to at least one, but often two or more external nameservers. Operating the Googlebot web crawler, we have observed an average resolution time of 130 ms for nameservers that respond. However, a full 4-6% of requests simply time out, due to UDP packet loss and servers being unreachable. If we take into account failures such as packet loss, dead nameservers, DNS configuration errors, etc., the actual average end-to-end resolution time is 300-400 ms. However, there is high variance and a long tail.

Though the cache miss rate may vary among DNS servers, cache misses are fundamentally difficult to avoid, for the following reasons:

Internet size and growth. Quite simply, as the Internet grows, both through the addition of new users and of new sites, most content is of marginal interest. While a few sites (and consequently DNS names) are very popular, most are of interest to only a few users and are accessed rarely; so the majority of requests result in cache misses.

Low time-to-live (TTL) values. The trend towards lower DNS TTL values means that resolutions need more frequent lookups.

Cache isolation. DNS servers are typically deployed behind load balancers which assign queries to different machines at random. This results in each individual server maintaining a separate cache rather than being able to reuse cached resolutions from a shared pool.

[/vc_column_text][vc_column_text]

Mitigations

In Google Public DNS, we have implemented several approaches to speeding up DNS lookup times. Some of these approaches are fairly standard; others are experimental:

Provisioning servers adequately to handle the load from client traffic, including malicious traffic.

Preventing DoS and amplification attacks. Although this is mostly a security issue, and affects closed resolvers less than open ones, preventing DoS attacks also has a benefit for performance by eliminating the extra traffic burden placed on DNS servers. For information on the approaches we are using to minimize the chance of attacks, see the page on security benefits.

Load-balancing for shared caching, to improve the aggregated cache hit rate across the serving cluster.

Providing global coverage for proximity to all users

[/vc_column_text][vc_column_text]

Provisioning serving clusters adequately

Caching DNS resolvers have to perform more expensive operations than authoritative nameservers, since many responses cannot be served from memory; instead, they require communication with other nameservers and thus demand a lot of network input/output. Furthermore, open resolvers are highly vulnerable to cache poisoning attempts, which increase the cache miss rate (such attacks specifically send requests for bogus names that can’t be resolved from cache), and to DoS attacks, which add to the traffic load. If resolvers are not provisioned adequately and cannot keep up with the load, this can have a very negative impact on performance. Packets get dropped and need to be retransmitted, nameserver requests have to be queued, and so on. All of these factors add to delays.

Therefore, it’s important for DNS resolvers to be provisioned for high-volume input/output. This includes handling possible DDoS attacks, for which the only effective solution is to over-provision with many machines. At the same time, however, it’s important not to reduce the cache hit rate when you add machines; this requires implementing an effective load-balancing policy, which we discuss below.[/vc_column_text][vc_column_text]

Load-balancing for shared caching

Scaling resolver infrastructure by adding machines can actually backfire and reduce the cache hit rate if load balancing is not done properly. In a typical deployment, multiple machines sit behind a load balancer that equally distributes traffic to each machine, using a simple algorithm such as round robin. The result of this is that each machine maintains its own independent cache, so that the cached content is isolated across machines. If each incoming query is distributed to a random machine, depending on the nature of the traffic, the effective cache miss rate can be increased proportionally. For example, for names with long TTLs that are queried repeatedly, the cache miss rate can be increased by the number of machines in the cluster. (For names with very short TTLs, that are queried very infrequently, or that result in uncacheable responses (0 TTL and errors), the cache miss rate is not really affected by adding machines.)

To boost the hit rate for cacheable names, it’s important to load-balance servers so that the cache is not fragmented. In Google Public DNS, we have two levels of caching. In one pool of machines, very close to the user, a small per-machine cache contains the most popular names. If a query cannot be satisfied from this cache, it is sent to another pool of machines that partition the cache by names. For this second level cache, all queries for the same name are sent to the same machine, where the name is either cached or it isn’t.[/vc_column_text][vc_column_text]

Distributing serving clusters for wide geographical coverage

For closed resolvers, this is not really an issue. For open resolvers, the closer your servers are located to your users, the less latency they will see at the client end. In addition, having sufficient geographical coverage can indirectly improve end-to-end latency, as nameservers typically return results optimized for the DNS resolver’s location. That is, if a content provider hosts mirrored sites around the world, that provider’s nameservers will return the IP address in closest proximity to the DNS resolver.

Google Public DNS is hosted in data centers worldwide, and uses anycast routing to send users to the geographically closest data center.

Note, however, that because nameservers geolocate according to the resolver’s IP address rather than the user’s, Google Public DNS has the same limitations as other open DNS services: that is, the server to which a user is referred might be farther away than one to which a local DNS provider would have referred. This could cause a slower browsing experience for certain sites.[/vc_column_text][vc_column_text]

Introduction: DNS security threats and mitigations

Because of the open, distributed design of the Domain Name System, and its use of the User Datagram Protocol (UDP), DNS is vulnerable to various forms of attack. Public or “open” recursive DNS resolvers are especially at risk, since they do not restrict incoming packets to a set of allowable source IP addresses. We are mostly concerned with two common types of attacks:

Spoofing attacks leading to DNS cache poisoning. Various types of DNS spoofing and forgery exploits abound, which aim to redirect users from legitimate sites to malicious websites. These include so-called “Kaminsky attacks”, in which attackers take authoritative control of an entire DNS zone.

Denial-of-service (DoS) attacks. Attackers may launch DDoS attacks against the resolvers themselves, or hijack resolvers to launch DoS attacks on other systems. Attacks that use DNS servers to launch DoS attacks on other systems by exploiting large DNS record/response size are known as amplification attacks.

Each class of attack is discussed further below.

Cache poisoning attacks

There are several variants of DNS spoofing attacks that can result in cache poisoning, but the general scenario is as follows:

The attacker sends a target DNS resolver multiple queries for a domain name for which s/he knows the server is not authoritative, and that is unlikely to be in the server’s cache.

The resolver sends out requests to other nameservers (whose IP addresses the attacker can also predict).

In the meantime, the attacker floods the victim server with forged responses that appear to originate from the delegated nameserver. The responses contain records that ultimately resolve the requested domain to IP addresses controlled by the attacker. They might contain answer records for the resolved name or, worse, they may further delegate authority to a nameserver owned by the attacker, so that s/he takes control of an entire zone.

If one of the forged responses matches the resolver’s request (for example, by query name, type, ID and resolver source port) and is received before a response from the genuine nameserver, the resolver accepts the forged response and caches it, and discards the genuine response.

Future queries for the compromised domain or zone are answered with the forged DNS resolutions from the cache. If the attacker has specified a very long time-to-live on the forged response, the forged records stay in the cache for as long as possible without being refreshed.

[/vc_column_text][vc_column_text]

DoS and amplification attacks

DNS resolvers are subject to the usual DoS threats that plague any networked system. However, amplification attacks are of particular concern because DNS resolvers are attractive targets to attackers who exploit the resolvers’ large response-to-request size ratio to gain additional free bandwidth. Resolvers that support EDNS0 (Extension Mechanisms for DNS) are especially vulnerable because of the substantially larger packet size that they can return.

In an amplification scenario, the attack proceeds as follows:

The attacker sends a victim DNS server queries using a forged source IP address. The queries may be sent from a single system or a network of systems all using the same forged IP address. The queries are for records that the attacker knows will result in much larger responses, up to several dozen times1 the size of the original queries (hence the name “amplification” attack).

The victim server sends the large responses to the source IP address passed in the forged requests, overwhelming the system and causing a DoS situation.

1See the paper DNS Amplification Attacks for examples, and a good discussion of the problem in general.

Mitigations

The standard system-wide solution to DNS vulnerabilities is the DNSSEC protocol. However, until it is universally implemented, open DNS resolvers need to independently take some measures to mitigate against known threats. Many techniques have been proposed; see IETF RFC 5452: Measures for making DNS more resilient against forged answers for an overview of most of them. In Google Public DNS, we have implemented, and we recommend, the following approaches:

Securing your code against buffer overflows, particularly the code responsible for parsing and serializing DNS messages.

Overprovisioning machine resources to protect against direct DoS attacks on the resolvers themselves. Since IP addresses are trivial for attackers to forge, it’s impossible to block queries based on IP address or subnet; the only effective way to handle such attacks is to simply absorb the load.

Implementing basic validity-checking of response packets and of nameserver credibility, to protect against simple cache poisoning. These are standard mechanisms and sanity checks that any standards-compliant caching resolver should perform.

Adding entropy to request messages, to reduce the probability of more sophisticated spoofing/cache poisoning attacks such as Kaminsky attacks. There are many recommended techniques for adding entropy, including randomizing source ports; randomizing the choice of nameservers (destination IP addresses); randomizing case in name requests; and appending nonce prefixes to name requests. Below, we give an overview of the benefits, limitations, and challenges of each of these techniques, and discuss how we implemented them in Google Public DNS.

Removing duplicate queries, to combat the probability of “birthday attacks”.

Rate-limiting requests, to prevent DoS and amplification attacks.

Monitoring the service for the client IPs using the most bandwidth and experiencing the highest response-to-request size ratio.

Supporting the DNSSEC protocol

The Domain Name Security Extensions (DNSSEC) standard is specified in several IETF RFCs: 4033, 4034, 4035, and 5155.

Resolvers that implement DNSSEC counter cache poisoning attacks by verifying the authenticity of responses received from nameservers. Each DNS zone maintains a set of private/public key pairs and for each DNS record, a unique digital signature is generated and encrypted using the private key. The corresponding public key is then authenticated via a chain of trust by keys belonging to parent zones. DNSSEC-compliant resolvers reject reponses that do not contain the correct signatures. DNSSEC effectively prevents responses from being tampered with, because in practice, signatures are almost impossible to forge without access to private keys.

As of January, 2013, Google Public DNS fully supports the DNSSEC protocol. We accept and forward DNSSEC-formatted messages and validate responses for correct authentication. We strongly encourage other resolvers to do the same.

Implementing basic validity checking

Some DNS cache corruption can be due to unintentional, and not necessarily malicious, mismatches between requests and responses (e.g. perhaps because of a misconfigured nameserver, a bug in the DNS software, and so on). At a minimum, DNS resolvers should put in checks to verify the credibility and relevance of nameservers’ responses. We recommend (and implement) all of the following defenses:

Do not set the recursive bit in outgoing requests, and always follow delegation chains explicitly. Disabling the recursive bit ensures that your resolver operates in “iterative” mode so that you query each nameserver in the delegation chain explicitly, rather than allowing another nameserver to perform these queries on your behalf.

Reject suspicious response messages. See below for details of what we consider to be “suspicious”.

Do not return A records to clients based on glue records cached from previous requests. For example, if you receive a client query for ns1.example.com, you should re-resolve the address, rather than sending an A record based on cached glue records returned from a .com TLD nameserver.

Rejecting responses that do not meet required criteria

Google Public DNS rejects all of the following:

Unparseable or malformed responses.

Responses in which the query ID, source IP, source port, or query name do not match those of the request.

Records which are not relevant to the request.

Answer records for which we cannot reconstruct the CNAME chain.

Records (in the answer, authority, or additional sections) for which the responding nameserver is not credible. We determine the “credibility” of a nameserver by its place in the delegation chain for a given domain. Google Public DNS caches delegation chain information, and we verify each incoming response against the cached information to determine the responding nameserver’s credibility for responding to a particular request.

Adding entropy to requests

Once a resolver does enforce basic sanity checks, an attacker has to flood the victim resolver with responses in an effort to match the query ID, UDP port (of the request), IP address (of the response), and query name of the original request before the legitimate nameserver does.

Unfortunately, this is not difficult to achieve, as the one uniquely identifying field, the query ID, is only 16 bits long (i.e. for a 1/65,536 chance in getting it right). The other fields are also limited in range, making the total number of unique combinations a relatively low number. See IETF RFC 5452, Section 7 for a calculation of the combinatorics involved.

Therefore, the challenge is to add as much entropy to the request packet as possible, within the standard format of the DNS message, to make it more difficult for attackers to successfully match a valid combination of fields within the window of opportunity. We recommend, and have implemented, all the techniques discussed in the following sections.

Randomizing source ports

As a basic step, never allow outgoing request packets to use the default UDP port 53, or to use a predictable algorithm for assigning multiple ports (e.g. simple incrementing). Use as wide a range of ports from 1024 to 65535 as allowable in your system, and use a reliable random number generator to assign ports. For example, Google Public DNS uses ~15 bits, to allow for approximately 32,000 different port numbers.

Note that if your servers are deployed behind firewalls, load-balancers, or other devices that perform network address translation (NAT), those devices may de-randomize ports on outgoing packets. Make sure you configure NAT devices to disable port de-randomization.

Randomizing choice of nameservers

Some resolvers, when sending out requests to root, TLD, or other nameservers, select the nameserver’s IP addressed based on the shortest distance (latency). We recommend that you randomize destination IP addresses to add entropy to the outgoing requests. In Google Public DNS, we simply pick a nameserver randomly among configured nameservers for each zone, somewhat favoring fast and reliable nameservers.

If you are concerned about latency, you can use round-trip time (RTT) banding, which consists of randomizing within a range of addresses that are below a certain latency threshold (e.g. 30 ms, 300 ms, etc.).

Randomizing case in query names

The DNS standards require that nameservers treat names with case-insensitivity. That is, the names example.com and EXAMPLE.COM should resolve to the same IP address3. However, in the response, most nameservers echo back the name as it appeared in the request, preserving the original case.

Therefore, another way to add entropy to requests is to randomly vary the case of letters in domain names queried. This technique, also known as “0x20″ because bit 0x20 is used to set the case of of US-ASCII letters, was first proposed in the IETF internet draft Use of Bit 0x20 in DNS Labels to Improve Transaction Identity. With this technique, the nameserver response must match not only the query name but the case of every letter in the name string; for example, wWw.eXaMpLe.CoM or WwW.ExamPLe.COm. This may add little or no entropy to queries for the top-level and root domains, but it’s effective for most hostnames.

One significant challenge we discovered when implementing this technique is that some nameservers do not follow the expected response behavior:

Some nameservers respond with complete case-insensitivity: that is, they return the same results for equivalent names with different cases in the request; but they do not match the exact case of the name in the response.

Other nameservers respond with complete case-sensitivity (in violation of the DNS standards): that is, they match the exact case of the name in the response; but return different results for equivalent names with different cases in the request (typically NXDOMAIN)!

For both of these types of nameservers, altering the case of the query name would produce undesirable results: for the first group, the response would be indistinguishable from a forged response; for the second group, the response could be totally invalid.

Our current solution to this problem is to create a whitelist of nameservers which we know apply the standards correctly, and to only apply the case randomization technique in requests to those servers. We also list the appropriate exception subdomains for each of them, based on analyzing our logs. If a response that appears to come from those servers does not contain the correct case, we reject the response. The whitelisted nameservers comprise more than 70% of our traffic.

3RFC 1034, Section 3.5 says:

Note that while upper and lower case letters are allowed in domain names, no significance is attached to the case. That is, two names with the same spelling but different case are to be treated as if identical.

Prepending nonce labels to query names

If a resolver cannot directly resolve a name from the cache, or cannot directly query an authoritative nameserver, then it must follow referrals from a root or TLD nameserver. In most cases, requests to the root or TLD nameservers will result in a referral to another nameserver, rather than an attempt to resolve the name to an IP address. For such requests, it should therefore be safe to attach a random label to a query name to increase the entropy of the request, while not risking a failure to resolve a non-existent name. That is, sending a request to a referring nameserver for a name prefixed with a nonce label, such as entriih-f10r3.www.google.com, should return the same result as a request for www.google.com.

Although in practice such requests make up less than 3% of outgoing requests, assuming normal traffic (since most queries can be answered directly from the cache or by a single query), these are precisely the types of requests that an attacker tries to force a resolver to issue. Therefore, this technique can be very effective at preventing Kaminsky-style exploits.

Implementing this technique requires that nonce labels only be used for requests that are guaranteed to result in referrals; that is, responses that do not contain records in the answers section. However, we encountered several challenges when attempting to define the set of such requests:

Some country-code TLD (ccTLD) nameservers are actually authoritative for other second-level TLDs (2LDs). Although they have two labels, 2LDs behave just like TLDs, which is why they are often handled by ccTLD nameservers. For example, the .uk nameservers are also authoritative for the mod.uk and nic.uk zones, and, hence, hostnames contained in those zones, such as www.nic.uk, www.mod.uk, and so on. In other words, requests to ccTLD nameservers for resolution of such hostnames will not result in referrals, but in authoritative answers; appending nonce labels to such hostnames will cause the names to be unresolvable.

Sometimes generic TLD (gTLD) nameservers return non-authoritative responses for nameservers. That is, there are some nameserver hostnames that happen to live in a gTLD zone rather than in the zone for their domain. A gTLD will return a non-authoritative answer for these hostnames, using whatever glue record it happens to have in its database, rather than returning a referral. For example, the nameserver ns3.indexonlineserver.com lives in a gTLD zone rather than in the indexonlineserver.com zone. If we issue a request to a gTLD server for n3.indexonlineserver.com, we get an IP address for it, rather than a referral. However, if we prepend a nonce label, we get a referral to indexonlineserver.com, which is then unable to resolve the hostname. Therefore, we cannot append nonce labels for nameservers which require a resolution from a gTLD server.

Authorities for zones and hostnames change over time. This can cause a nonce-prepended hostname that was once resolvable to become unresolvable if the delegation chain changes.

To address these challenges, we created a “blacklist” file containing exceptions for which we cannot append nonce labels. The file is populated with hostnames for which TLD nameservers return non-referring responses, according to our server logs. We continually review the exceptions list to ensure that it stays valid over time.

Removing duplicate queries

DNS resolvers are vulnerable to “birthday attacks”, so called because they exploit the mathematical “birthday paradox”, in which the likelihood of a match does not require a large number of inputs. Birthday attacks involve flooding the victim server not only with forged responses but also with initial queries, counting on the resolver to issue multiple requests for a single name resolution. The greater the number of issued outgoing requests, the greater the probability that the attacker will match one of those requests with a forged response: an attacker only needs on the order of 300 in-flight requests for a 50% success chance at matching a forged response, and 700 requests for close to 100% success.

To guard against this attack strategy, you should be sure to discard all duplicate queries from the outbound queue. For example, Google Public DNS, never allows more than a single outstanding request for the same query name, query type, and destination IP address.

Rate-limiting queries

Preventing denial-of-service attacks poses several particular challenges for open recursive DNS resolvers:

Open recursive resolvers are attractive targets for launching amplification attacks. They are high-capacity, high-reliability servers and can produce larger responses than a typical authoritative nameserver — especially if an attacker can inject a large response into their cache. It is incumbent on any developer of an open DNS service to prevent their servers from being used to launch attacks on other systems.

Amplification attacks can be difficult to detect while they are occurring. Attackers can launch an attack via thousands of open resolvers, so that each resolver only sees a small fraction of the overall query volume and cannot extract a clear signal that it has been compromised.

Malicious traffic must be blocked without any disruption or degration of the DNS service to normal users. DNS is an essential network service, so shutting down servers to cut off an attack is not an option, nor is denying service to any given client IP for too long. Resolvers must be able to quickly block an attack as soon as it starts, and restore fully operational service as soon as the attack ends.The best approach for combating DoS attacks is to impose a rate-limiting or “throttling” mechanism. Google Public DNS implements two kinds of rate control:

Rate control of outgoing requests to other nameservers. To protect other DNS nameservers against DoS attacks that could be launched from our resolver servers, Google Public DNS enforces per-nameserver QPS limits on outgoing requests from each serving cluster.

Rate control of outgoing responses to clients. To protect any other systems against amplification and traditional distributed DoS (botnet) attacks that could be launched from our resolver servers, Google Public DNS performs two types of rate limiting on client queries:

To protect against traditional volume-based attacks, each server imposes per-client-IP QPS and average bandwidth limits.

To guard against amplification attacks, in which large responses to small queries are exploited, each server enforces a per-client-IP maximum average amplification factor. The average amplification factor is a configurable ratio of response-to-query size, determined from historical traffic patterns observed in our server logs.

If queries from a specific source IP address exceed the maximum QPS, or exceed the average bandwidth or amplification limit consistently (the occasional large response will pass), we return (small) error responses or no response at all.

[/vc_column_text][/vc_column][/vc_row]

Show more