CacheMakers : A co-operative DNS caching service

Saumitra M. Das
School of Electrical and Computer Engineering, Purdue University
1285 Electrical Engineering Bldg., Box 479
West Lafayette, IN USA 47907
765-430-4335
smdas@purdue.edu
Michel Imhasly
Dept. of Electrical and Computer Engineering, Carnegie Mellon University
Pittsburgh, PA USA 15213
m.imhasly@rhone.ch
Rohyt Belani
Dept. of Electrical and Computer Engineering, Carnegie Mellon University
Pittsburgh, PA USA 15213
rohyt@cmu.edu

ABSTRACT

The Domain Name System (DNS) is used to map easy to remember hostnames to Internet Protocol (IP) addresses in the Internet. When a user enters a URL in the browser window its cache is first checked for this mapping. If it results in a cache miss the request is passed to the configured local name server (LNS). If the LNS cache, too, cannot resolve the URL it recursively contacts several name servers on the user's behalf. This recursive resolution is time consuming and overloads frequently visited name servers. In order to overcome these drawbacks we propose an approach called "Cooperative DNS Caching". In this approach LNS's of several autonomous networks pool their caches together in a centralized cache, which is then available to all the participants (LNS's). The intuition is that this central cache will behave analogous to an L2 cache in a microprocessor, thus leading to an increase in cache hit rate. As a result, we expect a reduction in the mean response time as seen by the user. Furthermore, the central caches, if extensively deployed, are also expected to ease the load on frequently visited name servers (e.g. root name servers, .edu name server, .com name server). In this paper, we present the design and implementation of CacheMakers, a distributed decentralized system which exploits reference locality in DNS requests to reduce latency due to DNS lookups.

Keywords

Client/server, Internet applications

1. INTRODUCTION

Advances in technology are resulting in the availability of large amounts of bandwidth for data transfer over the internet. As a result the transmission time for data has reduced significantly, and user response time are sometimes dominated by the resolution of hostnames by DNS. Thus, delay as seen by the user can be reduced significantly by optimization of the DNS lookup procedure under the constraint of operating within the existing infrastructure. The current DNS in the internet operates as follows. When a client issues a DNS query (i.e. requests a URL to a LNS), the LNS first checks to see if the hostname being requested is in its domain (i.e it is the authoritative server for the hostname). If not, the LNS checks its cache for the mapping (hostname -> IP address). If the entry is not cached, the LNS performs a recursive lookup on the client's behalf. This lookup methodology not only results in a user response time which may be of the order of seconds, but it also loads frequently queried name servers like the root name server and the .com authoritative name server.

    Caching of DNS queries has been proven to be an effective way to reduce client latency due to the high locality in the request stream [2]. In order to reduce this lookup time and hence user response time, we propose the deployment of a second level of caching in the form of a centralized cache shared by a number of participating LNS's. This central cache contains a pool of all the cache entries of the participating LNS's. Thus a cache miss in the LNS's cache is referred first to this larger centralized cache rather than the root server. This approach not only has the advantage of the LNS querying a nearer source (the centralized cache), but prevents the delay due to recursive lookup in case of a cache hit. Intuitively, this approach may work best for collaboration between domains, which are close to each other and distant from a root name server. However, even if the LNS's collaborating are distant from each other and close to a root name server, our system may entail a single distant lookup rather than a recursive lookup which may entail several round trips to authoritative name servers. As a result, the user will still see a reduction in response time. Recent studies on DNS performance [1] indicate that a lot of DNS lookups receive no answer (around 23% for MIT). Our design could potentially enhance the availability of the DNS service and reduce the number of unanswered lookups thus further reducing the client latency.

2. METHODOLOGY

Our system aims to fulfill the desirable features of a good distributed system. The goals of our design were:

2.1 Scalability

We intend to achieve scalability by making the central cache servers non-blocking and by using event-driven design. Non-blocking I/O ensures that there is a finite upper bound on the time for which a request keeps the server occupied. Further, in order to optimize in the case of disk I/O we fork additional processes and continue serving other requests. The event driven design implies that we use a select() based server to handle multiple user requests, rather than threads. It has been shown that the former architecture is more scalable as the latter may require an extremely large number of context switches with increasing number of requests. Our current prototype is in Java since in the short time we had, a proof of concept implementation was desired. However, we will need to look into libraries for handling event driven I/O instead of the current multithreaded implementation. There has been recent work on providing frameworks in which to work on efficient event driven I/O which could be applied to our system in the future.

2.2 Fault Tolerance

The centralized system consists of a load balancer and n replicas of the central cache. The load balancer receives two types of messages from the LNSs - updates and requests. Update messages contain a newly cached entry by a particular LNS. The new entry has to be incorporated in all the replicas of the central cache, and is thus multicast to all of them. The request messages on the other hand are only sent to one replica as selected by the load balancer. We use the primary-backup approach to provide fault tolerance in the case of the load balancer. If the primary load balancer goes down one of the replicas takes over as the primary and informs its peers (replicas) as well as the clients. The next backup is decided in a distributed fashion as follows. Every replica is configured with a unique identification number. The replicas exchange identification numbers initially. The replica that does not see an id smaller than its own is the next backup. Furthermore, the replica that behaves as the load balancer in case of failure, no longer performs cache lookups.

2.3 Security

There are mainly two reasons to implement security in our system. The CCS gets some critical Requests like - "update a DNS Entry". We need to ensure that an attacker cannot submit such a request, because otherwise he will be able to map a Domain Name to an IP number of his choice. The other reason is that we want to provide our service only to registered users.

3. IMPLEMENTATION

All the LNS first send their requests to the primary server, which does load-balancing for the incoming requests. We basically have a load balancer class, which can be modified to perform other algorithms for this task. When requests are received from the LNS, the primary server will handoff the connection to the cache server selected for that transaction by it. All cache servers have consistent replicated caches of DNS entries. Our system makes sure that all entries for the caches are consistent since special update packets sent from the LNS are broadcast to all cache servers. The cache servers 'snoop' for these packets on their network interfaces and update their cache entries if required. Client requests are simulated using a URL collection database, which can be dynamically programmed by the simulator for different request structure and overlap. The LNS respond to them either by reporting the IP address or making a request to our infrastructure.

4. Operational Scenarios

Cache Hit: If there is an LNS cache miss, the request for hostname resolution is forwarded to the load balancer of our system . The load balancer then forwards the request to a cache server as per its distribution algorithm. The selected server then queries its cache to determine if the requested mapping exists. In case of a cache hit the server replies to the requesting LNS, thus avoiding extensive delays due to recursive querying of authoritative name servers by the requesting LNS.

Cache Miss: As in the above case, when a query cannot be resolved by the LNS it is forwarded to our system. However if the centralized cache, too, does not contain the mapping it sends a negative reply to the requesting LNS. The LNS in this case resolves the name as per the conventional recursive lookup method. In addition to caching the newly acquired mapping, the LNS also sends it to the central system to cache, in an update packet.

Load Balancer Failure: The cache servers in addition to performing the required task of cache lookup are also capable of performing load balancing between replicas. Initially, when the system is initialized each replica of the cache server is assigned a unique identifier. Each replica broadcasts its identifier to all the other replicas. Each replica on receiving the list of identifiers can determine if it is the "primary" backup load balancer (if its identifier is less than all the identifiers it receives during the broadcast). Thus, when the primary load balancer fails (which is detected by the absence of a heartbeat "keep alive" message to the backup), the backup takes over as primary. In doing so it informs the other replicas so that they may determine the next backup. The "new" primary load balancer also informs the clients of the change so that all further request and update messages may be directed to it, rather than the load balancer which failed. The system is thus n-fault tolerant.

5. Experimental Results

In the normal operation of DNS, when a requested DNS mapping misses the LNS cache it is requested recursively from well known servers like the root server, .edu server, etc. However, in our system a miss from an LNS cache is first sent to the centralized cache server which pools the LNS caches of various domains. If there is a cache hit in the central server, the query does not have to be recursively resolved. In order to evaluate the performance improvement due to deployment of our system we derive the expected reduction in DNS resolution time per query depending on the hit rate. Measuring the improvement directly in terms of round trip time reductions is not fair due to the disparate computing powers of the servers deployed for the simulation of our system and actual DNS root servers. The following assumptions are made in the performance analysis:

1. The system is expected to be deployed in close proximity of the cooperating LNSs resulting in negligible propagation delays between the two.

2. Cache misses in the central server cache result in negligible performance penalty due to assumption 1.

3. Typical URLs consist of 3 tokens (e.g. www.xyz.com). Thus, normal DNS resolution of URLs would require three round trips from the LNS to authoritative servers.

4. Our system caches NS records and not A records to support DNS based load balancing and dynamic DNS. Thus, an average resolution in case of cache hit in our system entails only one round trip from the LNS to an authoritative name server.

    Thus, the performance gain in terms of reduction in user response time can be expressed as:

    Performance Gain = Hit Rate * (2 RTTs / 3 RTTs)

    Further, if N = number of cooperating LNSs, then Hit Rate = (ratio of overlap in requests) * (n-1) / n

    It is evident form the above equations that greater the overlap in requests from the various domains, greater is the performance gain achieved. Also, increasing the number of participating LNSs enhances the performance gain. To summarize, if we have 5 cooperating LNSs with a overlap in requests of 0.5(i.e similar environments of operation - all educational institutions or all corporate environments) - a realistic scenario, the performance gain is 26.4%. This gain, however, only accounts for that seen by the user. There is also a substantial reduction in load on the frequently visited authoritative name servers. In the above scenario, the root name server and the next level authoritative server experience a decrease in load equal to the hit rate of 40%. It can be seen that widespread deployment of instances of the system can significantly change the nature of the existent DNS infrastructure to being more clustered and thus significantly reduce DNS traffic on the internet backbones; a gain for ISPs too.

6. Conclusions

The CacheMakers cooperative DNS caching system shows high gain and is thus a useful technique. We believe that our architecture can easily accommodate other web content and provide a scalable and complete solution to the bandwidth hungry applications of the future. We are currently deploying the system over a larger area to accurately quantify the locality in requests and performance of the system.
 

7. REFERENCES

  1. Jaeyeon Jung, Emil Sit, Hari Balakrishnan and Robert Morris. DNS Performance and the Effectiveness of Caching. Proceedings of the {ACM} {SIGCOMM} Internet Measurement Workshop
    '01.
  2. E. Sit. A Study of Caching in the Internet Domain Name System. A Study of Caching in the Internet Domain Name System. Masters thesis, Massachusetts Institute of Technology, 2000.