WWW94: a caching relay for the World Wide Web
presented by steven glassman, digital equipment corporation, palo alto
steven glassman presented the design and performance of a caching relay for
the web. DEC has set up such a relay at palo alto because DEC has a security
firewall between internal computers and machines outside of DEC's network.
they added caching to improve the performance and to reduce network traffic.
in addition, caching can reduce the latency on requests for pages and cached
pages are available even if the server where the documents are stored is
currently not reachable.
on the other hand, caching introduces some problems: it may return stale
versions of a page if it has been changed since the last transmission, it
increases latency if the page has not been cached yet and there is a need
for additional local resources.
cached pages are stored as UNIX files with their URL as the filename. the files
are hashed into 4096 subdirectories which are organized into a three level
the main problem with caching is: is the cached information still valid or
has the original document been changed since it was copied ?
unfortunately, there is no mechanism to know if a document has been changed.
there is also no reliable expiration date in the document. therefore the time
until a page has to be re-cached has to be estimated somehow. DEC uses an
algorithm based on the last modification date. if the document was not changed
recently, it will probably not change for the next few days. a cached document
will be marked with an expiration date. if the relay receives a request for a
page and if the page is in the cache, the relay checks the expiration date.
if the page has not yet expired, it will be sent to the client without any
further tests. if the page has expired, the relay checks with the remote server
if the page has been changed and if so, it will be re-transmitted.
the log shows that 30 .. 50% of all requests were satisfied from the cache
and that relay supplied pages required about 15 .. 25% of the time compared
to transmission over the net. currently the cache is two gigabytes which
should be enough space to hold about 80'000 documents. currently the cache
holds about 630 megabytes of data and is constantly growing.
the statistics show that popularity of pages follows a zipf distribution.
in other words, on a log-log scale the ratio between the number of requests
per page and the number of pages is almost a straight line. this gives a
good indication for the number of requests per page one can expect for a
certain number of pages which in turn gives some hints for a reasonable
i found this an extremely interesting speech.
this paper is available on the web.
13-jun-94 (ra) /