Lock-Free Consistency Control for Web 2.0 Applications 1

Jiangming Yang1,3, Haixun Wang2, Ning Gu1, Yiming Liu1, Chunsong Wang1, Qiwei Zhang1

1Fudan University
2IBM T. J. Watson Research Center
3Microsoft Research Asia

ABSTRACT

Online collaboration and sharing is the central theme of many web-based services that create the so-called Web 2.0 phenomena. Using the Internet as a computing platform, many Web 2.0 applications set up mirror sites to provide large-scale availability and to achieve load balance. However, in the age of Web 2.0, where every user is also a writer and publisher, the deployment of mirror sites makes consistency maintenance a Web scale problem. Traditional concurrency control methods (e.g. two phase lock, serialization, etc.) are not up to the task for several reasons. First, large network latency between mirror sites will make two phase locking a throughput bottleneck. Second, locking will block a large portion of concurrent operations, which makes it impossible to provide large-scale availability. On the other hand, most Web 2.0 operations do not need strict serializability - it is not the intention of a user who is correcting a typo in a shared document to block another who is adding a comment, as long as consistency can still be achieved. Thus, in order to enable maximal online collaboration and sharing, we need a lock-free mechanism that can maintain consistency among mirror sites on the Web. In this paper, we propose a flexible and efficient method to achieve consistency maintenance in the Web 2.0 world. Our experiments show its good performance improvement compared with existing methods based on distributed lock.

Categories & Subject Descriptors

C.2.4 Computer-Communication Networks Distributed Systems [distributed applications] H.2.4 Database Management Systems [concurrency, transaction processing]

General Terms

Algorithms

1 INTRODUCTION

Many popular web sites, including commercial sites and community sites (e.g. GNU, SourceForge, etc.), have multiple mirrors across the Internet to support hundreds of millions of visits per day. Mirror sites are an easy solution to scalability as users are able to choose the nearest mirrors to get faster response time.

Unfortunately, using mirror sites for Web 2.0 applications for scalability may not be effective. In the Web 2.0 environment, each user is an author and publisher. The sheer amount of user interaction, and the fact that shared data are edited simultaneously at mirror sites physically far apart, create a lot of difficulty in consistency control. In fact, performances of some popular Web 2.0 sites (e.g. Wikipedia, Open Directory Project, YouTube, etc.) can be very different across different places at different times.

Traditional distributed database systems maintain consistency by enforcing the following semantics: the result of a concurrent execution must be equivalent to a serial one [16]. A common way to achieve serialized transactions is to use Two Phase Locking (2PL) [7]: transactions on different objects are executed simultaneously while transactions on a same object are serialized by using locks.

In the Web environment, the main weaknesses of the above approaches are the following: (1) Due to internet latency between mirror sites, it is very costly to obtain locks and commit transactions. To ensure data integrity during the process, the system must block all other operations (including read-only operations), which reduces performance, and is unacceptable in the Web 2.0 environment. This undoes the purpose of creating mirror sites, which is to increase scalability. (2) The locking mechanism requires all mirror sites to wait for the slowest site to commit the transaction and free the lock. Thus, vulnerability of one mirror site may affect the efficiency of the entire system.

Web 2.0 users need a more flexible concurrency control mechanism. Many current studies on user collaboration [6,14,11] show that in order to facilitate free and natural information flow among collaborating users, it is important to allow users to concurrently edit any part of the shared document at any time. Traditional methods are unfit as concurrent operations (in transaction mode) always block one another. Thus, it is clear that lock-based mechanisms are unsuitable for mainitaining consistency of mirror sites that run Web 2.0 applications. This motivates us to explore novel approaches for consistency control in the Web 2.0 environment.

In this paper, we introduce a light-weight, lock-free approach to maintain consistency of mirrored sites. We briefly summarize our approach below.

This paper is organized as follows. In Section 2, we describe the replicated architecture of mirrored sites. Section 3 introduces the causality concept. Lock-free consistency control for transactional and non-transactional operations are discussed in Section 4 and 5. Section 6 discusses experiments. We discuss related work in Section 7 and conclude in Section 8.

2 PROBLEM SETTING

In this section, we describe the consistency control problem in the web environment.


1 Representations of Data and Operations

Data shared in mirrored sites can be in varied forms. In this paper, we assume the shared data on mirrored sites are XML documents. Consequently, operations on the data are expressed by XML queries and updates.

There are two W3C standards for XML queries: XPath [17] and XQuery [18]. To process an XPath query, we start from the root node, match all paths satisfying the XPath pattern, and return the leaf nodes on this path. An XQuery is represented as a FLWR (For-Let-Where-Return) pattern, and processed in two phases: a twig pattern matching phase defined by FLW and a result construction phase defined by R. XQuery is an extension to XPath and is more powerful than XPath.

Update operations allow users to modify XML documents. Currently, there is no standard for Update operations. We use FLWU, an XQuery extension introduced by Tatarinov et. al. [15], to express updates. An FLWU is processed in two phases: in the FLW phase, multiple subtrees that match the pattern specified by FLW are generated, that is similar with the first part of XQuery [18], then in the U (update) phase, certain nodes of the sub-trees are modified. The actions of second phase (U) are based on the result of the first phase (FLW), thus we call the results of the first phase the execution set of the operation. In mirrored sites, so long as the FLW process gets the same execution set, the final results should be the same across all sites.

definitionsDefinition
\begin{definitions} (Execution Set). Following previous definitions in \cite{w3... ...U operation the execution set or $ES$\ of the FLWU operation. \end{definitions}

<Root>
    <book @title="Introduction to Algorithm">
        <category>CS</category>
        <tag>Hot</tag>
    </book>
    <book @title="Advanced Statistical Learning">
        <category>UnKnow</category>
    </book>
    <book @title="Linear Algebra">
        <category>Math</category>
    </book>
</Root>

In our paper, we use the above XML document consisting 3 books, and the four Update operations, $ U_{1}$, $ U_{2}$, $ U_{3}$ and $ U_{4}$ on the XML document as a running example for explaining consistency control in mirrored sites.

Figure 1: Four Update Operations
Operation $ U_{1}$:
Change the title ``Advanced
Statistical Learning'' to
``Statistical Learning''.
FOR $title in
  /root//title
WHERE $title =
 "Advanced
 Statistical Learning"
UPDATE $title {
  REPLACE $title WITH
  "Statistical Learning"
}
Operation $ U_{2}$:
Set the category of the
``Linear Algebra'' book
to ``Math''.
FOR $book in /root/book,
 $title = $book/title,
 $category = $book/category
WHERE $title =
    "Statistical Learning"
UPDATE $book {
  REPLACE $category WITH
  <category>Math</category>
}
Operation $ U_{3}$:
Add a discount tag
to books in ``Math''
category.
FOR $book in /root/book,
 $category =
    $book/category
WHERE $category = "Math"
UPDATE $book {
  INSERT
  <tag>Discount</tag>
}
Operation $ U_{4}$:
Set the category of the
``Linear Algebra'' book
to ``CS''.
FOR $book in /root/book,
 $title = $book/title,
 $category = $book/category
WHERE $title =
    "Linear Algebra"
UPDATE $book {
  REPLACE $category WITH
  <category>CS</category>
}

2 Architecture

We assume that shared data on each mirrored site is represented as a huge XML document. Our goal is to achieve good query/update response and support unconstrained collaboration. This leads us to adopt a replicated architecture for storing shared documents: the shared document is replicated at the local storage of each mirrored site. A user can operate on the shared data at any replica site, but usually at the site closest to the user. Non-transactional user operations are executed immediately at that site, and then dispatched to other sites. Transactional operations require more complicated scheduling. We discuss both cases in this paper.

Figure 2: The Replicated Architecture
\begin{figure}\centering\epsfig{file=images/F3.eps,width=0.8\textwidth} \end{figure}

As an example, Figure 2 shows three replicas in a groupware environment: $ SR_{1}$, $ SR_{2}$ and $ SR_{3}$. Replica $ SR_{3}$ receives and executes two operations $ U_{1}$ and $ U_{2}$. We assume $ U_{1}$ and $ U_{2}$ are causal operations, since $ U_2$ operates on the results of $ U_1$, and they are initiated from the same site, possibly by the same user. Operations $ U_{3}$, $ U_{4}$ and $ T_{1}$ arrive at roughly the same time on $ SR_1$, $ SR_2$, and $ SR_3$ respectively. Here, we assume $ T_1$ is a transaction. $ U_{3}$, $ U_{4}$ and $ T_{1}$ are concurrent operations.

To be clear, given a user operation, we call the replica receiving the operation the local replica and other replicas remote replicas. And for the local replica, we also call this operation its local operation, and operations dispatched from other replicas remote operation. In Section 3, we define a causal (partial) ordering relationships on operations based on their generation and execution sequences.


3 The Need for Consistency Control

We use three examples to illustrate the need for consistency control based on the above settings.

  1. (CAUSAL RELATIONSHIPS) Consider replica $ SR_{3}$, where $ U_{1}$ changes the title of the book from ``Advanced Statistical Learning'' to ``Statistical Learning'', and then $ U_{2}$ changes the category of ``Statistical Learning'' to ``Math''. Obviously, there is a causal relationship between $ U_1$ and $ U_{2}$, as the user intends to execute $ U_{2}$ based on $ U_{1}$'s result. Thus, operations with causal relationships must be executed in the right order.
  2. (TRANSACTIONS) Operation $ T_1$ is critical and must be executed in transactional mode. Here, $ T_1$ arrives in different order with regard to $ U_3$ and $ U_4$: on $ SR_1$, it arrives after $ U_3$, on $ SR_2$, after $ U_3$ and $ U_4$, and on $ SR_3$, it arrives first. To guarantee transaction semantics, we must ensure $ T_1$ is executed in the same order on all sites with regard to its concurrent operations.
  3. (CONCURRENT OPERATIONS) Operations $ U_3$ and $ U_4$ are non-transactional and do not have causal relationships. But they still need consistency control. At $ SR_1$, $ U_{3}$ adds a ``Discount'' tag to books in ``Math'' category. But at $ SR_{2}$, because $ U_{4}$ arrives earlier and changes one of the ``Math'' books to ``CS'' category, $ U_3$ will add the ``Discount'' tag to the ``Machine Learning'' book alone. Thus, the results on $ SR_1$ and $ SR_2$ are not consistent.

The three cases require different handling in consistency control. In the rest of the paper, Section 3 discusses consistency control issues for causal relationships, Section 4 for transactions, and Section 5 for non-transactional concurrent operations. Overall, we generalize the above cases and show our method ensures consistency in a replicated Web architecture.


3 CAUSALITY PRESERVATION

In this section, we present the concept of causality, as well as a known solution for causality preservation (ensuring causal operations such as $ U_1$ and $ U_2$ are executed in the same order across all replicated sites) without using locking mechanisms. The techniques we discuss here will be used as building blocks for ensuring consistency of transaction and non-transactional concurrent operations.

Consider operations $ U_1$, $ U_2$ in Figure 1, and replica $ SR_{3}$ in Figure 2. Operation $ U_{1}$ changes a book title from ``Advanced Statistical Learning'' to ``Statistical Learning'', and then $ U_{2}$ sets the category of book ``Statistical Learning'' to ``Math''. Obviously, $ U_{2}$ is causally after $ U_{1}$ since the user intends to execute $ U_{2}$ based on $ U_{1}$'s result. However, because of internet latency, replica sites such as $ SR_1$ may receive $ U_2$ before $ U_1$. To address this problem, we apply the causality preservation strategy.


\begin{definitions} (Causal Ordering Relation \lq\lq $\rightarrow$''). Given two ope... ...hat $O_{a} \rightarrow O_{x}$\ and $O_{x} \rightarrow O_{b}$. \end{definitions}


\begin{definitions} (Concurrent Relation \lq\lq $\parallel$''). Given two operations... ...her $O_{a} \rightarrow O_{b}$, nor $O_{b} \rightarrow O_{a}$. \end{definitions}

Figure 3 illustrates the 3 cases of causality given by Definition 2.

Figure 3: Causal Relationships
\begin{figure}\centering\begin{tabular}{c@{\hspace{1pc}}c@{\hspace{1pc}}c} \epsf... ...ges/F2_c.eps,width=0.42\textwidth} \ (a) & (b) & (c) \end{tabular}\end{figure}

Intuitvely, in order for every site to execute causal operations such as $ U_1$ and $ U_2$ in the same order, all they need to know is that $ U_1$ precedes $ U_2$ on the site they are generated, so if $ U_2$ arrives first, they will wait until $ U_1$ arrives. But in a distributed environment, it is difficult to implement a global, precise clock that informs each site the precedence of operations. The challenge is thus to define the "happened before" relation without using physical clocks.

To do this, we use a timestamping scheme based on State Vector ($ SV$) [6,14]. Let $ N$ be the number of replica sites (we assume $ N$ is a constant). Assume replica sites are identified by their unique replica IDs: $ 1, \cdots, N$. Each site $ i$ maintains a vector $ SV_i$ with $ N$ components. Initially, $ SV_i=\langle 0, \dots, 0 \rangle$. After it executes a remote operation dispatched from site $ j$, it sets $ SV_i[j]=SV_i[j]+1$.

In our approach, all (non-transactional) operations are executed immediately after their generation (operations' localized execution ensures good response and supports unconstrained collaboration). Then the operation is dispatched to remote sites with a timestamp (state vector) equal to the local state vector. Specifically, operation $ O$ generated from site $ j$ is dispatched to other replicas with a state vector timestamp $ SV_O=SV_j$.


\begin{definitions} (Execution Condition). Assume site $i$\ executes an operati... ...k]$, for all $1 \leq k \leq N$\ and $k \neq i$. \end{enumerate}\end{definitions}

Intuitively, the first condition ensures that $ O$ is the next operation from site $ i$, that is, no operations originated from site $ i$ before $ O$ have been missed by site $ j$. The second condition ensures that all operations originated from other sites and executed at site $ i$ before the generation of $ O$ have been executed at site $ j$ already. The two conditions ensure that $ O$ can executed at site $ j$ without violating causality. [14]

Theorem 1 The execution condition ensures that causality as defined in Definition 2 is preserved.

Take Figure 2 as an example. Assume the state vector of each site is $ \langle 0,0,0 \rangle$ initially. When $ U_2$ is dispatched to $ SR_1$, it carries a state vector with $ SV_{U_2}[3] = 2$, because $ SR_3$ has executed 2 local operations $ U_1$ and $ U_2$. However, $ SV_1[3]=0$ because site $ SR_1$ has not executed any operation from $ SR_3$. Thus the first execution condition, $ SV_{U_{2}}[3] = SV_1[3]+1$, is not satisfied, which means $ U_2$ cannot be executed on $ SR_1$ yet. Now $ U_{1}$ arrives at $ SR_1$ with state vector $ SV_{U_1}[3]=1$. It is executed immediately and $ SV_1[3]$ is increased by 1. Then, $ U_{2}$ is ready for execution because $ SV_{U_{2}}[3] = SV_1[3]+1$. Thus, $ U_1$ and $ U_2$ will be executed on $ SR_1$ and $ SR_2$ in the same order as they are executed on $ SR_3$. We have thus solved the first problem in Section 2.3.

We omit the proof of Theorem 1 in the paper. Details related to this result can be found in [6,10]. Specifically, Lamport [10] first introduced timestamp to describe the causal relationship. C. A. Ellis [6] first proposed an execution condition to preserve the causality among operations.


4 TRANSACTIONS

Some critical operations need to be executed in the transaction mode to guarantee integrity across multi-replicas. For example, when doing a money transfer, if the money is debited from the withdrawal account, it is important that it is credited to the deposit account. Also, transactions should not interfere with each other. For more information about desirable transaction properties, please refer to [16,7]. In this section, we discuss a light-weight approach to achieve this in a replicated environment.

1 Semantics

The transaction model is introduced to achieve concurrent transparency [16,7]. In the transaction semantics, although the system is running many transactions concurrently, the user can still assume that his/her transaction is the only operation in the system. To achieve the same goal in a distributed environment, we must guarantee a transaction is serializable with other operations (both transactional and non-transactional operations).


\begin{definitions} (Serialized Transaction). Let $T$\ be a transaction, and $O... ...$T$ in all sites, or $O$\ is executed after $T$\ in all sites. \end{definitions}

According to our definition of concurrent operations (Definition 3), it is easy to see that a serialized transaction does not have concurrent operations. This ensures that transaction integrity is guaranteed across multiple replicas. But in a fully replicated architecture, it is very hard to achieve global serialization. In the following, we describe in detail how we achieve global serialization in such a replicated environment.

2 A causality based approach

The state vector based approach for preserving causality is elegant and light-weight (Section 3). In this section, we implement transaction semantics based on causal relationships.

Of course, existing causal relationships are not enough to enforce serialization of operations across all sites. To solve this problem, we create dummy operations (NOOPs) to introduce a rich set of virtual causal relationships. With these virtual relationships, we implement transaction semantics.

Assume $ T_{1}$ is submitted to replica $ SR_{3}$. Because $ T_1$ is a transaction, it is not executed immediately at its local replica, as we must enforce transaction semantics across all sites. Each site $ SR_i$, including $ SR_3$ itself, creates a dummy NOOP$ _i$ upon receiving $ T_1$, and dispatches NOOP$ _i$ to other sites. This is shown in Figure 4, where a solid line represents the dispatching of a transaction from a local site to a remote site, a dashed line represents the dispatching of a NOOP, and a circle represents the dispatching of a NOOP to the local site itself.

Figure 4: Flexible Transaction Model
\begin{figure}\centering\epsfig{file=images/F6.eps,width=0.8\textwidth} \end{figure}

The NOOPs introduce a rich set of causal relationships. Assume $ NOOP_1$ is initiated at $ SR_1$ between its local operations $ O_1$ and $ O_3$, and $ NOOP_2$ is initiated at $ SR_2$ between its local operations $ O_2$ and $ O_4$. The NOOPs create the following causal relationships:

$\displaystyle O_1 \rightarrow NOOP_1 \rightarrow O_3$ and $\displaystyle O_2 \rightarrow NOOP_2 \rightarrow O_4$

Under the execution condition, $ NOOP_{1}$ and $ NOOP_{2}$ will be executed after $ O_{1}$ and $ O_{2}$, while $ O_{3}$ and $ O_{4}$ will be executed after $ NOOP_{1}$ and $ NOOP_{2}$. At any site, we execute $ T_1$ immediately after all NOOPs corresponding to $ T_{1}$ are executed. It is easy to see that at each site,

The above procedure works not only in this case. It actually enforces serializability in all cases. Intuitively, when NOOP$ _i$ is generated on $ SR_i$, it divides concurrent operations local to $ SR_i$ into two sets: $ Before_i$ and $ After_i$, which are concurrent operations that arrive before and after NOOP$ _i$ at $ SR_i$, respectively. The causality introduced by NOOP$ _i$ at site $ SR_i$ can be expressed as

$\displaystyle Before_i \rightarrow$ NOOP$\displaystyle _i \rightarrow After_i$
On any site $ SR_i$, at the time when $ T$ is causally ready to execute - i.e., all NOOPs have arrived and are causally ready to execute - we can be assured that all concurrent operations in $ B = Before_1 \cup \cdots \cup Before_{N}$ have been executed on $ SR_i$ and none concurrent operations in $ A = After_1 \cup \cdots \cup After_{N}$ have been executed on $ SR_i$. Note that $ A \cup B$ are the entire set of concurrent operations and $ A\cap B=\emptyset$, as any operation is local to one and only one replica. Thus, for any operation $ O$, either $ O\in A$ or $ O\in B$ must be true, which means $ O$ either executes before $ T$ or after $ T$ on any site $ SR_i$. Thus, $ T$ is serialized.

There is one pitfall in the above reasoning: it is possible that on a site, the set of NOOPs will never be causally ready to execute. This happens when there are more than one concurrent transactions. Figure 5 shows an example. Two concurrent transactions arrive at replica sites in different order. Let NOOP$ ^T_i$ denote the NOOP operation generated for transaction $ T$ at replica $ SR_i$. At site $ SR_{1}$, we have NOOP $ _{1}^{T_{2}}\rightarrow$ NOOP $ _{1}^{T_{1}}$, and at site $ SR_{3}$, we have NOOP $ _{3}^{T_{1}}\rightarrow$ NOOP $ _{3}^{T_{2}}$. Thus, neither the entire set of { NOOP$ _i^{T_1}$ }, nor the entire set of { NOOP$ _i^{T_2}$ }, will ever be causally ready for execution: the two transactions block each other.

Figure 5: Two Concurrent Transactions
\begin{figure}\centering\epsfig{file=images/F9.eps,width=0.8\textwidth} \end{figure}

Our goal is to achieve global serialization. Suppose finally we have $ T_1 \rightarrow T_2$. Clearly, causality in the form of $ NOOP_{i}^{T_{2}}\rightarrow NOOP_{i}^{T_{1}}$ are inconsistent with this order. The solution is to simply remove such causalities. However, this requires all sites to agree on the final order $ T_1 \rightarrow T_2$.

The State Vector timestamping mechanism we introduced in Section 3 creates a virtual global clock, so that each relica site agrees on the ``happened-before'' relationship for causal operations. This gives us a way to order non-transactional operations. Transactions, however, do not have State Vector timestamps (they are not executed immediately on local replicas). In our case, transactions trigger non-transactional NOOP operations, which enable us to extend the order among non-transactional operations to transactions.

A non-transactional operation is assigned a state vector timestamp at the site it is generated, and dispatched to all other replicas with the same state vector timestamp. We define a $ TOrder$ function [14] based on such timestamps.
\begin{definitions} (TOrder: total order for non-transactional operations). Let... ...(SV_{a})=sum(SV_{b})$, where $sum(SV)= \sum_{i=1}^{N} SV[i]$. \end{definitions}

It is easy to see that, because each non-transactional operation has the same state vector timestamp at each replica, the total order among non-transactional operations defined above is agreed by all replicas.

We then extend TOrder to create a total order among transactions.


\begin{definitions} (TOrder: total order for transactions). Let $T_1$\ and $T_2... ...actions. $T_1\prec T_2$\ iff $NOOP_1^{T_1}\prec NOOP_1^{T_2}$. \end{definitions}


\begin{algorithm} % latex2html id marker 834 [ht] \floatname{algorithm}{Algorit... ...luding $SR_i$) \par\ENDIF \par\STATE Goto 1 \par\end{algorithmic}\end{algorithm}


\begin{algorithm} % latex2html id marker 845 [ht] \floatname{algorithm}{Algorit... ...]+1$, for all $SR_j$\ENDIF \ENDIF \STATE Goto 1 \end{algorithmic}\end{algorithm}

With a total order among transactions, we can solve the problem above. Algorithm 1 handles local operations, and Algorithm 2 handles operations dispatched from other sites. Finally, Theorem 2 shows that they achieve serializability.

Theorem 2 Algorithm 1 and Algorithm 2 ensures that transactions are serialized.
Proof. See Lemma 1 and Theorem 1 in the Appendix. $ \qedsymbol$


5 CONSISTENCY CONTROL

In this section, we discuss how to perform consistency control for concurrent, non-transactional operations.

1 Overview

To achieve faster response time, non-transactional operations are executed at their local replicas immediately after they are submitted. Hence, the execution order of concurrent operations are different at different sites. As we show in the 3rd example in Section 2.3, concurrent operations may create inconsistent results on different sites.
Figure 6: With regard to site $ SR_2$ in Figure 2, (a) is the XML document after the execution of $ U_4$. We retrace to the state shown in (b), which is before the execution of $ U_4$, and (c) is the result of executing $ U_3$ and then $ U_4$ on the retraced state.
\begin{figure}\centering\epsfig{file=images/F4.eps,width=0.8\textwidth} \end{figure}

To ensure consistency without using locks, we propose an XML storage model that allows us to recover XML documents to states before a previous concurrent operation was executed. Figure 6 gives an example. Figure 6(a) shows the XML document on site $ SR_2$ after the execution of operation $ U_4$, which changes the category of ``Statistical Learning'' from ``CS'' to ``Math''. Later, $ SR_2$ receives operation $ U_3$, and it decides that $ U_3$ has an earlier timestamp than $ U_4$ (as indicated by TOrder). Then, it retraces the state of XML document to the state shown in Figure 6(b), which is the state before $ U_{4}$ is executed. Finally, $ U_3$ is executed on the retraced XML document, which adds a ``Discount'' tag to both ``Statistical Learning'' and ``Linear Algebra'' since both of them are in category ``Math'', then $ U_4$ is executed to change the category of ``Statistical Learning'' to ``CS''. The result XML document, which is shown in Figure 6(c), will be the same as that on site $ SR_1$, where $ U_3$ is generated and executed first.

The high level procedures for the local site and remote site are shown in Algorithm 3 and 4 respectively. Both of the algorithms are concerned with the non-transactional operations, as we have discussed transactions in Section 4. Algorithm 3 executes a local operation immediately once it is generated. Algorithm 4 retraces to earlier states when remote operates arrive with an earlier timestamp. Note that in neither of the two algorithms, the operations are forced to wait.

In the rest of the section, we describe in detail the storage model and the retracing algorithm.


\begin{algorithm} % latex2html id marker 1261 [!t] \floatname{algorithm}{Algori... ...SV_{i}[i] \Leftarrow SV_{i}[i]+1$\par\ENDIF \par\end{algorithmic}\end{algorithm}


\begin{algorithm} % latex2html id marker 1272 [!t] \floatname{algorithm}{Algorit... ...SV_{i}[j] \Leftarrow SV_{i}[j]+1$\par\ENDIF \par\end{algorithmic}\end{algorithm}

2 The Storage Model

In this section, we introduce an efficient XML query processing model that supports retracing.


Range Labels and Inverted Lists

XML queries are processed by twig pattern matching [2]. Several approaches exist, including the relational approach [15], the native XML approach [2,3], and a mixture of the two [19]. In general, native XML data representation has better query efficiency. In this paper, we base our approach on native XML data storage.

The core of XML query processing relies on one fundamental operation: determine the ancestor-descendent relationship between two nodes in an XML document. An efficient way of telling whether one node is an ancestor (descendant) of another node is to use interval-bases labels [4]. In the example shown in Figure 7 (which represents the XML document on site $ SR_2$ after the execution of $ U_4$ but before the execution of $ U_3$), each node $ n$ is labeled by an interval $ (start_n, end_n)$. Using the labels, we can immediately tell the ancestor-descendent relationships: node $ x$ is a descendant of node $ y$ if $ start_x \in (start_y, end_y)$. The labels in Figure 7 are real values in the range of [0, 1], which allows us to subdivide the range when new nodes are inserted.

Figure 7: The range labels of XML document nodes
\begin{figure}\centering\epsfig{file=images/F1.eps,width=0.8\textwidth} \end{figure}

To support efficient twig matching, we employ inverted lists to organize labeled nodes. More specifically, for each element/attribute/value in the XML document, we create a linked list which includes the labels of nodes that have the same element/attribute/value. For instance, in Figure 8, the linked list for the book element has three members: $ (0.1, 0.3), (0.4, 0.6), (0.7, 0.9)$, corresponding to the 3 books in Figure 7.

Figure 8: Inverted list for XML in Figure 7
\begin{figure}\centering\epsfig{file=images/F5.eps,width=0.8\textwidth} \end{figure}

The inverted lists enable us to efficiently find the nodes involved in a query and match the twig pattern by joining the nodes. The inverted lists, however, represent a certain state of the XML document only. In order to retrace its temporal history, i.e., states of the XML document before certain update operations are executed, we need some additional mechanisms.


Timestamps

Assume an XML document has gone through a series of updates $ u_1, \cdots, u_i, \cdots, u_k$ (the updates in the series follow TOrder). Now a new update $ u'$ comes in, and $ u' \prec u_i$. We recover the document to the state before $ u_i$ is applied, and then apply $ u', u_i, \cdots, u_k$ on the recovered documents.

In order to recover the document to a previous state, we assign each node in the inverted lists a pair of timestamps (create, delete), where create and delete are the TOrder number of the operation that creates and deletes the node. It is clear that create $ \prec$ delete. In other words, when an operation deletes a node, instead of removing it from the inverted lists, we assign it a delete timestamp, but keep it there. In Section 5.4, we show that we only need to keep a limited number of states in the inverted lists - when an operation has been executed on all sites, it will have no concurrent operations, and its effect is made permanent on the inverted lists. In the next section, we discuss how retracing and query processing are performed on timestamped inverted lists.

Figure 9: State changes in inverted lists
\begin{figure}\centering\epsfig{file=images/F8.eps,width=0.8\textwidth} \end{figure}


3 Query Processing

In this section, we discuss how to augment traditional XML query processing engines to support retracing, querying, and updating on timestamped inverted lists.


Retracing

The (create, delete) timestamps enable us to retrace the temporal history of an XML document. Let us first examine an example. The inverted lists shown in Figure 9 correspond to the XML documents in Figure 6. Figure 9(a) shows the inverted list on site $ SR_2$ after the execution of $ U_1$, $ U_2$, and $ U_4$. The Math node $ (0.53, 0.57)$ is created by operation $ U_2$ and deleted by operation $ U_4$. Thus, it has timestamp $ (U_2, U_4)$2. Since $ U_4$ changes the Math node to a CS node, a new node with timestamp $ (U_4,-)$ is created in the inverted list of CS. When $ U_3$ arrives, we find that, according to the total order, $ U_3\prec U_4$. We thus retrace the steps to before $ U_4$ is executed, as we show in Figure 6(b). Since the Math node $ (0.53, 0.57)$ is timestamped $ (U_2, U_4)$, which means it is deleted by $ U_4$, its timestamp is rolled back to $ (U_2, -)$, making the node current again. On the other hand, the CS node with timestamp $ (U_4,-)$ will be removed as it is created at time $ U_4$, which is a future timestamp at time $ U_3$.

In summary, the inverted lists keep mutltiple versions or multiple states for the category of the book ``Statistical Learning''. This is shown in Figure 10. First, $ U_2$ changes its category from Unknown to Math, then $ U_4$ changes it again to CS. Thus, the three states can be arranged into a sequence. It is clear that there is no overlap in their timestamps, and at any time only one of them is valid.

Figure 10: Dependence Relationship
\begin{figure}\centering\epsfig{file=images/F7.eps,width=0.8\textwidth} \end{figure}

In general, only nodes whose timestamp $ (t_{create}, t_{delete})$ satisfies $ t\in(t_{create}, t_{delete})$ are valid at $ t$. For the example above, given $ U_{3}$, whose timestamp precedes that of $ U_4$, we know only the Math state is valid, and the Unknown and CS states are invalid.

Querying

A common way to tackle twig pattern matching is to decompose the twig pattern into a set of parent-child and ancestor-descendant relationships, query each relationships independently, and then join the results.

We introduce an algorithm called State-Join (Algorithm 5) to process basic parent/child relationships, ``$ A/D$'', or ancestor /descendant relationships, ``$ A//D$''. The algorithm takes into consideration the states of each XML node defined by its timestamp. As finding the parent-child and ancestor/descendant relationships is the most fundamental operation in XML query processing, it can be used by other XML query algorithms, such as the stack-based algorithm known as Stack-Tree-Desc [2], for twig-pattern matching in a transparent manner.

As the example shown in Figure 8 (which represents the XML document on site $ SR_{2}$ after the execution of $ U_4$ but before the execution of $ U_3$), in order to process $ U_3$, we first fetch the inverted list of ``book'', ``title'', ``category'' and ``Math''. Though the category of book ``Linear Algebra'' has been changed into ``CS'', the original node (0.53, 0.57) still exists in the inverted list of ``Math'' and is timestamped ($ U_2$, $ U_4$). According to the total order, $ U_2\prec U_3 \prec U_4$, which means the node is valid as far as $ U_3$ is concerned. Since timestamp checking has $ O(1)$ cost, the State-Join algorithm will not increase the complexity of the original Stack-Tree-Desc algorithm.


\begin{algorithm} % latex2html id marker 1360 [ht] \floatname{algorithm}{Algori... ...rn Stack-Tree-Desc($AList$, $DList$) as $ES$\par\end{algorithmic}\end{algorithm}


Updating

An update operation may change nodes in the inverted lists in two ways: (1) Create new nodes with appropriate range labels and timestamps; (2) Revise the timestamps of existing nodes. The algorithm can be expressed in Algorithm 6.


\begin{algorithm} % latex2html id marker 1375 [ht] \floatname{algorithm}{Algori... ...rt of Section.\ref{sec:retrace} \par\ENDFOR \par\end{algorithmic}\end{algorithm}

Instead of modifying a node, what an update operation really does is to create a new state for the node in the state-based inverted list, and leave the original node intact (except changing its timestamp). For example, in order to execute $ U_3$ at the state of Figure 8, it (i) creates two new nodes (0.585, 0.595) and (0.885, 0.895), and insert them into the inverted list for ``tag''; (ii) creates two new nodes (0.588, 0.592) and (0.888, 0.892), and insert them to the inverted list for ``Discount''; and (iii) set the timestamps for all of them to ($ U_3$, -).

XML documents are sensitive to the order of sibling nodes under a parent node. If two operations insert two nodes under a same parent node, we must enforce that their order is the same at all sites. We again use the $ TOrder$ function to determine the sibling order of nodes inserted by concurrent operations. A serious treatment of this problem may use techniques as the scan function [8] to achieve global consistentancy for insert operations in a linear structure. We omit the proof here. Information for this result can be found in [8].


4 The Size of the Inverted Lists

To support rollback, the inverted lists keep multiple states of the documents. One question is, will the size of the inverted lists keeps growing? The answer is negative. The reason is that, when an operation has executed on all replicas, it will no longer be concurrent to any other operation. So can purge all timestamps related to this operation. In our approach, we store operations in an Operation History List (OHL). Each time a replica receives a new remote operation, it will update OHL. When an operation is executed on all replicas, it will be removed from OHL and its related state information will be removed from the inverted list. If it deleted some nodes, we remove those nodes permanently. Since the amount of concurrent operations is limited, the amount of the states we maintain in the inverted list is limited.

5 Convergence

Our approach ensures faster response time. Notably, each non-transactional concurrent operation is executed immediately on its local replica. More generally, concurrent operations are executed in different order on different replicas. This means inconsistency exists across mirrored sites. However, we guarantee that at any moment, if all operations have been executed on all replicas, the mirrored data on each replica converges to a same state. In essence, the convergence property ensures the consistency of the final results at the end of a cooperative editing session [14].

Theorem 3 At any time when all existing operations have been executed on all replicas, each replica has the same mirrored data (consistent).
Proof. See Lemma 2 and Theorem 2 in the Appendix. $ \qedsymbol$

6 EXPERIMENTS

In order to evaluate the efficiency of XML Updates, we build a prototype system and compare it to lock-based distributed system. The experiment is conducted with 100Mbps LAN, 2-4 sites, CPU P4 2.4G, main memory 1G, and Windows 2003. The prototype system and the lock-based distributed system are both implemented By Java (Version 1.6.0-b105). We adjust operations' frequency to simulate network latency.

Experiment data comes from DBLP website [1], and current size is 133MB. We create 70 queries and 70 updates and execute them repeatedly. An additional computer is used to submit the above operations to each site with given frequency, and record the average responding time. The process of experiments are described as follows.

There are three types of operations: Query, Non-Transaction Update, and Transaction Update. We label them as $ AST_{Query}$, $ AST_{NU}$ and $ AST_{TU}$ respectively. For lock-based distributed system, all operations are in transaction mode, and it has only two types of operations: $ Lock_{Query}$ and $ Lock_{Update}$.

First, we evaluate the effect of different operation frequencies. The parameters are: 3 sites, the proportions of $ AST_{Query}$, $ AST_{NU}$ and $ AST_{TU}$ are 90%, 8% and 2%, respectively, network latency between two sites is 200ms. The result is shown in Figure 11:

Figure 11: Sending Command Frequency
\begin{figure}\centering\epsfig{file=images/Experiment2.eps,height=6cm} \end{figure}
Figure 12: Network Latency Between Sites
\begin{figure}\centering\epsfig{file=images/Experiment3.eps,height=6cm} \end{figure}

When the frequency is every 50ms or 100ms an operation, since operations can not be executed immediately, they were blocked in the waiting queue. $ AST_{TU}$ and $ Lock_{Update}$ involves more dependancies than $ AST_{NU}$, $ AST_{Query}$ and $ Lock_{Query}$ do, so the response time of $ AST_{TU}$ and $ Lock_{Update}$ rose rapidly. On the other hand, when the frequency drops to 200ms to 600ms an operation, the response time of $ AST_{NU}$ is similar to query operation and also far below $ AST_{TU}$ and $ Lock_{Update}$. By using $ AST_{NU}$ instead of $ AST_{TU}$ and forcing more operations executed at local site we can significantly improve the response time.

Second, we evaluate the effect of different network latencies. The parameters are: 3 sites, the proportions of $ AST_{Query}$, $ AST_{NU}$ and $ AST_{TU}$ is 90%, 8% and 2% and the sending command frequency is 200ms. The network latency between sites are simulated by force a given delay. The result is shown in Figure 12.

As latency increases, the only notable growth is for $ AST_{TU}$ and $ Lock_{Update}$, as they need all sites participation. As noted before, one of the benefits to deploy more sites is to enable the user to select the nearest site to access and get higher speed and shorter respond time. So by relying on $ AST_{NU}$, which execute locally, we have a good way to achieve this goal. But the lock-based update operations need to request locks from all replicated sites and will block all other conflict operations. Even thought more sites are deployed, the respond time not only depends on the latency between the user and his/her entry site but also depends on the network latency among all replicated sites. Further more, the network latency in internet is unstable and inevitable.

Third, we evaluate the effect of compositions of different types of operations. Similarly, the parameters are: 3 sites, network latency between sites is 200ms, and the frequency is 200ms per operation. The results are shown in Figure 13 and Figure 14.

Figure 13: Operations' Proportion
\begin{figure}\centering\epsfig{file=images/Experiment4.eps,height=6cm} \end{figure}
Figure 14: Effect for Average Responding Time in Different Operations' Proportion
\begin{figure}\centering\epsfig{file=images/Experiment5.eps,height=6cm} \end{figure}
Figure 15: Number of Sites
\begin{figure}\centering\epsfig{file=images/Experiment1.eps,height=6cm} \end{figure}

The figures show, as the ratio of $ AST_{NU}$ increases, the conflicts between $ AST_{TU}$ decreases. It lowers the responding time of both $ AST_{NU}$ and $ AST_{TU}$. The figures also show that through increasign the ratio of $ AST_{NU}$, we can significantly reduce the average response time.

Finally, we evaluate the effect of number of sites. The parameters are: network latency is 200ms, the average frequency is 50ms per operation, and the proportions of $ AST_{Query}$, $ AST_{NU}$ and $ AST_{TU}$ operation are 90%, 8% and 2%. The result is shown in Figure 15:

To evaluate the effect for different number of sites, we choose the frequency to 50ms. The result shows as the number of sites increases, the average response time decreases.

In summary, the four experiment results show that the algorithm meet our expectations. By executing $ AST_{NU}$ locally and avoiding the dependance of network latency, we can make better load balance and achieve high-speed access and shorter respond time.

7 RELATED WORK

Our approach is similar in spirit to some existing non-locking methods [14,12,5,9] that rely on operation transformation for consistency maintenance, where an operation is executed immediately at its local site and then dispatched to other remote sites. By including or excluding some effects of concurrent operations, a remote operation finds its correct state to execute. However, a major difference is that these approaches only focus on atomic operations (e.g. inserting or deleting a atomic object) no matter which structure they used (liner or tree). They do not support query or update with conditions, nor do they support transactions.

There are a few multi-replicas consistency maintenance solutions. Bayou [13] implements a multi-replicas mobile database, but for concurrent operations, its repeated undo and redo lead to huge system cost. TACT [13] tries to limit the differences between replicas, but when a replica exceeds the limit, operations will be blocked and response time will increase. TSAE [13] uses ack vectors and vector clocks to learn about the progress of other replicas. The execution of an Update operation is blocked until it arrives at all replicas. XStamps [20] proposes a timestamp based XML data multi-replica control solution, but some conflicting operations will be aborted. Though this is avoidable in a serial execution, the authors point out that generally, aborting is inevitable in XStamps.

Our work focuses on consistency control of structured data such XML, and is uses the stack-based algorithm for XML query processing. Recent work including Holistic Twig Joins [3] further improves XML query performance. These new approaches rely on the same inverted list as the stack-based algorithm. Thus, they are compatible to our retracing approach.

8 CONCLUSIONS

In this paper, we proposed a lock-free approach for consistency maintance in Web 2.0 environment. Tranditional concurrent operations (e.g. two phase lock, serialization, etc.) lead to the throughput bottleneck among distributed mirror sites and a large portion of user operations is blocked. Our method has two significant features. First, we do not use the locking mechanism, so concurrent operations are executed as soon as possible upon their arrival. Our algorithm ensures the convergence of the state of the shared documents. This satisfies the need for most non-critical user operations in the Web 2.0 environment. Second, we support the transaction semantics for critical operations without using the locking mechanism. Instead, we rely on the lock-free causality preservation approach. We prove our approach implements consistency convergence and transaction semantics in our paper. Experimental results show that our approach achieves better load balance, high-speed access and shorter respond time.

REFERENCES

1
DBLP Computer Science Bibliography.
http://www.informatik.uni-trier.de/ ley/db/, 2007.
2
S. Al-Khalifa, H. V. Jagadish, J. M. Patel, Y. Wu, N. Koudas, and D. Srivastava.
Structural joins: A primitive for efficient XML query pattern matching.
In ICDE, 2002.
3
N. Bruno, N. Koudas, and D. Srivastava:.
Holistic twig joins: Optimal XML pattern matching.
In SIGMOD, 2002.
4
E. Cohen, H. Kaplan, and T. Milo.
Labeling dynamic XML trees.
In PODS, 2002.
5
A. H. Davis, C. Sun, and J. Lu.
Generalizing operational transformation to the standard general markup language.
In CSCW, 2002.
6
C. A. Ellis and S. J. Gibbs.
Concurrency control in groupware systems.
In SIGMOD, 1989.
7
K. Eswaran, J. Gray, R. Lorie, and I. Traiger.
The notions of consistency and predicate locks in a database system.
Commun, 19(11):624-633, 1976.
8
N. Gu, J. Yang, and Q. Zhang.
Consistency maintenance based on the mark & retrace technique in groupware systems.
In SIGGROUP, 2005.
9
C.-L. Ignat and M. C. Norrie.
Customizable collaborative editor relying on treeOPT algorithm.
In ECSCW, 2003.
10
L. Lamport.
Time, clocks, and the ordering of events in a distributed system.
Commun, 7:558¨C565, 1978.
11
D. Li and R. Li.
Transparent sharing and interoperation of heterogeneous single-user applications.
In CSCW, 2002.
12
D. Li and R. Li.
Preserving operation effects relation in group editors.
In CSCW, 2004.
13
Y. Saito and M. Shapiro.
Optimistic replication.
ACM Computing Surveys, 37(1):42-81, Mar 2005.
14
C. Sun, X. Jia, Y. Zhang, Y. Yang, and D. Chen.
Achieving convergence, causality-preservation, and intention-preservation in real-time cooperative editing systems.
ACM Transactions on Computer-Human Interaction, 5(1):63-108, Mar 1998.
15
I. Tatarinov, Z. G. Ives, A. Y. Halevy, and D. S. Weld.
Updating XML.
In SIGMOD, 2001.
16
I. Traiger, J. Gray, C. Galthier, and B. Lindsay.
Transactions and consistency in distributed database systems.
ACM Transactions on Database System, 7(3):323-342, 1982.
17
W3C.
XML path language (XPath) 2.0.
http://www.w3.org/TR/xpath20/, 2005.
18
W3C.
XQuery 1.0: an XML query language.
http://www.w3.org/TR/xquery/, 2005.
19
H. Wang and X. Meng.
The sequencing of tree structures for XML indexing.
In ICDE, 2005.
20
K. Win, W. Ng, Q. Liu, and E. Lim.
XStamps: A multiversion timestamps concurrency control protocol for XMLData.
In ICICS/PCM, 2003.
Lemma 1 The execution order of all transactional operations is consistent on all replica sites.
Proof. Suppose the execution order of $ n-1$ transactions at all sites are consistent. We want to extend the result to $ n$ transaction operations.

Assume there are $ n$ transactions in the system: $ T_{1}, T_{2}, \cdots, T{n}$. Without loss of generality, assume the $ T_{1} \prec T_i$, for all $ 1 \leq i \leq n$. There may exist one or more $ NOOP_{a}^{T_{i}}$ and we have $ NOOP_{a}^{T_{i}}\rightarrow NOOP_{a}^{T_{1}}$ for all $ 2\leq i\leq n$. Since $ T_{1} \prec T_i$, we switch the timestamps of involving NOOPs to reverse the causality. Specifically, $ NOOP_{b}^{T_{1}}$ will switch its timestamp with all $ NOOP_{b}^{T_{i}}$ until $ NOOP_{b}^{T_{1}}$ causally precedes all $ NOOP_{b}^{T_{i}}$ where $ 2\leq i\leq n$.

Thus, $ T_{1}$ must be executed first among $ \{T_{1}, T_{2}, T_{3}, \cdots, T{n}\}$ on all replica sites. Then the remain operations are $ \{T_{2}, T_{3}, \cdots, T{n}\}$. According to the assumption, the execution order of $ n-1$ operations at every site is consistent. Thus, the execution order are consistent after $ n$ operations in all replica site. $ \qedsymbol$

Theorem 1 Transactions in our system are serialized (Definition 5).
Proof. According to Lemma 1, the execution order of transactions is consistent on all replicas. To prove the current theorem, we only need to prove the order between each transaction and non-transaction is consistent on all replicas. Let $ T$ be a transaction and $ O$ be a non-transaction operation.

We consider the situation after all timestamp switches have taken place for transactional operations. Without loss of generality, assume $ O$ is generated from replica site $ a$. Since the switch rule only applies to NOOP operations from the same replica site, after the switch, $ T$ should still have one corresponding NOOP operation from each replica site. Let $ NOOP_{a}^{T}$ denote the NOOP operation from site $ a$. Since both $ O$ and $ NOOP_{a}^{T}$ are local operations of replica site $ a$, their relationship can only be Causal Relationship (Definition 2). Their execution order is determined by the Execution Condition for remote operation (Definition 4). Since transaction $ T$ will be executed together with $ NOOP_{a}^{T}$ at all sites. So the execution order between $ T$ and $ O$ is fixed.

In summary, no matter which type the operation is, the execution order is certain. This satisfies the Definition 5. $ \qedsymbol$

Lemma 2 Assume the system contains two concurrent operations and two replicas. The results are the same even though the two operations are executed in different order on the two replicas.
Proof. Assume $ U_1 \prec U_2$, and on one of the replicas, $ U_2$ executes first. We what to show executing $ U_1$ after $ U_2$ leads to the same result. Consider any node $ n$ in the document. If $ n$'s timestamp is not changed by $ U_2$, then $ U_2$ has no effect on $ n$, so executing $ U_1$ before or after $ U_2$ is the same. Suppose $ n$'s timestamp is changed to $ (U_2, -)$, that is, $ n$ is created by $ U_2$. Since $ U_1 \prec U_2$, or $ U_1 \not \in (U_2, -)$, the retracing algorithm will make $ n$ invisible to $ U_1$. Suppose $ n$'s timestamp is $ (U', U_2)$, i.e., $ n$ is deleted by $ U_2$. Assume $ n$'s original timestamp is $ (U', U'')$, and we have $ U' \prec U_2 \prec U''$. There are only two cases: (i) $ U'\prec U_1 \prec U''$ and (ii) $ U_1\prec U'$, where in case (i) $ n$ is visible to $ U_1$ and in case (ii) $ n$ is invisible to $ U_1$, no matter there is $ U_2$ or not. This means, with retracing, $ U_1$ is seeing the document as if $ U_2$ never occurred. $ \qedsymbol$
Theorem 2 Assume the system contains $ n$ operations and any number of replicas. After the $ n$ operations have executed on all replicas, the results on all replicas are the same..

Lemma 2 proved consistency for two operations in a distributed environment. Based on Lemma 2, we simply apply the results in our previous work (Theorem 3 of [8]) to extend the proof to $ n$ operations.



Footnotes

...Web 2.0 Applications 1
This paper is partly supported by National natural Science foundation of china (NSFC) under Grant No.90612008, No.60736020 and National grand Fundamental research 973 program of china under grant No.2005CB321905.

... $ (U_2, U_4)$2
For simplicity, we use $ U_2$ to denote the TOrder number of $ U_2$, which is used as timestamp.