Syntax for Semantic Enriching of Web Pages

Silvia Martelli
ISTI-CNR
Area della Ricerca di Pisa - Via G. Moruzzi 1, Pisa - Italy
+39 050 3152939
silvia.martelli@guest.cnuce.cnr.it
Jeremy J. Carroll
Hewlett-packard Labs
Bristol UK, BS34 12QZ
jjc@hpl.hp.com
Oreste Signore
ISTI-CNR
Area della Ricerca di Pisa - Via G. Moruzzi 1, Pisa - Italy
+39 050 3152995
oreste.signore@cnuce.cnr.it

ABSTRACT

Linking is a core common technology shared between the hypertext web and the semantic web. Extended XLinks can encode RDF graphs in the head of XHTML documents. These XLinks carry the semantic markup related to the document, typically using elements from Dublin Core. XLinks from the head into the body permit the use of the document's own displayed metadata. The use of XLink permits the use of RDF without the dreadful RDF/XML syntax. RDF/XML does not conform to XML Schema or DTD, and hence does not embed into validated XHTML. The XLinks are 'harvested' as RDF Statements.

Keywords

RDF, XHTML, XLink, metadata, Dublin Core.

1. INTRODUCTION

For the International Semantic Web Conference 2002, authors were encouraged to provide semantic markup for their abstracts and a tool was provided to help with this. The Semantic markup was created in RDF… and the results become RDF/XML stored as comments within the HTML! (and comments within the RDF/XML caused formatting problems). This work addresses the syntactic discontinuity between the Web and the Semantic Web. A major use case for semantic web technologies, particularly RDF, has been semantic markup for the annotation of web pages. The basic web metadata is typically encoded in Dublin Core [2]. Different solutions for this problem have been proposed, but all of these are less than ideal. The solution we propose is to use XLink as a syntax for representing RDF graphs. The problems solved in this work, concerning embedding RDF within XHTML are generic problems concerned with the use of the dreadful RDF/XML syntax for layering RDF over XML. The XLink technique described can be used for encoding arbitrary ad hoc semantic markup (as an RDF graph) in XML documents. Thus, we do not restrict XLink to marking up hypertext links, but find it a workable syntax for marking up semantic links as well. Moreover, we see linking as one of the core technologies that is shared across the hypertext web and the semantic web.

2. METADATA FOR WEB PAGES

For metadata in HTML documents the RDF Model and Syntax recommends, "simply to insert the RDF in-line". Another common approach is to use Dublin Core metadata within the <meta> tags in the head element of the document. While it is possible to read such mark-up as RDF, it is only useful for metadata that conforms to the Dublin Core schema, or a schema written with interoperation with Dublin Core in mind. Hence, it lacks the open-endedness of RDF metadata, for which any schema or no schema at all can be used. In [3] is shown how HTML span elements can be used to pick out of the document body the key data that is the document metadata. An advantage they pick out for this is that of avoiding duplication. We note that this practice of including the metadata inside the document is one that goes back millennia, and suggest that supporting such a well-established practice is a must for a solution to web metadata. The practical advantage is that the human readable document metadata and the machine-readable metadata are the same bits. A problem that emerged after the publication of the RDF Model and Syntax Recommendation is that RDF/XML cannot be embedded in DTD valid HTML 4.0. Instead, the revised RDF/XML syntax recommendation suggests that the metadata should form a separate document, and can be related to the original document using a link element in the head. A long tradition can also be pointed to for this practice. We note that a DTD has been developed for RDF/XML expressing simple Dublin Core. This could, in principle, be added to the XHTML DTD. However, it does not extend to the more sophisticated requirements of qualified Dublin Core. Two requirements for Web metadata markup are the ability to represent Dublin Core metadata (both simple and qualified), and the compatibility with validated XHTML. None of the approaches surveyed in [4] has both these properties. We suggest the use of XLink for metadata markup within XHTML. This solution, combined with "harvesting" techniques to extract RDF statements inspired by [5], does meet these requirements.

3. SOWING SEMANTICS

The author of a web document can directly identify the semantic information within the page, preferably using a tool such as OntoMat Annotizer [6]. To mark the position of this information in the page a <span> element with a unique id attribute can be used. This markup acts as an anchor for the semantic description. This description is built as an extended XLink in the <head> section. We preferred to not introduce new elements to maintain higher compatibility with the XHTML DTD. Figure 1 shows a fragment of XHTML, which uses URIrefs from the RDF Concepts and Abstract Syntax document. Using XLink it is possible to describe any RDF statement: the subject may be a URI reference or a blank node; the property a URI reference; the object a URI reference, a blank node, or a literal. Moreover we have the to indicate the datatype URI of typed literals in RDF: a special case that is important in XHTML is the special literal type rdf:XMLLiteral. This corresponds to the RDF/XML construction rdf:parseType="Literal". The recognition of this special datatype permits the inclusion of text with XHTML markup as literal values in the RDF graph. Such text is particularly important for certain text types: e.g. Japanese text marked up with ruby annotation (see [1]). If a literal value needed in the metadata is not available in the body of the document, it is possible to explicitly include it in the head of the document. For more information about this topic see [4].

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" 
         xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en"
         xml:base="http://weblabsrv.cnuce.cnr.it/www2003/abstract.html"
         base=" http://weblabsrv.cnuce.cnr.it/www2003/abstract.html">
<!-- xlink:href="" is a same document reference -RFC 2396- 
  and refers to this document-->
  <head xlink:type="extended">
    <link xlink:type="locator" xlink:href="" xlink:label="doc"
             xlink:role="http://www.w3.org/TR/rdf-concepts/#dfn-URI-reference"/>
    <link xlink:type="locator" xlink:href="#id-title" xlink:label="title"
             xlink:role="http://www.w3.org/TR/rdf-concepts/#dfn-plain-literal"/>
    <link xlink:type="arc" xlink:from="doc" xlink:to="title"
             xlink:arcrole="http://purl.org/dc/elements/1.1/title"/>
    ...
  </head>
  <body>
    <h1><span id="id-title">
  Syntax for Semantic Enriching of Web Pages</span></h1>
    <p> ...</p>
  </body>
</html>
Figure 1: Example Marked Up Document

4. EXTRACTING SEMANTICS

Both XLink and RDF provide a way for asserting relationships between resources. It is possible to define a mapping between links and RDF statements. This process is addressed as harvesting. The key insight is the meaning of the arcrole attribute in XLink matches the meaning of the predicate in a RDF statement. The underlying principles for harvesting are: each arc with an xlink:arcrole attribute originates at least one RDF statement; the starting resource is the RDF statement subject; the ending resource is the RDF statement object; the value of xlink:arcrole attribute is the RDF statement predicate. The full harvesting process can be implemented using appropriate XSLT transformations, as described in the pseudocode [Figure 2].

xmlns:rdfc = "http://www.w3.org/TR/rdf-concepts/#"
foreach (relevantArc) 
{'*[@xlink:type="extended"]/*[@xlink:type="arc"&&@xlink:arcrole!=""]'}
  predicate:=value-of(xlink:arcrole)
  subjectLabel:=value-of(xlink:from)
  objectLabel:=value-of(xlink:to)  
  foreach (relevantSubjectLinkElement) 
{"//*[@xlink:label=$subjectLabel]"} 
   choose
     when (xlink:role="rdfc:dfn-URI-reference")
        subjectType:="rdfc:dfn-URI-reference"
        subjectValue:= value-of(xlink:href)
     when (xlink:role="rdfc:dfn-blank-node")
        subjectType:="rdfc:dfn-blank-node"
        subjectValue= value-of(xlink:label)    
   endchoose
   foreach (relevantObjectLinkElement) 
{"//*[@xlink:label=$objectLabel]"} 
     choose
       when (xlink:role="rdfc:dfn-URI-reference")
          objectType:="rdfc:dfn-URI-reference"
          objectValue:=value-of(xlink:href)
       when (xlink:role="rdfc:dfn-blank-node")
          objectType:="rdfc:dfn-blank-node"
          objectValue:=value-of(xlink:label)
       when (xlink:role="rdfc:dfn-plain-literal")
          objectType:="rdfc:dfn-plain-literal"
          objectValue:=value-of
(element that has id equal to xlink:href minus '#')
       when (xlink:role="rdf:XMLLiteral")
          objectType:="rdf:XMLLiteral"
          objectValue:= value-of
(element that has id equal to xlink:href minus '#')
     endchoose
     createTriple(subjectType,subjectValue,predicate,objectType,objectValue)
   endforeach
  endforeach
endforeach

Figure 2: Pseudocode: RDF Statements from XLinks

5. CONCLUSION AND FUTURE WORK

The techniques we used concern both the semantic web and the (HTML) web. We do not restrict our architectural thoughts to one domain or the other. The W3C's Technical Architecture Group has not treated linking as primary in their Architectural Principles, instead we consider linking as primary in web architecture. We have shown that XLink can be used with validated XHTML pages to encode semantic markup. The use of a single web language (XLink) for both semantic and hypertextual links is a distinct advantage. The XLink markup is verbose, but the use of XML entities could shorten it. This markup needs to be combined with semi-automatic tools for semantic markup in order to be deployable. With the XSLT transformation, we developed, marked up pages can easily be loaded into semantic web tool kits. XLink can also be used for embedding RDF graphs in generic XML documents. We gained compatibility with HTML or XHTML. XLinks are described using XML attributes, which can be added to XHTML elements as shown. The same techniques could be used on HTML, but harvesting them would require more forgiving parsing technology than a standard XML parser. With current DTDs the xlinks that we add are not valid. However, the changes required to permit xlinks everywhere in XHTML are simple, and appear to be a requirement for XHTML 2.0. We also gained compatibility with RDF: our data is RDF. We have defined an alternative XML serialization of the same abstract syntax.

6. REFERENCES

  1. W3C Recommendations. http://www.w3.org/TR/
  2. Encoding Dublin Core metadata in HTML. RFC 2731,IETF, December 1999.
  3. J.A. Kunze, F. Van Harmelen and D. Fensel. Practical Knowledge Representation for the Web. IJCAI 99.
  4. J.J. Carroll, S. Martelli and O. Signore. XLink in XHTML to Represent RDF. Technical Report ISTI-2003-TR-02.
  5. R.J. Daniel. Harvesting RDF Statements from Xlinks. W3C Note, 2000. http://www.w3.org/TR/xlink2rdf
  6. S. Handschuh and S. Staab. Authoring and Annotation of Web Pages in CREAM. WWW2002