Journal reference: Computer Networks and ISDN Systems, Volume 28, issues 7–11, p. 1559.

Interactive Video on WWW: Beyond VCR-like Interfaces

Arun Katkere - Jennifer Schlenzig - Amarnath Gupta - Ramesh Jain

Contact email: katkere@ucsd.edu
Visual Computing Laboratory
University of California, San Diego
9500 Gilman Drive, Mail Code 0407
La Jolla, CA 92093-0407, USA

Abstract:

The WWW is evolving into a predominantly visual medium. The demand for access to images and video has been increasing rapidly. Interactive Video systems, which provide access to the content in video archives, are starting to emerge on the WWW. Partly due to the two-dimensional nature of the web, and partly due to the fact that images that comprise the video are two dimensional, most of these systems provide a VCR-like interface (play, fast-forward, reverse, etc., with additions like object selection, motion specification in the image space, and viewpoint selection). The basis of this paper is the realization that the video streams represent projections of a three-dimensional world, and the user is interested in this three-dimensional content and not the actual configuration of pixels in the image space. In this paper, we justify this intuition by enumerating the information-bearing entities that the user is interested in, and the information specification mechanisms that allow the user to query upon these entities. We will describe how such a intuitive system could be implemented using WWW technologies -- VRML, HTML, and HTTP -- and present our current WWW prototype which is based on extensions to some of these standards. This system is built on top of our multiple perspective interactive video (MPI Video) paradigm which provides a framework for the management of and interactive access to multiple streams of video data capturing different perspectives of related events.

1. Introduction

In a very short time, the World Wide Web has emerged as the most powerful framework for locating and accessing remote, distributed information. A number of protocols and interfaces have been designed for each of the many different kinds of information. For navigational access to documents with text, images and references, the hypertext metaphor for information request has been most popular. While for database-style search, both keyword and forms-based interfaces have been developed and are essentially Web-extended (or HTML enhanced) versions of individual native database languages[1, 24]. For most applications, these different needs of information access and manipulation do not cross boundaries. Thus virtual reality users do not make information browsing queries and hypertext document surfers typically do not navigate in a three-dimensional world. But why not? A truly collaborative virtual work environment must allow users to access documents, and 3D visualization of schema would surely improve user-database interaction. The purpose of this paper is to advocate the use of three dimensional user interfaces as a means of accessing various types of data on the World Wide Web. Specifically we address these issues in the context of multiple perspective interactive video (MPI Video)[6, 9].

Currently, the popular interface to interactive video resembles a enhanced VCR interface which allows only brief, sporadic feedback from the user. This limited interaction provides no support for the purpose of querying a database beyond simple who queries. To achieve interactive video we must empower the user with the capability of manipulating the spatio-temporal content of the video. In addition, it is within the province of the interface to offer more than button clicks and mouse movements. Environments such as ALIVE and those that incorporate gesture understanding[17] will have the greatest potential as an interactive interface.

It is the paradigm of MPI Video, described in more detail in Section 2, which demands and in fact enables this level of interaction. More than just a collection of video streams, the MPI Video environment is a heterogeneous, distributed information structure. The primary source of information is a number of live video streams acquired from a set of cameras covering a closed environment such as a football game. This environment has a static component consisting of a model of the environment which resides on a server. The server also contains a library of possible dynamic objects that can appear in the environment. Multiple sensors capture the event and the system dynamically reconstructs a sequence of camera-independent three-dimensional scenes from the video streams using computer vision techniques[7]. In MPI Video the role of the user is to view and navigate in this world as the real-life event unfolds. While remaining in this world, the user may also request additional information on any static or dynamic object. Secondary information resources such as hyper-linked HTML documents, databases of static images, and ftp sites of reference archives are available to the system and may need to be accessed either to initiate a user query or as the result of a query.

In Section 3 of this paper we propose a set of information classes that can be formulated in the MPI Video environment. We demonstrate why without a three-dimensional interface the user would lose the potential expressive power required in this paradigm. In Section 4, we elaborate on our information exchange architecture and how it supports the current query specification interface. In Section 5 we conclude the paper with a discussion of our future work plan.

2. The MPI Video paradigm

Figure 1: MPI Video System Architecture Overview

Multiple Perspective Interactive Video[6], MPI Video, provides a framework for the management of and interactive access to multiple streams of video data capturing different perspectives of related events[9]. MPI Video has dominant database and hypermedia components which allow a user to not only interact with live events but browse the underlying database for similar or related events or construct interesting queries.

Figure 2: Different Layers of the environment model. Arrows indicate data input either from other layers or from sensors.

The MPI Video architecture shown in Figure 1[6, 9] has the following components:

Video Data Analyzer: The MPI Video system must detect and recognize objects of potential interest and their locations in the scene. This requires powerful image segmentation methods. For structured applications, one may use knowledge of the domain and may even change or label objects to make the segmentation task easier.
Environment Model Builder: Individual camera scenes will be combined in this system to form a model of the environment. All potential objects of interest and their locations will be recorded in the environment model. The representation of the environment model depends on the facilities provided to the viewer.
Viewer Interface: A viewer is able to select the perspective that he or she desires. This information should be obtained from the user in a friendly but directed manner.
View Selector: The view selector responds to the user's request by selecting appropriate images to be displayed. These images may all come from one perspective or the system may have to select the best camera at every point in time to display the selected view and perspective.
Video Database: If the event is not a real time event, then it is possible to store the episode in a video database. Each camera sequence will be stored along with its metadata. Some of the metadata is feature based and allows content-based operations[5, 21]. Data can also be collected during a real time event and stored for later use.
Virtual View Builder: A particularly important component of MPI Video is Immersive Video[12], where a virtual camera is created for the viewer by combining the extracted model with the original video streams thus giving a sense of omniscient presence[12]. The viewer in an Immersive Video environment is no longer controlled by the limitations of a physical camera.

Three aspects central to this architecture are[9]:

Video data analysis and the assimilation of the multiple streams to form a single integrated world-representation. Selection of a ``best view'' from the input data stream.
A database subsystem which stores the raw video data, the derived data generated by the video analysis portion and any meta-data input by the user. The database supports content-based query operations by the user or software agents.
A hypermedia interface which supports navigation and querying of the wealth of data input to and derived by the system[22].

In this paper, we describe an interface to MPI Video in which the user primarily interacts with the system using an intuitive three-dimensional metaphor[8]. A basic WWW system has been built using behavior-enhanced VRML (e.g., VRBS[13], the upcoming VRML 2.x standard, etc.), HTML forms, and CGI scripts. Extensions to some of these are suggested in order to achieve a functional interactive video interface on the WWW.

2.1 MPI Video modeling

An important component of an MPI Video system is the Environment Model, a coherent, dynamic, multi-layered, three-dimensional representation of the content in the video streams (Figure 2[7]). It is this view-independent, task-dependent model that bridges the gap between two-dimensional image arrays which by themselves have no meaning and the complex information requirements placed by users and other components on the system.

The transformation of video data into objects in the environment model has been an ill-posed and difficult problem to solve. With the extra information provided by the multiple perspective video data, and with certain realistic assumptions (that hold in a large class of applications), however, it is possible to construct accurate models in a robust and quick fashion. In our current MPI Video systems, we make the following assumptions[8]:

objects are in motion most of the time,
objects move on known planar surfaces, and
objects are visible from at least two viewpoints.

In addition, we use the following sources of information extensively:

a priori knowledge of the geometry of the static environment,
knowledge about shapes and dynamic behaviors of moving objects in the domain, and
precomputed internal and external camera calibration models.

3. Information Specification in MPI Video

Expressiveness of interaction is fundamental to the design of any user interface model. This expressiveness can be achieved by using an information visualization metaphor as used by several database browsing and visualization groups[3]. Our motivation for developing a three-dimensional interface for MPI Video stems from the intuition that if is user is given a three-dimensional world changing dynamically with time, he or she would meaningfully operate in this world (i.e., can specify an object, or a space, or a search condition) only when he or she has the ability to navigate and act in it.

Intuitively, a three-dimensional interface would be extremely useful because:

Query Specification: it provides a natural way for specification of several types of queries such as ones involving spatial relationship specification
Infinite Perspectives: unlimited control over viewpoint allows a viewer to observe ``interesting'' actions from a convenient perspective
Selective Viewing: unlike video which is often cluttered, only interesting objects can be selectively displayed
Query Result Visualization: results of many types of queries are presented better in 3D

3.1 Information-bearing entities in MPI Video

To substantiate this intuition, let us first specify the information-bearing entities in MPI Video.

scene-object: A scene-object is an observable element in the reconstructed world, whose type (e.g., human or bicycle), shape, and motion history during the live event is known, but whose identity may not have been established.
object-part: Many of the objects reconstructed by computer vision techniques in MPI Video can be decomposed into subparts. Moreover, each subpart is likely to exhibit an articulated motion (like arm movement during walking) or a shape deformation happening over time. (e.g., in a changing facial expression). This requires us to treat parts of an object as separate information bearing entities distinct from the objects they belong to.
graphic-object: Languages such as VRML for the World Wide Web allow systems to efficiently and transport graphic models from one computer to another. MPI Video implies the ability to transport preconstructed static environment and the reconstructed scene objects across a network. It also implies that a user at a client computer can employ a three-dimensional interface as a data definition mechanism. For example, it can be used for a query by three-dimensional sketch mechanism. In that context, its role is that of a attributes generation which are then submitted to the system to be matched with scene objects appearing in the course of the live event.
space: Space is an intrinsic aspect of the world, which can be used to model a region of interest (this part of the world is called a ``corridor''). It can also be used to query a region of interest in the world. Space can be specified as a user-depicted region, by direction sensitive predicates (e.g., behind and to the left of the yellow building), distance-specific predicates (e.g., within 200 yards of the entrance), or by topological predicates (e.g., outside the ring). An important element for space is the concept of scale. It is possible that a three-dimensional world model is constructed at different granularity levels which can be structured hierarchically. Thus, information such as a ``play'' in football cannot be defined except in the context of a minimal scale of viewing. A related element in space specification is a ``viewpoint'' of observation. Depending on the observer's viewpoint the same query on an event may yield very different answers (e.g, how many persons left the building given this observer's viewpoint?).
time: Time specification is in many respects similar to space, with similar considerations for user-provided value ranges, temporal predicates, scale and resolution issues.
motion: Video offers the unique ability to visualize object and camera motions in the spatio-temporal scene. Many visually perceptible events can be expressed using motion. For example, Campbell and Bobick[2] has reported characterizing ballet movements using phase diagrams of velocity vectors. A difficult interface issue is to determine how a user may exactly articulate the motion desired. Ideally, motion may need to be specified in two complementary ways: by using displacement patterns over time and by specifying translational and rotational components of velocity for objects of interest. For both of these cases, a three-dimensional interface is necessary because visualizing a three dimensional displacement pattern is simpler than visualizing it from their planar projections with a two dimensional interface [19]. As a third case, the user may also need to specify a motion to indicate a time-varying viewpoint. In this case the requirement on the interface is to maintain a client-side distinction between the object of the scene and the observer of the scene.
raw video: Although many of the queries to the MPI Video system can be answered using the reconstructed spatio-temporal model of the real world, the raw video source of the information maintained at the server can still be very significant. The primary reason is that the computer vision procedures underlying the three-dimensional reconstruction are often task specific and may not use all information contained in the video. As a typical example, it may disregard the movement in the crowd for a sports event coverage. Thus the user may request simultaneous viewing of the raw video sources and their reconstructed form. Alternately the user may selectively view the video stream from a specific camera, over a defined sequence of frames. Still more significantly, the user may specify a viewing criteria (e.g., best view of a person, where ``best'' may mean ``occupying the center of the scene and covering more than x% of the frame''), and the system has to select the camera that satisfies the criteria. We note here that the system will possibly have to keep changing the camera in order to maintain the viewing criteria set by the user. In all of these cases, the critical implication to the World Wide Web framework is the desired client-server interactions, because the client must change the context of the video buffer from one channel to the another as soon as the viewing criteria forces it to.
Raw video can also be used for directed searching for some information that is not already extracted from the video. For instance, the system cannot be expected to have readily available information to answer an atypical query such as ``what is the dominant color of clothing among the crowd in the east side of the field?''. The querying client could apply this directed searching criterion to the raw video to obtained desired results.
image: Images are two-dimensional, yet they may play a critical role in specifying information in a three-dimensional world. An image from a key frame archive may be used to initiate a search through the environment of a live event video. Similarly an image of a modeled object from an image database may be used to identify a scene object.
document: Documents are omnipresent entities in the World Wide Web. In our application a document serves two purposes. Either it serves as an existing source of information, semantically linked to the environment and the objects being modeled, or it is a format for presentation of results. In both cases, the system has knowledge of the document structure used to index precomputed information or to compose the presentation, or both.

3.2 Functional requirements

Next, let us explore what operations need to be performed in the World Wide Web to define, update and manipulate the above information categories. The interface needs to allow:

client-controlled server-side update: Several different architectures are possible between the client system and its surrounds. In the foregoing discussion we generically refer to an information provider as a server as though it is connected point to point with a client. In reality communication with the server can be in a broadcast mode, or the system may communicate with a network of ``friends'' in a multicast mode. Regardless of the actual information exchange architecture, an important effect of having a three- dimensional interface for MPI Video is that the server can transport the full three-dimensional model to the client at the beginning of a session and subsequently send incremental updates to maintain consistency. At the level of the interface the client can specify update resolution to control the server-side update. Update resolution has three components -- the base update rate, the spatial resolution profile and the object granularity profile. The base update rate refers to the time interval between successive updates. The spatial resolution profile is a filter which partitions the spatial environment into blocks, and associates a priority value with each block. The priority value indicates the updates for certain regions of the space is more critical than others, and system should provide a more frequent, more secure update for these regions. For example, in a security application, a region near the entrance needs to have a better guaranteed update, compared to the adjoining courtyard. This may be implemented by an event-triggered, interrupt-based update rather than a scheduler-controlled periodic update at the base rate. The object granularity profile states the idea that not all scene object classes need to have the same granularity of description. Thus an unimportant object may be updated at a coarser shape granularity than a closely watched object. In a alerting application (an application which has registered a set of events to the system, and the system need to alert the user as soon as these events occur) users may also express combinations of the last two profiles. In this case the specification can be stated as ``Update with priority 1.0 as soon as an object of the type human enters this region and display the object with a voxel resolution of 1 cm'' (while the default resolution may be 10 cm).

Figure 3: Queries about entities such as plays are best described from a camera independent perspective
support for user interaction: Modes of user operation for a three-dimensional interface goes beyond what current HTML-based systems offer. For such systems, it is only possible to perform object selection (point and click operation), textual input, VCR-like operations (play, fast-forward, reverse, etc.), motion specification in image space, and viewpoint selection. For our set of intended objects, we additionally need the following operations:
1. paths, regions, and volumes of interest: The user interface needs to have a mechanism for selecting paths and regions of interest in three dimensions. The marking should be made either in an empty static environment model, which acts as a three-dimensional query from, or on an overlay of the reconstructed scene, or on an overlay of the replayed video. As shown in Figure 3, spatial paths are used to characterize a ``play''.
2. object construction and rendering: To implement a query (or model) by three dimensional sketch, the user must have a three-dimensional sketchpad, with a set of primitives, methods of deforming, scaling and combining them into create a meaningful object. This object is then rendered with rendering tools and placed in a three-dimensional environment. The task of the rendering requires one to specify the color, texture, reflectivity or any such other property that the queried (or modeled) object would have. The method to place them would need the ability to cut and paste, or copy and paste, or drag in the from the point of creation to the point of placement.
3. drag and drop: An effective way to perform on object based query is to use a library of predefined objects, or use a scene object and create an equivalent object with similar properties. The ability to perform inter-window and inter-application drag and drop facilitates this process. Specifically, one should be able to perform drag and drop objects between HTML documents and image databases, between image databases and static environment, between the reconstructed dynamic world and the graphical query form mentioned before, and so on.
4. velocity control: Motion can be implicitly specified by first drawing out the space covered and then placing a time constraint on it. However, it is more natural to provide a method to express the motion directly in terms of the displacement, velocity and acceleration. In this respect, video games are often equipped with velocity control mechanisms for changing both the magnitude and the direction of observer motion. The effect of the control is manifested by presenting a sequence of screens at the selected rate. In our case, methods are required to specify the motion components and not just to present the motion effect. It is however possible to create an interface where the user selects an object whose motion is of interest, and specifies the requirement by gesturing. In this case the system displays and keeps track of the intended velocity profile. This profile can be plotted for the user who then can then either choose to edit it, or put a range restriction by drawing an envelope around the curve.
5. texture mapping: Texture mapping has been cited a primitive computer graphic operation and may serve as a tool to specify image-level attributes of a query graphic object. For example, the user may create a human shaped graphic object, select its face object-part, and ascribe its identity by texture mapping a face image from an database.
maintenance of interaction history: The caching mechanism employed by current browsers on the World Wide Web is primitive because its unit of reference is the URL, and not the contents of a page. The VRML interfaces follow a ``traverse within the model space'' principle,and hence their target for caching is to quickly reconstruct the view given its previous state. Evidently this mechanism is ``closed''in the sense that only the model and at most the observer states need to be cached. The MPI Video paradigm needs at least three distinct uses of the information history which affects the interface and the caching policy. First, the user can perform several levels of undo operations while specifying the information requirement. As an example,the user can retrace the steps of an interactive three-dimensional query by sketch to make modifications before submitting a query. Second, the user may wish to refine a previous query, by going back a number of steps for interaction, modifying search parameters, and resubmitting it for evaluation. For this operation the client-side interface has to maintain both a history of the user interaction, as well as a cached sequence of previous results, to determine how much of the cached results can be reused for the modified query. The third situation for use of the interaction history is a replay mode, where the presentation condition is altered. For example, in a sports application the new replay request may require presentation of the last 30 seconds of action to be in slow motion, zoomed to show the best view of a specific player's body profile, with a greater display accuracy of his moving hand. The interface must provide controls to re-select the previous results, select objects and object parts of interest, enquire and respecify presentation parameters such as speed and viewing condition.
visualization of meta-information network: At any time during the user's interaction with the client system, the user needs to know the set of other information-bearing objects that are related to the information currently in use. For example, while watching a sports presentation, the information about the player statistics may also be available to the user although no explicit request has been made about it in the current session. This extra information may arise due to the fact that the server (or peer) maintains it as part of the total information context of the event. It could also arise because the user has a profile of his or her custom information needs, and the client had implicitly sent a request from this profile at the start of the session. The client has to have a mechanism of capturing and presenting this relation as a graph or network, which the user may wish to traverse by the way of browsing. It is possible that the interaction history and the meta-information network graph will be large enough to introduce performance penalties. However alternatives for cache and history management are beyond the scope of this paper.

4. MPI Video information exchange architecture

For a large number of users to access MPI Video archives, in addition to being intuitive and easy to use, the interface has to be widely accessible. In this section, we will describe how we can accomplish the user interactions described in Section 3 using the existing WWW protocols (such as HTTP) and languages (such as HTML, VRML, Java). With languages such as VRML and Java in a nascent stage, some enhancements are needed to implement even a rudimentary system. Wherever possible, our current implementations and proposed systems are based upon expected language enhancements.

4.1 Current prototype

Figure 4: Schematic of MPI Video interface showing the video data streams and the remote server, and the interface at the local user site. The local interface uses a HTML browser for initiating form-based queries and displaying text and image based system information, and a VRML browser for interacting with a three dimensional dynamic model of the underlying video data. Interactions between the different components are also shown. The client side is made dynamic and intelligent using behaviors.

Figure 4 shows schematically the various components of a WWW-based MPI Video system and some interactions between the components. To implement such a WWW-based MPI Video system, we need technologies for:

presentation of static and dynamic three-dimensional content
presentation of static and dynamic two-dimensional content
interaction in 3D domain
interaction in 2D domain

Unfortunately, the current WWW technology is designed to present and provide rudimentary interactions with two dimensional layouts. While this has proven to be sufficient for most of the current set of WWW-based interactive video and video database systems[23, 19], our user information specification paradigm, which allows the users to interact with the system at a content level instead of at the data level, cannot be easily implemented with this technology.

VRML[14], which is being designed primarily for multiuser interactions (``a scalable, fully interactive cyberspace''[15] such as Stephenson's Metaverse[20]), is currently usable as a way of presenting static 3D content on WWW. For our current implementations, we use a behavior-enhanced VRML prototype, VRBS[13] to present dynamic 3D content as well as provide rudimentary 3D interactions. Because this implementation uses VRBS[13], a experimental VRML behavior system that is not widely used, we cannot make it available on the WWW for wide accessibility until it is reimplemented using VRML 2.0 in a few months. At the time of writing, standardization of behaviors in VRML is underway, and the features we need to provide complex user interactions are being discussed (e.g, Moving Worlds proposal[18]).

Figure 5: Snapshot of an MPI Video WWW interface prototype showing dynamic models based on live events in a 3D (VRML) browser, a set of associated queries in an HTML browser and results of some queries.

Figure 5 shows a sample session at a client (user) site which uses the VRML browser, the HTML browser, and other applications[8]. The information is presented using the VRML browser, with interesting dynamic and static portions of the scene hyper-linked to either related VRML worlds or HTML query forms.

For the session shown in Figure 5, we used a campus courtyard with pedestrians covered with six video-resolution cameras. The video sequence was digitized at 10 frames per second and processed by the MPI Video modeling system. (A Quicktime video segment is available). The dynamic objects in the environment were detected and tracked to create a database of object ids, object locations, video clips, and related information built using flat files and dbm files.

Interface to this database is at two levels: via CGI scripts that interact with the underlying database, and via the VRML and HTML browsers that construct queries based on user input. All context information required to answer these queries is encoded as parameters to the CGI programs. Currently, the server answers all queries. As we will discuss in the Section 4.3, since information about objects, their locations, and their structure is continuously sent to each client, the client has the necessary information to answer a large class of queries without going to the server.

Using a combination of VRML and HTML browsers, the current system handles several types of queries:

Who is ``this'', where ``this'' is specified in 3D.
Where is Ms. X
- now?
- at time t?
Show me the best {profile|front} view(s) of Mr. Y
- now
- from time to time
This ability to specify the criterion for camera selection (instead of a video editor making the decision) is an important step towards empowering users[6].
Instead of selecting from a small number of real camera perspectives, a generalization of this query[7] would be to combine the virtual camera concept of Immersive Video[12] with an unconstrained or loosely constrained best-view criterion. Two examples of this type of query would be: show me this play from a panoramic point[25], and show me this play from the viewpoint of Player Z.
Show me the best closeup view of Mr. Z from {profile|front}.
Monitor this area and take appropriate action when a certain event (like a person entering the area) occurs. Currently, for the lack of better techniques, the center and radius of the area is typed in. As we argue in this paper, such queries are best specified in the 3D domain.
Show me:
- tracks of all dynamic objects
- tracks of Person X
- no tracks

To handle these queries, two client-side behaviors are implemented:

UpdateState: This behavior is called periodically to update the state of the world. The new state is downloaded from the server.
Monitor: This behavior is created on demand when the user requests monitoring of a certain region. Currently, since the behavior system (VRBS) does not have any ``sensors'', this behavior has to be called periodically. With the addition of sensors, this behavior could be called only when an event such as an object entering the specified region occurs.

4.2 Handling user interactions

While the prototype described above is useful in testing all forms of communication -- server and VRML browser, server and VRML behaviors, VRML behaviors and browser, server and HTML browser, VRML browser and HTML browser, etc. -- it contained only limited user interaction in 3D: viewpoint control and object selection. Even with the other queries being handled using HTML forms, this prototype is a significant step forward towards intuitive interactive video interfaces. To further advance the interface, we need to handle the types of user interactions described in Section 3.2 using VRML. The discussion in this section is based on the current version of one of the VRML 2.0 proposals[18]. We can safely assume that the VRML 2.0 standard will provide similar functionality.

A toolkit of editable basic shapes (such as lines, cubes, cylinders) is used to define paths, regions, and volumes of interest and to construct query objects. With a version of VRML that supports scripting, it is possible to implement a simple object construction suite. A more interesting question is how these are associated with queries. For example, if a user wants to ask the system ``did anybody come here, the user's definition of here has to be somehow associated with the form where the query is being constructed. A somewhat circuitous but feasible method is to ask the user to label each object she creates and to use the same label as parameters to queries. A more elegant solution is to allow the user to drag and drop objects. This requires the browsers to espouse a standard such as OpenDoc[16] or different browsers to be integrated.

4.3 Clients and servers

WWW is currently based on a client-server model[4]. A MPI Video system could be implemented in this framework. Our current prototypes use this model of interaction. When the system is to be accessed by a large number of users, especially in a live-event scenario, several problems arise: the load on the server increases as the number of querying clients increases, the problem of handling user's context information using a protocol free of context such as HTTP becomes more difficult, appreciable delay in the query response is counterproductive.

4.3.1 Intelligent clients

In our case, since the server is already sending the clients information about object position and structure continuously, if each client caches this information and has intelligence to answer frequently asked queries by itself, the load on the server will be reduced. The response to the query in this case is faster than the case where the client has to forward the query to the server and wait for a response. The user's context in this case is stored at the client, and this information is passed when necessary to the server. This model raises two key issues: specifying client side intelligence, and determining the default environment model entities that are sent to a client at every time instant so that most queries are handled by the client.

Client-side intelligence is specified using a safe language such as Java. Typically this is the same language as the one used for scripting behaviors in VRML. Because we do not want to send a general-purpose query handling engine to each client, to a certain extent, the client-side query handling logic is domain dependent. Determining the default environment model is a much harder problem. This is highly domain dependent. For example, the frequently queried upon entities in a football game are not the same as the entities queries upon in a interactive drama. An interesting strategy the could be followed is to start with a minimal default set, with minimal client query handling, and to incrementally enhance both based on server access statistics. How this is exactly achieved is currently being investigated.

4.3.2 Multicasting the environment model

If we have the client-side functionality described above, for live events, incremental updates to the default environment model may be multicast to the clients. This approach reduces the load on the server and engenders scalability. In this scenario, a new client joining the multicast session contacts the server (a nearby server in case of multiple servers) to download the current environment model and a default context. Alternatively, the client may choose to download the environment model and the context from a nearby ``friend''. After this bootstrapping, the client may monitor the current state by listening to the multicast channel for environment model updates. When the user queries, the client first checks its local environment model to check if it could answer the query locally. If the design of the system is correct, the information will be available locally most of the times. Queries that cannot be handled will be passed off to the server. The client may chose to augment its environment model with the results of this query.

This approach can be extended to handle clients (and networks) with different capabilities. Environment model, shown in Figure 2, is made up of easily decomposable layers. Hence, akin to the layered video concept of MBONE[10], we can multicast the environment model on several channels. Based on the client's capabilities (whether it can handle 10k polygons per second or a 4-joint articulated model) or the network bandwidth, the client can choose to listen to a subset of available channels.

4.4 Language for the environment model

Information in the MPI Video paradigm is shared using the environment model. How can this information be exchanged on the WWW? VRML, which is convenient for representing graphical entities is a good start for this. In addition, we need the ability to represent multi-modal information seamlessly, both in the continuous and discrete domain (``all media are equal''[11]). We also should be able to add semantics into the environment model which establishes links between different entities in the environment (e.g., this set of polygons is Person A's left hand).

Another issue is the decoding and the encoding of the environment model. In a WWW-based scenario, the server should spend time on the creation of the environment model if this helps the client in transforming the environment model into usable information faster. For instance, several higher-level entities in the environment model are deducible from the voxels that represent occupancy. But, from a WWW perspective, it is more efficient to compute these higher-level entities at the server.

5. Conclusion

Video is spatio-temporal data. To fully access the available information we must move beyond the preconceived notions of the VCR interface and keep in mind that:

Interactive TV is more than video-on-demand. Providing the user with only the capability to download videos at a convenient time, or select merchandise for purchase, ignores the fact that the scene captured by the video is inherently three-dimensional. It is this 3D data which the user wishes to manipulate.
User desired interactions require a 3D interface. Only 3D will support the desirable query by example.
Our current implementation is a step towards this goal, but assistance from the World Wide Web community is needed to enhance protocols which can support MPI Video. The worldwide success of any web-based application depends on the presence of standards which allow communication in a heterogeneous environment.

References

1: J. Boyle, J. E. Fothergill, and P. M. Gray. Design of a 3D user interface to a database. In J. Lee and G. Grinstein, editors, Database Issues for Data Visualization. IEEE Visualization '93 Workshop. Berlin, Germany: Springer-Verlag, 1994.
2: L. Campbell and A. Bobick. Recognition of human body motion using phase space constraints. Technical Report 309, MIT Media Laboratory, Perceptual Computing Section, MIT, Cambridge, MA, 1995.
3: C. Graham. Database visualization and VRML. In S. N. Spencer, editor, First Annual Symposium on the Virtual Reality Modeling Language, pages 21-24, San Diego, CA, 13-15 Dec. 1995. ACM Press.
4: K. Hughes. Entering the World-Wide Web: A Guide to Cyberspace. WWW document, Oct. 1993.
5: R. Jain and A. Hampapur. Metadata in Video Databases. In SIGMOD Record: Special Issue On Metadata For Digital Media. ACM: SIGMOD, Dec. 1994.
6: R. Jain and K. Wakimoto. Multiple Perspective Interactive Video. In Proceedings of the International Conference on Multimedia Computing and Systems, pages 202-211, Washington, DC, USA, May 15-18 1995. Los Alamitos, CA, USA: IEEE Computer Society Press.
7: A. Katkere, S. Moezzi, D. Kuramura, P. Kelly, and R. Jain. Towards video-based immersive environments. ACM-Springer Multimedia Systems Journal: Special Issue on Multimedia and Multisensory Virtual Worlds, Spring 1996.
8: A. Katkere, J. Schlenzig, and R. Jain. VRML-Based WWW interface to MPI Video. In S. N. Spencer, editor, First Annual Symposium on the Virtual Reality Modeling Language, pages 25-32, 137, San Diego, CA, Dec. 13-15 1995. ACM Press.
9: P. H. Kelly, A. Katkere, D. Y. Kuramura, S. Moezzi, S. Chatterjee, and R. Jain. An architecture for Multiple Perspective Interactive Video. In ACM Multimedia 1995 Proceedings, pages 201-212, San Francisco, CA, Nov. 5-9 1995.
10: S. McCanne. Layered Video. WWW document, Dec. 1995.
11: Microsoft Corporation. ActiveVRML white paper, Dec. 1995.
12: S. Moezzi, A. Katkere, D. Y. Kuramura, and R. Jain. Immersive Video. In Proceedings of the IEEE Virtual Reality Annual International Symposium 1996, Mar. 1996. To be published.
13: D. R. Nadeau and J. L. Moreland. The Virtual Reality Behavior System (VRBS): a behavior language protocol for VRML. In S. N. Spencer, editor, First Annual Symposium on the Virtual Reality Modeling Language, pages 53-61, San Diego, CA, Dec. 13-15 1995. ACM Press.
14: M. Pesce. VRML: browsing and building cyberspace. New Riders, 1995.
15: M. D. Pesce. VRML Architecture Group. WWW document, 1996.
16: K. Piersol. A Close-Up of OpenDoc. BYTE, Mar. 1994.
17: J. Schlenzig, E. Hunter, and R. Jain. Recursive identification of gesture inputs using hidden Markov models. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, pages 187-194. IEEE Computer Society Press, 5-7 Dec. 1994.
18: Silicon Graphics, WorldMaker, Sony, OnLive, Black Sun, Visual Software, and Paper, Inc. The Moving Worlds Proposal for VRML 2.0. WWW document, Jan. 1996. maintained by Chris Marrin.
19: J. R. Smith and S.-F. Chang. VisualSEEk: a Content-Based Image/Video Retrieval System. Java-based WWW demo, 1996.
20: N. Stepenson. Snow Crash. Bantam Books, 1992.
21: D. Swanberg, T. Weymouth, and R. Jain. Domain information model: an extended data model for insertions and query. In Proceedings of the Multimedia Information Systems, pages 39-51, Feb. 1992.
22: L.-C. Tai. Hypermedia in Multiple Perspective Interactive Video. Visual Computing Laboratory internal document, 1996. version 0.7.
23: Telemedia, Networks, and Systems Group, MIT LCS. TNS Technology Demonstrations. WWW demo, 1996.
24: C. Varela, D. Nekhayev, P. Chandrasekharan, C. Krishnan, V. Govindan, D. Modgil, S. Siddiqui, O. Nickolayev, D. Lebedenko, and M. Winslett. DB: browsing object-oriented databases over the web. In Proceedings of the Fourth International World Wide Web Conference,, 1994.
25: D. Yow, B. Yeo, M. M. Yeung, and B. Liu. Analysis and Presentation of Soccer Highlights from Digital Video. In Proceedings, Second Asian Conference on Computer Vision, Dec. 1995.

Arun Katkere
Mon Jan 29 17:09:18 PST 1996

About the authors

Arun Katkere is a graduate student researcher in the Visual Computing Laboratory, University of California, San Diego and a doctoral student in the Department of Electrical and Computer Engineering, University of California, San Diego. He received an M.S.E. degree in Computer Science and Engineering from the University of Michigan, Ann Arbor in 1993 and a B. Tech degree in Computer Engineering from the Mangalore University, India in 1991. He worked on robotic vision and simulation systems as a research assistant at the Artificial Intelligence Laboratory, University of Michigan from 1991–1993. His current research interests include information assimilation, environment modeling, interactive video and multimedia systems, immersive telepresence, WWW and Internet technologies, computer vision and autonomous outdoor robotics.
http://vision.ucsd.edu/~katkere

Jennifer Schlenzig
No biographical information available.
http://vision.ucsd.edu/~schlenz

Amarnath Gupta
No biographical information available.
http://www.virage.com/emps/amarnath.htm

Ramesh Jain is currently a Professor of Electrical and Computer Engineering, and Computer Science and Engineering at University of California at San Diego. Before joining UCSD, he was a Professor of Electrical Engineering and Computer Science and the founding Director of the Artificial Intelligence Laboratory at the University of Michigan, Ann Arbor, MI 48109. He has also been affiliated with Stanford University, IBM Almaden Research Labs, General Motors Research Labs, Wayne State University, University of Texas at Austin, University of Hamburg, West Germany, and Indian Institute of Technology, Kharagpur, India. His current research interest include multimedia information systems, image databases, machine vision, and intelligent systems.

Professor Jain is the founding chairman of Imageware Inc., an Arbor based company dedicated to revolutionize software interfaces for emerging sensor technologies. He is also the founding chairman of Virage, a San Diego based company developing systems for visual information retrieval.

Professor Jain is a Fellow of IEEE, AAAI, and Society of Photo-optical Instrumentation Engineers, and member of ACM, Pattern Recognition Society, Cognitive Science Society, Optical Society of America, and Society of Manufacturing Engineers. He has been involved in the organization of several professional conferences and workshops, and served on editorial boards of many journals. Currently, he is the Editor-in-Chief of IEEE MultiMedia Magazine, and is on the editorial boards of Machine Vision and Applications, Pattern Recognition, and Image and Vision Computing. He received his Ph.D. from IIT, Kharagpur in 1975 and his B.E. from Nagpur University in 1969.
http://vision.ucsd.edu/~jain