Reachable Subwebs for Traversal-Based Query Execution

Olaf Hartig M. Tamer Özsu
Database Research Group
Cheriton School of Computer Science
University of Waterloo

This page provides all digital artifacts related to our paper in the WebScience track of the 23rd Int. World Wide Web Conference (WWW 2014). All content on this page is licensed under the Creative Commons Attribution-Share Alike 3.0 License.

Table of content:


PDF file Paper (6 pages)

PDF file Technical Report (14 pages, extended version of the paper with all test queries and all measurements)

Visualizations of QPGs

Each of the following PDF documents provides a visualization of the QPG that we constructed using the information collected during the execution of the given FedBench-based CLD query (under cMatch-semantics). The blue vertices in these visualizations represent seed documents, the green vertices represent reachable LD documents that are relevant (as per the definition in the paper), and the red vertices are the reachable LD documents that are not relevant.

Beware, some of these documents are very complex, which may be a problem for some PDF viewers.

To generate these documents we used the Graphviz tools (our query execution log processor SQUIN-webviz generates the necessary input files for these tools).


Query Execution System

To execute CLD queries we used our link traversal based query execution system SQUIN, which is Free Software, licensed under the Apache License, Version 2.0.

The SQUIN code depends on Apache Jena (for the experiments we used version 2.10.1) and the Norbert library (we used version 0.3.2).

We emphasize that the version of SQUIN used for the experiments is based on a new query engine. This engine guarantees completeness of query results for CLD queries under cMatch-semantics (in contrast to the previous, iterator-based engine studied in our earlier research papers). Furthermore, this new engine can be instructed to log information that is necessary for constructing QPGs. This logging functionality is enabled by add the following lines to the file:




Hence, if this logging functionality is enabled, SQUIN generates three different log files, LogLinks.csv, LogIntSol.csv, and LogPrv.csv. We use these log files as input for the following tool.

Query Execution Log Processor

The source code of the tool that uses the aforementioned three log files to generate QPGs, visualize them, and compute measurements such as those given in the paper, can be found in the following package:

This tool depends on the JUNG framework (we used version 2.0.1), the JGraphT library (we used version 0.8.3), and the Grappa library (we used version 1.2).


To generate test Webs from a given base dataset and to simulate such a test Web using a Java servlet container (such as Apache Tomcat) we developed the WODSim framework, which is Free Software, licensed under the Apache License, Version 2.0.

The WODSim code depends on SQUIN and Apache Jena.

Test Webs

Each of the following packages contains a materialized version of one of the test Webs used for our study. To simulate such a test Web using WODSim (see above) unpack the package and refer to the obtained directory in the configuration file of the WODSim servlet.