DBTR-1: Giedrius Slivinskas, Christian S. Jensen, and Richard T. Snodgrass, Bringing Order to Query Optimization

A variety of developments combine to highlight the need for respecting order when manipulating relations. For example, new functionality is being added to SQL to support OLAP-style querying in which order is frequently an important aspect. The set- or multiset-based frameworks for query optimization that are currently being taught to database students are increasingly inadequate.
This paper presents a foundation for query optimization that extends existing frameworks to also capture ordering. A list-based relational algebra is provided along with three progressively stronger types of algebraic equivalences, concrete query transformation rules that obey the different equivalences, and a procedure for determining which types of transformation rules are applicable for optimizing a query. The exposition follows the style chosen by many textbooks, making it relatively easy to teach this material in continuation of the material covered in the textbooks, and to integrate this material into the textbooks.

DBTR-2: Christian S. Jensen, Augustas Kligys, Torben Bach Pedersen, and Igor Timko, Multidimensional Data Modeling for Location-Based Services

With the recent and continuing advances in areas such as wireless communications and positioning technologies, mobile, location-based services are becoming possible. Such services deliver location-dependent content to their users. More specifically, these services may capture the movements of their users in multidimensional databases, and their delivery of content in response to user requests may be based on the issuing of complex, multidimensional queries.
The application of multidimensional technology in this context poses a range of new challenges. The specific challenge addressed here concerns the provision of an appropriate multidimensional data model. In particular, the paper extends an existing multidimensional data model and algebraic query language to accommodate spatial values that exhibit partial containment relationships instead of the total containment relationships normally assumed in multidimensional data models. Partial containment introduces imprecision in aggregation paths. The paper proposes a method for evaluating the imprecision of such paths. The paper also offers transformations of dimension hierarchies with partial containment relationships to simple hierarchies, to which existing precomputation techniques are applicable.

DBTR-3: Arturas Mazeika, Michael Böhlen, and Peer Mylov, The Density Surfaces Module

DBTR-4: Albrecht Schmidt and Michael H. Böhlen, Parameter Estimation for Interactive Visualisation of Scientific Data

This paper presents a method for accelerating algorithms for computing common statistical operations like parameter estimation or sampling on B-Tree indexed data in the context of visualisation of large scientific data sets. The method builds heavily on technology that is already part of relation database management systems and requires only small extensions. The technical goal is the following: Given a massive set of scientific data like sensor data stored in a Relational Database Management System, enable interactive exploration and visualisation of the data by exploiting the technology that already is in place in the database back-end. The main underlying idea is the following: the shape of balanced data structures like B-Trees encodes and reflects data semantics according to the balance criterion. For example, clusters in the index attribute are somewhat likely to be present not only on the data or leaf level of the tree but should propagate up into the interior levels. The paper investigates opportunities and limitations of this approach for visualisation. The advantages of the method are manifold. Not only does it enable advanced algorithms through a performance boost for basic operations, but it also builds on functionality that is already present to a large degree in current RDBMSs; furthermore, it is fully dynamic by avoiding redundancy: when the underlying source data change, the index and therefore the estimations adapt accordingly. Furthermore, we show that the sample quality is data-independent and never worse than a uniform sample if some basic prerequisites are ensured.

DBTR-5: Alminas Civilis, Christian S. Jensen, Jovita Nenortaite, and Stardas Pakalnis, Efficient Tracking of Moving Objects with Precision Guarantees

We are witnessing continued improvements in wireless communications and geo-positioning. In addition, the performance/price ratio for consumer electronics continues to improve. These developments pave the way to a kind of location-based service that relies on the tracking of the continuously changing positions of the entire population of service users. This type of service is characterized by large volumes of updates, giving prominence to techniques for location representation and update.
In this paper, we present several representations, along with associated update techniques, that predict the future positions of moving objects. For all representations, the predicted position of a moving object is updated whenever the deviation between it and the actual position of the object exceeds a given threshold. For the case where the road network, in which the object is moving, is known, we propose a so-called segment-based policy that represents and predicts an object’s movement according to the road’s shape. Map matching is used for determining the road on which an object is moving. Empirical performance studies and comparisons of the proposed techniques based on a real road network and GPS logs from cars are reported.

DBTR-6: Mong Li Lee, Wynne Hsu, Christian S. Jensen, Bin Cui, and Keng Lik Teo, Supporting Frequent Updates in R-Trees: A Bottom-Up Approach

Advances in hardware-related technologies promise to enable new data management applications that monitor continuous processes. In these applications, enormous amounts of state samples are obtained via sensors and are streamed to a database. Further, updates are very frequent and may exhibit locality. While the R-tree is the index of choice for multi-dimensional data with low dimensionality, and is thus relevant to these applications, R-tree updates are also relatively inefficient. We present a bottom-up update strategy for R-trees that generalizes existing update techniques and aims to improve update performance. It has different levels of reorganization ranging from global to local during updates, avoiding expensive top-down updates. A compact main-memory summary structure that allows direct access to the R-tree index nodes is used together with efficient bottom-up algorithms. Empirical studies indicate that the bottom-up strategy outperforms the traditional top-down technique, leads to indices with better query performance, achieves higher throughput, and is scalable.

DBTR-7: Dennis Pedersen and Torben Bach Pedersen, Synchronizing XPath Views

The increasing availability of XML-based data sources, e.g., for publishing data on the WWW, means that more and more applications (data consumers) rely on accessing and using XML data. Typically, the access is achieved by defining views over the XML data, and accessing data through these views. However, the XML data sources are often independent of the data consumers and may change their schemas without notification, invalidating the XML views defined by the data consumers. This requires the view definitions to be updated to reflect the new structure of the data sources, a process termed view synchronization. XPath is the most commonly used language for retrieving parts of XML documents, and is thus an important cornerstone for XML view definitions. This paper presents techniques for discovering schema changes in XML data sources and synchronizing XPath-based views to reflect these schema changes. In many cases, this allows the XML data consumers to continue their operation without interruption. Experiments show that the techniques work well even if both schema and data change at the same time. To our knowledge, this is the first presented technique for synchronizing views over XML data.

DBTR-8: Dennis Pedersen, Jesper Pedersen, and Torben Bach Pedersen, Integrating XML Data In The TARGIT OLAP System

This paper presents the results of industrial work on the logical integration of OLAP and XML data sources, carried out in cooperation between TARGIT, a Danish OLAP client vendor, and Aalborg University. A prototype has been developed that allows XML data stored outside the OLAP system to be used as dimensions and measures in the OLAP system in the same way as ordinary dimensions and measures. This provides a powerful and flexible way to handle unexpected or short-term data requirements as well as rapidly changing data. Compared to earlier work, this paper presents several major extensions that resulted from TARGIT’s requirements. These include the ability to use XML data as measures, as well as a novel multigranular data model and query language that formalizes and extends the TARGIT data model and query language.

DBTR-9: Agne Brilingaite, Christian S. Jensen, and Nora Zokaite, Enabling Routes as Context in Mobile Services

With the continuing advances in wireless communications, geo-positioning, and portable electronics, an infrastructure is emerging that enables the delivery of on-line, location-enabled services to very large numbers of mobile users. A typical usage situation for mobile services is one characterized by a small screen and no keyboard, and by the service being only a secondary focus of the user. Under such circumstances, it is particularly important to deliver the “right” information and service at the right time,
with as little user interaction as possible. This may be achieved by making services context aware. Mobile users frequently follow the same route to a destination as they did during previous trips to the destination, and the route and destination constitute important aspects of the context for a range of services. This paper presents key concepts underlying a software component that identifies and accumulates the routes of a user along with their usage patterns and that makes the routes available to services.
Experiences from using the component on logs of GPS positions acquired from vehicles traveling within
a real road network are reported.

DBTR-10: Alminas Civilis, Christian S. Jensen, and Stardas Pakalnis, Techniques for Efficient Road-Network-Based Tracking of Moving Objects

With the continued advances in wireless communications, geo-positioning, and consumer electronics, an infrastructure is emerging that enables location-based services that rely on the tracking of the continuously changing positions of entire populations of service users, termed moving objects. This scenario is characterized by large volumes of updates, for which reason location update technologies become important. A setting is assumed in which a central database stores a representation of each moving object’s current position. This position is to be maintained so that it deviates from the user’s real position by at most a given threshold. To do so, each moving object stores locally the central representation of its position. Then an object updates the database whenever the deviation between its actual position (as obtained from a GPS device) and the database position exceeds the threshold. The main issue considered is how to represent the location of a moving object in a database so that tracking can be done with as few updates as possible. The paper proposes to use the road network within which the objects are assumed to move for predicting their future positions. The paper presents algorithms that modify an initial road-network representation, so that it works better as a basis for predicting an object’s position; it proposes to use known movement patterns of the object, in the form of routes; and it proposes to use acceleration profiles together with the routes. Using real GPS-data and a corresponding real road network, the paper offers empirical evaluations and comparisons that include three existing approaches and all the proposed approaches.

DBTR-11: Steffen Ulsø Knudsen, Torben Bach Pedersen, Christian Thomsen, and Kristian Torp, RelaXML: Bidirectional Transfer Between Relational and XML Data

In modern enterprises, almost all data is stored in relational databases. Additionally, most enterprises increasingly collaborate with other enterprises in long-running read-write workflows, primarily through XML-based data exchange technologies such as web services. However, bidirectional XML data exchange is cumbersome and must often be hand-coded, at considerable expense. This paper remedies the situation by proposing RELAXML, an automatic and effective approach to bidirectional XML-based exchange of relational data. RELAXML supports re-use through multiple inheritance, and handles both export of relational data to XML documents and (re-)import of XML documents with a large degree of flexibility in terms of the SQL statements and XML document structures supported. Import and export are formally defined so as to avoid semantic problems, and algorithms to implement both are given. A performance study shows that the approach has a reasonable overhead compared to hand-coded programs.

DBTR-12: Xuepeng Yin and Torben Bach Pedersen, What can Hierarchies do for Data Streams

Much effort has been put into building data streams management systems for querying data streams. Here, data streams have been viewed as a flow of low-level data items, e.g., sensor readings or IP packet data. Stream query languages have mostly been SQL-based, with the STREAM and TelegraphCQ languages as examples. However, there has been little work on supporting OLAP-like queries that provide multi-dimensional and summarized views of stream data. In this paper, we introduce a multidimensional stream query language and its formal semantics. Our approach enables powerful OLAP queries against data streams with dimension hierarchies, thus turning low-level data streams into informative high-level aggregates. A comparison with STREAM shows that our approach is more flexible and powerful for high-level OLAP queries, as well as far more compact and concise.

DBTR-13: Igor Timko, Curtis E. Dyreson, and Torben Bach Pedersen, Probabilistic Data Modeling and Querying for Location-Based Data Warehouses

Motivated by the increasing need to handle complex, dynamic, uncertain multidimensional data in location-based warehouses, this paper proposes a novel probabilistic data model that can address the complexities of such data. The model provides a foundation for handling complex hierarchical and uncertain data, e.g., data from the location-based services domain such as transportation infrastructures and the attached static and dynamic content such as speed limits and vehicle positions. The paper also presents algebraic operators that support querying of such data. Use of pre-aggregation for implementation of the operators is also discussed. The work is motivated with a real-world case study, based on our collaboration with a leading Danish vendor of location-based services.

DBTR-14: Igor Timko, Curtis E. Dyreson, and Torben Bach Pedersen, Probability Distributions as Pre-Aggregated Data in Data Warehouses

DBTR-15: Claus A. Christensen, Steen Gundersborg, Kristian de Linde, and Kristian Torp, A Unit-Test Framework for Database Applications

The outcome of a test of an application that stores data in a database naturally depends on the state of the database. It is therefore important that test developers are able to set up and tear down database states in a simple and efficient manner. In existing unit-test frameworks, setting up and tearing down such test fixtures is labor intensive and often requires copy-and-paste of code. This paper presents an extension to existing unit-test frameworks that allows unit tests to reuse data inserted by other unit tests in a very structured fashion. With this approach, the test fixture for each unit test can be minimized. In addition, the reuse between unit tests can speed up the execution of test suites. A performance test on a medium-size project shows a 40% speed up and an estimated 25% reduction in the number of lines of test code.

DBTR-16: Xuegang Huang, Christian S. Jensen, and Simonas Saltenis, The Islands Approach to Nearest Neighbor Querying in Spatial Networks

Much research has recently been devoted to the data management foundations of location-based mobile services. In one important scenario, the service users are constrained to a transportation network. As a result, query processing in spatial road networks is of interest. In this paper, we propose a versatile approach to k nearest neighbor computation in spatial networks, termed the Islands approach. By offering flexible yet simple means of balancing re-computation and pre-computation, this approach is able to manage the trade-off between query and update performance, and it offers better overall query and update performance than do its predecessors. The result is a single, efficient, and versatile approach to k nearest neighbor computation that obviates the need for using several k nearest neighbor approaches for supporting a single service scenario. The experimental comparison with the existing techniques uses real-world road network data and considers both I/O and CPU performance, for both queries and updates.

DBTR-17: Xuepeng Yin and Torben Bach Pedersen, Algebra-Based Optimization of XML-Extended OLAP Queries

In today’s OLAP systems, integrating fast changing data, e.g., stock quotes, physically into a cube is complex and time-consuming. The widespread use of XML makes it very possible that this data is available in XML format on the WWW; thus, making XML data logically federated with OLAP systems is desirable. This report presents a complete foundation for such OLAP-XML federations. This includes a prototypical query engine, a simplified query semantics based on previous work, and a complete physical algebra which enables precise modeling of the execution tasks of an OLAP-XML query.
Effective algebra-based and cost-based query optimization and implementation are also proposed, as well as the execution techniques. Finally, experiments with the prototypical query engine w.r.t. federation performance, optimization effectiveness, and feasibility suggest that our approach, unlike the physical integration, is a practical solution for integrating fast changing data into OLAP systems.

DBTR-18: Juan Manuel Perez, Rafael Berlanga, Maria Jose Aramburu, and Torben Bach Pedersen, Integrating Data Warehouses with Web Data: A Survey

This paper surveys the most relevant research on combining Data Warehouse (DW) and Web data. It studies the XML technologies that are currently being used to integrate, store, query and retrieve web data, and their application to data warehouses. The paper addresses the problem of integrating heterogeneous DWs and explains how to deal with both semi-structured and unstructured data in DWs
and On-Line Analytical Processing.

DBTR-19: Xuegang Huang and Christian S. Jensen, A Streams-Based Framework for Defining Location-Based Queries

An infrastructure is emerging that supports the delivery of on-line, location-enabled services to mobile users. Such services involve novel database queries, and the database research community is quite active in proposing techniques for the efficient processing of such queries. In parallel to this, the management of data streams has become an active area of research. While most research in mobile services concerns performance issues, this paper aims to establish a formal framework for defining the semantics of queries encountered in mobile services, most notably the so-called continuous queries that are particularly relevant in this context. Rather than inventing an entirely new framework, the paper proposes a framework that builds on concepts from data streams and temporal databases. Definitions of example queries demonstrates how the framework enables clear formulation of query semantics and the comparison of queries. The paper also proposes a categorization of location-based queries.

DBTR-20: Gyozo Gidofalvi, Xuegang Huang, and Torben Bach Pedersen, Privacy-Preserving Data Mining on Moving Object Trajectories

The popularity of embedded positioning technologies in mobile devices and the development of mobile communication technology have paved the way for powerful location-based services (LBSs). To make LBSs useful and user–friendly, heavy use is made of context information, including patterns in user location data which are extracted by data mining methods. However, there is a potential conflict of interest: the data mining methods want as precise data as possible, while the users want to protect their privacy by not disclosing their exact movements. This paper aims to resolve this conflict by proposing a general framework that allows user location data to be anonymized, thus preserving privacy, while still allowing interesting patterns to be discovered. The framework allows users to specify individual desired levels of privacy that the data collection and mining system will then meet. Privacy-preserving methods are proposed for two core data mining tasks, namely finding dense spatiotemporal regions and finding frequent routes. An extensive set of experiments evaluate the methods, comparing them to their non-privacy-preserving equivalents. The experiments show that the framework still allows most patterns to be found, even when privacy is preserved.

DBTR-21: Xuegang Huang and Hua Lu, Snapshot Density Queries on Location Sensors

Density queries are of practical importance in many mobility related applications including traffic monitoring. Previous work has so far assumed a client/server architecture to solve such queries. Whereas in this paper, carefully constructed sensor networks, which consist of both lightweight location sensors and more powerful processing nodes, are proposed to answer density queries. We assume the location sensors are sensors that are placed in a geographical area and can only detect the amount of objects moving in vicinity. This paper is focused on estimating the dense regions of objects. Our approach partitions the region of interest into subregions, and deploys in each subregion both sensor nodes detecting moving objects and a processing node issuing and answering density queries. Three algorithms, CF, VCF and GF, are proposed to efficiently process density queries on those processing nodes. Indications of extensive empirical evaluation are threefold. First, our solution is effective as accuracy gained is acceptably high. Second, our solution is efficient as it incurs short CPU time. Third, different query processing algorithms are fit for different scenarios with specific environment settings.

DBTR-22: Kostas Tzoumas, Timos Sellis, and Christian S. Jensen, A Reinforcement Learning Approach for Adaptive Query Processing

In adaptive query processing, query plans are improved at runtime by means of feedback. In the very flexible approach based on so-called eddies, query execution is treated as a process of routing tuples to the query operators that combine to compute a query. This makes it possible to alter query plans at the granularity of tuples. Further, the complex task of searching the query plan space for a suitable plan now resides in the routing policies used. These policies must adapt to the changing execution environment and must converge at a near-optimal plan when the environment stabilizes. This paper advances adaptive query processing in two respects. First, it proposes a general framework for the routing problem that may serve the same role for adaptive query processing as does the framework of search in query plan space for conventional query processing. It thus offers an improved foundation for research in adaptive query processing. The framework leverages reinforcement learning theory and formalizes a tuple routing policy as a mapping from a state space to an action space, capturing query semantics as well as routing constraints. In effect, the framework transforms query optimization from a search problem in query plan space to an unsupervised learning problem with quantitative rewards that is tightly coupled with the query execution. The framework covers selection queries as well as joins that use all proposed join execution mechanisms (SHJs, SteMs, STAIRs). Second, in addition to showing how existing routing policies can fit into the framework, the paper demonstrates new routing policies that build on advances in reinforcement learning. By means of empirical studies, it is shown that the proposed policies embody the desired adaptivity and convergence characteristics, and that they are capable of clearly outperforming existing policies.

DBTR-23: Christian Thomsen and Torben Bach Pedersen, A Survey of Open Source Tools for Business Intelligence

The industrial use of open source Business Intelligence (BI) tools is becoming more common, but is still not as widespread as for other types of software. It is therefore of interest to explore which possibilities are available for open source BI and compare the tools. In this survey paper, we consider the capabilities of a number of open source tools for BI. In the paper, we consider a number of Extract‐Transform‐Load (ETL) tools, database management systems (DBMSs), On‐Line Analytical Processing (OLAP) servers, and OLAP clients. We find that, unlike the situation a few years ago, there now exist mature and powerful tools in all these categories. However, the functionality still falls somewhat short of that found in commercial tools.

DBTR-24: Morten Middelfart and Torben Bach Pedersen, Discovering Sentinel Rules for Business Intelligence

This paper proposes the concept of sentinel rules for multi-dimensional data that warns users when measure data concerning the external environment changes. For instance, a surge in negative blogging about a company could trigger a sentinel rule warning that revenue will decrease within two months, so a new course of action can be taken. Hereby, we expand the window of opportunity for organizations and facilitate successful navigation even though the world behaves chaotically. Since sentinel rules are at the schema level as opposed to the data level, and operate on data changes as opposed to absolute data values, we are able to discover strong and useful sentinel rules that would otherwise be hidden when using sequential pattern mining or correlation techniques. We present a method for sentinel rule discovery and an implementation of this method that scales linearly on large data volumes.

DBTR-25: Christian Thomsen and Torben Bach Pedersen, pygrametl: A Powerful Programming Framework for Extract-Transform-Load Programmers

Extract–Transform–Load (ETL) processes are used for extracting data, transforming it and loading it into data warehouses (DWs). Many tools for creating ETL processes exist. The dominating tools all use graphical user interfaces (GUIs) where the developer visually defines the data flow and operations. In this paper, we challenge this approach and propose to do ETL programming by writing code. To make the programming easy, we present the (Python-based) framework pygrametl which offers commonly used functionality for ETL development. By using the framework, the developer can efficiently create effective ETL solutions from which the full power of programming can be exploited. Our experiments show that when pygrametl is used, both the development time and running time are short when compared to an existing GUI-based tool.

DBTR-26: Darius Šidlauskas, Simonas Šaltenis, Christian W. Christiansen, Jan M. Johansen, and Donatas Šaulys, Trees or Grids? Indexing Moving Objects in Main Memory

New application areas, such as location-based services, rely on the efficient management of large collections of mobile objects. Maintaining accurate, up-to-date positions of these objects results in massive update loads that must be supported by spatial indexing structures and main-memory indexes are usually necessary to provide high update performance. Traditionally, the R-tree and its variants were used for indexing spatial data, but most of the recent research assumes that a simple, uniform grid is the best choice for managing moving objects in main memory.
We perform an extensive experimental study to compare the two approaches on modern hardware. As the result of numerous design-and-experiment iterations, we propose the update- and query-efficient variants of the R-tree and the grid. The experiments with these indexes reveal a number of interesting insights. First, the coupling of a spatial index, grid or R-tree, with a secondary index on object IDs boosts the update performance significantly. Next, the R-tree, when combined with such a secondary index, can provide update performance competitive with the grid. Finally, the grid can compete with the R-tree in terms of the query performance and it is surprisingly robust to varying parameters of the workloads. In summary, the study shows that, in most cases, the choice of the index boils down to the issues such as the ease of implementation or the support for spatially extended objects.

DBTR-27: Hua Lu, Christian S. Jensen, and Zhenjie Zhang, Skyline Ordering: A Flexible Framework for Efficient Resolution of Size Constraints on Skyline Queries

Given a set of multi-dimensional points, a skyline query returns the interesting points that are not dominated by other points. It has been observed that the actual cardinality (s) of a skyline query result may differ substantially from the desired result cardinality (k), which has prompted studies on how to reduce s for the case where k < s.
This paper goes further by addressing the general case where the relationship between k and s is not known beforehand. Due to their complexity, the existing pointwise ranking and set-wide maximization techniques are not well suited for this problem. Moreover, the former often incurs too many ties in its ranking, and the latter is inapplicable for k > s. Based on these observations, the paper proposes a new approach, called skyline ordering, that forms a skyline-based partitioning of a given data set, such that an order exists among the partitions. Then set-wide maximization techniques may be applied within each partition. Efficient algorithms are developed for skyline ordering and for resolving size constraints using the skyline order. The results of extensive experiments show that skyline ordering yields a flexible framework for the efficient and scalable resolution of arbitrary size constraints on skyline queries.

DBTR-28: Man Lung You, Ira Assent, Christian S. Jensen, and Panos Kalnis, Outsourced Similarity Search on Metric Data Assets

This paper considers a cloud computing setting in which similarity querying of metric data is outsourced to a service provider. The data is to be revealed only to trusted users, not to the service provider or anyone else. Users query the server for the most similar data objects to a query example. Outsourcing offers the data owner scalability and a low initial investment. The need for privacy may be due to the data being sensitive (e.g., in medicine), valuable (e.g., in astronomy), or otherwise confidential. Given this setting, the paper presents techniques that transform the data prior to supplying it to the service provider for similarity queries on the transformed data. Our techniques provide interesting trade-offs between query cost and accuracy. They are then further extended to offer two intuitive privacy guarantees. Empirical studies with real data demonstrate that the techniques are capable of offering privacy while enabling efficient and accurate processing of similarity queries

DBTR-29: Xiufeng Liu, Christian Thomsen, and Torben Bach Pedersen, ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This report presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The report describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.

DBTR-30: Sari Haj Hussein, Hua Lu, and Torben Bach Pedersen, Unified Modeling and Reasoning in Constrained Outdoor and Indoor Spaces

Geographic information systems traditionally dealt with only outdoor spaces. In recent years, indoor spatial information systems have started to attract attention partly due to the increasing use of receptor devices (e.g., RFID readers or wireless sensor networks) in both outdoor and indoor spaces. Applications that employ these devices are expected to span uniformly and supply seamless functionality in both outdoor and indoor spaces. What makes this impossible is the current absence of a unified account of these two types of spaces both in terms of modeling and reasoning about the models. This paper presents a unified model of outdoor and indoor spaces and receptor deployments in these spaces. The model is expressive, flexible, and invariant to the segmentation of a space plan, and the receptor deployment policy. It is focused on partially constrained outdoor and indoor motion. On top of this model, clear representation of routes is honed, and a powerful route observability function is derived. The model facilitates probabilistic incorporation of receptor data through a probabilistic trajectory-to-route translator that enables performing high-level reasoning about points of potential traffic (over)load in outdoor and indoor spaces, so-called bottleneck points. An experimental evaluation corroborates the accuracy of the translator and the sensibleness of the reasoning, when applied to synthetic data, and importantly, to uncleansed, real-world receptor data obtained from tracking RFID-tagged flight baggage.

DTR-30-Rev: Sari Haj Hussein, Hua Lu, and Torben Bach Pedersen, Unified Modeling and Reasoning in Constrained Outdoor and Indoor Spaces

In recent years, indoor spatial data management has started to attract attention partly due to the increasing use of receptor devices (e.g., RFID readers, and wireless sensor networks) in both outdoor and indoor spaces. Applications that employ these devices are expected to span uniformly and supply seamless functionality in both outdoor and indoor spaces. What makes this impossible is the current absence of a unified account of these two types of spaces both in terms of modeling and reasoning about the models. This paper reviews and extends a recent unified model of outdoor and indoor spaces and receptor deployments in these spaces. The extended model enables modelers to capture various information pieces from the physical world. On top of the extended model, this paper hones the route observability concept, derives its powerful, bounded information-theoretic function, and demonstrates its usefulness in enhancing the reading environment. Additionally, this paper establishes a conclusive relation between a route observability and the uncertainty in tracking moving objects. The extended model enables incorporating receptor data through a probabilistic trajectory-to-route translator. This translator first facilitates the tracking of moving objects enabling the search for them to be optimized, and second permits performing high-level reasoning about points of potential traffic (over)load in outdoor and indoor spaces, so-called bottleneck points. A functional analysis illustrates the behavior of the route observability function. An experimental evaluation follows to corroborate the competitive accuracy of the translator, the high quality of the inference, and the sensibleness of the reasoning, when applied to synthetic data, and to uncleansed, real-world data obtained from tracking RFID-tagged flight baggage.

DBTR-31: Xiufeng Liu, Christian Thomsen, and Torben Bach Pedersen, CloudETL: Scalable Dimensional ETL for Hadoop and Hive

Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as an RDBMS-like system for DWs and provides good and scalable analytical features. It is, however, still challenging to do proper dimensional ETL processing with (relational) Hive; for example, the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are both complex and hard to handle manually). To remedy this, we here present the cloud-enabled ETL framework CloudETL. CloudETL uses Hadoop to parallelize the ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about the technical details of MapReduce. CloudETL provides built-in support for different dimensional concepts, including star schemas and SCDs. In the paper, we present how CloudETL works. We present different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present an extensive performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and significantly outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD in an experiment and needs 112 statements while CloudETL only needs 4.

DBTR-32: Ove Andersen and Kristian Torp An Open-Source ITS Platform

In this report a complete system used to compute travel-times from GPS data is described. Two approaches to computing travel time are proposed one based on points and one based on trips. Overall both approaches gives reasonable results compared to existing manual estimated travel times. However, the trip-based approach requires more GPS data and of a higher quality than the point-based approach. The system has been completely implemented using open-source software and is in production. A detailed performance study, using a desktop PC, shows that the system can handle large data sizes and that the performance scales, for some components, linearly with the number of processor cores available. The main conclusion is that large quantity of GPS data can, with a very limited budget, used for estimating travel times, if enough GPS data is available.

DBTR-33: Xike Xie, Man Lung Yiu, Reynold Cheng, and Hua Lu Trajectory Possible Nearest Neighbor Queries over Imprecise Location Data

Trajectory queries, which retrieve nearby objects for every point of a given route, can be used to identify alerts of potential threats along a vessel route, or monitor the adjacent rescuers to a travel path. However, the locations of these objects (e.g., threats, succours) may not be precisely obtained due to hardware limitations of measuring devices, as well as complex natures of the surroundings. For such data, we consider a common model, where the possible locations of an object are bounded by a closed region, called “imprecise region”. Ignoring or coarsely wrapping imprecision can render low query qualities, and cause undesirable consequences such as missing alerts of threats and poor response rescue time. Also, the query is quite time-consuming, since all points on the trajectory are considered. In this paper, we study how to efficiently evaluate trajectory queries over imprecise objects, by proposing a novel concept, u-bisector, which is an extension of bisector specified for imprecise data. Based on the u-bisector, we provide an efficient and versatile solution which supports different shapes of commonly-used imprecise regions (e.g., rectangles, circles, and line segments). Extensive experiments on real datasets show that our proposal achieves better efficiency, quality, and scalability than its competitors.

DBTR-34: Ove Andersen, Benjamin B. Krogh, and Kristian Torp Analyse af elbilers forbrug (in Danish)


DBTR-35: Alex B. Andersen, Nurefşan Gür, Katja Hose, Kim A. Jakobsen, and Torben Bach Pedersen Publishing Danish Agricultural Government Data as Semantic Web Data

Recent advances in Semantic Web technologies have led to a growing popularity of the (Linked) Open Data movement. Only recently, the Danish government has joined the movement and published several data sets – formerly only accessible for a fee – as Open Data in various formats, such as CSV and text files. These raw data sets are difficult to process automatically and combine with other data sources on the Web. Hence, our goal is to convert such data into RDF and make it available to a broader range of users and applications as Linked Open Data. In this paper, we discuss our experiences based on the particularly interesting use case of agricultural data as agriculture is one of the most important industries in Denmark. We describe the process of converting the data and discuss the particular problems that we encountered with respect to the considered data sets. We additionally evaluate our result based on several queries that could not be answers based on existing sources before.

DBTR-36: Ove Andersen, Benjamin B. Krogh, and Kristian Torp Analyse af elbilers forbrug for perioden 2012-2013 (in Danish)

DBTR-37: Jovan Varga, Ekaterina Dobrokhotova, Oscar Romero, Torben Bach Pedersen, and Christian Thomsen SM4MQ: A Semantic Model for Multidimensional Queries

On-Line Analytical Processing (OLAP) is a data analysis approach to support decision-making. On top of that, Exploratory OLAP is a novel initiative for the convergence of OLAP and the Semantic Web (SW) that enables the use of OLAP techniques on SW data. Moreover, OLAP approaches exploit different metadata artifacts (e.g., queries) to assist users with the analysis. However, modeling and sharing of most of these artifacts are typically overlooked. Thus, in this paper we focus on the query metadata artifact in the Exploratory OLAP context and propose an RDF-based vocabulary for its representation, sharing, and reuse on the SW. As OLAP is based on the underlying multidimensional (MD) data model we denote such queries as MD queries and define SM4MQ: A Semantic Model for Multidimensional Queries. Furthermore, we propose a method to automate the exploitation of queries by means of SPARQL. We apply the method to a use case of transforming queries from SM4MQ to a vector representation. For the use case, we developed the prototype and performed an evaluation that shows how our approach can significantly ease and support user assistance such as query recommendation.

DBTR-38: Emmanouil Valsomatzis, Torben Bach Pedersen, Alberto Abello Trading Aggregated Flex-Offers via Flexible Orders

Flexibility of small loads, in particular from Electric Vehicles (EVs), has recently attracted a lot of interest due to their possibility of participating in the energy market and the new commercial potentials. Different from existing works, the aggregation techniques proposed in this paper produce flexible aggregated loads from EVs taking into account technical market requirements. The produced aggregated flexible loads fulfill the energy market requirements. They can be further transformed into the so-called flexible orders and be traded in the day-ahead market by a Balance Responsible Party (BRP). As a result, the BRP achieves more than 27% cost reduction in energy purchase based on 2016 real electricity prices.