Research and Development at WorDS

                                      A problem well stated is a problem half solved.

                                                                                   --Charles Kettering

Highlights

  • The Kepler Workflow Environment collaboratively developed by the WorDS team has been the most cited (according to Web of Science) and one of the most well-adopted and influential scientific workflow products with hundreds of thousands of downloads.
  • Innovative techniques developed by the WorDS team on utilization of distributed data parallel (DDP) programming patterns as a programming model for big data scientific workflows are a first in applying such techniques to scientific workflow development.
  • Long-term leadership of the WorDS team on application of diverse distributed computing techniques to scientific workflows within the Kepler environment has resulted in a number of international collaborations with modules contributed by the partners as well as the WorDS team.
  • Our provenance work for modeling and utilization of provenance of e-science applications was influential in the discussions for producing standards for provenance.

The research and development efforts at the WorDS Center is built around application-driven computing to advance computational practices involving process management and scientific workflows. To achieve this goal, our current research encompasses diverse areas of scientific computing and data science at the intersection of scientific workflows, provenance, distributed computing ranging across clusters to clouds, bioinformatics, observatory systems, conceptual data querying, and software modeling. These sub-categories of our research are explained in more detailed at this page. 

Distributed Data-Parallel (DDP) Task Execution

DDP patterns such as MapReduce have become increasingly popular for building  Big Data applications. We integrated Map, Reduce and other Big Data programing patterns, such as Match, CoGroup and Cross, into the Kepler Scientific Workflow System to execute Big Data workflows on top of Hadoop and Stratosphere systems. The advantages of the integration include: (i) its heterogeneous nature in which Big Data programming patterns are placed as part of other workflow tasks; (ii) its visual programming approach that does not require scripting of Big Data patterns; (iii) its adaptability for execution of distributed data-parallel applications on different execution engines; (iv) our analysis on how to get the best application performances by configuring different factors, including selected data parallel pattern and data parallel number.

Data and Process Provenance for Reproducible Science

A major benefit of using scientific workflow systems is the ability to make provenance collection a part of the workflow. Such provenance includes not only the standard data lineage information but also information about the context in which the workflow was used, execution that processed the data, and the evolution of the workflow design. We propose a framework for data and process provenance in the Kepler Scientific Workflow System. Based on the requirements and issues related to data and workflow provenance in a multi-disciplinary workflow system, we introduce how generic provenance capture can be facilitated in Kepler’s actor-oriented workflow environment. We describe the usage of the stored provenance information for efficient rerun of scientific workflows. We also study provenance of Big Data workflows and its effects on workflow execution performance by recording the provenance of MapReduce workflows.

Workflow Fault Tolerance

We build fault tolerance framework for Kepler-based scientific workflows to support implementation of a range of comprehensive failure coverage and recovery options. The framework is divided into three major components: (i) a general contingency Kepler actor that provides a recovery block functionality at the workflow level, (ii) an external monitoring module that tracks the underlying workflow components, and monitors the overall health of the workflow execution, and (iii) a checkpointing mechanism that provides smart resume capabilities for cases in which an unrecoverable error occurs. This framework takes advantage of the provenance data collected by the Kepler-based workflows to detect failures and help in fault-tolerance decision making.

Workflow as a Service (WFaaS)

We build WFaaS service model to support workflow publishing, query and execution in the Cloud. WFaaS enables composition of multiple software packages and services based on the logic encapsulated within a workflow service. WFaaS facilitates a service and management environment for flexible application integration via workflows. Utilizing other cloud services, such as IaaS and DaaS, WFaaS provides interfaces to get proper data, software packages, and VMs for workflow execution. Our WFaaS architecture allows for management of multiple concurrently running workflows and virtual machines in the Cloud. Based on this architecture, we research and benchmark heuristic workflow scheduling algorithms for optimized workflow execution in the Cloud. For different application scenarios, we  experiment with multiple algorithms to find proper configurations for reduced cost and increase price/performance ratio without affecting much performance.

Sensor Data Management and Streaming Data Analysis

Sensor networks are increasingly being deployed to create field-based environmental observatories. As the size and complexity of these networks increase, many challenges arise including monitoring and controlling sensor devices, archiving large volumes of continuously generated data, and the management of heterogeneous hardware devices. The Kepler Sensor Platform (SensorView), an open-source, vender- neutral extension to a scientific workflow system for full-lifecycle management of sensor networks. This extension addresses many of the challenges that arise from sensor site management by providing a suite of tools for monitoring and controlling deployed sensors, as well as for sensor data analysis, modeling, visualization, documentation, archival, and retrieval. An integrated scheduler interface has been developed allowing users to schedule workflows for periodic execution on remote servers. We evaluate the scalability of periodically executed sensor archiving workflows that automatically download, document, and archive data from a sensor site.

Distributed Execution of Scientific Workflows

To facilitate scientific workflow execution in distributed environments including HPC Cluster, Grid and Cloud, we develop a high-level distributed execution framework for dataflow/workflow applications, and the mechanisms to make the framework easy-to-use, comprehensive, adaptable, extensible, and efficient. To apply and refine the high-level framework based on different execution environments or requirement characteristics, we propose several concrete distributed architectures/approaches. The first one is the distributed data-parallel architecture explained above. The second one, named Master-Slave, can accelerate parameter sweep workflows by utilizing ad-hoc network computing resources. For a typical parameter sweep workflow, this architecture can realize concurrent independent sub-workflow executions with minimal user configuration, allowing large gains in productivity with little of the typical overhead associated with learning distributed computing systems. The third one is to facilitate e-Science discovery using scientific workflows in Grid environments. It presents a scientific workflow based approach to facilitate computational chemistry processes in the Grid execution environment by providing features such as resource consolidation, task parallelism, provenance tracking, fault tolerance and workflow reuse. The fourth one is to balance compute resource load for data-parallel workflow execution in Cluster and Cloud environments. It discusses how to interact with traditional cluster resource managers like Oracle Grid Engine (formerly SGE) and recent ones like Hadoop using a dataflow-based scheduling approach to balance physical/virtual compute resource load for data-parallel workflow executions.