Stratosphere Team Visit

Venue: SDSC Meeting Room 145-E (Please see http://www.sdsc.edu/about_sdsc/visitor_info.html for directions.)

Date: October 4th, 2012

Attendees:

Alexander Alexandrov (TU Berlin)
Ilkay Altintas (UCSD)
Thomas Bodner (TU Berlin)
Daniel Crawl (UCSD)
Alin Deutsch (UCSD)
Stephan Ewen (TU Berlin)
Amarnath Gupta (UCSD)
Volker Markl (TU Berlin)
Yannis Papakonstantinou (UCSD)
Kostas Tzoumas (TU Berlin)
Jianwu Wang (UCSD)

Agenda:

9:00: Introductions
9:10: Goals for the day (Ilkay Altintas)
9:20: Introduction (Volker Markl)
9:30: Query Optimization with MapReduce Functions (Kostas Tzoumas)
10:00: Spinning Fast Iterative Data Flows (Stephan Ewen)
10:30: Break
10:45: Generating a Myriad of Atoms in the Blink of an Eye (Alexander Alexandrov)
11:15: A Taxonomy of Platforms for Analytics on Big Data (Thomas Bodner)
11:45: Group Discussion
12:15: Lunch Break
13:30: bioKepler and Disributed Data Parallel (DDP) Framework in Kepler (Ilkay Altintas)
14:00: Using Stratosphere for DDP Workflows in Kepler + Short Demo (Daniel Crawl)
14:30: Data-Intensive Bioinformatics Workflows using DDP Framework + Short Demo (Jianwu Wang)
15:00: Coffee Break
15:30: Group Discussion
17:00: Workshop adjourns


Introduction (Volker Markl, 5min)

Bio: http://www.dima.tu-berlin.de/menue/mitarbeiter/volker_markl/


Part 1 (Kostas Tzoumas, 25min):

Title: Query Optimization with MapReduce Functions

Abstract: Many systems for big data analytics employ a data flow programming abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators using a shallow static code analysis pass based on reverse data and control flow analysis over the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of non-relational data flows, a unique feature among today's systems.

Bio: Kostas Tzoumas is a postdoctoral researcher co-leading the Stratosphere research project at the Technische Universität Berlin. He received his PhD from Aalborg University in 2011 with a thesis on discovering and exploiting correlations for query optimization. He was a visiting researcher at the University of Maryland, College Park, and an intern at Microsoft Research. He received a Diploma in Electrical and Computer Engineering from the National Technical University of Athens in 2007. His research interests are centered around systems for data analytics, including query processing and optimization in massively parallel environments.


Part 2 (Stephan Ewen, 25min):

Title: Spinning Fast Iterative Data Flows

Abstract: Parallel data flow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel data flow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate "incremental iterations", a form of workset iterations, with parallel data flows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the "sparse computational dependencies" inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified data flow abstraction.

Bio: Stephan Ewen is a research associate at the department for Database Systems and Information Management (DIMA) at the Technische Universität Berlin. He is working on the Stratosphere Project that aims at creating a versatile and efficient analytics engine for deep analysis of Big Data on cloud platforms. Within the project, Stephan works on the system's data flow programming abstraction, the data flow optimization and the parallel runtime system. Prior to joining the DIMA group, Stephan completed the “Applied Computer Science” program at the University of Cooperative Education Stuttgart jointly with IBM Germany and got his Diploma from the University of Stuttgart. In the course of his studies, Stephan Ewen worked, among others, for the IBM Almaden Research Centre and the IBM Development Laboratory Böblingen.


Part 3 (Alexander Alexandrov, 25min):

Title: Generating a Myriad of Atoms in the Blink of an Eye

Abstract: Data from real-world applications is regarded as the golden standard for database systems evaluation. Unfortunately, finding appropriate real-world datasets is often hard due to various privacy-related constraints. To overcome this problem, we developed the Myriad Parallel Data Generator Toolkit - a generic toolkit for declarative specification of synthetic data generators that provides built-in parallelization support for the specified data generation programs. In this talk, I will motivate and present the main technical challenges solved by the highly-parallel execution model of the Myriad Toolkit. In addition, to demonstrate the usability of the toolkit, I will also give a brief overview of the supported data generator specification syntax and explain how different statistical constraints for the generated data can be implemented using the appropriate combination of specification routines.

Bio: Alexander Alexandrov is a research associate at the Database Systems and Information Management research group at the Technische Universität Berlin. Before moving to Berlin for a Master in Computer Science at TU Berlin, he received his Bachelor of Science in Software and Internet Technologies at the University of Mannheim. Alexander has been working on the Stratosphere project both as student and research assistant since 2009. His research interests include data generation, evaluation, and query optimization for large-scale parallel batch processing systems with partial operator semantics.


Part 4 (Thomas Bodner, 25min):

Title: A Taxonomy of Platforms for Analytics on Big Data

Abstract: Within the past few years, industrial and academic organizations designed a wealth of systems for data-intensive analytics including MapReduce, SCOPE/Dryad, ASTERIX, Stratosphere, Spark, and many others. These systems are being applied to new applications from diverse domains other than (traditional) relational OLAP, making it difficult to understand the tradeoffs between them and the workloads for which they were built. We present a taxonomy of existing system stacks based on their architectural components and the design choices made related to data processing and programmability to sort this space. We further demonstrate a web repository for sharing Big Data analytics platform information and use cases. The repository enables researchers and practitioners to store and retrieve data and queries for their use case, and to easily reproduce experiments from others on different platforms, simplifying comparisons.

Bio: Thomas Bodner is a second year Master's student in the computer science department at the Technische Universität Berlin working in the Database Systems and Information Management (DIMA) group on the Stratosphere project. He received his B.S. from the University of Cooperative Education at Stuttgart. In the course of his studies, Thomas Bodner studied abroad at University of California, Irvine and Royal Melbourne Institute of Technology. He worked as an intern at the IBM Almaden Research Center and the IBM Böblingen Laboratory. His research interests include benchmarking of and query optimization for Big Data analytics systems.


Part 5 (Ilkay Altintas, 25min):

Title: bioKepler and Distributed Data Parallel (DDP) Framework in Kepler

Abstract: For enabling bioinformaticians and computational biologists to conduct efficient analysis, there still remains a need for higher-level abstractions on top of scientific workflow systems and distributed computing methods. The bioKepler project is a three year long project that builds scientific workflow components to execute a set of bioinformatics tools using distributed execution patterns. Once customized, these components are executed on multiple distributed platforms including various Cloud and Grid computing platforms. This talk presents the current status of the bioKepler project and a framework for distributed data-parallel execution in the Kepler scientific workflow system that enables users to easily switch between different DDP execution engines.

Bio: http://users.sdsc.edu/~altintas


Part 6 (Daniel Crawl, 25min):

Title: Using Stratosphere for DDP Workflows in Kepler

Abstract: The Kepler DDP framework includes a set of specialized actors that provide an interface to data-parallel patterns. Each DDP actor corresponds to a particular data-parallel pattern, and there are actors for Map, Reduce, Cross, CoGroup, and Match. The semantics of these data-parallel patterns are the same with the input contracts in the PACT programming model. This talk presents the DDP actors and the Stratosphere director in Kepler.

Bio: Daniel Crawl is the lead architect for the overall integration of distributed data parallel (DDP) execution patterns and the Kepler Scientific Workflow System. He conducts research and development of execution patterns, bioActors, and distributed directors. He received his PhD from the University of Colorado in 2006. His research interests include scientific workflows, distributed computing, provenance, and fault-tolerance.


Part 7 (Jianwu Wang, 25min):

Title: Data-Intensive Bioinformatics Workflows using DDP Framework

Abstract: The Kepler DDP framework has been used in bioinformatics. We found many existing bioinformatics tools could be easily parallelized using DDP framework. This talk presents some demo data-intensive bioinformatics workflows using DDP Framework and their experiment data.

Bio: Jianwu Wang is an assistant project scientist at San Diego Supercomputer Center, University of California, San Diego, U.S. He is also an adjunct professor at North China University of Technology, China. He got his Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences in 2007. His research interests include Service-Oriented Computing, End-User Programming, Scientific Workflows, Distributed Computing, Data-Intensive Computing. He has published over 30 papers with more than 270 citations. He is associate editor or editorial board member of four international journals, co-chair of two related workshops. He is also program committee member for 20 conferences/workshops, and reviewer of over 10 journals or books.