WorDS of Data Science beginning with P
Generally speaking, provenance is about data origin. Provenance plays an important role in data science. Data science experiments are often developed iteratively involving multiple executions with different versions of data sources, accessing multiple applications and cyberinfrastructure components. The validity and authenticity of most such experiments hinges on ability to reproduce the results consistently.
In the context of scientific workflows, provenance usually means the lineage and processing history of a data product, and the record of the processes that led to it. Provenance captures workflow design and execution history. Provenance helps in tracking workflow inputs, outputs, process and data intersection points, so that experiments can be verified, replayed, and, when possible, reproduced in precise manner. Provenance also enables comparison between different workflow versions, smart re-reruns and failure recovery
Kepler Provenance Framework
Kepler, a scientific workflow platform, provides simple and smart solution to keep track of provenance information when it comes to complex scientific computational experiments. Kepler Provenance reduces the complexity of manually tracking input data sets, order of execution, computational outcome and compute infrastructure information, providing simple ways to share reproducible experiments within scientific community.
Kepler provenance can be broadly classified into three categories Workflow Specification, Workflow Evolution, and Workflow Execution. Workflow Specification captures information such as actors, ports, connections, and parameters. Workflow Evolution tracks the change transpired over the development period such as parameter values that change over time, addition/removal of actors, ports, etc. The third category footprints workflow execution history such as start/stop of workflow, individual actor executions and data exchanged between actors.
Kepler supports provenance recording to different output types that includes text, XML, in a SQL database such as MySQL, Postgres, Oracle, or HSQL. The recorded provenance data can be accessed during workflow execution in text/ XML format, or explored using SQL queries and tools such as workflow run manager, reporting tool. Workflow run manager helps to display and search past executions and reporting tool in Kepler helps to create reports based on workflow results.