You are here

Cooking With a Laptop?


Tuesday, July 18, 2017, by Jen Johnson
Printer-friendly version

Jen at ComputerHow are data analysis and the collection of provenance like cooking?

Data analysis is based on datasets, like those collected in the field and laboratory. Datasets are the basis for the rest of the analysis and represent the raw ingredients of a meal.

Next, analyses are performed on these datasets. There is a wide variety of possible analyses to perform, comparable to the multitude of ways to clean, slice, and flavor even the most basic combinations of ingredients.  

But have you ever tried to make a dish with only a list of the ingredients? While stews and smoothies may work out, many dishes require certain steps to be carried out in a certain order or in a certain manner. Therefore, it is much easier (especially for an inexperienced chef like me!) if the recipe is more detailed.

Similarly, the logic of a data analysis is difficult to follow if only the raw data is provided, or if the description of the statistical analyses used is not sufficiently detailed. In the data analysis world, this detailed “recipe” is called provenance, or information that describes steps of the analysis and the history behind the final results. The goal of collecting and visualizing provenance, like that of a recipe, is to facilitate the reproducibility of the process that it describes.

[An R script represented in DDG Explorer, a tool to visualize the provenance of R scripts. A DDG, or Data Derivation Graph, represents datasets and processes as nodes, and the relationships between them as edges. If the R script was a recipe, the yellow process nodes would be cleaning, slicing, and other activities performed on the ingredients, which would be the pink and purple data nodes.]Usually, ecologists use the programming language R to perform data analyses. My mentors, Emery Boose, Barbara Lerner, and Matthew Lau have been collecting and visualizing data provenance for R scripts using the tools RDataTracker and DDGExplorer. A script is a set of instructions that can be used on any input of the same format to produce an instance of an output. RDataTracker parses an R script, converting each line into a process node in a Data Derivation Graph (DDG). RDataTracker records files accessed by the script and intermediate data sets as data nodes. The relationships between process nodes and data nodes are represented as edges in the graph. Nodes and edges are formatted into a Prov-JSON file, which provides a standard textual representation of a DDG. Next, the file is read by DDGExplorer. This application allows the user to interact with a visual representation of a DDG, with expandable and collapsible nodes, associated line numbers, and views of intermediate data nodes.

However, R is not the only programming language used for data analysis. Python is another language used. Because of differences between the languages, they have different strengths that complement each other to produce the desired results in complicated analyses.

 Think of Python and R as 2 chefs with different preparation techniques, tasked with making a single cake. One way to divide the work between the chefs might be to assign each chef to a different sub-process of completing the task. Chef Python is responsible for the sub-process of making the batter, which Chef R is responsible for the icing. To work efficiently, both chefs need to communicate with each other to prevent repeated steps.

 A workflow is a sequence of sub-processes required to complete a process. For example, the workflow for baking a cake includes the sub-processes of mixing the batter and making the icing. The goal for this project is to provide provenance not only for individual Python and R scripts, but also for complete workflows consisting of multiple scripts. Provenance of an entire workflow provides a general overview and abstraction of the analysis.

 One challenging aspect of provenance collection is the concept of granularity. Granularity is the level of abstraction at which provenance is collected. This allows for provenance to be sufficiently detailed or abstracted, depending on the needs of the viewer. If you are already an experienced chef, you will not need details on how to separate egg whites from yolks or the best technique for whisking. However, an inexperienced chef might need step-by-step instructions.

 [A workflow for the baking of a cake at a) high, b) moderate, and c) low levels of granularity. Each view of the workflow provides different information that is useful for different purposes. Figure 2a represents the linking of sub-processes and provides a broad overview. Figure 2b represents the expansion of one of these processes. Figure 2c represents the expansion of one of the steps in the first process and provides a detailed level of provenance.]

In the data analysis world, a higher granularity of provenance collection provides information that would be useful for debugging lines of script. This is because each process that is performed on the dataset can be examined individually to ensure that its result is as intended. On the other hand, a lower granularity would provide a simple flowchart for a high-level understanding and clear communication.

The first software I tried for Python provenance was StarFlow. This software collected provenance from Python scripts at a low granularity, only recording when files were accessed. Like an over-simplified recipe, this does not provide much novel insight with such a simple workflow. Therefore, I also tried another software NoWorkflow. This provided a much higher default granularity. Now, the ongoing challenge is to make the information adaptable to the viewers’ needs. Control flow loops in the scripts, functions, and the scripts themselves are collapsible nodes to produce a dynamic visual representation of the workflow in DDG Explorer. 

 

 

[A Python script represented in DDG Explorer, based on low granularity provenance collected by StarFlow]   [A Python script represented in DDG Explorer, based on high granularity provenance collected by noWorkflow. The control flow loops, functions, and scripts can be collapsed or expanded by the user.]

Once Python scripts can be linked together into a workflow, the next step is to integrate provenance collection tools for Python and R into a single workflow. This will allow data scientists to collect provenance for complete and complex workflows. And when recipes get big, the results get even bigger and better. Yum.

Jen is a rising senior at Middlebury College studying Computer Science and Molecular Biology.