You are here

Where did that data come from, anyway?

Thursday, July 13, 2017, by Connor Gregorich-Trevor
Printer-friendly version

[A typical workday. Photo by Jill Fusco]Imagine that you've found an interesting piece of research, but you feel that it left certain questions unanswered. So you decide to start your own project based on this research. But when you go to begin, you find out that the authors gave almost no information about how they obtained their data. You don't know what kind of programs they used, what analyses they ran, or even what tools they used to collect the data. Because you have no way of knowing this, you will be unable to replicate their experiment, and you will be left unable to complete your project.

One of the most frustrating parts of extending research is lack of proper documentation. Studies may contain data with no information about the analyses that were run, how the data was obtained, or the conditions under which it was collected. Without this crucial information, called provenance, replicating studies or verifying results and reliability becomes extremely difficult. This is the problem that I have set out to help solve this summer.

[A file workflow produced by DDGExplorer. Photo by Connor Gregorich-Trevor]I am spending this summer at Harvard Forest, where I work under my mentors Emery Boose and Barbara Lerner towards solving this problem. I spend each day in Shaler Hall, programming new features for RDataTracker, a program which collects provenance for analyses done in the R programming language. With RDataTracker, users can easily keep track of when their scripts were run and the inputs to those scripts. I also work on a companion program to RDataTracker called DDGExplorer, with which users can display the flow of a program and visually examine the sequence of transformations their data undergoes.

I specifically added functionality so that as the user runs multiple scripts with RDataTracker, they can use DDGExplorer to visualize how their scripts interact with one another. For example, if one script writes data to a file and another script reads from that same file, those two scripts will be connected in DDGExplorer with arrows. Clicking on a menu item displays a web of the connections between different scripts, which allows the user to keep track of what happens to their data over time.

[Taking measurements in the field. Photo by Jen Johnson]When I am not hard at work or fixing bugs I unintentionally created (such as “When I check the box to show line numbers all the arrows flip around for some reason”), I spend my time long distance running on the miles of beautiful trails in the forest. I have had the opportunity to see all kinds of wildlife up close when out on my runs, including porcupines, foxes, newts, and (thankfully only once) a bear. I also spend a few hours each week in the field taking water level readings and ensuring that data collection tools are functioning properly. During the weekends, the other REU students and I hang out and go on trips together. We have gone camping, and traveled to Boston, Salem, New York City, and more.

Overall I am having an amazing summer at Harvard Forest, and learning an incredible amount, through both my work, projects done in my own time, and the amazing seminar speakers that come to talk to us. I have made a ton of great friends here, and I know I will forever value my experiences here this summer.

Connor Gregorich-Trevor is a rising fourth year at Grinnell college, where he is enrolled as a Computer Science and Psychology double major.