You are here

Harvard Forest Data Archive

HF336

Evaluation of Mask R-CNN Model for Counting Reproductive Structures of Six Plant Species 1895-2018

Related Publications

Data

Overview

  • Lead: Charles Davis, Julien Champ, Daniel Park, Alexis Joly, Aaron Ellison, Pierre Bonnet
  • Investigators:
  • Contact: Information Manager
  • Start date: 1895
  • End date: 2018
  • Status: complete
  • Location: Global
  • Latitude: -90 to +90 degrees
  • Longitude: -180 to +180 degrees
  • Elevation:
  • Datum: WGS84
  • Taxa: Anemone canadensis, Anemone hepatica, Anemone quinquefolia, Trillium erectum, Trillium grandiflorum, Trillium undulatum
  • Release date: 2023
  • Language: English
  • EML file: knb-lter-hfr.336.4
  • DOI: digital object identifier
  • EDI: data package
  • DataONE: data package
  • Related links:
  • Study type: long-term measurement, historical, modeling
  • Research topic: ecological informatics and modelling; historical and retrospective studies; international research projects; large experiments and permanent plot studies
  • LTER core area: primary production
  • Keywords: climate change, flowering, future scenarios, imagery, life history, models, phenology, reproduction
  • Abstract:

    Phenology––the timing of life-history events––is a key trait for understanding responses of organisms to climate. The digitization and online mobilization of herbarium specimens is rapidly advancing our understanding of plant phenological response to climate and climatic change. The current common practice of manually harvesting data from individual specimens greatly restricts our ability to scale data collection to entire collections. Recent investigations have demonstrated that machine-learning models can facilitate data collection from herbarium specimens. However, present attempts have focused largely on simplistic binary coding of reproductive phenology (e.g., flowering or not). Here, we use crowd-sourced phenological data of numbers of buds, flowers, and fruits of more than 3000 specimens of six common wildflower species of the eastern United States (Anemone canadensis, A. hepatica, A. quinquefolia, Trillium erectum, T. grandiflorum, and T. undulatum} to train a model using Mask R-CNN to segment and count phenological features. A single global model was able to automate the binary coding of reproductive stage with greater than 90% accuracy. Segmenting and counting features were also successful, but accuracy varied with phenological stage and taxon. Counting buds was significantly more accurate than flowers or fruits. Moreover, botanical experts provided more reliable data than either crowd-sourcers or our Mask R-CNN model, highlighting the importance of high-quality human training data. Finally, we also demonstrated the transferability of our model to automated phenophase detection and counting of the three Trillium species, which have large and conspicuously-shaped reproductive organs. These results highlight the promise of our two-phase crowd-sourcing and machine-learning pipeline to segment and count reproductive features of herbarium specimens, providing high-quality data with which to study responses of plants to ongoing climatic change.

  • Methods:

    Training

    This dataset is based on a subset of the data used in Park et al. (2018, 2019), and is dedicated to evaluate deep learning model (more precisely Mask R-CNN) for counting reproductive structures of six species in two genera (Anemone and Trillium) of common spring-flowering herbs in the US. This subset includes 3073 specimens of: Anemone canadensis (108), Anemone hepatica (524), Anemone quinquefolia (686), Trillium erectum (862), Trillium grandiflorum (226), and Trillium undulatum (667). Each specimen (in form of an URL providing access to the image of an herbarium sheet) is described with X, Y coordinates allowing localization of masks of the 3 different types of reproductive structures (Buds, Flowers and Fruits) visible on the specimen.

    For the purposes of the study, these specimens were used to train several deep learning models, allowing their evaluation for counting tasks. This dataset (hf336-02-tasks.zip) is then composed of 10 folders, one for each experiment that was conducted : (i) one folder “all” contains all data used to train a model with specimens belonging to the 6 species together, (ii) 6 folders contains data used to train species-specific models (one for each species), (iii) 3 others folders were used to evaluate models based on training data from 2 species of Trillium, and test on the third species of this genus.

    Each folder contains 6 json documents, dedicated to access to the data and reproduce our experiments. Each json document is described below:

    CONFIG – setting used by Facebook’s implementation of Mask R-CNN, allowing access to the parameters of the training and testing phase of the model

    fulltrain – metadata of the dataset used during training phase, split in 2 subsets “train” and “cv”

    train – metadata related to the initial training set

    cv – metadata related to the validation set used to control training performances on external, avoiding overfitting on training dataset

    test – metadata related to the test set used to evaluate modem performances

    INFOS – Experimental settings about the different sub-datasets

    Results

    The file hf336-01-model-results.csv contains the raw results for the 10 models related to the 10 tasks plus crowdsourcing and expert counts. This data table is described as follows:

    image.id – filename of the .jpg images in hf336-03-images.csv

    image.url – URL for the image

    genus and species – genus name (Trillium or Anemone) and species name for each image

    exp.bud, exp.flower and exp. fruit – true numbers of buds, flowers or fruits, respectively

    pred.bud, pred.flower and pred.fruit – the total number of buds, flowers or fruits predicted by Mask R-CNN using global model

    pred.2tr.bud, pred.2tr.flower, pred.2tr.fruit, pred.2tr.bis.bud, pred.2tr.bis.flower, pred.2tr.bis.fruit, pred.2tr.ter.bud, pred.2tr.ter.flower and pred.2tr.ter.fruit – total number of buds, flowers or fruits on the given Trillium species predicted by Mask R-CNN using model trained on the other two Trillium species

    pred.(genus-species abbreviation).bud, pred.(genus-species abbreviation).flower, pred.(genus-species abbreviation).fruit – total number of buds, flowers, or fruits predicted by Mask R-CNN using model for Anemone canadensis (anca), Anemone hepatica (anhe), Anemone quinquefolia (anqu), Trillium erectum (trer), Trillium grandiflorum (trgr) or Trillium undulatem (trun)

    crowd1.pred.bud, crowd1.pred.flower, crowd1.pred.fruit – total number of buds, flowers or fruits estimated by crowd-sourcer 1 (repeated for crowd-sourcers 2-15)

    expert1.pred.bud, expert1.pred.flower, expert1.pred.fruit – total number of buds, flowers or fruits estimated by expert 1 (repeated for expert 2)

    Publications cited

    Park, D., Williams, A., Law, E., Ellison, A., and Davis, C. (2018). Assessing Plant Phenological Patterns in the Eastern United States Over the Last 120 Years [Data set]. Environmental Data Initiative. https://doi.org/10.6073/PASTA/F6AFA728BB5EDFD79F458D7D5E23F559

    Park, D. S., Breckheimer, I., Williams, A. C., Law, E., Ellison, A. M., and Davis, C. C. (2019). Herbarium specimens reveal substantial and unexpected variation in phenological sensitivity across the eastern United States. Philosophical Transactions of the Royal Society B, 374(1763), 20170394.

  • Organization: Harvard Forest. 324 North Main Street, Petersham, MA 01366, USA. Phone (978) 724-3302. Fax (978) 724-3595.

  • Project: The Harvard Forest Long-Term Ecological Research (LTER) program examines ecological dynamics in the New England region resulting from natural disturbances, environmental change, and human impacts. (ROR).

  • Funding: National Science Foundation LTER grants: DEB-8811764, DEB-9411975, DEB-0080592, DEB-0620443, DEB-1237491, DEB-1832210.

  • Use: This dataset is released to the public under Creative Commons CC0 1.0 (No Rights Reserved). Please keep the dataset creators informed of any plans to use the dataset. Consultation with the original investigators is strongly encouraged. Publications and data products that make use of the dataset should include proper acknowledgement.

  • License: Creative Commons Zero v1.0 Universal (CC0-1.0)

  • Citation: Davis C, Champ J, Park D, Joly A, Ellison A, Bonnet P. 2023. Evaluation of Mask R-CNN Model for Counting Reproductive Structures of Six Plant Species 1895-2018. Harvard Forest Data Archive: HF336 (v.4). Environmental Data Initiative: https://doi.org/10.6073/pasta/652360975027243e84db125b51bb8645.

Detailed Metadata

hf336-01: raw results for models

  • Compression: none
  • Format: csv
  • Type: document

hf336-02: training data

  • Compression: zip
  • Format: JSON
  • Type: document

hf336-03: herbarium images

  • Compression: none
  • Format: csv
  • Type: document

hf336-04: images

  • Compression: zip
  • Format: jpeg
  • Type: image