uid=HFR,o=lter,dc=ecoinformatics,dc=org
all
public
read
doi:10.6073/pasta/9c0625930a4086f4df39d549e98a9443
Soil Bacteria and Archaea in Macrosystems Biodiversity Project at Harvard Forest 2012
Jizhong
Zhou
https://orcid.org/0000-0003-2014-0564
Robert
Waide
https://orcid.org/0000-0001-8127-8634
James
Brown
Ye
Deng
https://orcid.org/0000-0002-7584-0632
Researcher
2023
English
Patterns of biodiversity, such as the increase toward the tropics and the peaked curve during ecological succession, are fundamental phenomena for ecology. Such patterns have multiple, interacting causes, but temperature emerges as a dominant factor across organisms from microbes to trees and mammals, and across terrestrial, marine, and freshwater environments. However, there is little consensus on the underlying mechanisms, even as global temperatures increase and the need to predict their effects becomes more pressing.
The purpose of this project is to generate and test theory for how temperature impacts biodiversity through its effect on biochemical processes and metabolic rate. A combination of standardized surveys in the field and controlled experiments in the field and laboratory measure diversity of three taxa -- trees, invertebrates, and microbes -- and key biogeochemical processes of decomposition in seven forests distributed along a geographic gradient of increasing temperature from cold temperate to warm tropical.
This field experiment focused on soil microbes. DNA was extracted and purified from soil cores from an array of 21 1m2 subplots. The V4 region of the 16S rRNA genes for bacteria and archaea were amplified and sequenced using Illumina MiSeq by the University of Oklahoma Institute for Environmental Genomics as part of a macrosystems biodiversity and latitude project supported by the National Science Foundation under Cooperative Agreement DEB#1065836.
abundance
bacteria
biodiversity
genetics
microbes
soil
soil organic matter
LTER controlled vocabulary
populations
organic matter
LTER core area
Harvard Forest
HFR
LTER
USA
HFR default
This dataset is released to the public under Creative Commons CC0 1.0 (No Rights Reserved). Please keep the dataset creators informed of any plans to use the dataset. Consultation with the original investigators is strongly encouraged. Publications and data products that make use of the dataset should include proper acknowledgement.
Creative Commons Zero v1.0 Universal
https://spdx.org/licenses/CC0-1.0.html
CC0-1.0
https://harvardforest.fas.harvard.edu/exist/apps/datasets/showData.html?id=hf262
Harvard Forest. Coordinates based on WGS84 datum.
-72.17899
-72.17329
+42.54054
+42.53780
352
363
meter
2012
2012
complete
Information Manager
Harvard Forest
324 North Main Street
Petersham
MA
01366
USA
(978) 724-3302
hf-im@lists.fas.harvard.edu
Harvard Forest
324 North Main Street
Petersham
MA
01366
USA
(978) 724-3302
(978) 724-3595
https://harvardforest.fas.harvard.edu
Overview
A nested sampling design was implemented to survey the background pools of regional taxonomic diversity at six forest sites across America along a latitudinal gradient of increasing temperature: Niwot, Andrews, Harvard, Coweeta, Luquillo and Barro Colorado Island. At each site, we located a central subplot first and then laid out 1-m2 subplots in four directions with distances of 1m, 10m, 50m, 100m and 200m from the central subplot. In each m2 plot, 9 soil cores were collected and pooled to form a soil sample. By this sampling method, 21 samples were collected for each site with a total of 126 samples.
To determine the biodiversity of microbial communities, three genes targeting different taxonomic groups with different taxonomic resolutions were sequenced: The first of these are the V3-V4 regions of the 16S rRNA genes for determining the biodiversity of bacteria and archaea. All of the target genes were amplified and sequenced to a great depth using Illumina MiSeq. An average of 64K sequence reads were obtained for the 16S gene. The numbers of OTUs (Operational Taxonomic Units) obtained varied considerably (1.6-5.8 times) based on different sequence similarity thresholds. Such sequencing efforts appear to be reasonably sufficient to estimate the diversity of the microbial communities examined, at least for the dominant populations, as indicated by rarefaction analysis, which showed that the rarefaction curves approached to saturation at different cutoffs for the three target genes.
Amplicon Sequencing Analysis Protocols
1. Raw data. The raw data includes three files: R1.fastq, R2.fastq, and I1.fastq. The first files are read files contains real sequences in FASTQ format and the third file is the index file which contains the barcode sequence information for each corresponding reads in the first two files. Here, to make an example, we trim the files and only included 10000 entries in each file.
2. Upload. All the files need to be uploaded into our sequencing analysis server. Big files larger than 2GB need to be uploaded through an ftp server.
3. Remove PhiX sequences. The PhiX sequences are added to increase the diversity of the nucleotides in each position and enhance the sequencing accuracy. These sequences are removed here using BLAST against the PhiX genome sequences. (Output: Galaxy8-10)
4. Split sequence by barcodes. A barcode list needs to be provided to extract these barcode from the index file so each sequence will be assigned to the sample its barcode represents. After this step, the sample name (barcode name) will be attached to every sequence ID linked by “--”. (Output: Galaxy13, 14)
5. Remove Primer (optional). This step is only needed if the primers are included in the reads. It depends on the sequencing strategy. If proceeds, the primers will be removed from the beginning of the reads within a certain range. (Output: Galaxy16,17)
6. Join pair-end reads. A program called FLASH is used to join the forward and reversed reads together. (Output: Galaxy20)
7. Quality trim. The quality trimming process is carried out by Btrim . (Output: Galaxy24)
8. Extract FASTA from FASTQ. The FASTQ format contains information of both sequences and their quality. For further analysis based, sequences in FASTA files are the only information needed, though users can still get the quality files here if they want. (Output: Galaxy26)
9. Remove sequences contain N. N is undetermined base which could indicate unreliable sequences after it. User can choose to remove completely or trim at the position of N to keep the remaining part of the sequences. (Output: Galaxy27)
10. Filter sequences by length. If the sequences are too short (this usually doesn’t happen for Illumina sequences after joining pair-ends reads), they will greatly affect the clustering results when generating OTUs and induce errors. And if the sequences are too long usually means they are sequencing errors. (Output: Galaxy29)
11. Remove potential chimeras using Uchime . To remove chimeras, which are commonly created during DNA sample amplification by PCR, U-CHIME is used. There are two modes in the U-CHIME algorithm: de novo and reference database. The de novo mode uses abundance information to detect chimeras on the assumption that chimeras are less abundant than their parents because they must’ve undergone fewer rounds of amplification. We usually use the reference database mode to save computational time. The Greengene 16S dataset is used as reference for 16S sequencing analysis. To save more computational time, all the identical sequences are removed from the dataset and a redundancy map contains all the identical sequence ID is also generated for future use. (Output: Galaxy32,33)
12. Correct frame shifts for protein encoding sequences (optional). When the sequences are from protein encoding genes, correct open reading frames (ORFs) need to be assured to allow accurate translation and alignment in the future analysis. A program called FrameBot (developed by RDP, not published yet) is used to perform this process. A reference file of protein sequences needs to be provided, and frame shift-corrected protein and DNA sequences are generated.
13. Resample (optional). Resample the sequences so that each sample has the same sequence number.
14. Pick OTUs. This step will use clustering methods to form OTUs (operational taxonomy units) based on sequence similarity. Now we have three methods that users can choose from: CD-HIT, UCLUST, McClust. McClust is a complete lineage algorithm which provides more accurate clustering results, but it is very time and space consuming. For large sequence datasets, CD-HIT and UCLUST are recommended. For millions of sequences generated from Illumina platform, UCLUST should be chosen to do the clustering. The clusters generated from this step will be treated as OTUs for further analysis. (Output: Galaxy35)
15. Generate OTU table. In the OTU table, the rows represent OTUs and columns are samples, the values between are the sequence numbers belong to the corresponding OTUs as well as corresponding samples. The redundancy map from the UCHIME step is also needed to add back the identical sequences. OTU tables w/o singlets are generated and also their corresponding representative sequences. (Output: Galaxy36-39)
Taxa
Some taxa remain unclassified. For those OTUs classified to family, the taxa are as follows: Enterobacteriaceae, Gemmatimonadaceae, Spirochaetaceae, Planctomycetaceae, Hyphomicrobiaceae, Rhodospirillaceae, Acetobacteraceae, Rickettsiaceae, Sphingomonadaceae, Ktedonobacteraceae, Polyangiaceae, Nocardioidaceae, Myxococcaceae, Opitutaceae, Legionellaceae, Bradyrhizobiaceae, Cystobacteraceae, Phaselicystidaceae, Acidimicrobiaceae, Rhodocyclaceae, Coxiellaceae, Bdellovibrionaceae, Parachlamydiaceae, Conexibacteraceae, Xanthomonadaceae, Ruminococcaceae, Thermoleophilaceae, Acidimicrobineae_incertae_sedis, Nocardiaceae, Micromonosporaceae, Solirubrobacteraceae, Nannocystaceae, Xanthobacteraceae, Pseudonocardiaceae, Oxalobacteraceae, Neisseriaceae, Chitinophagaceae, Puniceicoccaceae, Burkholderiaceae, Methylococcaceae, Streptosporangineae_incertae_sedis, Cytophagaceae, Catenulisporaceae, Chthonomonadaceae, Aurantimonadaceae, Caulobacteraceae, Thermomonosporaceae, Sinobacteraceae, Pseudomonadaceae, Comamonadaceae, Kofleriaceae, Rhodobacteraceae, Anaerolineaceae, Actinospicaceae, Bacillaceae 2, Beijerinckiaceae, Holophagaceae, Streptosporangiaceae, Verrucomicrobiaceae, Nitrospiraceae, Pasteuriaceae, Geobacteraceae, Peptococcaceae 2, Microbacteriaceae, Armatimonadaceae, Geodermatophilaceae, Simkaniaceae, Cryomorphaceae, Caldilineaceae, Nitrosomonadaceae, Rhodobiaceae, Clostridiaceae 1, Chloroplast, Sphaerobacteraceae, Hydrogenophilaceae, Thiotrichales_incertae_sedis, Oceanospirillaceae, Family II, Family I, Leptotrichiaceae, Rikenellaceae, Haliangiaceae, Methylocystaceae, Sphingobacteriaceae, Intrasporangiaceae, Alcaligenaceae, Erythrobacteraceae, Phyllobacteriaceae, Family XIII, Alteromonadaceae, Gracilibacteraceae, Campylobacteraceae, Bacteriovoracaceae, Sporichthyaceae, Iamiaceae, Saprospiraceae, Demequinaceae, Ectothiorhodospiraceae, Nakamurellaceae, Syntrophaceae, Chromatiaceae, Desulfobacteraceae, Flammeovirgaceae, Micrococcaceae, Planococcaceae, Rhizobiaceae, Methylobacteriaceae, Burkholderiales_incertae_sedis, Acidothermaceae, Phycisphaeraceae, Cellulomonadaceae, Streptomycetaceae, Kineosporiaceae, Syntrophobacteraceae, Rhizobiales_incertae_sedis, Thermotogaceae, Veillonellaceae, Porphyromonadaceae, Mycobacteriaceae, Peptococcaceae 1, Moraxellaceae, Lachnospiraceae, Methylophilaceae, Shewanellaceae, Trueperaceae, Mycoplasmataceae, Flavobacteriaceae, Beutenbergiaceae, Bacillales_Incertae Sedis XII, Cryptosporangiaceae, Rhodothermaceae, Desulfohalobiaceae, Family IX, Celerinatantimonadaceae, Leptospiraceae, Prevotellaceae, Rubrobacteraceae, Alicyclobacillaceae, Herpetosiphonaceae, Bacillaceae 1, Thermaceae, Fervidicoccaceae, Hyphomonadaceae, Natranaerobiaceae, Tsukamurellaceae, Syntrophorhabdaceae, Marinilabiaceae, Enterococcaceae, Jiangellaceae, Chlamydiaceae, Halomonadaceae, Ignavibacteriaceae, Cyclobacteriaceae, Deinococcaceae, Brevibacteriaceae, Brucellaceae, Thermoanaerobacteraceae, Corynebacteriaceae, Desulfovibrionaceae, Methanomicrobiaceae, Propionibacteriaceae, Family V, Coriobacteriaceae, Sanguibacteraceae, Thermosporotrichaceae, Thiotrichaceae, Anaplasmataceae, Dermacoccaceae, Leuconostocaceae, Elusimicrobiaceae, Paenibacillaceae 1, Thermoplasmatales_incertae_sedis, Clostridiales_Incertae Sedis XVIII, Oceanospirillales_incertae_sedis, Actinomycetaceae, Desulfuromonadaceae, Victivallaceae, Methanobacteriaceae, Thermoactinomycetaceae 1, Listeriaceae, Alcanivoracaceae, Erysipelotrichaceae, Bacteroidaceae, Family IV, Desulfobulbaceae, Thermogemmatisporaceae, Synergistaceae, Family XI, Promicromonosporaceae, Pseudomonadales_incertae_sedis, Carnobacteriaceae, Euzebyaceae, Thermofilaceae, Patulibacteraceae, Lactobacillaceae, Methanocorpusculaceae, Methanocellaceae, Deferribacteraceae, Glycomycetaceae, Chloroflexaceae, Eubacteriaceae, Aeromonadaceae, Helicobacteraceae, Methanomicrobiales_incertae_sedis, Syntrophomonadaceae, Jonesiaceae, Halobacteriaceae, Piscirickettsiaceae, Staphylococcaceae, Cohaesibacteraceae, Desulfarculaceae, Halothiobacillaceae, Ruaniaceae, Streptococcaceae, Thermomicrobiaceae, Paenibacillaceae 2, Cardiobacteriaceae, Psychromonadaceae, Vibrionaceae, Colwelliaceae, Hahellaceae, Sutterellaceae, Actinopolysporaceae, Francisellaceae, Peptostreptococcaceae, Fusobacteriaceae, Clostridiales_Incertae Sedis XI, Clostridiales_Incertae Sedis XIII, Acidaminococcaceae, Bacteroidales_incertae_sedis, Clostridiaceae 4, Entomoplasmataceae, Spiroplasmataceae, Clostridiaceae 2, Acholeplasmataceae, Thermoactinomycetaceae 2, Family VIII, Bogoriellaceae, Pasteurellaceae, Methanosarcinaceae, Idiomarinaceae, Nocardiopsaceae, Bacillales_Incertae Sedis XI, Chlorobiaceae, Pseudoalteromonadaceae, Bacillales_incertae_sedis, Dietziaceae, Clostridiaceae 3, Waddliaceae, Moritellaceae, Desulfurellaceae, Methanosaetaceae, Dermabacteraceae, Saccharospirillaceae, Hydrogenothermaceae, Segniliparaceae, Succinivibrionaceae.
Harvard Forest Long-Term Ecological Research
Harvard Forest
324 North Main Street
Petersham
MA
01366
USA
(978) 724-3302
(978) 724-3595
https://harvardforest.fas.harvard.edu
https://ror.org/059cpzx98
pointOfContact
The Harvard Forest Long-Term Ecological Research (LTER) program examines ecological dynamics in the New England region resulting from natural disturbances, environmental change, and human impacts.
National Science Foundation LTER grants: DEB-8811764, DEB-9411975, DEB-0080592, DEB-0620443, DEB-1237491, DEB-1832210.
hf262-01-hf-16s-rrna.csv
hf16s rRNA bacteria archaea
hf262-01-hf-16s-rrna.csv
34973096
409e4a14449e3941a3374be02871a2e6
1
\r\n
column
,
https://harvardforest.fas.harvard.edu/data/p26/hf262/hf262-01-hf-16s-rrna.csv
otu
the ID of the Operational Taxonomic Unit (OTU)
the ID of the Operational Taxonomic Unit (OTU)
NA
missing value
domain
classified taxonomic domain of the OTU (bacteria or
archaea)
classified taxonomic domain of the OTU (bacteria or archaea)
NA
missing value
phylum
classified taxonomic phylum of the OTU
classified taxonomic phylum of the OTU
NA
missing value
class
classified taxonomic class of the OTU
classified taxonomic class of the OTU
NA
missing value
order
classified taxonomic order of the OTU
classified taxonomic order of the OTU
NA
missing value
family
classified taxonomic family of the OTU
classified taxonomic family of the OTU
NA
missing value
genus
classified taxonomic genus of the OTU
classified taxonomic genus of the OTU
NA
missing value
h100e
number of OTUs of a specific classification found in the soil sampled at
the H100E location (Harvard Forest, 100 meters east of the central
subplot)
number
1
whole
NA
missing value
h100n
number of OTUs of a specific classification found in the soil sampled at
the H100N location (Harvard Forest, 100 meters north of the central
subplot)
number
1
whole
NA
missing value
h100s
number of OTUs of a specific classification found in the soil sampled at
the H100S location (Harvard Forest, 100 meters south of the central
subplot)
number
1
whole
NA
missing value
h100w
number of OTUs of a specific classification found in the soil sampled at
the H100W location (Harvard Forest, 100 meters west of the central
subplot)
number
1
whole
NA
missing value
h10e
number of OTUs of a specific classification found in the soil sampled at
the H10E location (Harvard Forest, 10 meters east of the central
subplot)
number
1
whole
NA
missing value
h10n
number of OTUs of a specific classification found in the soil sampled at
the H10N location (Harvard Forest, 10 meters north of the central
subplot)
number
1
whole
NA
missing value
h10s
number of OTUs of a specific classification found in the soil sampled at
the H10S location (Harvard Forest, 10 meters south of the central
subplot)
number
1
whole
NA
missing value
h10w
number of OTUs of a specific classification found in the soil sampled at
the H10W location (Harvard Forest, 10 meters west of the central
subplot)
number
1
whole
NA
missing value
h1e
number of OTUs of a specific classification found in the soil sampled at
the H1E location (Harvard Forest, 1 meter east of the central subplot)
number
1
whole
NA
missing value
h1n
number of OTUs of a specific classification found in the soil sampled at
the H1N location (Harvard Forest, 1 meter north of the central subplot)
number
1
whole
NA
missing value
h1s
number of OTUs of a specific classification found in the soil sampled at
the H1S location (Harvard Forest, 1 meter south of the central subplot)
number
1
whole
NA
missing value
h1w
number of OTUs of a specific classification found in the soil sampled at
the H1W location (Harvard Forest, 1 meter west of the central subplot)
number
1
whole
NA
missing value
h200e
number of OTUs of a specific classification found in the soil sampled at
the H200E location (Harvard Forest, 200 meter east of the central
subplot)
number
1
whole
NA
missing value
h200n
number of OTUs of a specific classification found in the soil sampled at
the H200N location (Harvard Forest, 200 meter north of the central
subplot)
number
1
whole
NA
missing value
h200s
number of OTUs of a specific classification found in the soil sampled at
the H200S location (Harvard Forest, 200 meter south of the central
subplot)
number
1
whole
NA
missing value
h200w
number of OTUs of a specific classification found in the soil sampled at
the H200W location (Harvard Forest, 200 meter west of the central
subplot)
number
1
whole
NA
missing value
h50e
number of OTUs of a specific classification found in the soil sampled at
the H50E location (Harvard Forest, 50 meter east of the central subplot)
number
1
whole
NA
missing value
h50n
number of OTUs of a specific classification found in the soil sampled at
the H50N location (Harvard Forest, 50 meter north of the central
subplot)
number
1
whole
NA
missing value
h50s
number of OTUs of a specific classification found in the soil sampled at
the H50S location (Harvard Forest, 50 meter south of the central
subplot)
number
1
whole
NA
missing value
h50w
number of OTUs of a specific classification found in the soil sampled at
the H50W location (Harvard Forest, 50 meter west of the central subplot)
number
1
whole
NA
missing value
hc
number of OTUs of a specific classification found in the soil sampled at
the HC location (Harvard Forest, central subplot)
number
1
whole
NA
missing value
257867
biodiversity
international
regional
short-term measurements
https://harvardforest.fas.harvard.edu/exist/apps/datasets/showData.html?id=hf261
https://harvardforest.fas.harvard.edu/exist/apps/datasets/showData.html?id=hf263
https://harvardforest.fas.harvard.edu/exist/apps/datasets/showData.html?id=hf260
https://harvardforest.fas.harvard.edu/exist/apps/datasets/showData.html?id=hf264