Dataset for modeling spatial and temporal variation in natural background specific conductivity

This file contains the data set used to develop a random forest model predict background specific conductivity for stream segments in the contiguous United States. This Excel readable file contains 56 columns of parameters evaluated during development. The data dictionary provides the definition of the abbreviations and the measurement units. Each row is a unique sample described as R** which indicates the NHD Hydrologic Unit (underscore), up to a 7-digit COMID, (underscore) sequential sample month. To develop models that make stream-specific predictions across the contiguous United States, we used StreamCat data set and process (Hill et al. 2016; https://github.com/USEPA/StreamCat). The StreamCat data set is based on a network of stream segments from NHD+ (McKay et al. 2012). These stream segments drain an average area of 3.1 km2 and thus define the spatial grain size of this data set. The data set consists of minimally disturbed sites representing the natural variation in environmental conditions that occur in the contiguous 48 United States. More than 2.4 million SC observations were obtained from STORET (USEPA 2016b), state natural resource agencies, the U.S. Geological Survey (USGS) National Water Information System (NWIS) system (USGS 2016), and data used in Olson and Hawkins (2012) (Table S1). Data include observations made between 1 January 2001 and 31 December 2015 thus coincident with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (https://modis.gsfc.nasa.gov/data/). Each observation was related to the nearest stream segment in the NHD+. Data were limited to one observation per stream segment per month. SC observations with ambiguous locations and repeat measurements along a stream segment in the same month were discarded. Using estimates of anthropogenic stress derived from the StreamCat database (Hill et al. 2016), segments were selected with minimal amounts of human activity (Stoddard et al. 2006) using criteria developed for each Level II Ecoregion (Omernik and Griffith 2014). Segments were considered as potentially minimally stressed where watersheds had 0 - 0.5% impervious surface, 0 – 5% urban, 0 – 10% agriculture, and population densities from 0.8 – 30 people/km2 (Table S3). Watersheds with observations with large residuals in initial models were identified and inspected for evidence of other human activities not represented in StreamCat (e.g., mining, logging, grazing, or oil/gas extraction). Observations were removed from disturbed watersheds, with a tidal influence or unusual geologic conditions such as hot springs. About 5% of SC observations in each National Rivers and Stream Assessment (NRSA) region were then randomly selected as independent validation data. The remaining observations became the large training data set for model calibration.

This dataset is associated with the following publication: Olson, J., and S. Cormier. Modeling spatial and temporal variation in natural background specific conductivity. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 53(8): 4316-4325, (2019).

Data and Resources

Field Value
accessLevel public
bureauCode {020:00}
catalog_conformsTo https://project-open-data.cio.gov/v1.1/schema
describedBy https://pasteur.epa.gov/uploads/10.23719/1500945/documents/Data_Dictionary_RefModelData_20180611_508.docx
describedByType application/vnd.openxmlformats-officedocument.wordprocessingml.document
identifier https://doi.org/10.23719/1500945
license https://pasteur.epa.gov/license/sciencehub-license.html
modified 2018-06-14
programCode {020:096}
publisher U.S. EPA Office of Research and Development (ORD)
publisher_hierarchy U.S. Government > U.S. Environmental Protection Agency > U.S. EPA Office of Research and Development (ORD)
references {https://doi.org/10.1021/acs.est.8b06777}
resource-type Dataset
source_datajson_identifier true
source_hash d96bf15282072d404d6881af0199fc82e50c13df
source_schema_version 1.1
Groups
  • AmeriGEOSS
  • National Provider
  • North America
Tags
  • amerigeo
  • amerigeoss
  • carbonate-salts
  • ckan
  • conductivity
  • drought
  • geo
  • geoss
  • modeled-conductivity
  • national
  • north-america
  • streamcat
  • united-states
isopen False
license_id other-license-specified
license_title other-license-specified
maintainer Susan Cormier
maintainer_email cormier.susan@epa.gov
metadata_created 2025-11-22T02:21:30.415475
metadata_modified 2025-11-22T02:21:30.415479
notes This file contains the data set used to develop a random forest model predict background specific conductivity for stream segments in the contiguous United States. This Excel readable file contains 56 columns of parameters evaluated during development. The data dictionary provides the definition of the abbreviations and the measurement units. Each row is a unique sample described as R** which indicates the NHD Hydrologic Unit (underscore), up to a 7-digit COMID, (underscore) sequential sample month. To develop models that make stream-specific predictions across the contiguous United States, we used StreamCat data set and process (Hill et al. 2016; https://github.com/USEPA/StreamCat). The StreamCat data set is based on a network of stream segments from NHD+ (McKay et al. 2012). These stream segments drain an average area of 3.1 km2 and thus define the spatial grain size of this data set. The data set consists of minimally disturbed sites representing the natural variation in environmental conditions that occur in the contiguous 48 United States. More than 2.4 million SC observations were obtained from STORET (USEPA 2016b), state natural resource agencies, the U.S. Geological Survey (USGS) National Water Information System (NWIS) system (USGS 2016), and data used in Olson and Hawkins (2012) (Table S1). Data include observations made between 1 January 2001 and 31 December 2015 thus coincident with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (https://modis.gsfc.nasa.gov/data/). Each observation was related to the nearest stream segment in the NHD+. Data were limited to one observation per stream segment per month. SC observations with ambiguous locations and repeat measurements along a stream segment in the same month were discarded. Using estimates of anthropogenic stress derived from the StreamCat database (Hill et al. 2016), segments were selected with minimal amounts of human activity (Stoddard et al. 2006) using criteria developed for each Level II Ecoregion (Omernik and Griffith 2014). Segments were considered as potentially minimally stressed where watersheds had 0 - 0.5% impervious surface, 0 – 5% urban, 0 – 10% agriculture, and population densities from 0.8 – 30 people/km2 (Table S3). Watersheds with observations with large residuals in initial models were identified and inspected for evidence of other human activities not represented in StreamCat (e.g., mining, logging, grazing, or oil/gas extraction). Observations were removed from disturbed watersheds, with a tidal influence or unusual geologic conditions such as hot springs. About 5% of SC observations in each National Rivers and Stream Assessment (NRSA) region were then randomly selected as independent validation data. The remaining observations became the large training data set for model calibration. This dataset is associated with the following publication: Olson, J., and S. Cormier. Modeling spatial and temporal variation in natural background specific conductivity. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 53(8): 4316-4325, (2019).
num_resources 1
num_tags 13
title Dataset for modeling spatial and temporal variation in natural background specific conductivity