Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters

As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively.
Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.

Data e Risorse

Campo Valore
accessLevel public
bureauCode {010:12}
catalog_@context https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
catalog_conformsTo https://project-open-data.cio.gov/v1.1/schema
catalog_describedBy https://project-open-data.cio.gov/v1.1/schema/catalog.json
identifier USGS:60ec47c6d34e3bf20b41756f
metadata_type geospatial
modified 20210726
old-spatial -112.15118408104, 25.774766586665, -77.082824707442, 48.614464422854
publisher U.S. Geological Survey
publisher_hierarchy Department of the Interior > U.S. Geological Survey
resource-type Dataset
source_datajson_identifier true
source_hash c0f6e9250a13ba03ab6e2dd02548d87ade226dd9
source_schema_version 1.1
spatial {"type": "Polygon", "coordinates": [[[-112.15118408104, 25.774766586665], [-112.15118408104, 48.614464422854], [ -77.082824707442, 48.614464422854], [ -77.082824707442, 25.774766586665], [-112.15118408104, 25.774766586665]]]}
theme {geospatial}
Gruppi
  • AmeriGEOSS
  • National Provider
  • North America
Tag
  • alabama
  • alaska
  • alaska-region
  • amerigeo
  • amerigeoss
  • arizona
  • arkansas
  • california
  • cambrian-period
  • carboniferous-period
  • cenozoic-era
  • ckan
  • colorado
  • colorado-plateau-and-basin-and-range-region
  • cretaceous-period
  • devonian-period
  • eastern-region
  • elk-point-group
  • florida
  • geo
  • geochemical-analysis
  • geochemistry
  • georgia
  • geoss
  • gulf-coast-region
  • idaho
  • illinois
  • indiana
  • iowa
  • jefferson-group
  • jurassic-period
  • k-nearest-neighbors
  • kansas
  • kentucky
  • louisiana
  • lousiana
  • machine-learning
  • madison-group
  • mannville-group
  • maryland
  • mesozoic-era
  • michigan
  • mid-continent-region
  • mississippi
  • mississippian-period
  • missouri
  • montana
  • naive-bayes
  • national
  • nebraska
  • neogene-period
  • nevada
  • new-mexico
  • new-york
  • north-america
  • north-dakota
  • ohio
  • oklahoma
  • ontario
  • ordovician-period
  • oregon
  • pacific-coast-region
  • paleogene-period
  • paleozoic-era
  • paleozoic-period
  • pennsylvania
  • pennsylvanian-period
  • permian-period
  • precambrian-period
  • produced-waters
  • proterozoic-era
  • random-forest
  • rocky-mountains-and-northern-great-plains-region
  • saskatchewan
  • silurian-period
  • south-dakota
  • tennessee
  • tertiary-period
  • texas
  • three-forks-group
  • triassic-period
  • united-states
  • usgs-60ec47c6d34e3bf20b41756f
  • utah
  • virginia
  • washington
  • west-texas-and-eastern-new-mexico-region
  • west-virginia
  • wyoming
isopen False
license_id notspecified
license_title License not specified
maintainer Jenna L Shelton
maintainer_email jlshelton@usgs.gov
metadata_created 2025-11-19T21:14:00.100333
metadata_modified 2025-11-19T21:14:00.100340
notes As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.
num_resources 2
num_tags 89
title Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters