Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters

As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively.
Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.

Data e Risorse

Original MetadataXML
The metadata original format
Esplora
- Anteprima
- Download
Digital DataXML
Landing page for access to the data
Esplora
- Anteprima
- Download

Campo	Valore
accessLevel	public
bureauCode	{010:12}
catalog_@context	https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
catalog_conformsTo	https://project-open-data.cio.gov/v1.1/schema
catalog_describedBy	https://project-open-data.cio.gov/v1.1/schema/catalog.json
identifier	USGS:60ec47c6d34e3bf20b41756f
metadata_type	geospatial
modified	20210726
old-spatial	-112.15118408104, 25.774766586665, -77.082824707442, 48.614464422854
publisher	U.S. Geological Survey
publisher_hierarchy	Department of the Interior > U.S. Geological Survey
resource-type	Dataset
source_datajson_identifier	true
source_hash	c0f6e9250a13ba03ab6e2dd02548d87ade226dd9
source_schema_version	1.1
spatial	{"type": "Polygon", "coordinates": [[[-112.15118408104, 25.774766586665], [-112.15118408104, 48.614464422854], [ -77.082824707442, 48.614464422854], [ -77.082824707442, 25.774766586665], [-112.15118408104, 25.774766586665]]]}
theme	{geospatial}
Gruppi	AmeriGEOSS National Provider North America
Tag	alabama alaska alaska-region amerigeo amerigeoss arizona arkansas california cambrian-period carboniferous-period cenozoic-era ckan colorado colorado-plateau-and-basin-and-range-region cretaceous-period devonian-period eastern-region elk-point-group florida geo geochemical-analysis geochemistry georgia geoss gulf-coast-region idaho illinois indiana iowa jefferson-group jurassic-period k-nearest-neighbors kansas kentucky louisiana lousiana machine-learning madison-group mannville-group maryland mesozoic-era michigan mid-continent-region mississippi mississippian-period missouri montana naive-bayes national nebraska neogene-period nevada new-mexico new-york north-america north-dakota ohio oklahoma ontario ordovician-period oregon pacific-coast-region paleogene-period paleozoic-era paleozoic-period pennsylvania pennsylvanian-period permian-period precambrian-period produced-waters proterozoic-era random-forest rocky-mountains-and-northern-great-plains-region saskatchewan silurian-period south-dakota tennessee tertiary-period texas three-forks-group triassic-period united-states usgs-60ec47c6d34e3bf20b41756f utah virginia washington west-texas-and-eastern-new-mexico-region west-virginia wyoming
isopen	False
license_id	notspecified
license_title	License not specified
maintainer	Jenna L Shelton
maintainer_email	jlshelton@usgs.gov
metadata_created	2025-11-19T21:14:00.100333
metadata_modified	2025-11-19T21:14:00.100340
notes	As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.
num_resources	2
num_tags	89
title	Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters