Determining the Predictive Limit of QSAR Models

The research done to evaluate how the predictivity of models are effected by error in either the training or the test set is simple to describe conceptually. Benchmark datasets are downloaded from reputable sources. Then the datasets are split into training and test sets. Randomized error is added and then models created on both error laden and native training sets. Those models are used to predict both error laden and native test sets. Differences in standard statistics commonly used to assess predictivity are observed.

This dataset is associated with the following publication: Kolmar, S., and C. Grulke. The Effect of Noise on the Predictive Limit of QSAR Models. Journal of Cheminformatics. Springer, New York, NY, USA, 13: 92, (2021).

Data and Resources

https://github.com/USEPA/CompTox-ChemInf-ModelE...
Explore
- More information
- Go to resource

Field	Value
accessLevel	public
bureauCode	{020:00}
catalog_conformsTo	https://project-open-data.cio.gov/v1.1/schema
identifier	https://doi.org/10.23719/1524279
license	https://pasteur.epa.gov/license/sciencehub-license.html
modified	2021-06-21
programCode	{020:000}
publisher	U.S. EPA Office of Research and Development (ORD)
publisher_hierarchy	U.S. Government > U.S. Environmental Protection Agency > U.S. EPA Office of Research and Development (ORD)
references	{https://doi.org/10.1186/s13321-021-00571-7}
resource-type	Dataset
source_datajson_identifier	true
source_hash	980bb136a083e64c00da59d6bbf83b90c67b3e31
source_schema_version	1.1
Groups	AmeriGEOSS National Provider North America
Tags	amerigeo amerigeoss ckan error gaussian-process geo geoss model-evaluation national north-america prediction-error united-states
isopen	False
license_id	other-license-specified
license_title	other-license-specified
maintainer	Scott Kolmar
maintainer_email	kolmar.scott@epa.gov
metadata_created	2025-11-22T19:45:21.458979
metadata_modified	2025-11-22T19:45:21.458983
notes	The research done to evaluate how the predictivity of models are effected by error in either the training or the test set is simple to describe conceptually. Benchmark datasets are downloaded from reputable sources. Then the datasets are split into training and test sets. Randomized error is added and then models created on both error laden and native training sets. Those models are used to predict both error laden and native test sets. Differences in standard statistics commonly used to assess predictivity are observed. This dataset is associated with the following publication: Kolmar, S., and C. Grulke. The Effect of Noise on the Predictive Limit of QSAR Models. Journal of Cheminformatics. Springer, New York, NY, USA, 13: 92, (2021).
num_resources	1
num_tags	12
title	Determining the Predictive Limit of QSAR Models