Predict_Organ_Toxicity_ChemResTox_Data

We use a supervised machine learning strategy to systematically investigate the relative importance of study type, machine learning algorithm, and type of descriptor on predicting in vivo repeat-dose toxicity at the organ-level. A total of 985 compounds were represented using chemical structural descriptors, ToxPrint chemotype descriptors, and bioactivity descriptors from ToxCast in vitro high-throughput screening assays. Using ToxRefDB, a total of 35 target organ outcomes were identified that contained at least 100 chemicals (50 positive and 50 negative). Supervised machine learning was performed using Naïve Bayes, k-nearest neighbor, random forest, classification and regression trees, and support vector classification approaches. Model performnce was assessed based on F1 scores using five-fold cross-validation with balanced bootstrap replicates. Fixed effects modeling showed the variance in F1 scores was explained mostly by target organ outcome, followed by descriptor type, machine learning algorithm, and interactions between these three factors. A combination of bioactivity and chemical structure or chemotype descriptors were the most predictive. Model performance improved with more chemicals (up to a maximum of 24%) and these gains were correlated (ρ= 0.92) with the number of chemicals.

This dataset is associated with the following publication: Liu, J., G. Patlewicz, A. Williams, R. Thomas, and I. Shah. (Chemical Research in Toxicology) Predicting organ toxicity using in vitro bioactivity data and chemical structure. CHEMICAL RESEARCH IN TOXICOLOGY. American Chemical Society, Washington, DC, USA, 30: 2046−2059, (2017).

Data and Resources

https://gaftp.epa.gov/comptox/NCCT_Publication_...
Explore
- More information
- Go to resource

Field	Value
accessLevel	public
bureauCode	{020:00}
catalog_conformsTo	https://project-open-data.cio.gov/v1.1/schema
identifier	https://doi.org/10.23719/1407008
license	https://pasteur.epa.gov/license/sciencehub-license.html
modified	2017-09-28
programCode	{020:095}
publisher	U.S. EPA Office of Research and Development (ORD)
publisher_hierarchy	U.S. Government > U.S. Environmental Protection Agency > U.S. EPA Office of Research and Development (ORD)
references	{https://doi.org/10.1021/acs.chemrestox.7b00084}
resource-type	Dataset
source_datajson_identifier	true
source_hash	bcc4bd45587e4e7a387a983436fe09e4357d47a5
source_schema_version	1.1
Groups	AmeriGEOSS National Provider North America
Tags	amerigeo amerigeoss bioactivity chemotypes ckan geo geoss high-throughput-screening high-throughput-toxicology machine-learning national north-america qsar toxcast toxrefdb united-states
isopen	False
license_id	other-license-specified
license_title	other-license-specified
maintainer	Keith Houck
maintainer_email	houck.keith@epa.gov
metadata_created	2025-11-22T17:22:44.677106
metadata_modified	2025-11-22T17:22:44.677110
notes	We use a supervised machine learning strategy to systematically investigate the relative importance of study type, machine learning algorithm, and type of descriptor on predicting in vivo repeat-dose toxicity at the organ-level. A total of 985 compounds were represented using chemical structural descriptors, ToxPrint chemotype descriptors, and bioactivity descriptors from ToxCast in vitro high-throughput screening assays. Using ToxRefDB, a total of 35 target organ outcomes were identified that contained at least 100 chemicals (50 positive and 50 negative). Supervised machine learning was performed using Naïve Bayes, k-nearest neighbor, random forest, classification and regression trees, and support vector classification approaches. Model performnce was assessed based on F1 scores using five-fold cross-validation with balanced bootstrap replicates. Fixed effects modeling showed the variance in F1 scores was explained mostly by target organ outcome, followed by descriptor type, machine learning algorithm, and interactions between these three factors. A combination of bioactivity and chemical structure or chemotype descriptors were the most predictive. Model performance improved with more chemicals (up to a maximum of 24%) and these gains were correlated (ρ= 0.92) with the number of chemicals. This dataset is associated with the following publication: Liu, J., G. Patlewicz, A. Williams, R. Thomas, and I. Shah. (Chemical Research in Toxicology) Predicting organ toxicity using in vitro bioactivity data and chemical structure. CHEMICAL RESEARCH IN TOXICOLOGY. American Chemical Society, Washington, DC, USA, 30: 2046−2059, (2017).
num_resources	1
num_tags	16
title	Predict_Organ_Toxicity_ChemResTox_Data