FastGroup: A program to dereplicate libraries of 16S rDNA sequences

Background Ribosomal 16S DNA sequences are an essential tool for identifying and classifying microbes. High-throughput DNA sequencing now makes it economically possible to produce very large datasets of 16S rDNA sequences in short time periods, necessitating new computer tools for analyses. Here we describe FastGroup, a Java program designed to dereplicate libraries of 16S rDNA sequences. By dereplication we mean to: 1) compare all the sequences in a data set to each other, 2) group similar sequences together, and 3) output a representative sequence from each group. In this way, duplicate sequences are removed from a library.

      Results
      FastGroup was tested using a library of single-pass, bacterial 16S rDNA sequences cloned from coral-associated bacteria. We found that the optimal strategy for dereplicating these sequences was to: 1) trim ambiguous bases from the 5' end of the sequences and all sequence 3' of the conserved Bact517 site, 2) match the sequences from the 3' end, and 3) group sequences >=97% identical to each other.


      Conclusions
      The FastGroup program simplifies the dereplication of 16S rDNA sequence libraries and prepares the raw sequences for subsequent analyses.

Data and Resources

Field Value
accessLevel public
bureauCode {009:25}
catalog_@context https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
catalog_@id https://healthdata.gov/data.json
catalog_conformsTo https://project-open-data.cio.gov/v1.1/schema
catalog_describedBy https://project-open-data.cio.gov/v1.1/schema/catalog.json
identifier https://healthdata.gov/api/views/4y98-uvbn
issued 2025-07-14
landingPage https://healthdata.gov/d/4y98-uvbn
modified 2025-09-06
programCode {009:033}
publisher National Institutes of Health
resource-type Dataset
source_datajson_identifier true
source_hash c0f07686f6065716217350c7466c77fd22a6b53f1003e797dcebd7e7f141c29f
source_schema_version 1.1
theme {NIH}
Groups
  • AmeriGEOSS
  • National Provider
  • North America
Tags
  • 16s-rdna
  • AmeriGEO
  • AmeriGEOSS
  • CKAN
  • GEO
  • GEOSS
  • National
  • North America
  • United States
  • bacterial-sequences
  • microbial-identification
  • nih
  • sequence-analysis
isopen False
license_id notspecified
license_title License not specified
maintainer NIH
maintainer_email info@nih.gov
metadata_created 2025-09-23T16:43:11.794495
metadata_modified 2025-09-23T16:43:11.794502
notes Background Ribosomal 16S DNA sequences are an essential tool for identifying and classifying microbes. High-throughput DNA sequencing now makes it economically possible to produce very large datasets of 16S rDNA sequences in short time periods, necessitating new computer tools for analyses. Here we describe FastGroup, a Java program designed to dereplicate libraries of 16S rDNA sequences. By dereplication we mean to: 1) compare all the sequences in a data set to each other, 2) group similar sequences together, and 3) output a representative sequence from each group. In this way, duplicate sequences are removed from a library. Results FastGroup was tested using a library of single-pass, bacterial 16S rDNA sequences cloned from coral-associated bacteria. We found that the optimal strategy for dereplicating these sequences was to: 1) trim ambiguous bases from the 5' end of the sequences and all sequence 3' of the conserved Bact517 site, 2) match the sequences from the 3' end, and 3) group sequences >=97% identical to each other. Conclusions The FastGroup program simplifies the dereplication of 16S rDNA sequence libraries and prepares the raw sequences for subsequent analyses.
num_resources 1
num_tags 13
title FastGroup: A program to dereplicate libraries of 16S rDNA sequences