Antibiotics Are Man's Greatest Invention

Thirty complete Streptomyces genome sequences for mining novel secondary metabolite biosynthetic gene clusters

Genomic DNA (gDNA) extraction

Total 30 streptomycetes were purchased from Korean Collection for Type Cultures (KCTC, Korea). A stock of streptomycetes were inoculated to 50 mL of liquid culture medium with 0.16 g mL−1 of glass beads (3 ± 0.3 mm diameter) in 250 mL baffled flask and grown at 30 °C in a 200 rpm orbital shaker. Each streptomycetes was grown in one of four different culture medium, R5(–) medium (25 mM TES (pH 7.2), 103 g L−1 sucrose, 1% (w/v) glucose, 5 g L−1 yeast extract, 10.12 g L−1 MgCl2∙6H2O, 0.25 g L−1 K2SO4, 0.1 g L−1 casamino acids, 0.08 g L−1 ZnCl2, 0.4 mg L−1 FeCl3, 0.02 mg L−1 CuCl2∙2H2O, 0.02 mg L−1 MnCl2∙4H2O, 0.02 mg L−1 Na2B4O7∙10H2O, and 0.02 mg L−1 (NH4)6Mo7O24∙4H2O), 1 × sporulation medium (3.33 g L−1 glucose, 1 g L−1 yeast extract, 1 g L−1 beef extract, 2 g L−1 tryptose, and 0.006 g L−1 FeSO4∙7H2O), YEME medium (340 g L−1 sucrose, 10 g L−1 glucose, 3 g L−1 yeast extract, 5 g L−1 bacto peptone, and 3 g L−1 oxoid malt extract), and MYM medium (4 g L−1 maltose, 4 g L−1 yeast extract, 10 g L−1 malt extract). For gDNA extraction, 25 mL cultured cells were harvested at the exponential growth phase and washed twice with same volume of 10 mM EDTA, followed by the lysozyme (10 mg mL−1) treatment at 37 °C for 45 min. gDNA was extracted using a Wizard Genomic DNA Purification Kit (Promega, Madison, WI, USA) according to the manufacturer’s instruction. Quality and quantity of extracted gDNA samples were evaluated using 1% agarose gel electrophoresis and Nanodrop (Thermo Fisher Scientific, Waltham, MA, USA), respectively.

Short-read (Illumina) genome sequencing

For construction of short-read genome sequencing library, 2.5 μg of gDNA was sheared to approximately 350 bp by a Covaris instrument (Covaris Inc., Woburn, MA, USA) with the following conditions; Power 175, Duty factor 20%, C. burst 200, Time 23 s, 8 times. The library was constructed using a TruSeq DNA PCR-Free LT kit (Illumina Inc., San Diego, CA, USA) following manufacturer’s instruction. Briefly, the fragmented DNA samples were cleaned and end-repaired, followed by the adaptor ligation and bead-based size selection ranging from 400 to 500 bp. Quantity of final libraries was measured using Qubit® dsDNA HS Assay Kit (Thermo Fisher Scientific) and the library size was determined using Agilent 2200 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). Among the constructed sequencing libraries, 29 libraries were sequenced with the HiSeq. 2500 (Illumina Inc.) as 100 bp single-end reads and remaining one library for S. tsukubaensis was sequenced with the Miseq v.2 (Illumina Inc.) with 50 bp single-read recipe. Finally, 0.46 to 5.18 Gbp of raw sequence data were obtained and the read qualities were examined by creating sequencing QC reports function of CLC genomic workbench version 6.5.1 (CLC bio, Denmark) (Online-only Table 1 and Fig. 1a).

Fig. 1

Quality of the genome sequencing data. (a) Distribution of Illumina reads quality based on Phred score. (b) Read quality distribution of PacBio reads. Black line indicates total number of bases in the reads which have greater read quality than the corresponding read quality value on x-axis.

Long-read (PacBio) genome sequencing

A total of 5 μg gDNA was used as input for PacBio genome sequencing library preparation. The sequencing library was constructed with the PacBio SMRTbellTM Template Prep Kit (Pacific Biosciences, Menlo Park, CA, USA) following manufacturer’s instructions. Fragments smaller than 20 kbp were removed using the Blue Pippin Size selection system (Sage Science, Beverly, MA, USA) and the constructed libraries were validated using Agilent 2100 Bioanalyzer (Agilent Technologies). Final SMRTbell libraries were sequenced using one or two SMRT cells with P6-C4-chemistry (DNA Sequencing Reagent 4.0) on the PacBio RS II sequencing platform (Pacific Biosciences). Approximately, 0.5 to 3.0 Gbp of raw sequence data were generated (Online-only Table 1).

Genome assembly

Among the raw PacBio sequencing reads, only the reads with a read quality value greater than 0.75 and a length longer than 50 bp were filtered (Fig. 1b). Post filtered reads were assembled by the hierarchical genome assembly process workflow (HGAP, Version 2.3), including consensus polishing with Quiver18. For each assembled contig, error correction was performed based on their estimated genome size and average coverage. Raw reads from the Illumina sequencing were quality trimmed using CLC genomic workbench version 6.5.1 (ambiguous limit 2 and quality limit 0.05) and assembled using de novo assembly function of CLC genomic workbench version 6.5.1 with default parameters. To expand the assembled contigs, all of assembled PacBio and Illumina contigs were aligned using MAUVE 2.4.019 and linked using GAP5 program (Staden package)20.

Genome correction

Quality trimmed Illumina sequencing reads were mapped to the assembled genome using CLC genomic workbench version 6.5.1 (mismatch cost 2, insertion cost 3, deletion cost 3, length fraction 0.9, and similarity fraction 0.9). Conflicts showing more than 80% frequency for Illumina reads were corrected as Illumina sequence (Table 1). In addition, percentage of mapped Illumina reads on to the assembled genome represents degree of completeness (Table 1 and Fig. 2b). Completeness of gene space was estimated using the BUSCO v3 (Table 2)21.

Table 1 The statistics of genome assembly and correction.
Fig. 2

Genome assembly of 30 streptomycetes. (a) Strategy for genome assembly and corrections. (b) Profile of Illumina reads mapped on assembled genomes. Data were visualized using SignalMap (Roche NimbleGen, Inc.). Red line indicates the average Illumina read coverage of all genomic positions.

Table 2 Gene space completeness of completed genomes.

Genome annotation and secondary metabolite biosynthetic gene cluster prediction

The complete genome sequences of streptomycetes were submitted to the NCBI GenBank database and annotated by the latest updated version of NCBI Prokaryotic Genome Annotation Pipeline (PGAP)22. Using the GenBank formatted files of each genomes as input, secondary metabolite biosynthetic gene clusters were predicted by antiSMASH 4.023.