NaPDoS

Short Description: 

The Natural Product Domain Seeker (NaPDoS) pipeline is a bioinformatic tool for the rapid detection and analysis of secondary metabolite genes associated with polyketide synthase and non-ribosomal peptide synthesis pathways. This tool detects, extracts, and classifies ketosynthase (KS) and condensation (C) domains from DNA or amino acid sequence data, using methods that are particularly robust in cases of fragmented, incomplete draft genomes and environmental sequences.

Candidate secondary metabolite domains are identified and classified by sequence comparison to a broad set of manually curated reference genes from well-characterized chemical pathways. Candidate gene sequences are extracted, trimmed, translated (if necessary) and subjected to domain-specific phylogenetic clustering to predict what their putative products might be, and to determine whether these products are likely to produce compounds similar to or different from previously known biosynthetic pathways.

The current web-based NaPDoS pipeline can only accommodate relatively small input data sets, due to computational resource limitations. We wish to overcome these limitations to analyze larger data sets by converting the pipeline into a parallelized BioKepler workflow.

PUBLICATION: Ziemert N, Podell S, Penn K, Badger JH, Allen E, Jensen PR The Natural Product Domain Seeker NaPDoS: A Phylogeny Based Bioinformatic Tool to Classify Secondary Metabolite Gene Diversity. PLoS One. 2012;7(3):e34064 Epub 2012 Mar 29. PubMed PMID: 22479523

PUBLIC WEBSERVER: http://npdomainseeker.ucsd.edu

Workflow Inputs: 

user-entered multi-fasta file of nucleic acid sequences

User alterable parameters: 

1. Minimum match length for BLAST alignment

2. Maximum e-value for HMMSEARCH domain identification  

3. Maximum e-value for pathway assignment using BLAST

Reference information (not alterable by users): 

1. Individual protein sequences for ~700 reference examples in faa format, to be used in BLAST searches (ungapped fasta amino acids)

2. Curated alignment for ~400 classified reference sequences in afa format (gapped fasta amino acids). These are a subset of the reference sequence examples.

3. Custom HMM pattern(s) for high stringency domain search.

Output: 

 1. Trimmed candidate protein sequences inserted into reference tree

2. Table of results, showing numbers and positions of candidates on input sequences

  • Candidate ID
  • Parent (input) nucleic acid sequence id
  • Start position, End position, Reading frame
  • Reference database match id, common name
  • Percent identity. Align length, E-value
  • Pathway product info (link to metadata)

3. Trimmed nucleic acid sequences of matches

4. Trimmed amino acid sequences of matches

5. Trimmed amino acid alignment of candidate sequences with reference sequences

Validation and Test Plan: 

1. Test positive and negative control input sequences absent from reference database (including close and distant relatives, with target domains at known locations)

2. Test handling of case where no hits are found.

3. Test handling of large hit numbers (thousands) and large input files (multi-GB).

4. Verify correctness of trimmed sequences, coordinates, number of hits, alignments, and positions in final tree by comparing pipeline to results on same input data using gold standard (existing website http://napdos.ucsd.edu).

Parallelization Opportunities and Potential Future Extensions: 

Rate limiting steps are currently BLAST search and MUSCLE alignment.

Executable Codes Used and Commands to Execute Them: 
  1. BLASTX (parallel) input versus reference sequences.

    formatdb -i KS_ref_seqs_12062011.faa

    blastall -p blastx -d KS_ref_seqs_12062011.faa -i test1_input.fna -b1 -v1 -m8 -e $max_evalue -o test1_input.m8

  2. Parse, extract and trim preliminary hits, translate into protein sequences (custom perl scripts)

    select_tab_quant_greater_than.pl test1_input.m8 3 $min_aln > test1_input.m8.size_select

    cut -f1,7,8 test1_input.m8.size_select |sort |uniq > getlist

    get_subsequence.pl getlist test1_input.fna > test1_input_trim_cands.fna

    transeq -sequence test1_input_trim_cands.fna -frame 1 -table 11 -outseq test1_input_trim_cands.faa

  3. HMMSEARCH on preliminary protein hits to remove false positives (parallel - optional)

    hmmpfam -E 1e-5 PKS_KS.hmm test1_input_trim_cands.faa > test1_input_trim_cands.pfam

    parse_pfam.pl test1_input_trim_cands.pfam |cut -f1 > hmm_getlist

    getseq_multiple.pl hmm_getlist test1_input_trim_cands.faa > test1_input_final_cands.faa

  4. MUSCLE insertion into pre-existing alignment  (parallel - optional)

    muscle -in test1_input_final_cands.faa -out test1_input_final_cands.afa

    muscle -profile -in1 classified_KS_align_12072011.afa -in2 test1_input_final_cands.afa -out final_align.afa

  5. FastTree to obtain newick-format tree (parallel - optional)

    FastTree final_align.afa > final_result.tre