flowcraft.templates.process_assembly module

Purpose

This module is intended to process the output of assemblies from a single sample from programs such as Spades or Skesa. The main input is an assembly file produced by an assembler, which will then be filtered according to user-specified parameters.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id: Sample Identification string.
    • e.g.: 'SampleA'
  • assembly: Fasta file with the assembly.
    • e.g.: 'contigs.fasta'
  • opts: List of options for processing spades assembly.
    1. Minimum contig length.
      • e.g.: '150'
    2. Minimum k-mer coverage.
      • e.g.: '2'
    3. Maximum number of contigs per 1.5Mb.
      • e.g.: '100'
  • assembler: The name of the assembler
    • e.g.: spades

Generated output

(Values within ${} are substituted by the corresponding variable.)

  • '${sample_id}.assembly.fasta' : Fasta file with the filtered assembly.
    • e.g.: 'Sample1.assembly.fasta'
  • ${sample_id}.report.fasta : CSV file with the results of the filters for each contig.
    • e.g.: 'Sample1.report.csv'

Code documentation

class flowcraft.templates.process_assembly.Assembly(assembly_file, min_contig_len, min_kmer_cov, sample_id)[source]

Bases: object

Class that parses and filters a Fasta assembly file

This class parses an assembly fasta file, collects a number of summary statistics and metadata from the contigs, filters contigs based on user-defined metrics and writes filtered assemblies and reports.

Parameters:
assembly_file : str

Path to assembly file.

min_contig_len : int

Minimum contig length when applying the initial assembly filter.

min_kmer_cov : int

Minimum k-mer coverage when applying the initial assembly. filter.

sample_id : str

Name of the sample for the current assembly.

Methods

filter_contigs(self, \*comparisons) Filters the contigs of the assembly according to user provided comparisons.
get_assembly_length(self) Returns the length of the assembly, without the filtered contigs.
write_assembly(self, output_file[, filtered]) Writes the assembly to a new file.
write_report(self, output_file) Writes a report with the test results for the current assembly
contigs = None

dict: Dictionary storing data for each contig.

filtered_ids = None

list: List of filtered contig_ids.

min_gc = None

float: Sets the minimum GC content on a contig.

sample = None

str: The name of the sample for the assembly.

report = None

dict: Will contain the filtering results for each contig.

filters = None

list: Setting initial filters to check when parsing the assembly file. This can be later changed using the ‘filter_contigs’ method.

filter_contigs(self, *comparisons)[source]

Filters the contigs of the assembly according to user provided comparisons.

The comparisons must be a list of three elements with the contigs key, operator and test value. For example, to filter contigs with a minimum length of 250, a comparison would be:

self.filter_contigs(["length", ">=", 250])

The filtered contig ids will be stored in the filtered_ids list.

The result of the test for all contigs will be stored in the report dictionary.

Parameters:
comparisons : list

List with contig key, operator and value to test.

get_assembly_length(self)[source]

Returns the length of the assembly, without the filtered contigs.

Returns:
x : int

Total length of the assembly.

write_assembly(self, output_file, filtered=True)[source]

Writes the assembly to a new file.

The filtered option controls whether the new assembly will be filtered or not.

Parameters:
output_file : str

Name of the output assembly file.

filtered : bool

If True, does not include filtered ids.

write_report(self, output_file)[source]

Writes a report with the test results for the current assembly

Parameters:
output_file : str

Name of the output assembly file.