flowcraft.templates.process_assembly_mapping module

Purpose

This module is intended to process the coverage report from the assembly_mapping process.

TODO: Better purpose

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample Identification string.
    • e.g.: 'SampleA'
  • assembly : Fasta assembly file.
    • e.g.: 'SH10761A.assembly.fasta'
  • coverage : TSV file with the average coverage for each assembled contig.
    • e.g.: 'coverage.tsv'
  • coverage_bp : TSV file with the coverage for each assembled bp.
    • e.g.: 'coverage.tsv'
  • bam_file : BAM file with the alignment of reads to the genome.
    • e.g.: 'sorted.bam'
  • opts : List of options for processing assembly mapping output.
    1. Minimum coverage for assembled contigs. Can be``auto``.
      • e.g.: 'auto' or '10'
    2. Maximum number of contigs.
      • e.g.: ‘100’
  • gsize: Expected genome size.
    • e.g.: '2.5'

Generated output

  • ${sample_id}_filtered.assembly.fasta : Filtered assembly file in Fasta format.
    • e.g.: 'SampleA_filtered.assembly.fasta'
  • filtered.bam : BAM file with the same filtering as the assembly file.
    • e.g.: filtered.bam

Code documentation

flowcraft.templates.process_assembly_mapping.parse_coverage_table(coverage_file)[source]

Parses a file with coverage information into objects.

This function parses a TSV file containing coverage results for all contigs in a given assembly and will build an OrderedDict with the information about their coverage and length. The length information is actually gathered from the contig header using a regular expression that assumes the usual header produced by Spades:

contig_len = int(re.search("length_(.+?)_", line).group(1))
Parameters:
coverage_file : str

Path to TSV file containing the coverage results.

Returns:
coverage_dict : OrderedDict

Contains the coverage and length information for each contig.

total_size : int

Total size of the assembly in base pairs.

total_cov : int

Sum of coverage values across all contigs.

flowcraft.templates.process_assembly_mapping.filter_assembly(assembly_file, minimum_coverage, coverage_info, output_file)[source]

Generates a filtered assembly file.

This function generates a filtered assembly file based on an original assembly and a minimum coverage threshold.

Parameters:
assembly_file : str

Path to original assembly file.

minimum_coverage : int or float

Minimum coverage required for a contig to pass the filter.

coverage_info : OrderedDict or dict

Dictionary containing the coverage information for each contig.

output_file : str

Path where the filtered assembly file will be generated.

flowcraft.templates.process_assembly_mapping.filter_bam(coverage_info, bam_file, min_coverage, output_bam)[source]

Uses Samtools to filter a BAM file according to minimum coverage

Provided with a minimum coverage value, this function will use Samtools to filter a BAM file. This is performed to apply the same filter to the BAM file as the one applied to the assembly file in filter_assembly().

Parameters:
coverage_info : OrderedDict or dict

Dictionary containing the coverage information for each contig.

bam_file : str

Path to the BAM file.

min_coverage : int

Minimum coverage required for a contig to pass the filter.

output_bam : str

Path to the generated filtered BAM file.

flowcraft.templates.process_assembly_mapping.check_filtered_assembly(coverage_info, coverage_bp, minimum_coverage, genome_size, contig_size, max_contigs, sample_id)[source]

Checks whether a filtered assembly passes a size threshold

Given a minimum coverage threshold, this function evaluates whether an assembly will pass the minimum threshold of genome_size * 1e6 * 0.8, which means 80% of the expected genome size or the maximum threshold of genome_size * 1e6 * 1.5, which means 150% of the expected genome size. It will issue a warning if any of these thresholds is crossed. In the case of an expected genome size below 80% it will return False.

Parameters:
coverage_info : OrderedDict or dict

Dictionary containing the coverage information for each contig.

coverage_bp : dict

Dictionary containing the per base coverage information for each contig. Used to determine the total number of base pairs in the final assembly.

minimum_coverage : int

Minimum coverage required for a contig to pass the filter.

genome_size : int

Expected genome size.

contig_size : dict

Dictionary with the len of each contig. Contig headers as keys and the corresponding lenght as values.

max_contigs : int

Maximum threshold for contig number. A warning is issued if this threshold is crossed.

sample_id : str

Id or name of the current sample

Returns:
x : bool

True if the filtered assembly size is higher than 80% of the expected genome size.

flowcraft.templates.process_assembly_mapping.get_coverage_from_file(coverage_file)[source]
Parameters:
coverage_file
flowcraft.templates.process_assembly_mapping.evaluate_min_coverage(coverage_opt, assembly_coverage, assembly_size)[source]

Evaluates the minimum coverage threshold from the value provided in the coverage_opt.

Parameters:
coverage_opt : str or int or float

If set to “auto” it will try to automatically determine the coverage to 1/3 of the assembly size, to a minimum value of 10. If it set to a int or float, the specified value will be used.

assembly_coverage : int or float

The average assembly coverage for a genome assembly. This value is retrieved by the :py:func:parse_coverage_table function.

assembly_size : int

The size of the genome assembly. This value is retrieved by the py:func:get_assembly_size function.

Returns:
x: int

Minimum coverage threshold.

flowcraft.templates.process_assembly_mapping.get_assembly_size(assembly_file)[source]

Returns the number of nucleotides and the size per contig for the provided assembly file path

Parameters:
assembly_file : str

Path to assembly file.

Returns:
assembly_size : int

Size of the assembly in nucleotides

contig_size : dict

Length of each contig (contig name as key and length as value)