flowcraft.templates.process_assembly_mapping module¶
Purpose¶
This module is intended to process the coverage report from the
assembly_mapping
process.
TODO: Better purpose
Expected input¶
The following variables are expected whether using NextFlow or the
main()
executor.
sample_id
: Sample Identification string.- e.g.:
'SampleA'
- e.g.:
assembly
: Fasta assembly file.- e.g.:
'SH10761A.assembly.fasta'
- e.g.:
coverage
: TSV file with the average coverage for each assembled contig.- e.g.:
'coverage.tsv'
- e.g.:
coverage_bp
: TSV file with the coverage for each assembled bp.- e.g.:
'coverage.tsv'
- e.g.:
bam_file
: BAM file with the alignment of reads to the genome.- e.g.:
'sorted.bam'
- e.g.:
opts
: List of options for processing assembly mapping output.- Minimum coverage for assembled contigs. Can be``auto``.
- e.g.:
'auto'
or'10'
- e.g.:
- Maximum number of contigs.
- e.g.: ‘100’
gsize
: Expected genome size.- e.g.:
'2.5'
- e.g.:
Generated output¶
${sample_id}_filtered.assembly.fasta
: Filtered assembly file in Fasta format.- e.g.:
'SampleA_filtered.assembly.fasta'
- e.g.:
filtered.bam
: BAM file with the same filtering as the assembly file.- e.g.:
filtered.bam
- e.g.:
Code documentation¶
-
flowcraft.templates.process_assembly_mapping.
parse_coverage_table
(coverage_file)[source]¶ Parses a file with coverage information into objects.
This function parses a TSV file containing coverage results for all contigs in a given assembly and will build an
OrderedDict
with the information about their coverage and length. The length information is actually gathered from the contig header using a regular expression that assumes the usual header produced by Spades:contig_len = int(re.search("length_(.+?)_", line).group(1))
Parameters: - coverage_file : str
Path to TSV file containing the coverage results.
Returns: - coverage_dict : OrderedDict
Contains the coverage and length information for each contig.
- total_size : int
Total size of the assembly in base pairs.
- total_cov : int
Sum of coverage values across all contigs.
-
flowcraft.templates.process_assembly_mapping.
filter_assembly
(assembly_file, minimum_coverage, coverage_info, output_file)[source]¶ Generates a filtered assembly file.
This function generates a filtered assembly file based on an original assembly and a minimum coverage threshold.
Parameters: - assembly_file : str
Path to original assembly file.
- minimum_coverage : int or float
Minimum coverage required for a contig to pass the filter.
- coverage_info : OrderedDict or dict
Dictionary containing the coverage information for each contig.
- output_file : str
Path where the filtered assembly file will be generated.
-
flowcraft.templates.process_assembly_mapping.
filter_bam
(coverage_info, bam_file, min_coverage, output_bam)[source]¶ Uses Samtools to filter a BAM file according to minimum coverage
Provided with a minimum coverage value, this function will use Samtools to filter a BAM file. This is performed to apply the same filter to the BAM file as the one applied to the assembly file in
filter_assembly()
.Parameters: - coverage_info : OrderedDict or dict
Dictionary containing the coverage information for each contig.
- bam_file : str
Path to the BAM file.
- min_coverage : int
Minimum coverage required for a contig to pass the filter.
- output_bam : str
Path to the generated filtered BAM file.
-
flowcraft.templates.process_assembly_mapping.
check_filtered_assembly
(coverage_info, coverage_bp, minimum_coverage, genome_size, contig_size, max_contigs, sample_id)[source]¶ Checks whether a filtered assembly passes a size threshold
Given a minimum coverage threshold, this function evaluates whether an assembly will pass the minimum threshold of
genome_size * 1e6 * 0.8
, which means 80% of the expected genome size or the maximum threshold ofgenome_size * 1e6 * 1.5
, which means 150% of the expected genome size. It will issue a warning if any of these thresholds is crossed. In the case of an expected genome size below 80% it will return False.Parameters: - coverage_info : OrderedDict or dict
Dictionary containing the coverage information for each contig.
- coverage_bp : dict
Dictionary containing the per base coverage information for each contig. Used to determine the total number of base pairs in the final assembly.
- minimum_coverage : int
Minimum coverage required for a contig to pass the filter.
- genome_size : int
Expected genome size.
- contig_size : dict
Dictionary with the len of each contig. Contig headers as keys and the corresponding lenght as values.
- max_contigs : int
Maximum threshold for contig number. A warning is issued if this threshold is crossed.
- sample_id : str
Id or name of the current sample
Returns: - x : bool
True if the filtered assembly size is higher than 80% of the expected genome size.
-
flowcraft.templates.process_assembly_mapping.
get_coverage_from_file
(coverage_file)[source]¶ Parameters: - coverage_file
-
flowcraft.templates.process_assembly_mapping.
evaluate_min_coverage
(coverage_opt, assembly_coverage, assembly_size)[source]¶ Evaluates the minimum coverage threshold from the value provided in the coverage_opt.
Parameters: - coverage_opt : str or int or float
If set to “auto” it will try to automatically determine the coverage to 1/3 of the assembly size, to a minimum value of 10. If it set to a int or float, the specified value will be used.
- assembly_coverage : int or float
The average assembly coverage for a genome assembly. This value is retrieved by the :py:func:parse_coverage_table function.
- assembly_size : int
The size of the genome assembly. This value is retrieved by the py:func:get_assembly_size function.
Returns: - x: int
Minimum coverage threshold.
-
flowcraft.templates.process_assembly_mapping.
get_assembly_size
(assembly_file)[source]¶ Returns the number of nucleotides and the size per contig for the provided assembly file path
Parameters: - assembly_file : str
Path to assembly file.
Returns: - assembly_size : int
Size of the assembly in nucleotides
- contig_size : dict
Length of each contig (contig name as key and length as value)