flowcraft.templates.integrity_coverage module¶
Purpose¶
This module receives paired FastQ files, a genome size estimate and a minimum coverage threshold and has three purposes while iterating over the FastQ files:
- Checks the integrity of FastQ files (corrupted files).
- Guesses the encoding of FastQ files (this can be turned off in the
opts
argument).- Estimates the coverage for each sample.
Expected input¶
The following variables are expected whether using NextFlow or the
main()
executor.
sample_id
: Sample Identification string- e.g.:
'SampleA'
- e.g.:
fastq_pair
: Pair of FastQ file paths- e.g.:
'SampleA_1.fastq.gz SampleA_2.fastq.gz'
- e.g.:
gsize
: Expected genome size- e.g.:
'2.5'
- e.g.:
cov
: Minimum coverage threshold- e.g.:
'15'
- e.g.:
opts
: Specify additional arguments for executing integrity_coverage. The arguments should be a string of command line arguments, such as ‘-e’. The accepted arguments are:'-e'
: Skip encoding guess.
Generated output¶
The generated output are output files that contain an object, usually a string.
(Values within ${}
are substituted by the corresponding variable.)
${sample_id}_encoding
: Stores the encoding for the sample FastQ. If no encoding could be guessed, write ‘None’ to file.- e.g.:
'Illumina-1.8'
or'None'
- e.g.:
${sample_id}_phred
: Stores the phred value for the sample FastQ. If no phred could be guessed, write ‘None’ to file.'33'
or'None'
${sample_id}_coverage
: Stores the expected coverage of the samples, based on a given genome size.'112'
or'fail'
${sample_id}_report
: Stores the report on the expected coverage estimation. This string written in this file will appear in the coverage report.'${sample_id}, 112, PASS'
${sample_id}_max_len
: Stores the maximum read length for the current sample.'152'
Notes¶
In case of a corrupted sample, all expected output files should have
'corrupt'
written.
Code documentation¶
-
flowcraft.templates.integrity_coverage.
RANGES
= {'Illumina-1.3': [64, (64, 104)], 'Illumina-1.5': [64, (66, 105)], 'Illumina-1.8': [33, (33, 74)], 'Sanger': [33, (33, 73)], 'Solexa': [64, (59, 104)]}¶ dict: Dictionary containing the encoding values for several fastq formats. The key contains the format and the value contains a list with the corresponding phred score and a list with the range of encodings.
-
flowcraft.templates.integrity_coverage.
MAGIC_DICT
= {b'\\x1f\\x8b\\x08': 'gz', b'\\x42\\x5a\\x68': 'bz2', b'\\x50\\x4b\\x03\\x04': 'zip'}¶ dict: Dictionary containing the binary signatures for three compression formats (gzip, bzip2 and zip).
-
flowcraft.templates.integrity_coverage.
guess_file_compression
(file_path, magic_dict=None)[source]¶ Guesses the compression of an input file.
This function guesses the compression of a given file by checking for a binary signature at the beginning of the file. These signatures are stored in the
MAGIC_DICT
dictionary. The supported compression formats are gzip, bzip2 and zip. If none of the signatures in this dictionary are found at the beginning of the file, it returnsNone
.Parameters: - file_path : str
Path to input file.
- magic_dict : dict, optional
Dictionary containing the signatures of the compression types. The key should be the binary signature and the value should be the compression format. If left
None
, it falls back toMAGIC_DICT
.
Returns: - file_type : str or None
If a compression type is detected, returns a string with the format. If not, returns
None
.
-
flowcraft.templates.integrity_coverage.
get_qual_range
(qual_str)[source]¶ Get range of the Unicode encode range for a given string of characters.
The encoding is determined from the result of the
ord()
built-in.Parameters: - qual_str : str
Arbitrary string.
Returns: - x : tuple
(Minimum Unicode code, Maximum Unicode code).
-
flowcraft.templates.integrity_coverage.
get_encodings_in_range
(rmin, rmax)[source]¶ Returns the valid encodings for a given encoding range.
The encoding ranges are stored in the
RANGES
dictionary, with the encoding name as a string and a list as a value containing the phred score and a tuple with the encoding range. For a given encoding range provided via the two first arguments, this function will return all possible encodings and phred scores.Parameters: - rmin : int
Minimum Unicode code in range.
- rmax : int
Maximum Unicode code in range.
Returns: - valid_encodings : list
List of all possible encodings for the provided range.
- valid_phred : list
List of all possible phred scores.