flowcraft.templates.fastqc_report module¶
Purpose¶
This module is intended parse the results of FastQC for paired end FastQ samples. It parses two reports:
- Categorical report
- Nucleotide level report.
Expected input¶
The following variables are expected whether using NextFlow or the
main()
executor.
sample_id
: Sample identification string- e.g.:
'SampleA'
- e.g.:
result_p1
: Path to both FastQC result files for pair 1- e.g.:
'SampleA_1_data SampleA_1_summary'
- e.g.:
result_p2
: Path to both FastQC result files for pair 2- e.g.:
'SampleA_2_data SampleA_2_summary'
- e.g.:
opts
: Specify additional arguments for executing fastqc_report. The arguments should be a string of command line arguments, The accepted arguments are:'--ignore-tests'
: Ignores test results from FastQC categorical summary. This is used in the first run of FastQC.
Generated output¶
The generated output are output files that contain an object, usually a string.
fastqc_health
: Stores the health check for the current sample. If it- passes all checks, it contains only the string ‘pass’. Otherwise, contains
the summary categories and their respective results
- e.g.:
'pass'
optimal_trim
: Stores a tuple with the optimal trimming positions for 5’- and 3’ ends of the reads.
- e.g.:
'15 151'
Code documentation¶
-
flowcraft.templates.fastqc_report.
write_json_report
(sample_id, data1, data2)[source]¶ Writes the report
Parameters: - data1
- data2
-
flowcraft.templates.fastqc_report.
get_trim_index
(biased_list)[source]¶ Returns the trim index from a
bool
listProvided with a list of
bool
elements ([False, False, True, True]
), this function will assess the index of the list that minimizes the number of True elements (biased positions) at the extremities. To do so, it will iterate over the boolean list and find an index position where there are two consecutiveFalse
elements after aTrue
element. This will be considered as an optimal trim position. For example, in the following list:[True, True, False, True, True, False, False, False, False, ...]
The optimal trim index will be the 4th position, since it is the first occurrence of a
True
element with two False elements after it.If the provided
bool
list has noTrue
elements, then the 0 index is returned.Parameters: - biased_list: list
List of
bool
elements, whereTrue
means a biased site.
Returns: - x : index position of the biased list for the optimal trim.
-
flowcraft.templates.fastqc_report.
trim_range
(data_file)[source]¶ Assess the optimal trim range for a given FastQC data file.
This function will parse a single FastQC data file, namely the ‘Per base sequence content’ category. It will retrieve the A/T and G/C content for each nucleotide position in the reads, and check whether the G/C and A/T proportions are between 80% and 120%. If they are, that nucleotide position is marked as biased for future removal.
Parameters: - data_file: str
Path to FastQC data file.
Returns: - trim_nt: list
List containing the range with the best trimming positions for the corresponding FastQ file. The first element is the 5’ end trim index and the second element is the 3’ end trim index.
-
flowcraft.templates.fastqc_report.
get_sample_trim
(p1_data, p2_data)[source]¶ Get the optimal read trim range from data files of paired FastQ reads.
Given the FastQC data report files for paired-end FastQ reads, this function will assess the optimal trim range for the 3’ and 5’ ends of the paired-end reads. This assessment will be based on the ‘Per sequence GC content’.
Parameters: - p1_data: str
Path to FastQC data report file from pair 1
- p2_data: str
Path to FastQC data report file from pair 2
Returns: - optimal_5trim: int
Optimal trim index for the 5’ end of the reads
- optima_3trim: int
Optimal trim index for the 3’ end of the reads
See also
-
flowcraft.templates.fastqc_report.
get_summary
(summary_file)[source]¶ Parses a FastQC summary report file and returns it as a dictionary.
This function parses a typical FastQC summary report file, retrieving only the information on the first two columns. For instance, a line could be:
'PASS Basic Statistics SH10762A_1.fastq.gz'
This parser will build a dictionary with the string in the second column as a key and the QC result as the value. In this case, the returned
dict
would be something like:{"Basic Statistics": "PASS"}
Parameters: - summary_file: str
Path to FastQC summary report.
Returns: - summary_info: :py:data:`OrderedDict`
Returns the information of the FastQC summary report as an ordered dictionary, with the categories as strings and the QC result as values.
-
flowcraft.templates.fastqc_report.
check_summary_health
(summary_file, **kwargs)[source]¶ Checks the health of a sample from the FastQC summary file.
Parses the FastQC summary file and tests whether the sample is good or not. There are four categories that cannot fail, and two that must pass in order for the sample pass this check. If the sample fails the quality checks, a list with the failing categories is also returned.
Categories that cannot fail:
fail_sensitive = [ "Per base sequence quality", "Overrepresented sequences", "Sequence Length Distribution", "Per sequence GC content" ]
Categories that must pass:
must_pass = [ "Per base N content", "Adapter Content" ]
Parameters: - summary_file: str
Path to FastQC summary file.
Returns: - x : bool
Returns
True
if the sample passes all tests.False
if not.- summary_info : list
A list with the FastQC categories that failed the tests. Is empty if the sample passes all tests.