a2ihelper package¶
Submodules¶
a2ihelper.call_reditools2 module¶
- a2ihelper.call_reditools2.get_genes_positions(genes: list, path_ref_annotation: str, gzip_file: bool = False) list¶
Return the coordinates of a gene symbol from a GTF file. It can be used as input to run_per_gene_position_list.
- Parameters:
genes (list) – list of genes to get coordinates
path_ref_annotation (str) – full reference GTF file path.
- Returns:
a dict of coordinates of each gene symbol (chr:start-end).
- Return type:
dict
Example
Using the GTF file from gencode
>>> get_genes_positions(['B2m', 'Apol1'], '/.../GRCh38.p14.genome.fa') ['chr2:122147686-122153083', 'chr18:60803848-60812646']
- a2ihelper.call_reditools2.get_utr_genes_positions(genes: list, path_ref_annotation: str, gzip_file: bool = False) list¶
Return the coordinates of a gene symbol from a GTF file. It can be used as input to run_per_gene_position_list.
- Parameters:
genes (list) – list of genes to get coordinates
path_ref_annotation (str) – full reference GTF file path.
- Returns:
a dict of coordinates of each gene symbol (chr:start-end).
- Return type:
dict
Example
Using the GTF file from gencode
>>> get_genes_positions(['B2m', 'Apol1'], '/.../GRCh38.p14.genome.fa') ['chr2:122147686-122153083', 'chr18:60803848-60812646']
- a2ihelper.call_reditools2.indexing_ref(path_ref_genome)¶
- a2ihelper.call_reditools2.run_per_gene_position(gene_position: str, in_bam_file: str, path_out_res: str, ref_genome_file: str, path_reditools: str, reditools_options: str) None¶
Run reditools2.0 via Python
- Parameters:
gene_position (str) – coordinate of chromossome:start-end positions to search editing sites. Example: ‘chr2:122147686-122153083’. One can get the coordinates from gene symbol using get_genes_positions function.
in_bam_file (str) – full input BAM file path
path_out_res (str) – output directory
ref_genome_file (str) – full reference FASTA file path. Must the same used to build the aligned bam file.
path_reditools (str) – full directory where reditools.py is installed. Usually is in similiar path ‘/../reditools2.0/src/cineca’
reditools_options (str) – optional arguments to run reditools2. All the options are expalined in https://github.com/BioinfoUNIBA/REDItools2
- Returns:
it doesn’t return nothing, just run reditools2
- Return type:
None
Example
Using the toyfile from https://github.com/guilhermetabordaribas/a2iHelperPy
>>> gene_position = 'chr2:122147686-122153083' >>> in_bam_file = '/.../sample1.sortedByCoord.out.bam' >>> path_out_res = '/.../out/' >>> ref_genome_file = '/.../GRCh38.p14.genome.fa' >>> path_reditools = '/.../reditools2.0/src/cineca/' >>> reditools_options = '--strict' >>> run_per_gene_position(gene_position, in_bam_file, path_out_res, ref_genome_file, path_reditools, reditools_options='--strict')
- a2ihelper.call_reditools2.run_per_gene_position_list(genes_positions: list, in_bam_file: str, path_out_res: str, ref_genome_file: str, path_reditools: str, reditools_options: str, n_jobs: int = 4) None¶
Run run_per_gene_position for a list of gense coordinates (genes_positions)
- Parameters:
gene_position (list) – list of coordinates of chromossome:start-end positions to search editing sites. Example: [‘chr2:122147686-122153083’, ‘chr18:60803848-60812646’, ‘chr6:65671590-65712326’]. One can get the coordinates from gene symbol using get_genes_positions function.
in_bam_file (str) – full input file BAM path
path_out_res (str) – output directory
ref_genome_file (str) – full reference FASTA file path. Must the same used to build the aligned bam file.
path_reditools (str) – full directory where reditools.py is installed. Usually is in similiar path ‘/../reditools2.0/src/cineca’
reditools_options (str) – optional arguments to run reditools2. All the options are expalined in https://github.com/BioinfoUNIBA/REDItools2
n_jobs (int) – number of jobs in parallel
- Returns:
it doesn’t return nothing, just run reditools2 for a list o coordinates
- Return type:
None
Example
Using the toyfile from https://github.com/guilhermetabordaribas/a2iHelperPy
>>> genes_positions = ['chr2:122147686-122153083', 'chr6:65671590-65712326', 'chr15:78191114-78206400'] >>> in_bam_file = '/.../sample1.sortedByCoord.out.bam' >>> path_out_res = '/.../out/' >>> ref_genome_file = '/.../GRCh38.p14.genome.fa' >>> path_reditools = '/.../reditools2.0/src/cineca/' >>> reditools_options = '--strict' >>> run_per_gene_position_list(genes_positions, in_bam_file, path_out_res, ref_genome_file, path_reditools, reditools_options='--strict', n_jobs=4)
a2ihelper.filter module¶
- a2ihelper.filter.call_snp_vep(coordinates: list = [], species: str = 'homo_sapiens', sub: str = 'G')¶
Consult SNPs in VEP (ensembl tool).
- Parameters:
coordinates (list) – List of strings to call VEP. The string must be a list of chr_coordinate (e.g: [‘9_129401662’, ‘1_6524705’]).
species (str) – String with the species that you are requesting. It is based on Ensembl names like homo_sapiens, mus_musculus, rattus_norvegicus, zebrafish. Other genomes can be searched here: https://rest.ensembl.org/documentation/info/species
sub (str) – Substitution base, can be G (A>G) or C (T>C).
- Returns:
List of chr_coordinate and respective rsID
- Return type:
list
- a2ihelper.filter.filter_gtest(df_a, df_g, pvalue_filter_limit=0.05, gtest_filter_limit=0, bh_correction=False)¶
Filter positions limiting the quantity of samples with significant p-values for independecy G-test.
- Parameters:
df_a (df) – pandas DataFrame of Adenine counts (A). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions
df_g (df) – pandas DataFrame of Guanine (Inosine) counts (G). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions
pvalue_filter_limit (float) – Minimum p-value threshold to consider the chi2_test significant.
gtest_filter_limit (int) – Number max of conditions with significant p-value frequency in one position in each contition
bh_correction (bool) – If True, all p-values will be corrected by false_discovery_control
- Returns:
returns inputs df_a, df_g without filtered coordinates.
- Return type:
tuple
- a2ihelper.filter.filter_positions(df, nan_filter=True, nan_filter_limit=0, zero_filter=True, zero_filter_limit=0, hundred_filter=True, hundred_filter_limit=0, per_condition=False)¶
Filter positions limiting the quantity of samples with nan values, zero editing and 100% of editing across samples.
- Parameters:
meta (df) – Frequency editing DataFrame of all samples analyzed. The DataFrame must be like merge_files output. The last two columns must be region and conditions.
nan_filter (bool) – Must be True to filter quantity of nan values across samples in each contition
nan_filter_limit (int) – Number max of samples with nan in one position in each contition
zero_filter (bool) – Must be True to filter quantity of zero editing values across samples in each contition
zero_filter_limit (int) – Number max of samples with zero frequency in one position in each contition
hundred_filter (bool) – Must be True to filter quantity of 100% of editing values across samples in each contition
hundred_filter_limit (int) – Number max of samples with 100% frequency in one position in each contition
per_condition (bool) – Must be True to filter quantity by condition individually. And False to filter by all samples.
- Returns:
DataFrame without filtered positions
- Return type:
df
- a2ihelper.filter.filter_snps_vep(df, species: str = 'homo_sapiens', sub: str = 'G')¶
Filter positions by consulting SNPs in VEP (ensembl tool).
- Parameters:
df (DataFrame) – Frequency editing DataFrame of all samples analyzed. The DataFrame must be like merge_files output. The columns must be chr_coordinate (e.g: 9_129401662). The last two columns must be region and conditions.
species (str) – String with the species that you are requesting. It is based on Ensembl names like homo_sapiens, mus_musculus, rattus_norvegicus, zebrafish. Other genomes can be searched here: https://rest.ensembl.org/documentation/info/species
sub (str) – Substitution base, can be G (A>G) or C (T>C).
- Returns:
df_1 – DataFrame df without filtered positions
df_2 – DataFrame with positions and rsID
- a2ihelper.filter.independency_gtest(df_a, df_g, only_pvalue=True)¶
Perform independecy G-test to verify if there’s deviation from the expected proportions and significant variation among the replicates.
- Parameters:
df_a (df) – pandas DataFrame of Adenine counts (A). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions
df_g (df) – pandas DataFrame of Guanine (Inosine) counts (G). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions
only_pvalue (bool) – If True, the function will return two DataFrames only with p-values. Otherwise will return the attributes of Chi2ContingencyResult object (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html).
- Returns:
DataFrame with scipy.stats.chi2_contingency results.
- Return type:
df
- a2ihelper.filter.merge_files_all_regions(meta, coverage_q30: int = 10)¶
Merge all RES files for ALL regions (output of REDItools2) in three pandas DataFrames of frequency or count per position. The first DataFrame is the frequency of editing (A-G or T-C). The Second DataFrame is the count of A (or T) per position. And the last one is the count of G (or C) per postion.
- Parameters:
meta (df) –
- A pandas DataFrame with metadata information for ALL regions of interest. The first four columns are mandatory
First: Full path file names of REDItools2 results tables Second: Samples names Third: regions (genes symbol) Fourth: conditions
coverage_q30 (int) – An integer to define the minimum of reads with quality higher than in each position
- Returns:
a tuple of three pandas DataFrames (df, df_a, df_g). df: frequency of editing df_a: counts of A or T df_g: counts of G or C region_list: list of non-empty regions
- Return type:
tuple
- a2ihelper.filter.merge_files_one_region(meta, coverage_q30: int = 10)¶
Merge all RES files for the same and UNIQUE region (output of REDItools2) in three pandas DataFrames of frequency or count per position. The first DataFrame is the frequency of editing (A-G or T-C). The Second DataFrame is the count of A (or T) per position. And the last one is the count of G (or C) per postion.
- Parameters:
meta (df) –
- A pandas DataFrame with metadata information for UNIQUE region. The first four columns are mandatory
First: Full path file names of REDItools2 results tables Second: Samples names Third: region (gene symbol) Fourth: conditions
coverage_q30 (int) – An integer to define the minimum of reads with quality higher than in each position
- Returns:
a tuple of three pandas DataFrames (df, df_a, df_g). df: frequency of editing df_a: counts of A or T df_g: counts of G or C
- Return type:
tuple
- a2ihelper.filter.pool_positions(df_a, df_g, pvalue_filter_limit=0.05, gtest_filter_limit=0, bh_correction=False)¶
Pool the counts per positions and per consitions of Adenine (A) and Guanine (G) Dataframes. To improve the Fisher exact test or Chi2 test between conditions.
- Parameters:
df_a (df) – pandas DataFrame of Adenine counts (A). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions
df_g (df) – pandas DataFrame of Guanine (Inosine) counts (G). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions
pvalue_filter_limit (float) – Minimum p-value threshold to consider the chi2_test significant.
gtest_filter_limit (int) – Number max of conditions with significant p-value frequency in one position in each contition
bh_correction (bool) – If True, all p-values will be corrected by false_discovery_control
- Returns:
returns inputs df_a, df_g without filtered coordinates.
- Return type:
tuple
- a2ihelper.filter.replace_nan_by_zero(df, min_ratio_nan: float = 0.6666666666666666)¶
Replace NaN values by zero if the ratio of NaNs by total samples in one condition is greater than min_ratio_nan.
- Parameters:
meta (df) – Frequency editing, Adenine or Guanine counts DataFrame of all samples analyzed. The DataFrame must be like merge_files output. The last two columns must be region and conditions.
min_ratio_nan (float) – Must be a float representing the minimum ratio of NaN values of each condition that can be replaced by zero. Each column will be treated individually.
- Returns:
DataFrame with NaNs values replaced by zeros
- Return type:
df
a2ihelper.stats module¶
- a2ihelper.stats.anova_tukey_test(df, only_pvalue: bool = True, pvalue_filter_limit_anova: float = 0.05, pvalue_filter_limit_tukey: float = 0.05, return_only_significant: bool = True)¶
# Need to test in more than two conditions and to return a dataframe Anova with post-hoc test for more than two conditions.
- Parameters:
df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.
only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.
pvalue_filter_limit_anova (float) – Limit of ANOVA p-value to be considered statistically significant
pvalue_filter_limit_tukey (float) – Limit of post-hoc p-value to be considered statistically significant.
return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.
- Returns:
returns p-values for Anova post-hoc test.
- Return type:
DataFrame
- a2ihelper.stats.chi2_test(df_a, df_g, only_pvalue: bool = True, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)¶
Chi-square test for two or more than two conditions.
- Parameters:
df_a (df) – pandas DataFrame of Adenine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.
df_g (df) – pandas DataFrame of Guanine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.
only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.
pvalue_filter_limit (float) – Limit of Chi2 p-value to be considered statistically significant.
return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.
- Returns:
returns p-values for Chi2 test.
- Return type:
DataFrame
- a2ihelper.stats.conditon_pearson_corr(df, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)¶
Calculate the R Pearson correlation of editing frequencies in each condition.
- Parameters:
df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.
pvalue_filter_limit (float) – Limit of R Pearson p-value to be considered statistically significant.
return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.
- Returns:
returns a DataFrame of R and p-values.
- Return type:
DataFrame
- a2ihelper.stats.entropy_calculation(df_a, df_g)¶
Calculate the Shannon entropy between Adenine and Guanine presence in each condition.
- Parameters:
df_a (df) – pandas DataFrame of Adenine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.
df_g (df) – pandas DataFrame of Guanine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.
- Returns:
returns odd ratio values.
- Return type:
DataFrame
- a2ihelper.stats.fisher_test(df_a, df_g, only_pvalue=True, return_only_significant=True, pvalue_filter_limit=0.05)¶
Fisher Excat test for two conditions.
- Parameters:
df_a (df) – pandas DataFrame of Adenine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.
df_g (df) – pandas DataFrame of Guanine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.
only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.
pvalue_filter_limit (float) – Limit of Fisher p-value to be considered statistically significant.
return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.
- Returns:
returns p-values for Fisher Exact test.
- Return type:
DataFrame
- a2ihelper.stats.gene_pearson_corr(df, exprs: list, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)¶
Calculate the R Pearson correlation of editing frequencies and expression of a specific gene.
- Parameters:
df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.
exprs (list) – List of gene expression values of gene of interest. in the same sample order of df.
pvalue_filter_limit (float) – Limit of R Pearson p-value to be considered statistically significant.
return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.
- Returns:
returns a DataFrame of R and p-values.
- Return type:
DataFrame
- a2ihelper.stats.kruskal_dunn_test(df, only_pvalue: bool = True, pvalue_filter_limit_kruskal: float = 0.05, pvalue_filter_limit_dunn: float = 0.05, return_only_significant: bool = True, p_adjust: str = None)¶
# Still NEED to test in more than two conditions and to return a dataframe Kruskal-Wallis with post-hoc test for more than two conditions.
- Parameters:
df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.
only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.
pvalue_filter_limit_kruskal (float) – Limit of Kruskal-Wallis p-value to be considered statistically significant
pvalue_filter_limit_dunn (float) – Limit of Dunn post-hoc p-value to be considered statistically significant.
return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.
- Returns:
returns p-values for Anova post-hoc test.
- Return type:
DataFrame
- a2ihelper.stats.mannwhitney_test(df, only_pvalue: bool = True, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)¶
Perform the Mann-Whitney U rank test on two independent samples between two conditions.
- Parameters:
df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.
only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.
pvalue_filter_limit (float) – Limit of p-value to be considered statistically significant
return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.
- Returns:
returns p-values for Mann-Whitney U rank test.
- Return type:
DataFrame
- a2ihelper.stats.odds_r(df_a, df_g)¶
Compute the odds ratio for two conditions.
- Parameters:
df_a (df) – pandas DataFrame of Adenine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.
df_g (df) – pandas DataFrame of Guanine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.
- Returns:
returns odd ratio values.
- Return type:
DataFrame
- a2ihelper.stats.t_student_test(df, only_pvalue: bool = True, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)¶
Perform the t Student test on two independent samples between two conditions.
- Parameters:
df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.
only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.
pvalue_filter_limit (float) – Limit of p-value to be considered statistically significant
return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.
- Returns:
returns p-values for t Student test.
- Return type:
DataFrame
a2ihelper.plot module¶
- a2ihelper.plot.boxplot(df, positions_to_plot: list = None, log_scale: bool = False, ax=None, pvalue_list=None, figsize: tuple = None, order: list = None, hue_order: list = None, palette: str = None)¶
- a2ihelper.plot.confidence_ellipse(x, y, ax, n_std=3.0, facecolor='none', **kwargs)¶
This function is borrow from https://matplotlib.org/stable/gallery/statistics/confidence_ellipse.html Create a plot of the covariance confidence ellipse of x and y.
- Parameters:
x (array-like, shape (n, )) – Input data.
y (array-like, shape (n, )) – Input data.
ax (matplotlib.axes.Axes) – The axes object to draw the ellipse into.
n_std (float) – The number of standard deviations to determine the ellipse’s radiuses.
**kwargs – Forwarded to ~matplotlib.patches.Ellipse
- Return type:
matplotlib.patches.Ellipse
- a2ihelper.plot.corr_pearson_plot(p_corr, log_scale: bool = False, p_value_line: float = None, ax=None, figsize: tuple = None)¶
- a2ihelper.plot.entropy_plot(entr, n_top: int = 50, log_scale: bool = False, ax=None, figsize: tuple = None, order: list = None, hue_order: list = None, palette: str = None)¶
- a2ihelper.plot.manhattanplot(df_or, df_pv, p_value_line: float = None, chr_order: list = [], ax=None, figsize: tuple = None)¶
- a2ihelper.plot.pca(data, condition, hue=False, ax=None, figsize=None, conf_ellipse=False)¶
- a2ihelper.plot.tsne(data, condition, hue=False, ax=None, figsize=None, conf_ellipse=False)¶
- a2ihelper.plot.volcanoplot(data, pv_col='padj', pv_lim=0.1, logFC_col='log2FoldChange', logFC_lim=1.5, gene_col=False, figsize=None, ax=None, use_adjusttext=False, text=None)¶