a2ihelper package

Submodules

a2ihelper.call_reditools2 module

a2ihelper.call_reditools2.get_genes_positions(genes: list, path_ref_annotation: str, gzip_file: bool = False) list

Return the coordinates of a gene symbol from a GTF file. It can be used as input to run_per_gene_position_list.

Parameters:
  • genes (list) – list of genes to get coordinates

  • path_ref_annotation (str) – full reference GTF file path.

Returns:

a dict of coordinates of each gene symbol (chr:start-end).

Return type:

dict

Example

Using the GTF file from gencode

>>> get_genes_positions(['B2m', 'Apol1'], '/.../GRCh38.p14.genome.fa')
['chr2:122147686-122153083', 'chr18:60803848-60812646']
a2ihelper.call_reditools2.get_utr_genes_positions(genes: list, path_ref_annotation: str, gzip_file: bool = False) list

Return the coordinates of a gene symbol from a GTF file. It can be used as input to run_per_gene_position_list.

Parameters:
  • genes (list) – list of genes to get coordinates

  • path_ref_annotation (str) – full reference GTF file path.

Returns:

a dict of coordinates of each gene symbol (chr:start-end).

Return type:

dict

Example

Using the GTF file from gencode

>>> get_genes_positions(['B2m', 'Apol1'], '/.../GRCh38.p14.genome.fa')
['chr2:122147686-122153083', 'chr18:60803848-60812646']
a2ihelper.call_reditools2.indexing_ref(path_ref_genome)
a2ihelper.call_reditools2.run_per_gene_position(gene_position: str, in_bam_file: str, path_out_res: str, ref_genome_file: str, path_reditools: str, reditools_options: str) None

Run reditools2.0 via Python

Parameters:
  • gene_position (str) – coordinate of chromossome:start-end positions to search editing sites. Example: ‘chr2:122147686-122153083’. One can get the coordinates from gene symbol using get_genes_positions function.

  • in_bam_file (str) – full input BAM file path

  • path_out_res (str) – output directory

  • ref_genome_file (str) – full reference FASTA file path. Must the same used to build the aligned bam file.

  • path_reditools (str) – full directory where reditools.py is installed. Usually is in similiar path ‘/../reditools2.0/src/cineca’

  • reditools_options (str) – optional arguments to run reditools2. All the options are expalined in https://github.com/BioinfoUNIBA/REDItools2

Returns:

it doesn’t return nothing, just run reditools2

Return type:

None

Example

Using the toyfile from https://github.com/guilhermetabordaribas/a2iHelperPy

>>> gene_position = 'chr2:122147686-122153083'
>>> in_bam_file = '/.../sample1.sortedByCoord.out.bam'
>>> path_out_res = '/.../out/'
>>> ref_genome_file = '/.../GRCh38.p14.genome.fa'
>>> path_reditools = '/.../reditools2.0/src/cineca/'
>>> reditools_options = '--strict'
>>> run_per_gene_position(gene_position, in_bam_file, path_out_res, ref_genome_file, path_reditools, reditools_options='--strict')
a2ihelper.call_reditools2.run_per_gene_position_list(genes_positions: list, in_bam_file: str, path_out_res: str, ref_genome_file: str, path_reditools: str, reditools_options: str, n_jobs: int = 4) None

Run run_per_gene_position for a list of gense coordinates (genes_positions)

Parameters:
  • gene_position (list) – list of coordinates of chromossome:start-end positions to search editing sites. Example: [‘chr2:122147686-122153083’, ‘chr18:60803848-60812646’, ‘chr6:65671590-65712326’]. One can get the coordinates from gene symbol using get_genes_positions function.

  • in_bam_file (str) – full input file BAM path

  • path_out_res (str) – output directory

  • ref_genome_file (str) – full reference FASTA file path. Must the same used to build the aligned bam file.

  • path_reditools (str) – full directory where reditools.py is installed. Usually is in similiar path ‘/../reditools2.0/src/cineca’

  • reditools_options (str) – optional arguments to run reditools2. All the options are expalined in https://github.com/BioinfoUNIBA/REDItools2

  • n_jobs (int) – number of jobs in parallel

Returns:

it doesn’t return nothing, just run reditools2 for a list o coordinates

Return type:

None

Example

Using the toyfile from https://github.com/guilhermetabordaribas/a2iHelperPy

>>> genes_positions = ['chr2:122147686-122153083', 'chr6:65671590-65712326', 'chr15:78191114-78206400']
>>> in_bam_file = '/.../sample1.sortedByCoord.out.bam'
>>> path_out_res = '/.../out/'
>>> ref_genome_file = '/.../GRCh38.p14.genome.fa'
>>> path_reditools = '/.../reditools2.0/src/cineca/'
>>> reditools_options = '--strict'
>>> run_per_gene_position_list(genes_positions, in_bam_file, path_out_res, ref_genome_file, path_reditools, reditools_options='--strict', n_jobs=4)

a2ihelper.filter module

a2ihelper.filter.call_snp_vep(coordinates: list = [], species: str = 'homo_sapiens', sub: str = 'G')

Consult SNPs in VEP (ensembl tool).

Parameters:
  • coordinates (list) – List of strings to call VEP. The string must be a list of chr_coordinate (e.g: [‘9_129401662’, ‘1_6524705’]).

  • species (str) – String with the species that you are requesting. It is based on Ensembl names like homo_sapiens, mus_musculus, rattus_norvegicus, zebrafish. Other genomes can be searched here: https://rest.ensembl.org/documentation/info/species

  • sub (str) – Substitution base, can be G (A>G) or C (T>C).

Returns:

List of chr_coordinate and respective rsID

Return type:

list

a2ihelper.filter.filter_gtest(df_a, df_g, pvalue_filter_limit=0.05, gtest_filter_limit=0, bh_correction=False)

Filter positions limiting the quantity of samples with significant p-values for independecy G-test.

Parameters:
  • df_a (df) – pandas DataFrame of Adenine counts (A). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions

  • df_g (df) – pandas DataFrame of Guanine (Inosine) counts (G). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions

  • pvalue_filter_limit (float) – Minimum p-value threshold to consider the chi2_test significant.

  • gtest_filter_limit (int) – Number max of conditions with significant p-value frequency in one position in each contition

  • bh_correction (bool) – If True, all p-values will be corrected by false_discovery_control

Returns:

returns inputs df_a, df_g without filtered coordinates.

Return type:

tuple

a2ihelper.filter.filter_positions(df, nan_filter=True, nan_filter_limit=0, zero_filter=True, zero_filter_limit=0, hundred_filter=True, hundred_filter_limit=0, per_condition=False)

Filter positions limiting the quantity of samples with nan values, zero editing and 100% of editing across samples.

Parameters:
  • meta (df) – Frequency editing DataFrame of all samples analyzed. The DataFrame must be like merge_files output. The last two columns must be region and conditions.

  • nan_filter (bool) – Must be True to filter quantity of nan values across samples in each contition

  • nan_filter_limit (int) – Number max of samples with nan in one position in each contition

  • zero_filter (bool) – Must be True to filter quantity of zero editing values across samples in each contition

  • zero_filter_limit (int) – Number max of samples with zero frequency in one position in each contition

  • hundred_filter (bool) – Must be True to filter quantity of 100% of editing values across samples in each contition

  • hundred_filter_limit (int) – Number max of samples with 100% frequency in one position in each contition

  • per_condition (bool) – Must be True to filter quantity by condition individually. And False to filter by all samples.

Returns:

DataFrame without filtered positions

Return type:

df

a2ihelper.filter.filter_snps_vep(df, species: str = 'homo_sapiens', sub: str = 'G')

Filter positions by consulting SNPs in VEP (ensembl tool).

Parameters:
  • df (DataFrame) – Frequency editing DataFrame of all samples analyzed. The DataFrame must be like merge_files output. The columns must be chr_coordinate (e.g: 9_129401662). The last two columns must be region and conditions.

  • species (str) – String with the species that you are requesting. It is based on Ensembl names like homo_sapiens, mus_musculus, rattus_norvegicus, zebrafish. Other genomes can be searched here: https://rest.ensembl.org/documentation/info/species

  • sub (str) – Substitution base, can be G (A>G) or C (T>C).

Returns:

  • df_1 – DataFrame df without filtered positions

  • df_2 – DataFrame with positions and rsID

a2ihelper.filter.independency_gtest(df_a, df_g, only_pvalue=True)

Perform independecy G-test to verify if there’s deviation from the expected proportions and significant variation among the replicates.

Parameters:
  • df_a (df) – pandas DataFrame of Adenine counts (A). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions

  • df_g (df) – pandas DataFrame of Guanine (Inosine) counts (G). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions

  • only_pvalue (bool) – If True, the function will return two DataFrames only with p-values. Otherwise will return the attributes of Chi2ContingencyResult object (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html).

Returns:

DataFrame with scipy.stats.chi2_contingency results.

Return type:

df

a2ihelper.filter.merge_files_all_regions(meta, coverage_q30: int = 10)

Merge all RES files for ALL regions (output of REDItools2) in three pandas DataFrames of frequency or count per position. The first DataFrame is the frequency of editing (A-G or T-C). The Second DataFrame is the count of A (or T) per position. And the last one is the count of G (or C) per postion.

Parameters:
  • meta (df) –

    A pandas DataFrame with metadata information for ALL regions of interest. The first four columns are mandatory

    First: Full path file names of REDItools2 results tables Second: Samples names Third: regions (genes symbol) Fourth: conditions

  • coverage_q30 (int) – An integer to define the minimum of reads with quality higher than in each position

Returns:

a tuple of three pandas DataFrames (df, df_a, df_g). df: frequency of editing df_a: counts of A or T df_g: counts of G or C region_list: list of non-empty regions

Return type:

tuple

a2ihelper.filter.merge_files_one_region(meta, coverage_q30: int = 10)

Merge all RES files for the same and UNIQUE region (output of REDItools2) in three pandas DataFrames of frequency or count per position. The first DataFrame is the frequency of editing (A-G or T-C). The Second DataFrame is the count of A (or T) per position. And the last one is the count of G (or C) per postion.

Parameters:
  • meta (df) –

    A pandas DataFrame with metadata information for UNIQUE region. The first four columns are mandatory

    First: Full path file names of REDItools2 results tables Second: Samples names Third: region (gene symbol) Fourth: conditions

  • coverage_q30 (int) – An integer to define the minimum of reads with quality higher than in each position

Returns:

a tuple of three pandas DataFrames (df, df_a, df_g). df: frequency of editing df_a: counts of A or T df_g: counts of G or C

Return type:

tuple

a2ihelper.filter.pool_positions(df_a, df_g, pvalue_filter_limit=0.05, gtest_filter_limit=0, bh_correction=False)

Pool the counts per positions and per consitions of Adenine (A) and Guanine (G) Dataframes. To improve the Fisher exact test or Chi2 test between conditions.

Parameters:
  • df_a (df) – pandas DataFrame of Adenine counts (A). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions

  • df_g (df) – pandas DataFrame of Guanine (Inosine) counts (G). Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions

  • pvalue_filter_limit (float) – Minimum p-value threshold to consider the chi2_test significant.

  • gtest_filter_limit (int) – Number max of conditions with significant p-value frequency in one position in each contition

  • bh_correction (bool) – If True, all p-values will be corrected by false_discovery_control

Returns:

returns inputs df_a, df_g without filtered coordinates.

Return type:

tuple

a2ihelper.filter.replace_nan_by_zero(df, min_ratio_nan: float = 0.6666666666666666)

Replace NaN values by zero if the ratio of NaNs by total samples in one condition is greater than min_ratio_nan.

Parameters:
  • meta (df) – Frequency editing, Adenine or Guanine counts DataFrame of all samples analyzed. The DataFrame must be like merge_files output. The last two columns must be region and conditions.

  • min_ratio_nan (float) – Must be a float representing the minimum ratio of NaN values of each condition that can be replaced by zero. Each column will be treated individually.

Returns:

DataFrame with NaNs values replaced by zeros

Return type:

df

a2ihelper.stats module

a2ihelper.stats.anova_tukey_test(df, only_pvalue: bool = True, pvalue_filter_limit_anova: float = 0.05, pvalue_filter_limit_tukey: float = 0.05, return_only_significant: bool = True)

# Need to test in more than two conditions and to return a dataframe Anova with post-hoc test for more than two conditions.

Parameters:
  • df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.

  • only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.

  • pvalue_filter_limit_anova (float) – Limit of ANOVA p-value to be considered statistically significant

  • pvalue_filter_limit_tukey (float) – Limit of post-hoc p-value to be considered statistically significant.

  • return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.

Returns:

returns p-values for Anova post-hoc test.

Return type:

DataFrame

a2ihelper.stats.chi2_test(df_a, df_g, only_pvalue: bool = True, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)

Chi-square test for two or more than two conditions.

Parameters:
  • df_a (df) – pandas DataFrame of Adenine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.

  • df_g (df) – pandas DataFrame of Guanine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.

  • only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.

  • pvalue_filter_limit (float) – Limit of Chi2 p-value to be considered statistically significant.

  • return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.

Returns:

returns p-values for Chi2 test.

Return type:

DataFrame

a2ihelper.stats.conditon_pearson_corr(df, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)

Calculate the R Pearson correlation of editing frequencies in each condition.

Parameters:
  • df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.

  • pvalue_filter_limit (float) – Limit of R Pearson p-value to be considered statistically significant.

  • return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.

Returns:

returns a DataFrame of R and p-values.

Return type:

DataFrame

a2ihelper.stats.entropy_calculation(df_a, df_g)

Calculate the Shannon entropy between Adenine and Guanine presence in each condition.

Parameters:
  • df_a (df) – pandas DataFrame of Adenine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.

  • df_g (df) – pandas DataFrame of Guanine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.

Returns:

returns odd ratio values.

Return type:

DataFrame

a2ihelper.stats.fisher_test(df_a, df_g, only_pvalue=True, return_only_significant=True, pvalue_filter_limit=0.05)

Fisher Excat test for two conditions.

Parameters:
  • df_a (df) – pandas DataFrame of Adenine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.

  • df_g (df) – pandas DataFrame of Guanine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.

  • only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.

  • pvalue_filter_limit (float) – Limit of Fisher p-value to be considered statistically significant.

  • return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.

Returns:

returns p-values for Fisher Exact test.

Return type:

DataFrame

a2ihelper.stats.gene_pearson_corr(df, exprs: list, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)

Calculate the R Pearson correlation of editing frequencies and expression of a specific gene.

Parameters:
  • df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.

  • exprs (list) – List of gene expression values of gene of interest. in the same sample order of df.

  • pvalue_filter_limit (float) – Limit of R Pearson p-value to be considered statistically significant.

  • return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.

Returns:

returns a DataFrame of R and p-values.

Return type:

DataFrame

a2ihelper.stats.kruskal_dunn_test(df, only_pvalue: bool = True, pvalue_filter_limit_kruskal: float = 0.05, pvalue_filter_limit_dunn: float = 0.05, return_only_significant: bool = True, p_adjust: str = None)

# Still NEED to test in more than two conditions and to return a dataframe Kruskal-Wallis with post-hoc test for more than two conditions.

Parameters:
  • df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.

  • only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.

  • pvalue_filter_limit_kruskal (float) – Limit of Kruskal-Wallis p-value to be considered statistically significant

  • pvalue_filter_limit_dunn (float) – Limit of Dunn post-hoc p-value to be considered statistically significant.

  • return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.

Returns:

returns p-values for Anova post-hoc test.

Return type:

DataFrame

a2ihelper.stats.mannwhitney_test(df, only_pvalue: bool = True, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)

Perform the Mann-Whitney U rank test on two independent samples between two conditions.

Parameters:
  • df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.

  • only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.

  • pvalue_filter_limit (float) – Limit of p-value to be considered statistically significant

  • return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.

Returns:

returns p-values for Mann-Whitney U rank test.

Return type:

DataFrame

a2ihelper.stats.odds_r(df_a, df_g)

Compute the odds ratio for two conditions.

Parameters:
  • df_a (df) – pandas DataFrame of Adenine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.

  • df_g (df) – pandas DataFrame of Guanine counts. Rows are samples and columns are coordinates. The DataFrame must be pooled before with a2iHelper.editing.pool_positions(). The last two columns must be region and conditions.

Returns:

returns odd ratio values.

Return type:

DataFrame

a2ihelper.stats.t_student_test(df, only_pvalue: bool = True, pvalue_filter_limit: float = 0.05, return_only_significant: bool = True)

Perform the t Student test on two independent samples between two conditions.

Parameters:
  • df (df) – pandas DataFrame of editing frequency. Rows are samples and columns are coordinates. The DataFrame must be like merge_files output. The last two columns must be region and conditions.

  • only_pvalue (bool) – If True, it will return only the p-values. Otherwise it will return a tuple with statistic and pvalue.

  • pvalue_filter_limit (float) – Limit of p-value to be considered statistically significant

  • return_only_significant (bool) – If True, it will return only the p-values less than pvalue_filter_limit.

Returns:

returns p-values for t Student test.

Return type:

DataFrame

a2ihelper.plot module

a2ihelper.plot.boxplot(df, positions_to_plot: list = None, log_scale: bool = False, ax=None, pvalue_list=None, figsize: tuple = None, order: list = None, hue_order: list = None, palette: str = None)
a2ihelper.plot.confidence_ellipse(x, y, ax, n_std=3.0, facecolor='none', **kwargs)

This function is borrow from https://matplotlib.org/stable/gallery/statistics/confidence_ellipse.html Create a plot of the covariance confidence ellipse of x and y.

Parameters:
  • x (array-like, shape (n, )) – Input data.

  • y (array-like, shape (n, )) – Input data.

  • ax (matplotlib.axes.Axes) – The axes object to draw the ellipse into.

  • n_std (float) – The number of standard deviations to determine the ellipse’s radiuses.

  • **kwargs – Forwarded to ~matplotlib.patches.Ellipse

Return type:

matplotlib.patches.Ellipse

a2ihelper.plot.corr_pearson_plot(p_corr, log_scale: bool = False, p_value_line: float = None, ax=None, figsize: tuple = None)
a2ihelper.plot.entropy_plot(entr, n_top: int = 50, log_scale: bool = False, ax=None, figsize: tuple = None, order: list = None, hue_order: list = None, palette: str = None)
a2ihelper.plot.manhattanplot(df_or, df_pv, p_value_line: float = None, chr_order: list = [], ax=None, figsize: tuple = None)
a2ihelper.plot.pca(data, condition, hue=False, ax=None, figsize=None, conf_ellipse=False)
a2ihelper.plot.tsne(data, condition, hue=False, ax=None, figsize=None, conf_ellipse=False)
a2ihelper.plot.volcanoplot(data, pv_col='padj', pv_lim=0.1, logFC_col='log2FoldChange', logFC_lim=1.5, gene_col=False, figsize=None, ax=None, use_adjusttext=False, text=None)

Module contents