hgmd.hgmd

Set of modularized components of COMET’s HGMD testing.

For marker expression, float comparisions are fuzzy to 1e-3. Marker expression must therefore be normalized to a point where a difference of 0.001 is insignificant. I.e. 15.001 and 15.000 are treated as equivalent expression values.

hgmd.hgmd.add_complements(marker_exp)

Adds columns representing gene complement to a gene expression matrix.

Gene complements are represented simplistically: gene expression values for a given gene X are multiplied by -1 and become a new column, labeled X_c. “High” expression values of X become “low” values of X_c, and vice versa, where discrete expression corresponds to a “high” value, and discrete non-expression to a “low” value.

marker_exp should have cell row labels, gene column labels, gene expression float values.

Parameters:marker_exp – gene expression DataFrame whose rows are cell identifiers, columns are gene identifiers, and values are float values representing gene expression.
Returns:A DataFrame of same format as marker_exp, but with a new column added for each existing column label, representing the column label gene’s complement.
Return type:pandas.DataFrame
hgmd.hgmd.batch_t(marker_exp, c_list, coi)

Applies t test to a gene expression matrix, gene by gene.

Parameters:
  • marker_exp – A DataFrame whose rows are cell identifiers, columns are gene identifiers, and values are float values representing gene expression.
  • c_list – A Series whose indices are cell identifiers, and whose values are the cluster which that cell is part of.
  • coi – The cluster of interest.
Returns:

A matrix with arbitary row indices whose columns are the gene, t statistic, then t p-value; the last two being of float type. Their names are ‘gene’, ‘t_stat’ and ‘t_pval’.

Return type:

pandas.DataFrame

hgmd.hgmd.batch_xlmhg(marker_exp, c_list, coi, X=None, L=None)

Applies XL-mHG test to a gene expression matrix, gene by gene.

Outputs a 3-column DataFrame representing statistical results of XL-mHG.

Parameters:
  • marker_exp – A DataFrame whose rows are cell identifiers, columns are gene identifiers, and values are float values representing gene expression.
  • c_list – A Series whose indices are cell identifiers, and whose values are the cluster which that cell is part of.
  • coi – The cluster of interest.
  • X – An integer to be used as argument to the XL-mHG test.
  • L – An integer to be used as argument to the XL-mHG test.
Returns:

A matrix with arbitrary row indices, whose columns are the gene name, stat, cutoff, and pval outputs of the XL-mHG test; of float, int, and float type respectively. Their names are ‘gene’, ‘HG_stat’, ‘mHG_cutoff’, and ‘mHG_pval’.

Return type:

pandas.DataFrame

hgmd.hgmd.discrete_exp(marker_exp, cutoff_val)

Converts a continuous gene expression matrix to discrete.

As a note: cutoff values correspond to the “top” of non-expression. Only cells expressing at values greater than the cutoff are marked as “expressing”; cells expressing at the cutoff exactly are not.

Parameters:
  • marker_exp – A DataFrame whose rows are cell identifiers, columns are gene identifiers, and values are float values representing gene expression.
  • cutoff_val – A Series whose rows are gene identifiers, and values are cutoff values.
Returns:

A gene expression matrix identical to marker_exp, but with boolean rather than float expression values.

Return type:

pandas.DataFrame

hgmd.hgmd.mhg_cutoff_value(marker_exp, cutoff_ind)

Finds discrete expression cutoff value, from given cutoff index.

The XL-mHG test outputs the index of the cutoff of highest significance between a sample and population. This functions finds the expression value which corresponds to this index. Cells above this value we define as expressing, and cells below this value we define as non-expressing. We therefore choose this value to be between the expression at the index, and the expression of the “next-highest” cell. I.e. for expression [3.0 3.0 1.5 1.0 1.0] and index 4, we should choose a cutoff between 1 and 1.5. This implementation will add epsilon to the lower bound (i.e. the value of FLOAT_PRECISION). In our example, the output will be 1.0 + FLOAT_PRECISION. For FLOAT_PRECISION = 0.001, this is 1.001.

Parameters:
  • marker_exp – A DataFrame whose rows are cell identifiers, columns are gene identifiers, and values are float values representing gene expression.
  • cutoff_ind – A DataFrame whose ‘gene’ column are gene identifiers, and whose ‘mHG_cutoff’ column are cutoff indices
Returns:

A DataFrame whose ‘gene’ column are gene identifiers, and whose ‘cutoff_val’ column are cutoff values corresponding to input cutoff indices.

Return type:

pandas.DataFrame

hgmd.hgmd.mhg_slide(marker_exp, cutoff_val)

Slides cutoff indices in XL-mHG output out of uniform expression groups.

The XL-mHG test may place a cutoff index that “cuts” across a group of uniform expression inside the sorted expression list. I.e. for a population of cells of which many have zero expression, the XL-mHG test may demand that we sample some of the zero-expression cells and not others. This is impossible because the cells are effectively identical. This function therefore moves the XL-mHG cutoff index so that it falls on a measurable gene expression boundary.

Example: for a sorted gene expression list [5, 4, 1, 0, 0, 0] and XL-mHG cutoff index 4, this function will “slide” the index to 3; marking the boundary between zero expression and expression value 1.

Parameters:
  • marker_exp – A DataFrame whose rows are cell identifiers, columns are gene identifiers, and values are float values representing gene expression.
  • cutoff_val – A DataFrame whose ‘gene’ column are gene identifiers, and whose ‘cutoff_val’ column are cutoff values corresponding to input cutoff indices.
Returns:

A DataFrame with ‘gene’, ‘mHG_cutoff’, and ‘cutoff_val’ columns, slid.

Return type:

pandas.DataFrame

hgmd.hgmd.pair_hg(gene_map, in_cls_count, pop_count, in_cls_product, total_product, upper_tri_indices)

Finds hypergeometric statistic of gene pairs.

Takes in discrete single-gene expression matrix, and finds the hypergeometric p-value of the sample that includes cells which express both of a pair of genes.

Parameters:
  • gene_map – An Index mapping index values to gene names.
  • in_cls_count – The number of cells in the cluster.
  • pop_count – The number of cells in the population.
  • in_cls_product – The cluster paired expression count matrix.
  • total_product – The population paired expression count matrix.
  • upper_tri_indices – An array specifying UT indices; from numpy.utri
Returns:

A matrix with columns: the two genes of the pair, hypergeometric test statistics for that pair. Their names are ‘gene_1’, ‘gene_2’, ‘HG_stat’.

Return type:

pandas.DataFrame

hgmd.hgmd.pair_product(discrete_exp, c_list, coi)

Finds paired expression counts. Returns in matrix form.

The product of the transpose of the discrete_exp DataFrame is a matrix whose rows and columns correspond to individual genes. Each value is the number of cells which express both genes (i.e. the dot product of two lists of 1s and 0s encoding expression/nonexpression for their respective genes in the population). The product therefore encodes joint expression counts for any possible gene pair (including a single gene paired with itself).

This function produces two matrices: one considering only cells inside the cluster of interest, and one considering all cells in the population.

This function also produces a list mapping integer indices to gene names, and the population cell count.

Additionally, only the upper triangular part of the output matrices is unique. This function therefore also returns the upper triangular indices for use by other functions; this is a lazy workaround for the issue that comes with using columns ‘gene_1’ and ‘gene_2’ to store gene pairs; the gene pair (A, B) is therefore treated differently than (B, A). Specifying the upper triangular part prevents (B, A) from existing.

TODO: fix this redundancy using multi-indices

Parameters:
  • discrete_exp – A DataFrame whose rows are cell identifiers, columns are gene identifiers, and values are boolean values representing gene expression.
  • c_list – A Series whose indices are cell identifiers, and whose values are the cluster which that cell is part of.
  • coi – The cluster of interest.
Returns:

(gene mapping list, cluster count, total count, cluster paired expression count matrix, population paired expression count matrix, upper triangular matrix index)

Return type:

(pandas.Index, int, int, numpy.ndarray, numpy.ndarray, numpy.ndarray)

hgmd.hgmd.pair_tp_tn(gene_map, in_cls_count, pop_count, in_cls_product, total_product, upper_tri_indices)

Finds simple true positive/true negative values for the cluster of interest, for all possible pairs of genes.

Parameters:
  • gene_map – An Index mapping index values to gene names.
  • in_cls_count – The number of cells in the cluster.
  • pop_count – The number of cells in the population.
  • in_cls_product – The cluster paired expression count matrix.
  • total_product – The population paired expression count matrix.
  • upper_tri_indices – An array specifying UT indices; from numpy.utri
Returns:

A matrix with arbitary row indices and 4 columns: containing the two genes of the pair, then true positive and true negative values respectively. Their names are ‘gene_1’, ‘gene_2’, ‘TP’, and ‘TN’.

Return type:

pandas.DataFrame

hgmd.hgmd.tp_tn(discrete_exp, c_list, coi)

Finds simple true positive/true negative values for the cluster of interest.

Parameters:
  • discrete_exp – A DataFrame whose rows are cell identifiers, columns are gene identifiers, and values are boolean values representing gene expression.
  • c_list – A Series whose indices are cell identifiers, and whose values are the cluster which that cell is part of.
  • coi – The cluster of interest.
Returns:

A matrix with arbitary row indices, and has 3 columns: one for gene name, then 2 containing the true positive and true negative values respectively. Their names are ‘gene’, ‘TP’, and ‘TN’.

Return type:

pandas.DataFrame