py_research.stats module#

Helper functions for statistical evalutation of (dataframe-based) data.

dist_table(df, category_cols, id_cols=None, value_col=None, domains={}, category_parent_cols=None)[source]#

Return a frequency table of the distribution of unique entities.

Entities are identified by id_cols. Distribution is presented over unique categories in category_cols.

Parameters:
  • df (DataFrame) – Dataframe to evaluate.

  • category_cols (str | list[str]) – Columns to evaluate distribution over.

  • id_cols (str | list[str] | None) – Columns to identify entities by.

  • value_col (str | None) – Unique values per entity to sum up.

  • domains (dict[str, list[Hashable] | ndarray | Index]) – Force the distribution to be evaluated over these domains, filling missing values with 0.

  • category_parent_cols (str | dict[str, str] | None) – If category values are discrete and hierarchical, you may supply a parent column for each category column. This will be used to aggregate the distribution over the parent categories.

Returns:

Series of the distribution’s values (count or sum) given the categories in the index.

Return type:

Series