py_research.stats module#
Helper functions for statistical evalutation of (dataframe-based) data.
- dist_table(df, category_cols, id_cols=None, value_col=None, domains={}, category_parent_cols=None)[source]#
Return a frequency table of the distribution of unique entities.
Entities are identified by
id_cols
. Distribution is presented over unique categories incategory_cols
.- Parameters:
df (DataFrame) – Dataframe to evaluate.
category_cols (str | list[str]) – Columns to evaluate distribution over.
id_cols (str | list[str] | None) – Columns to identify entities by.
value_col (str | None) – Unique values per entity to sum up.
domains (dict[str, list[Hashable] | ndarray | Index]) – Force the distribution to be evaluated over these domains, filling missing values with 0.
category_parent_cols (str | dict[str, str] | None) – If category values are discrete and hierarchical, you may supply a parent column for each category column. This will be used to aggregate the distribution over the parent categories.
- Returns:
Series of the distribution’s values (count or sum) given the categories in the index.
- Return type: