puffbird.FrameEngine

class puffbird.FrameEngine(table, datacols=None, indexcols=None, inplace=False, handle_column_types=True, enforce_identifier_string=False, fastpath=False)[source]

Class to handle and transform a pandas.DataFrame object.

Parameters
tableDataFrame

A table with singular Index columns, where each column corresponds to a specific data type. MultiIndex columns will be made singular with the to_flat_index method. It is recommended that all columns and index names are identifier string types. Individual cells within datacols columns may have arbitrary objects in them, but cells within indexcols columns must be hashable.

datacolslist-like, optional

The columns in table that are considered “data”. For example, columns where each cell is a numpy.array object. If None, all columns are considered datacols columns, unless indexcols is specified. Defaults to None.

indexcolslist-like, optional

The columns in table that are immutable or hashable types, e.g. strings or integers. These may correspond to “metadata” that describe or specify the datacols columns. If None, only the index of the table, which may be MultiIndex, are considered indexcols columns. If datacols is specified and indexcols is None, then the remaining columns are also added to the index of table. Defaults to None.

inplacebool, optional

If possible do not copy the table object. Defaults to False.

handle_column_typesbool, optional

If True, converts not string column types to strings. Defaults to True.

enforce_identifier_stringbool, optional

If True, try to convert all types to identifier string types and check if all columns are identifier string types. Enforcement only works if column types are str, Number, or tuple object types. Throw an error if enforcement does not work. Defaults to False.

Notes

A table has singular Index columns, where each column corresponds to a specific data type. These types of tables are often fetched from databases that use data models such as datajoint. The table often needs to be transformed, so that various computations such as groupby can be performed or the data can be plotted easily with packages such as seaborn. In the table, the columns and the index names are considered together and divided into datacols and indexcols. “Data columns” are usually columns that contain Python objects that are iterable and need to be “exploded” in order to convert these columns into numeric or other immutable data types. This is why I call these types of tables “puffy” dataframes. “Index columns” usually contain other information, often considered “metadata”, that uniquely identify each row. Each row for a specific column is considered to have the same data type and can thus be “exploded” the same way. Missing data (NaNs) are allowed.

Examples

>>> import pandas as pd
>>> import puffbird as pb
>>> df = pd.DataFrame({
...     'a': [[1,2,3], [4,5,6,7], [3,4,5]],
...     'b': [{'c':['asdf'], 'd':['ret']}, {'d':['r']}, {'c':['ff']}],
... })
>>> df
              a                              b
0     [1, 2, 3]  {'c': ['asdf'], 'd': ['ret']}
1  [4, 5, 6, 7]                   {'d': ['r']}
2     [3, 4, 5]                  {'c': ['ff']}
>>> engine = pb.FrameEngine(df)

The FrameEngine instance has various methods that allow for quick manipulation of this “puffy” dataframe. For example, we can create a long dataframe using the to_long() method:

>>> engine.to_long()
    index_col_0 b_level0  b_level1     b  a_level0    a
0             0        c         0  asdf         0  1.0
1             0        c         0  asdf         1  2.0
2             0        c         0  asdf         2  3.0
3             0        d         0   ret         0  1.0
4             0        d         0   ret         1  2.0
5             0        d         0   ret         2  3.0
6             1        d         0     r         0  4.0
7             1        d         0     r         1  5.0
8             1        d         0     r         2  6.0
9             1        d         0     r         3  7.0
10            2        c         0    ff         0  3.0
11            2        c         0    ff         1  4.0
12            2        c         0    ff         2  5.0

Attributes

cols

Tuple of “data columns” and “index columns” in the table.

cols_rename

Mapping of renamed “data columns” and “index columns” in table.

datacols

Tuple of the “data columns” in the table.

datacols_rename

Mapping of renamed “data columns” in table.

indexcols

Tuple of the “index columns” in the table.

indexcols_rename

Mapping of renamed “index columns” in table.

table

DataFrame passed during initialization.

Methods

apply(func, new_col_name, *args[, …])

Apply a function to each row in the table.

col_apply(func, col[, new_col_name, …])

Apply a function to a specific column in each row in the table.

drop(*cols[, skip, skip_index, skip_data])

Drop columns in place.

expand_col(col[, reset_index, dropna, …])

Expand a column that contain DataFrame or Series object types to create a single long-format DataFrame.

multid_pivot([values])

Pivot the table to create a multidimensional xarray.DataArray or xarray.DataSet object.

rename(**rename_kws)

Rename columns in place.

to_long(*cols[, iterable, max_depth, …])

Transform the “puffy” table into a long-format DataFrame.

to_puffy(*indexcols[, keep_missing_idcs, …])

Make the table “puffier” by aggregating across unique sets of “index columns”.