`fsspec_utils.utils.polars` API Reference¶

`opt_dtype()`¶

Optimize data types of a Polars DataFrame for performance and memory efficiency.

This function analyzes each column and converts it to the most appropriate data type based on content, handling string-to-type conversions and numeric type downcasting.

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame to optimize.
`include`	`list[str]` or `None`	Optional list of column names to include in the optimization process. If None, all columns are considered.
`exclude`	`list[str]` or `None`	Optional list of column names to exclude from the optimization process.
`time_zone`	`str` or `None`	Optional time zone string for datetime parsing.
`shrink_numerics`	`bool`	If True, numeric columns will be downcasted to smaller data types if possible without losing precision.
`allow_unsigned`	`bool`	If True, unsigned integer types will be considered for numeric column optimization.
`allow_null`	`bool`	If True, columns containing only null values will be cast to the Null type.
`strict`	`bool`	If True, an error will be raised if any column cannot be optimized (e.g., due to type inference issues).

Example:

import polars as pl
from fsspec_utils.utils.polars import opt_dtype

df = pl.DataFrame({
    "col_int": ["1", "2", "3"],
    "col_float": ["1.1", "2.2", "3.3"],
    "col_bool": ["True", "False", "True"],
    "col_date": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "col_str": ["a", "b", "c"],
    "col_null": [None, None, None]
})
optimized_df = opt_dtype(df, shrink_numerics=True)
print(optimized_df.schema)
# Expected output similar to:
# Schema({
#     'col_int': Int8,
#     'col_float': Float32,
#     'col_bool': Boolean,
#     'col_date': Date,
#     'col_str': Utf8,
#     'col_null': Null
# })

Returns:

polars.DataFrame: DataFrame with optimized data types

`unnest_all()`¶

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame.
`seperator`	`str`	The separator used to flatten nested column names. Defaults to '_'.
`fields`	`list[str]` or `None`	Optional list of specific fields (structs) to unnest. If None, all struct columns will be unnested.

Example:

import polars as pl
from fsspec_utils.utils.polars import explode_all

df = pl.DataFrame({
    "id": [1, 2],
    "values": [[10, 20], [30]]
})
exploded_df = explode_all(df)
print(exploded_df)
# shape: (3, 2)
# ┌─────┬────────┐
# │ id  ┆ values │
# │ --- ┆ ---    │
# │ i64 ┆ i64    │
# ╞═════╪════════╡
# │ 1   ┆ 10     │
# │ 1   ┆ 20     │
# │ 2   ┆ 30     │
# └─────┴────────┘

import polars as pl
from fsspec_utils.utils.polars import unnest_all

df = pl.DataFrame({
    "id": [1, 2],
    "data": [
        {"a": 1, "b": {"c": 3}},
        {"a": 4, "b": {"c": 6}}
    ]
})
unnested_df = unnest_all(df, seperator='__')
print(unnested_df)
# shape: (2, 3)
# ┌─────┬──────┬───────┐
# │ id  ┆ data__a ┆ data__b__c │
# │ --- ┆ ---  ┆ ---     │
# │ i64 ┆ i64  ┆ i64     │
# ╞═════╪══════╪═════════╡
# │ 1   ┆ 1    ┆ 3       │
# │ 2   ┆ 4    ┆ 6       │
# └─────┴──────┴─────────┘

Returns:

None

`explode_all()`¶

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame.

Example:

import polars as pl
from fsspec_utils.utils.polars import drop_null_columns

df = pl.DataFrame({
    "col1": [1, 2, 3],
    "col2": [None, None, None],
    "col3": ["a", None, "c"]
})
df_cleaned = drop_null_columns(df)
print(df_cleaned)
# shape: (3, 2)
# ┌──────┬───────┐
# │ col1 ┆ col3  │
# │ ---  ┆ ---   │
# │ i64  ┆ str   │
# ╞══════╪═══════╡
# │ 1    ┆ a     │
# │ 2    ┆ null  │
# │ 3    ┆ c     │
# └──────┴───────┘

Returns:

None

`with_strftime_columns()`¶

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame.
`strftime`	`str`	The `strftime` format string (e.g., "%Y-%m-%d" for date, "%H" for hour).
`timestamp_column`	`str`	The name of the timestamp column to use. Defaults to 'auto' (attempts to infer).
`column_names`	`list[str]` or `None`	Optional list of new column names to use for the generated columns. If None, names are derived from the `strftime` format.

Returns:

None

`with_truncated_columns()`¶

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame.
`truncate_by`	`str`	The duration string to truncate by (e.g., "1h", "1d", "1mo").
`timestamp_column`	`str`	The name of the timestamp column to truncate. Defaults to 'auto' (attempts to infer).
`column_names`	`list[str]` or `None`	Optional list of new column names for the truncated columns. If None, names are derived automatically.

Returns:

None

`with_datepart_columns()`¶

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame.
`timestamp_column`	`str`	The name of the timestamp column to extract date parts from. Defaults to 'auto' (attempts to infer).
`year`	`bool`	If True, extract the year as a new column.
`month`	`bool`	If True, extract the month as a new column.
`week`	`bool`	If True, extract the week of the year as a new column.
`yearday`	`bool`	If True, extract the day of the year as a new column.
`monthday`	`bool`	If True, extract the day of the month as a new column.
`day`	`bool`	If True, extract the day of the week (1-7, Monday=1) as a new column.
`weekday`	`bool`	If True, extract the weekday (0-6, Monday=0) as a new column.
`hour`	`bool`	If True, extract the hour as a new column.
`minute`	`bool`	If True, extract the minute as a new column.
`strftime`	`str` or `None`	Optional `strftime` format string to apply to the timestamp column before extracting parts.

Returns:

None

`with_row_count()`¶

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame.
`over`	`list[str]` or `None`	Optional list of column names to partition the data by before adding row counts. If None, a global row count is added.

Returns:

None

`drop_null_columns()`¶

Remove columns with all null values from the DataFrame.

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame.

Returns:

None

`unify_schemas()`¶

Parameters:

Name	Type	Description
`dfs`	`list[polars.DataFrame]`	A list of Polars DataFrames to unify their schemas.

Returns:

None

`cast_relaxed()`¶

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame to cast.
`schema`	`dict` or `polars.Schema`	The target schema to cast the DataFrame to. Can be a dictionary mapping column names to data types or a Polars Schema object.

Returns:

None

`delta()`¶

Parameters:

Name	Type	Description
`df1`	`polars.DataFrame`	The first Polars DataFrame.
`df2`	`polars.DataFrame`	The second Polars DataFrame.
`subset`	`list[str]` or `None`	Optional list of column names to consider when calculating the delta. If None, all columns are used.
`eager`	`bool`	If True, the delta calculation is performed eagerly. Defaults to False (lazy).

Returns:

None

`partition_by()`¶

Parameters:

Name	Type	Description
`df`	`polars.DataFrame`	The input Polars DataFrame to partition.
`timestamp_column`	`str` or `None`	The name of the timestamp column to use for time-based partitioning. Defaults to None.
`columns`	`list[str]` or `None`	Optional list of column names to partition by. Defaults to None.
`strftime`	`str` or `None`	Optional `strftime` format string for time-based partitioning. Defaults to None.
`timedelta`	`str` or `None`	Optional timedelta string (e.g., "1h", "1d") for time-based partitioning. Defaults to None.
`num_rows`	`int` or `None`	Optional number of rows per partition for row-based partitioning. Defaults to None.

Returns:

None

fsspec_utils.utils.polars API Reference¶

opt_dtype()¶

unnest_all()¶

explode_all()¶

with_strftime_columns()¶

with_truncated_columns()¶

with_datepart_columns()¶

with_row_count()¶

drop_null_columns()¶

unify_schemas()¶

cast_relaxed()¶

delta()¶

partition_by()¶

`fsspec_utils.utils.polars` API Reference¶

`opt_dtype()`¶

`unnest_all()`¶

`explode_all()`¶

`with_strftime_columns()`¶

`with_truncated_columns()`¶

`with_datepart_columns()`¶

`with_row_count()`¶

`drop_null_columns()`¶

`unify_schemas()`¶

`cast_relaxed()`¶

`delta()`¶

`partition_by()`¶