Skip to content

fsspec_utils.utils.polars API Reference

opt_dtype()

Optimize data types of a Polars DataFrame for performance and memory efficiency.

This function analyzes each column and converts it to the most appropriate data type based on content, handling string-to-type conversions and numeric type downcasting.

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame to optimize.
include list[str] or None Optional list of column names to include in the optimization process. If None, all columns are considered.
exclude list[str] or None Optional list of column names to exclude from the optimization process.
time_zone str or None Optional time zone string for datetime parsing.
shrink_numerics bool If True, numeric columns will be downcasted to smaller data types if possible without losing precision.
allow_unsigned bool If True, unsigned integer types will be considered for numeric column optimization.
allow_null bool If True, columns containing only null values will be cast to the Null type.
strict bool If True, an error will be raised if any column cannot be optimized (e.g., due to type inference issues).

Example:

import polars as pl
from fsspec_utils.utils.polars import opt_dtype

df = pl.DataFrame({
    "col_int": ["1", "2", "3"],
    "col_float": ["1.1", "2.2", "3.3"],
    "col_bool": ["True", "False", "True"],
    "col_date": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "col_str": ["a", "b", "c"],
    "col_null": [None, None, None]
})
optimized_df = opt_dtype(df, shrink_numerics=True)
print(optimized_df.schema)
# Expected output similar to:
# Schema({
#     'col_int': Int8,
#     'col_float': Float32,
#     'col_bool': Boolean,
#     'col_date': Date,
#     'col_str': Utf8,
#     'col_null': Null
# })

Returns:

  • polars.DataFrame: DataFrame with optimized data types

unnest_all()

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame.
seperator str The separator used to flatten nested column names. Defaults to '_'.
fields list[str] or None Optional list of specific fields (structs) to unnest. If None, all struct columns will be unnested.

Example:

import polars as pl
from fsspec_utils.utils.polars import explode_all

df = pl.DataFrame({
    "id": [1, 2],
    "values": [[10, 20], [30]]
})
exploded_df = explode_all(df)
print(exploded_df)
# shape: (3, 2)
# ┌─────┬────────┐
# │ id  ┆ values │
# │ --- ┆ ---    │
# │ i64 ┆ i64    │
# ╞═════╪════════╡
# │ 1   ┆ 10     │
# │ 1   ┆ 20     │
# │ 2   ┆ 30     │
# └─────┴────────┘
import polars as pl
from fsspec_utils.utils.polars import unnest_all

df = pl.DataFrame({
    "id": [1, 2],
    "data": [
        {"a": 1, "b": {"c": 3}},
        {"a": 4, "b": {"c": 6}}
    ]
})
unnested_df = unnest_all(df, seperator='__')
print(unnested_df)
# shape: (2, 3)
# ┌─────┬──────┬───────┐
# │ id  ┆ data__a ┆ data__b__c │
# │ --- ┆ ---  ┆ ---     │
# │ i64 ┆ i64  ┆ i64     │
# ╞═════╪══════╪═════════╡
# │ 1   ┆ 1    ┆ 3       │
# │ 2   ┆ 4    ┆ 6       │
# └─────┴──────┴─────────┘

Returns:

  • None

explode_all()

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame.

Example:

import polars as pl
from fsspec_utils.utils.polars import drop_null_columns

df = pl.DataFrame({
    "col1": [1, 2, 3],
    "col2": [None, None, None],
    "col3": ["a", None, "c"]
})
df_cleaned = drop_null_columns(df)
print(df_cleaned)
# shape: (3, 2)
# ┌──────┬───────┐
# │ col1 ┆ col3  │
# │ ---  ┆ ---   │
# │ i64  ┆ str   │
# ╞══════╪═══════╡
# │ 1    ┆ a     │
# │ 2    ┆ null  │
# │ 3    ┆ c     │
# └──────┴───────┘

Returns:

  • None

with_strftime_columns()

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame.
strftime str The strftime format string (e.g., "%Y-%m-%d" for date, "%H" for hour).
timestamp_column str The name of the timestamp column to use. Defaults to 'auto' (attempts to infer).
column_names list[str] or None Optional list of new column names to use for the generated columns. If None, names are derived from the strftime format.

Returns:

  • None

with_truncated_columns()

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame.
truncate_by str The duration string to truncate by (e.g., "1h", "1d", "1mo").
timestamp_column str The name of the timestamp column to truncate. Defaults to 'auto' (attempts to infer).
column_names list[str] or None Optional list of new column names for the truncated columns. If None, names are derived automatically.

Returns:

  • None

with_datepart_columns()

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame.
timestamp_column str The name of the timestamp column to extract date parts from. Defaults to 'auto' (attempts to infer).
year bool If True, extract the year as a new column.
month bool If True, extract the month as a new column.
week bool If True, extract the week of the year as a new column.
yearday bool If True, extract the day of the year as a new column.
monthday bool If True, extract the day of the month as a new column.
day bool If True, extract the day of the week (1-7, Monday=1) as a new column.
weekday bool If True, extract the weekday (0-6, Monday=0) as a new column.
hour bool If True, extract the hour as a new column.
minute bool If True, extract the minute as a new column.
strftime str or None Optional strftime format string to apply to the timestamp column before extracting parts.

Returns:

  • None

with_row_count()

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame.
over list[str] or None Optional list of column names to partition the data by before adding row counts. If None, a global row count is added.

Returns:

  • None

drop_null_columns()

Remove columns with all null values from the DataFrame.

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame.

Returns:

  • None

unify_schemas()

Parameters:

Name Type Description
dfs list[polars.DataFrame] A list of Polars DataFrames to unify their schemas.

Returns:

  • None

cast_relaxed()

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame to cast.
schema dict or polars.Schema The target schema to cast the DataFrame to. Can be a dictionary mapping column names to data types or a Polars Schema object.

Returns:

  • None

delta()

Parameters:

Name Type Description
df1 polars.DataFrame The first Polars DataFrame.
df2 polars.DataFrame The second Polars DataFrame.
subset list[str] or None Optional list of column names to consider when calculating the delta. If None, all columns are used.
eager bool If True, the delta calculation is performed eagerly. Defaults to False (lazy).

Returns:

  • None

partition_by()

Parameters:

Name Type Description
df polars.DataFrame The input Polars DataFrame to partition.
timestamp_column str or None The name of the timestamp column to use for time-based partitioning. Defaults to None.
columns list[str] or None Optional list of column names to partition by. Defaults to None.
strftime str or None Optional strftime format string for time-based partitioning. Defaults to None.
timedelta str or None Optional timedelta string (e.g., "1h", "1d") for time-based partitioning. Defaults to None.
num_rows int or None Optional number of rows per partition for row-based partitioning. Defaults to None.

Returns:

  • None