`fsspec_utils.utils.pyarrow` API Reference¶

`dominant_timezone_per_column()`¶

For each timestamp column (by name) across all schemas, detect the most frequent timezone (including None).

If None and a timezone are tied, prefer the timezone. Returns a dict: {column_name: dominant_timezone}

Parameters:

Name	Type	Description
`schemas`	`list[pyarrow.Schema]`	A list of PyArrow schemas to analyze.

Example:

import pyarrow as pa
from fsspec_utils.utils.pyarrow import dominant_timezone_per_column

schema1 = pa.schema([("ts", pa.timestamp("ns", tz="UTC"))])
schema2 = pa.schema([("ts", pa.timestamp("ns", tz="Europe/Berlin"))])
schema3 = pa.schema([("ts", pa.timestamp("ns"))]) # naive
schemas = [schema1, schema2, schema3]

dominant_tz = dominant_timezone_per_column(schemas)
print(dominant_tz)
# Expected: {'ts': 'UTC'} (or 'Europe/Berlin' depending on logic)

Returns:

dict: {column_name: dominant_timezone}

`standardize_schema_timezones_by_majority()`¶

For each timestamp column (by name) across all schemas, set the timezone to the most frequent (with tie-breaking).

Parameters:

Name	Type	Description
`schemas`	`list[pyarrow.Schema]`	A list of PyArrow schemas to standardize.

Example:

import pyarrow as pa
from fsspec_utils.utils.pyarrow import standardize_schema_timezones_by_majority

schema1 = pa.schema([("ts", pa.timestamp("ns", tz="UTC"))])
schema2 = pa.schema([("ts", pa.timestamp("ns", tz="Europe/Berlin"))])
schemas = [schema1, schema2]

standardized_schemas = standardize_schema_timezones_by_majority(schemas)
print(standardized_schemas[0].field("ts").type)
print(standardized_schemas[1].field("ts").type)
# Expected: timestamp[ns, tz=Europe/Berlin] (or UTC, depending on tie-breaking)

Returns:

list[pyarrow.Schema]: A new list of schemas with updated timestamp timezones.

`standardize_schema_timezones()`¶

Standardize timezone info for all timestamp columns in a list of PyArrow schemas.

Parameters:

Name	Type	Description
`schemas`	`list[pyarrow.Schema]`	The list of PyArrow schemas to process.
`timezone`	`str` or `None`	The target timezone to apply to timestamp columns. If None, timezones are removed. If "auto", the most frequent timezone across schemas is used.

Example:

import pyarrow as pa
from fsspec_utils.utils.pyarrow import standardize_schema_timezones

schema1 = pa.schema([("ts", pa.timestamp("ns", tz="UTC"))])
schema2 = pa.schema([("ts", pa.timestamp("ns"))]) # naive
schemas = [schema1, schema2]

# Remove timezones
new_schemas_naive = standardize_schema_timezones(schemas, timezone=None)
print(new_schemas_naive[0].field("ts").type)
# Expected: timestamp[ns]

# Set a specific timezone
new_schemas_berlin = standardize_schema_timezones(schemas, timezone="Europe/Berlin")
print(new_schemas_berlin[0].field("ts").type)
# Expected: timestamp[ns, tz=Europe/Berlin]

Returns:

list[pyarrow.Schema]: New schemas with standardized timezone info.

`unify_schemas()`¶

Unify a list of PyArrow schemas into a single schema.

Parameters:

Name	Type	Description
`schemas`	`list[pyarrow.Schema]`	List of PyArrow schemas to unify.
`use_large_dtypes`	`bool`	If True, keep large types like large_string.
`timezone`	`str` or `None`	If specified, standardize all timestamp columns to this timezone. If "auto", use the most frequent timezone across schemas. If None, remove timezone from all timestamp columns.
`standardize_timezones`	`bool`	If True, standardize all timestamp columns to the most frequent timezone.

Returns:

pyarrow.Schema: A unified PyArrow schema.

`cast_schema()`¶

Cast a PyArrow table to a given schema, updating the schema to match the table's columns.

Parameters:

Name	Type	Description
`table`	`pyarrow.Table`	The PyArrow table to cast.
`schema`	`pyarrow.Schema`	The target schema to cast the table to.

Returns:

pyarrow.Table: A new PyArrow table with the specified schema.

`convert_large_types_to_normal()`¶

Convert large types in a PyArrow schema to their standard types.

Parameters:

Name	Type	Description
`schema`	`pyarrow.Schema`	The PyArrow schema to convert.

Returns:

pyarrow.Schema: A new PyArrow schema with large types converted to standard types.

`opt_dtype()`¶

Optimize data types of a PyArrow Table for performance and memory efficiency.

Parameters:

Name	Type	Description
`table`	`pyarrow.Table`
`include`	`list[str]`, optional
`exclude`	`list[str]`, optional
`time_zone`	`str`, optional
`shrink_numerics`	`bool`
`allow_unsigned`	`bool`
`use_large_dtypes`	`bool`
`strict`	`bool`
`allow_null`	`bool`	If False, columns that only hold null-like values will not be converted to pyarrow.null().

Returns:

pyarrow.Table: A new table casted to the optimal schema.

fsspec_utils.utils.pyarrow API Reference¶

dominant_timezone_per_column()¶

standardize_schema_timezones_by_majority()¶

standardize_schema_timezones()¶

unify_schemas()¶

cast_schema()¶

convert_large_types_to_normal()¶

opt_dtype()¶

`fsspec_utils.utils.pyarrow` API Reference¶

`dominant_timezone_per_column()`¶

`standardize_schema_timezones_by_majority()`¶

`standardize_schema_timezones()`¶

`unify_schemas()`¶

`cast_schema()`¶

`convert_large_types_to_normal()`¶

`opt_dtype()`¶