fsspec_utils.utils.pyarrow
API Reference¶
dominant_timezone_per_column()
¶
For each timestamp column (by name) across all schemas, detect the most frequent timezone (including None).
If None and a timezone are tied, prefer the timezone. Returns a dict: {column_name: dominant_timezone}
Parameters:
Name | Type | Description |
---|---|---|
schemas |
list[pyarrow.Schema] |
A list of PyArrow schemas to analyze. |
Example:
Returns:
dict
: {column_name: dominant_timezone}
standardize_schema_timezones_by_majority()
¶
For each timestamp column (by name) across all schemas, set the timezone to the most frequent (with tie-breaking).
Parameters:
Name | Type | Description |
---|---|---|
schemas |
list[pyarrow.Schema] |
A list of PyArrow schemas to standardize. |
Example:
Returns:
list[pyarrow.Schema]
: A new list of schemas with updated timestamp timezones.
standardize_schema_timezones()
¶
Standardize timezone info for all timestamp columns in a list of PyArrow schemas.
Parameters:
Name | Type | Description |
---|---|---|
schemas |
list[pyarrow.Schema] |
The list of PyArrow schemas to process. |
timezone |
str or None |
The target timezone to apply to timestamp columns. If None, timezones are removed. If "auto", the most frequent timezone across schemas is used. |
Example:
Returns:
list[pyarrow.Schema]
: New schemas with standardized timezone info.
unify_schemas()
¶
Unify a list of PyArrow schemas into a single schema.
Parameters:
Name | Type | Description |
---|---|---|
schemas |
list[pyarrow.Schema] |
List of PyArrow schemas to unify. |
use_large_dtypes |
bool |
If True, keep large types like large_string. |
timezone |
str or None |
If specified, standardize all timestamp columns to this timezone. If "auto", use the most frequent timezone across schemas. If None, remove timezone from all timestamp columns. |
standardize_timezones |
bool |
If True, standardize all timestamp columns to the most frequent timezone. |
Returns:
pyarrow.Schema
: A unified PyArrow schema.
cast_schema()
¶
Cast a PyArrow table to a given schema, updating the schema to match the table's columns.
Parameters:
Name | Type | Description |
---|---|---|
table |
pyarrow.Table |
The PyArrow table to cast. |
schema |
pyarrow.Schema |
The target schema to cast the table to. |
Returns:
pyarrow.Table
: A new PyArrow table with the specified schema.
convert_large_types_to_normal()
¶
Convert large types in a PyArrow schema to their standard types.
Parameters:
Name | Type | Description |
---|---|---|
schema |
pyarrow.Schema |
The PyArrow schema to convert. |
Returns:
pyarrow.Schema
: A new PyArrow schema with large types converted to standard types.
opt_dtype()
¶
Optimize data types of a PyArrow Table for performance and memory efficiency.
Parameters:
Name | Type | Description |
---|---|---|
table |
pyarrow.Table |
|
include |
list[str] , optional |
|
exclude |
list[str] , optional |
|
time_zone |
str , optional |
|
shrink_numerics |
bool |
|
allow_unsigned |
bool |
|
use_large_dtypes |
bool |
|
strict |
bool |
|
allow_null |
bool |
If False, columns that only hold null-like values will not be converted to pyarrow.null(). |
Returns:
pyarrow.Table
: A new table casted to the optimal schema.