ak.to_parquet¶
Defined in awkward.operations.convert on line 2876.
-
ak.
to_parquet
(array, where, explode_records=False, list_to32=False, string_to32=True, bytestring_to32=True)¶ - Parameters
array – Data to write to a Parquet file.
where (str, Path, file-like object) – Where to write the Parquet file.
explode_records (bool) – If True, lists of records are written as records of lists, so that nested fields become top-level fields (which can be zipped when read back).
list_to32 (bool) – If True, convert Awkward lists into 32-bit Arrow lists if they’re small enough, even if it means an extra conversion. Otherwise, signed 32-bit
ak.layout.ListOffsetArray
maps to ArrowListType
and all others map to ArrowLargeListType
.string_to32 (bool) – Same as the above for Arrow
string
andlarge_string
.bytestring_to32 (bool) – Same as the above for Arrow
binary
andlarge_binary
.options – All other options are passed to pyarrow.parquet.ParquetWriter. In particular, if no
schema
is given, a schema is derived from the array type.
Writes an Awkward Array to a Parquet file (through pyarrow).
>>> array1 = ak.Array([[1, 2, 3], [], [4, 5], [], [], [6, 7, 8, 9]])
>>> ak.to_parquet(array1, "array1.parquet")
If the array
does not contain records at top-level, the Arrow table will consist
of one field whose name is ""
.
Parquet files can maintain the distinction between “option-type but no elements are
missing” and “not option-type” at all levels, including the top level. However,
there is no distinction between ?union[X, Y, Z]]
type and union[?X, ?Y, ?Z]
type.
Be aware of these type distinctions when passing data through Arrow or Parquet.
To make a partitioned Parquet dataset, use this function to write each Parquet
file to a directory (as separate invocations, probably in parallel with multiple
processes), then give them common metadata by calling ak.to_parquet.dataset
.
>>> ak.to_parquet(array1, "directory-name/file1.parquet")
>>> ak.to_parquet(array2, "directory-name/file2.parquet")
>>> ak.to_parquet(array3, "directory-name/file3.parquet")
>>> ak.to_parquet.dataset("directory-name")
Then all of the flies in the collection can be addressed as one array. For example,
>>> dataset = ak.from_parquet("directory_name", lazy=True)
(If it is large, you will likely want to load it lazily.)
See also ak.to_arrow
, which is used as an intermediate step.
See also ak.from_parquet
.