Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms.
Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they’re not.
The Awkward Array project is divided into 3 layers with 5 main components.
The high-level interface and Numba implementation are described in the Python API reference, though the Numba implementation is considered an internal detail (in the _connect
submodule, which starts with an underscore and is therefore private in Python).
This reference describes the
Awkward Arrays are a tree of layout nodes nested within each other, some of which carry one-dimensional arrays that get reinterpreted as part of the structure. These descend from the abstract base class ak::Content and are reflected through pybind11 as ak.layout.Content and subclasses in Python.
T
. RawArray.h is a header-only implementation and ak::RawArrayOf<T> are not used anywhere else in the Awkward Array project, Python in particular. This node type only exists for pure C++ dependent projects.std::vectors
and a std::string
.None
in Python (a subclass of ak::Content in C++).The ak::Record, ak::None, and ak::NumpyArray with empty shape are technically ak::Content in C++ even though they represent scalar data, rather than arrays. (This can be checked with the isscalar method.) This is because they are possible return values of methods that would ordinarily return ak::Contents, so they are subclasses to simplify the type hierarchy. However, in the Python layer, they are converted directly into Python scalars, such as ak.layout.Record (which isn't an ak.layout.Content subclass), Python's None
, or a Python number/bool.
Most layout nodes contain another content node (ak::RecordArray and ak::UnionArrayOf<T, I> can contain more than one), thus forming a tree. Only ak::EmptyArray, ak::RawArrayOf<T>, and ak::NumpyArray cannot contain a content, and hence these are leaves of the tree.
Note that ak::PartitionedArray and its concrete class, ak::IrregularlyPartitionedArray, are not ak::Content because they cannot be nested within a tree. Partitioning is only allowed at the root of the tree.
Iterator for layout nodes: ak::Iterator.
Index for layout nodes: integer and boolean arrays that define the shape of the data structure, such as boolean masks in ak::ByteMaskedArray, are not ak::NumpyArray but a more constrained type called ak::IndexOf<T>.
Identities for layout nodes: ak::IdentitiesOf<T> are an optional surrogate key for certain join operations. (Not yet used.)
This is the type of data in a high-level ak.Array or ak.Record as reported by ak.type. It represents as much information as a data analyst needs to know (e.g. the distinction between variable and fixed-length lists, but not the distinction between ak::ListArrayOf<T> and ak::ListOffsetArrayOf<T>).
size
is part of the type description.All concrete ak::Type subclasses are composable except ak::ArrayType.
This is the type of a ak::Content array expressed with low-level granularity (e.g. including the distinction between ak::ListArrayOf<T> and ak::ListOffsetArrayOf<T>). There is a one-to-one relationship between ak::Content subclasses and ak::Form subclasses, and each ak::Form maps to only one ak::Type.
ISOPTION = false
.ISOPTION = true
.The ak.ArrayBuilder is an append-only array for generating data backed by ak.layout.ArrayBuilder (layout-level ArrayBuilder) and ak::ArrayBuilder (C++ implementation).
ak::ArrayBuilder is the front-end for a tree of ak::Builder instances. The structure of this tree indicates the current state of knowledge about the type of the data it's being filled with, and this tree can grow from any node. Types always grow in the direction of more generality, so the tree only gets bigger.
Here is an example that illustrates how knowledge about the type grows.
The ak::Builder instances contain arrays of accumulated data, and thus store both data (in these arrays) and type (in their tree structure). The hierarchy is not exactly the same as ak::Content/ak::Form (which are identical to each other) or ak::Type, since it reflects the kinds of data to be encountered in the input data: mostly JSON-like, but with a distinction between records with named fields and tuples with unnamed slots.
"__array__"
equal to "string"
or "bytestring"
.ISOPTION = true
.Options for building an array: ak::ArrayBuilderOptions are passed to every ak::Builder in the tree.
Buffers for building an array: ak::GrowableBuffer<T> is a one-dimensional array with append-only semantics, used both for buffers that will become ak::IndexOf<T> and buffers that will become ak::NumpyArray. It works like std::vector
in that it replaces its underlying storage at logarithmically frequent intervals, but unlike a std::vector
in that the underlying storage is a std::shared_ptr<void*>
.
After an ak::ArrayBuilder is turned into a ak::Content with snapshot, the read-only data and its append-only source share buffers (with std::shared_ptr<void*>
). This makes the snapshot operation fast and capable of being called frequently, and the std::shared_ptr
manages the lifetime of buffers that stay in scope because a ak::Content is using it, even if a ak::GrowableBuffer<T> is not (because it reallocated its internal buffer).
Array building is not as efficient as computing with pre-built arrays because the type-discovery makes each access a tree-descent. Array building is also an exception to the rule that C++ implementations do not touch array data (do not dereference pointers in ak::RawArrayOf<T>, ak::NumpyArray, ak::IndexOf<T>, and ak::IdentitiesOf<T>). The ak::Builder instances append to their ak::GrowableBuffer<T>, which are assumed to exist in main memory, not a GPU.
Reducer operations like ak.sum and ak.max are implemented as a group for code reuse. The most complex part is how elements from variable-length lists are grouped when axis != -1
(see ak::Content::reduce_next). The problem of actually computing sums and maxima is comparatively simple.
Each reducer has a class that implements the sum, max, etc.
Many Python objects can be used as slices; they are each converted into a specialized C++ type before being passed to ak::Content::getitem.
slice
object with start
, stop
, and step
, which selects a range of elements.np.newaxis
(a.k.a. None
), which inserts a new regular dimension of length 1 in the sliced output.Elipsis
(a.k.a. ...
), which skips enough dimensions to put the rest of the slice items at the deepest possible level.Note: ak::SliceGenerator is not a ak::SliceItem; it's an ak::ArrayGenerator. (A case of clashing naming conventions.)
The following classes hide Awkward Array's dependence on RapidJSON, so that it's a swappable component. Header files do not reference RapidJSON, but the implementation of these classes do: ak::ToJson, ak::ToJsonString, ak::ToJsonPrettyString, ak::ToJsonFile, ak::ToJsonPrettyFile.
The kernels library is separated from the C++ codebase by a pure C interface, and thus the kernels could be used by other languages.
The CPU kernels are implemented in these files:
The GPU kernels follow exactly the same interface, though a different implementation.
FIXME: kernel function names and arguments should be systematized in some searchable way.