Awkward Array Documentation

Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms.

Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they’re not.

The Awkward Array project is divided into 3 layers with 5 main components.

The high-level interface and Numba implementation are described in the Python API reference, though the Numba implementation is considered an internal detail (in the _connect submodule, which starts with an underscore and is therefore private in Python).

This reference describes the

  • C++ classes: in namespace awkward (often abbreviated as "ak"), which are compiled into (or dylib or lib).
  • pybind11 interface: no namespace, but contained entirely within the python directory, which are compiled into awkward._ext for use in Python.
  • CPU kernels: no namespace, but contained entirely within the kernels directory, which are compiled into (or dylib or lib). This library is fully usable from any language that can call functions through FFI.
  • GPU kernels: FIXME! (not implemented yet)

Awkward Arrays are a tree of layout nodes nested within each other, some of which carry one-dimensional arrays that get reinterpreted as part of the structure. These descend from the abstract base class ak::Content and are reflected through pybind11 as ak.layout.Content and subclasses in Python.

  • ak::Content: the abstract base class.
  • ak::EmptyArray: an array of unknown type with no elements (usually produced by ak::ArrayBuilder, which can't determine type at a given level without samples).
  • ak::RawArrayOf<T>: a one-dimensional array of any fixed-bytesize C++ type T. RawArray.h is a header-only implementation and ak::RawArrayOf<T> are not used anywhere else in the Awkward Array project, Python in particular. This node type only exists for pure C++ dependent projects.
  • ak::NumpyArray: any NumPy array (e.g. multidimensional shape, arbitrary dtype), though usually only one-dimensional arrays of numbers. Contrary to its name, ak::NumpyArray can be used in pure C++, without dependence on pybind11 or Python. The NumPy array is described in terms of std::vectors and a std::string.
  • ak::RegularArray: splits its nested content into equal-length lists.
  • ak::ListArrayOf<T>: splits its nested content into variable-length lists with full generality (may use its content non-contiguously, overlapping, or out-of-order).
  • ak::ListOffsetArrayOf<T>: splits its nested content into variable-length lists, assuming contiguous, non-overlapping, in-order content.
  • ak::RecordArray: represents a logical array of records with a "struct of arrays" layout in memory.
  • ak::Record: represents a single record (a subclass of ak::Content in C++, but not Python).
  • ak::IndexedArrayOf<T, ISOPTION>: rearranges and/or duplicates its content by lazily applying an integer index.
  • ak::ByteMaskedArray: represents its content with missing values with an 8-bit boolean mask.
  • ak::BitMaskedArray: represents its content with missing values with a 1-bit boolean mask.
  • ak::UnmaskedArray: specifies that its content can contain missing values in principle, but no mask is supplied because all elements are non-missing.
  • ak::UnionArrayOf<T, I>: interleaves a set of arrays as a tagged union, can represent heterogeneous data.
  • ak::VirtualArray: generates an array on demand from an ak::ArrayGenerator or a ak::SliceGenerator and optionally caches the generated array in an ak::ArrayCache.
  • ak::None: represents a missing value that will be converted to None in Python (a subclass of ak::Content in C++).

The ak::Record, ak::None, and ak::NumpyArray with empty shape are technically ak::Content in C++ even though they represent scalar data, rather than arrays. (This can be checked with the isscalar method.) This is because they are possible return values of methods that would ordinarily return ak::Contents, so they are subclasses to simplify the type hierarchy. However, in the Python layer, they are converted directly into Python scalars, such as ak.layout.Record (which isn't an ak.layout.Content subclass), Python's None, or a Python number/bool.

Most layout nodes contain another content node (ak::RecordArray and ak::UnionArrayOf<T, I> can contain more than one), thus forming a tree. Only ak::EmptyArray, ak::RawArrayOf<T>, and ak::NumpyArray cannot contain a content, and hence these are leaves of the tree.

Note that ak::PartitionedArray and its concrete class, ak::IrregularlyPartitionedArray, are not ak::Content because they cannot be nested within a tree. Partitioning is only allowed at the root of the tree.

Iterator for layout nodes: ak::Iterator.

Index for layout nodes: integer and boolean arrays that define the shape of the data structure, such as boolean masks in ak::ByteMaskedArray, are not ak::NumpyArray but a more constrained type called ak::IndexOf<T>.

Identities for layout nodes: ak::IdentitiesOf<T> are an optional surrogate key for certain join operations. (Not yet used.)

This is the type of data in a high-level ak.Array or ak.Record as reported by ak.type. It represents as much information as a data analyst needs to know (e.g. the distinction between variable and fixed-length lists, but not the distinction between ak::ListArrayOf<T> and ak::ListOffsetArrayOf<T>).

All concrete ak::Type subclasses are composable except ak::ArrayType.

This is the type of a ak::Content array expressed with low-level granularity (e.g. including the distinction between ak::ListArrayOf<T> and ak::ListOffsetArrayOf<T>). There is a one-to-one relationship between ak::Content subclasses and ak::Form subclasses, and each ak::Form maps to only one ak::Type.

The ak.ArrayBuilder is an append-only array for generating data backed by ak.layout.ArrayBuilder (layout-level ArrayBuilder) and ak::ArrayBuilder (C++ implementation).

ak::ArrayBuilder is the front-end for a tree of ak::Builder instances. The structure of this tree indicates the current state of knowledge about the type of the data it's being filled with, and this tree can grow from any node. Types always grow in the direction of more generality, so the tree only gets bigger.

Here is an example that illustrates how knowledge about the type grows.

b = ak.ArrayBuilder()
# fill commands # as JSON # current array type
b.begin_list() # [ # 0 * var * unknown (initially, the type is unknown)
b.integer(1) # 1, # 0 * var * int64
b.integer(2) # 2, # 0 * var * int64
b.real(3) # 3.0 # 0 * var * float64 (all the integers have become floats)
b.end_list() # ], # 1 * var * float64
b.begin_list() # [ # 1 * var * float64
b.end_list() # ], # 2 * var * float64
b.begin_list() # [ # 2 * var * float64
b.integer(4) # 4, # 2 * var * float64
b.null() # null, # 2 * var * ?float64 (now the floats are nullable)
b.integer(5) # 5 # 2 * var * ?float64
b.end_list() # ], # 3 * var * ?float64
b.begin_list() # [ # 3 * var * ?float64
b.begin_record() # { # 3 * var * ?union[float64, {}]
b.field("x") # "x": # 3 * var * ?union[float64, {"x": unknown}]
b.integer(1) # 1, # 3 * var * ?union[float64, {"x": int64}]
b.field("y") # "y": # 3 * var * ?union[float64, {"x": int64, "y": unknown}]
b.begin_list() # [ # 3 * var * ?union[float64, {"x": int64, "y": var * unknown}]
b.integer(2) # 2, # 3 * var * ?union[float64, {"x": int64, "y": var * int64}]
b.integer(3) # 3 # 3 * var * ?union[float64, {"x": int64, "y": var * int64}]
b.end_list() # ] # 3 * var * ?union[float64, {"x": int64, "y": var * int64}]
b.end_record() # } # 3 * var * ?union[float64, {"x": int64, "y": var * int64}]
b.end_list() # ] # 4 * var * ?union[float64, {"x": int64, "y": var * int64}]
# [[1.0, 2.0, 3.0], [], [4.0, None, 5.0], [{'x': 1, 'y': [2, 3]}]]

The ak::Builder instances contain arrays of accumulated data, and thus store both data (in these arrays) and type (in their tree structure). The hierarchy is not exactly the same as ak::Content/ak::Form (which are identical to each other) or ak::Type, since it reflects the kinds of data to be encountered in the input data: mostly JSON-like, but with a distinction between records with named fields and tuples with unnamed slots.

Options for building an array: ak::ArrayBuilderOptions are passed to every ak::Builder in the tree.

Buffers for building an array: ak::GrowableBuffer<T> is a one-dimensional array with append-only semantics, used both for buffers that will become ak::IndexOf<T> and buffers that will become ak::NumpyArray. It works like std::vector in that it replaces its underlying storage at logarithmically frequent intervals, but unlike a std::vector in that the underlying storage is a std::shared_ptr<void*>.

After an ak::ArrayBuilder is turned into a ak::Content with snapshot, the read-only data and its append-only source share buffers (with std::shared_ptr<void*>). This makes the snapshot operation fast and capable of being called frequently, and the std::shared_ptr manages the lifetime of buffers that stay in scope because a ak::Content is using it, even if a ak::GrowableBuffer<T> is not (because it reallocated its internal buffer).

Array building is not as efficient as computing with pre-built arrays because the type-discovery makes each access a tree-descent. Array building is also an exception to the rule that C++ implementations do not touch array data (do not dereference pointers in ak::RawArrayOf<T>, ak::NumpyArray, ak::IndexOf<T>, and ak::IdentitiesOf<T>). The ak::Builder instances append to their ak::GrowableBuffer<T>, which are assumed to exist in main memory, not a GPU.

Reducer operations like ak.sum and ak.max are implemented as a group for code reuse. The most complex part is how elements from variable-length lists are grouped when axis != -1 (see ak::Content::reduce_next). The problem of actually computing sums and maxima is comparatively simple.

Each reducer has a class that implements the sum, max, etc.

Many Python objects can be used as slices; they are each converted into a specialized C++ type before being passed to ak::Content::getitem.

  • ak::Slice: represents a Python tuple of slice items, which selects multiple dimensions at once. A non-tuple slice in Python is wrapped as a single-item ak::Slice in C++.
  • ak::SliceItem: abstract base class for non-tuple slice items.
  • ak::SliceAt: represents a single integer as a slice item, which selects an element.
  • ak::SliceRange: represents a Python slice object with start, stop, and step, which selects a range of elements.
  • ak::SliceField: represents a single string as a slice item, which selects a record field or projects across all record fields in an array.
  • ak::SliceFields: represents an iterable of strings as a slice item, which selects a set of record fields or projects across a set of all record fields in an array.
  • ak::SliceNewAxis: represents np.newaxis (a.k.a. None), which inserts a new regular dimension of length 1 in the sliced output.
  • ak::SliceEllipsis: represents Elipsis (a.k.a. ...), which skips enough dimensions to put the rest of the slice items at the deepest possible level.
  • ak::SliceArrayOf: represents an integer or boolean array (boolean arrays are converted into integers early), which do general element-selection (rearrangement, filtering, and duplication).
  • ak::SliceJaggedOf: represents a one level of jaggedness in an array as a slice item.
  • ak::SliceMissingOf: represents an array with missing values as a slice item.

Note: ak::SliceGenerator is not a ak::SliceItem; it's an ak::ArrayGenerator. (A case of clashing naming conventions.)

The following classes hide Awkward Array's dependence on RapidJSON, so that it's a swappable component. Header files do not reference RapidJSON, but the implementation of these classes do: ak::ToJson, ak::ToJsonString, ak::ToJsonPrettyString, ak::ToJsonFile, ak::ToJsonPrettyFile.

The kernels library is separated from the C++ codebase by a pure C interface, and thus the kernels could be used by other languages.

The CPU kernels are implemented in these files:

The GPU kernels follow exactly the same interface, though a different implementation.

FIXME: kernel function names and arguments should be systematized in some searchable way.