How to filter arrays by number of items#

import awkward as ak

In general, arrays are filtered using NumPy-like slicing. Numerical values can be filtered by numerical expressions in a way that is very similar to NumPy:

array = ak.Array([
    [[0, 1.1, 2.2], []], [[3.3, 4.4]], [], [[5.5], [6.6, 7.7, 8.8, 9.9]]
])

array[array > 4]

[[[], []],
 [[4.4]],
 [],
 [[5.5], [6.6, 7.7, 8.8, 9.9]]]
-------------------------------
type: 4 * var * var * float64

but it’s also common to want to filter arrays by the number of items in each list, for two reasons:

to exclude empty lists so that subsequent slices can select the item at index 0,
to make the list lengths rectangular for computational steps that require rectangular array (such as most forms of machine learning).

There are two functions that provide the lengths of lists: ak.num() and ak.count(). To filter arrays, you’ll most likely want ak.num().

Use `ak.num`#

ak.num() can be applied at any axis, and it returns the number of items in lists at that axis with the same shape for all levels above that axis.

ak.num(array, axis=0)

array(4)

ak.num(array, axis=1)   # default

[2,
 1,
 0,
 2]
---------------
type: 4 * int64

ak.num(array, axis=2)

[[3, 0],
 [2],
 [],
 [1, 4]]
---------------------
type: 4 * var * int64

Thus, if you want to select outer lists of array with length 2, you would use axis=1:

array[ak.num(array) == 2]

[[[0, 1.1, 2.2], []],
 [[5.5], [6.6, 7.7, 8.8, 9.9]]]
-------------------------------
type: 2 * var * var * float64

And if you want to select inner lists of array with length greater than 2, you would use axis=2:

array[ak.num(array, axis=2) > 2]

[[[0, 1.1, 2.2]],
 [],
 [],
 [[6.6, 7.7, 8.8, 9.9]]]
-----------------------------
type: 4 * var * var * float64

The ragged array of booleans that you get from comparing ak.num() with a number is exactly what is needed to slice the array.

Don’t use `ak.count`#

By contrast, ak.count() returns structures that you can’t use this way (for all but axis=-1):

ak.count(array, axis=None)   # default

ak.count(array, axis=0)

[[3, 2, 1],
 [1, 1, 1, 1]]
---------------------
type: 2 * var * int64

ak.count(array, axis=1)

[[1, 1, 1],
 [1, 1],
 [],
 [2, 1, 1, 1]]
---------------------
type: 4 * var * int64

ak.count(array, axis=2)   # equivalent to axis=-1 for this array

[[3, 0],
 [2],
 [],
 [1, 4]]
---------------------
type: 4 * var * int64

Also, ak.num() can be used on arrays that contain records, whereas ak.count() (like other reducers), can’t.

As a reducer, ak.count() is intended to be used in a mathematical formula with other reducers, like ak.sum(), ak.max(), etc. (usually as a denominator). Its axis behavior matches that of other reducers, which is important for the shapes of nested lists to align.

How to filter arrays by number of items#

Use ak.num#

Don’t use ak.count#

Use `ak.num`#

Don’t use `ak.count`#