sort one array by another#

A common task in data analysis is to select items from one array that minimizes or maximizes another, or to sort one array by the values of another.

import awkward as ak

Naive attempt goes wrong#

For instance, in

data = ak.Array([
    [
        {"title": "zero", "x": 0, "y": 0},
        {"title": "two", "x": 2, "y": 2.2},
        {"title": "one", "x": 1, "y": 1.1},
    ],
    [],
    [
        {"title": "four", "x": 4, "y": 4.4},
        {"title": "three", "x": 3, "y": 3.3},
    ],
    [
        {"title": "five", "x": 5, "y": 5.5},
    ],
    [
        {"title": "eight", "x": 8, "y": 8.8},
        {"title": "six", "x": 6, "y": 6.6},
        {"title": "nine", "x": 9, "y": 9.9},
        {"title": "seven", "x": 7, "y": 7.7},
    ],
])

you may want to score each record with a computed value, such as x**2 + y**2, and then select the record with the highest score from each list.

score = data.x**2 + data.y**2
score

[[0, 8.84, 2.21],
 [],
 [35.4, 19.9],
 [55.2],
 [141, 79.6, 179, 108]]
-----------------------
type: 5 * var * float64

At first, it would seem that ak.argmax() is what you need to identify the item with the highest score from each list and select it from data.

best_index = ak.argmax(score, axis=1)
best_index

[1,
 None,
 0,
 0,
 2]
----------------
type: 5 * ?int64

However, if you attempt to slice the data with this, you’ll either get an indexing error or lists instead of records:

data[best_index]

[[],
 None,
 [{title: 'zero', x: 0, y: 0}, {...}, {title: 'one', x: 1, y: 1.1}],
 [{title: 'zero', x: 0, y: 0}, {...}, {title: 'one', x: 1, y: 1.1}],
 [{title: 'four', x: 4, y: 4.4}, {title: 'three', x: 3, y: 3.3}]]
--------------------------------------------------------------------
type: 5 * option[var * {
    title: string,
    x: int64,
    y: float64
}]

What happend?#

Following the logic for reducers, the ak.argmin() function returns an array with one fewer dimension than the input: the data is an array of lists of records, but best_index is an array of integers. We want an array of lists of integers.

The keepdims=True parameter can ensure that the output has the same number of dimensions as the input:

best_index = ak.argmax(score, axis=1, keepdims=True)
best_index

[[1],
 [None],
 [0],
 [0],
 [2]]
--------------------
type: 5 * 1 * ?int64

Now these integers are at the same level of depth as the records that we want to select:

result = data[best_index]
result

[[{title: 'two', x: 2, y: 2.2}],
 [None],
 [{title: 'four', x: 4, y: 4.4}],
 [{title: 'five', x: 5, y: 5.5}],
 [{title: 'nine', x: 9, y: 9.9}]]
---------------------------------
type: 5 * var * ?{
    title: string,
    x: int64,
    y: float64
}

In the above, each length-1 list contains the record with the highest score. Even the empty list, for which the ak.argmax() is missing (None), is now a length-1 list containing None. We can remove this length-1 list structure with a slice:

result[:, 0]

[{title: 'two', x: 2, y: 2.2},
 None,
 {title: 'four', x: 4, y: 4.4},
 {title: 'five', x: 5, y: 5.5},
 {title: 'nine', x: 9, y: 9.9}]
-------------------------------
type: 5 * ?{
    title: string,
    x: int64,
    y: float64
}

To summarize this as a handy idiom, the way to get the record with maximum data.x**2 + data.y**2 from an array of lists of records named data is

data[ak.argmax(data.x**2 + data.y**2, axis=1, keepdims=True)][:, 0]

[{title: 'two', x: 2, y: 2.2},
 None,
 {title: 'four', x: 4, y: 4.4},
 {title: 'five', x: 5, y: 5.5},
 {title: 'nine', x: 9, y: 9.9}]
-------------------------------
type: 5 * ?{
    title: string,
    x: int64,
    y: float64
}

For an array of lists of lists of records, axis=2 and the final slice would be [:, :, 0], and so on.

Sorting by another array#

In addition to selecting items corresponding to the minimum or maximum of some other array, we may want to sort by another array. Just as ak.argmin() and ak.argmax() are the functions that would convey indexes from one array to another, ak.argsort() conveys sorted indexes from one array to another array. However, ak.argsort() always maintains the total number of dimensions, so we don’t need to worry about keepdims.

sorted_indexes = ak.argsort(score)
sorted_indexes

[[0, 2, 1],
 [],
 [1, 0],
 [0],
 [1, 3, 0, 2]]
---------------------
type: 5 * var * int64

data[sorted_indexes]

[[{title: 'zero', x: 0, y: 0}, {...}, {title: 'two', x: 2, y: 2.2}],
 [],
 [{title: 'three', x: 3, y: 3.3}, {title: 'four', x: 4, y: 4.4}],
 [{title: 'five', x: 5, y: 5.5}],
 [{title: 'six', x: 6, y: 6.6}, {...}, ..., {title: 'nine', x: 9, y: 9.9}]]
---------------------------------------------------------------------------
type: 5 * var * {
    title: string,
    x: int64,
    y: float64
}

This sorted data has the same type as data:

data.type.show()

5 * var * {
    title: string,
    x: int64,
    y: float64
}

It’s exactly what we want. ak.argsort() is easier to use than ak.argmin() and ak.argmax().

Getting the top n items#

The ak.min(), ak.max(), ak.argmin(), and ak.argmax() functions select one extreme value. If you want the top n items (with n ≠ 1), you can use ak.sort() or ak.argsort(), followed by a slice:

top2 = data[ak.argsort(score)][:, :2]
top2

[[{title: 'zero', x: 0, y: 0}, {title: 'one', x: 1, y: 1.1}],
 [],
 [{title: 'three', x: 3, y: 3.3}, {title: 'four', x: 4, y: 4.4}],
 [{title: 'five', x: 5, y: 5.5}],
 [{title: 'six', x: 6, y: 6.6}, {title: 'seven', x: 7, y: 7.7}]]
-----------------------------------------------------------------
type: 5 * var * {
    title: string,
    x: int64,
    y: float64
}

Notice, though, that not all of these lists have length 2. The lists with 0 or 1 input items have 0 or 1 output items: these lists have up to length 2. That may be fine, but the example with ak.argmax(), above, resulted in None for an empty list. We could emulate that with ak.pad_none().

padded = ak.pad_none(top2, 2, axis=1)
padded

[[{title: 'zero', x: 0, y: 0}, {title: 'one', x: 1, y: 1.1}],
 [None, None],
 [{title: 'three', x: 3, y: 3.3}, {title: 'four', x: 4, y: 4.4}],
 [{title: 'five', x: 5, y: 5.5}, None],
 [{title: 'six', x: 6, y: 6.6}, {title: 'seven', x: 7, y: 7.7}]]
-----------------------------------------------------------------
type: 5 * var * ?{
    title: string,
    x: int64,
    y: float64
}

The data type still says “var *”, meaning that the lists are allowed to be variable-length, even though they happen to all have length 2. At this point, we might not care because that’s all we need in order to convert these fields into NumPy arrays (e.g. for some machine learning process):

ak.to_numpy(padded.x)

masked_array(
  data=[[0, 1],
        [--, --],
        [3, 4],
        [5, --],
        [6, 7]],
  mask=[[False, False],
        [ True,  True],
        [False, False],
        [False,  True],
        [False, False]],
  fill_value=999999)

ak.to_numpy(padded.y)

masked_array(
  data=[[0.0, 1.1],
        [--, --],
        [3.3, 4.4],
        [5.5, --],
        [6.6, 7.7]],
  mask=[[False, False],
        [ True,  True],
        [False, False],
        [False,  True],
        [False, False]],
  fill_value=1e+20)

Or we might want to force the data type to ensure that the lists have length 2, using ak.to_regular(), ak.enforce_type(), or just by passing clip=True in the original ak.pad_none().

ak.to_regular(padded, axis=1)

[[{title: 'zero', x: 0, y: 0}, {title: 'one', x: 1, y: 1.1}],
 [None, None],
 [{title: 'three', x: 3, y: 3.3}, {title: 'four', x: 4, y: 4.4}],
 [{title: 'five', x: 5, y: 5.5}, None],
 [{title: 'six', x: 6, y: 6.6}, {title: 'seven', x: 7, y: 7.7}]]
-----------------------------------------------------------------
type: 5 * 2 * ?{
    title: string,
    x: int64,
    y: float64
}

(Now the list lengths are “2 *”, rather than “var *”.)