How to find the best match between two collections using Cartesian (cross) product#

In high energy physics (HEP), ak.combinations() is often needed to find particles whose trajectories are close to each other, separately in many high-energy collision events (axis=1). In some applications, the two collections that need to be matched are simulated particles and reconstructed versions of those particles (“gen-reco matching”), and in other applications, the two collections are different types of particles, such as electrons and jets.

I’ll describe how to solve such a problem on this page, but avoid domain-specific jargon by casting it as a problem of finding the distance between bunnies and foxes—if a bunny is too close to a fox, it will get eaten!

import awkward as ak
import numpy as np

Setting up the problem#

In 1000 separate yards (big suburb), there’s a random number of bunnies and a random number of foxes, each with random x, y positions. We’re making ragged arrays of records using ak.unflatten() and ak.zip().

np.random.seed(12345)

number_of_bunnies = np.random.poisson(3.5, 1000)   # average of 3.5 bunnies/yard
number_of_foxes = np.random.poisson(1.5, 1000)     # average of 1.5 foxes/yard

bunny_xy = np.random.normal(0, 1, (number_of_bunnies.sum(), 2))
fox_xy = np.random.normal(0, 1, (number_of_foxes.sum(), 2))

bunnies = ak.unflatten(ak.zip({"x": bunny_xy[:, 0], "y": bunny_xy[:, 1]}), number_of_bunnies)
foxes = ak.unflatten(ak.zip({"x": fox_xy[:, 0], "y": fox_xy[:, 1]}), number_of_foxes)

bunnies

[[{x: -1, y: -0.727}, {x: 0.735, y: -0.594}, {x: -0.588, y: 0.718}],
 [{x: -0.449, y: -1.61}, {x: 1.59, y: -0.627}, ..., {...}, {x: 1.49, y: -1.42}],
 [{x: 0.441, y: -0.705}, {x: -1.96, y: 0.31}],
 [{x: -0.121, y: -0.232}, {x: -0.44, y: -2.03}, ..., {x: 0.0405, y: 1.24}],
 [{x: -0.703, y: -0.049}, {x: -0.209, y: ..., ...}, ..., {x: 0.525, y: 0.265}],
 [{x: -0.958, y: 0.119}, {x: 0.885, y: -0.266}],
 [{x: -1.28, y: 0.235}, {x: 0.625, y: ..., ...}, {x: -0.748, y: -0.336}],
 [{x: 0.37, y: 0.582}, {x: -0.594, y: -2.68}, ..., {...}, {x: 0.412, y: 0.32}],
 [{x: 0.57, y: 0.342}, {x: -0.416, y: -0.973}],
 [{x: -0.0416, y: 0.802}, {x: -0.144, y: ..., ...}, ..., {x: -0.635, y: -1.71}],
 ...,
 [{x: 1.7, y: -0.564}],
 [{x: 1.11, y: -0.309}, {x: 0.155, y: -0.458}, ..., {x: 0.352, y: -1.18}],
 [{x: -0.534, y: 0.555}, {x: -1.07, y: -1.37}, {x: -0.771, y: 0.699}],
 [{x: -0.586, y: 1.86}],
 [{x: 0.0432, y: -1.22}, {x: -1.15, y: 0.45}],
 [{x: -0.868, y: 1.79}, {x: 0.999, y: -1.16}, ..., {x: -0.877, y: 0.203}],
 [{x: 0.73, y: 0.335}, {x: -0.372, y: -0.659}, {x: 0.946, y: -1.1}],
 [{x: -0.622, y: -1.26}],
 [{x: 1.47, y: 1.5}, {x: 1.94, y: -2.37}]]
--------------------------------------------------------------------------------
type: 1000 * var * {
    x: float64,
    y: float64
}

foxes

[[{x: 0.183, y: -1.17}],
 [{x: 0.497, y: 0.761}],
 [{x: -1.06, y: 0.0902}],
 [{x: -0.106, y: -1.28}],
 [{x: 0.338, y: 0.642}],
 [{x: -0.833, y: -1.11}, {x: -0.163, y: ..., ...}, {x: -0.548, y: -0.546}],
 [{x: 0.89, y: 0.67}, {x: -1.24, y: 0.543}, {...}, {x: 0.211, y: -1.03}],
 [{x: 0.411, y: 1.23}, {x: -0.51, y: 0.931}, {...}, {x: 0.953, y: 1.18}],
 [{x: -3.47, y: -0.273}],
 [{x: -0.553, y: 0.339}],
 ...,
 [{x: 0.856, y: 1.06}, {x: 1.12, y: -0.0269}],
 [],
 [],
 [],
 [{x: -1.51, y: 0.912}],
 [{x: -0.665, y: -0.175}],
 [{x: -0.401, y: -1.2}, {x: -1.33, y: -1.06}],
 [],
 [{x: -0.0623, y: -0.596}, {x: 1.35, y: ..., ...}, ..., {x: 0.052, y: -0.242}]]
-------------------------------------------------------------------------------
type: 1000 * var * {
    x: float64,
    y: float64
}

Find all combinations#

In each yard, we find all bunny-fox pairs, regardless of whether they’re close or not using ak.cartesian(), and then unpacking the pairs with ak.unzip().

pair_bunnies, pair_foxes = ak.unzip(ak.cartesian([bunnies, foxes]))

These two arrays, pair_bunnies and pair_foxes, have the same type as bunnies and foxes, but different numbers of items in each list because now they’re paired to match each other. Both kinds of animals are duplicated to enable this match.

pair_bunnies

[[{x: -1, y: -0.727}, {x: 0.735, y: -0.594}, {x: -0.588, y: 0.718}],
 [{x: -0.449, y: -1.61}, {x: 1.59, y: -0.627}, ..., {...}, {x: 1.49, y: -1.42}],
 [{x: 0.441, y: -0.705}, {x: -1.96, y: 0.31}],
 [{x: -0.121, y: -0.232}, {x: -0.44, y: -2.03}, ..., {x: 0.0405, y: 1.24}],
 [{x: -0.703, y: -0.049}, {x: -0.209, y: ..., ...}, ..., {x: 0.525, y: 0.265}],
 [{x: -0.958, y: 0.119}, {x: -0.958, y: ..., ...}, ..., {x: 0.885, y: -0.266}],
 [{x: -1.28, y: 0.235}, {x: -1.28, y: 0.235}, ..., {x: -0.748, y: -0.336}],
 [{x: 0.37, y: 0.582}, {x: 0.37, y: 0.582}, ..., {...}, {x: 0.412, y: 0.32}],
 [{x: 0.57, y: 0.342}, {x: -0.416, y: -0.973}],
 [{x: -0.0416, y: 0.802}, {x: -0.144, y: ..., ...}, ..., {x: -0.635, y: -1.71}],
 ...,
 [{x: 1.7, y: -0.564}, {x: 1.7, y: -0.564}],
 [],
 [],
 [],
 [{x: 0.0432, y: -1.22}, {x: -1.15, y: 0.45}],
 [{x: -0.868, y: 1.79}, {x: 0.999, y: -1.16}, ..., {x: -0.877, y: 0.203}],
 [{x: 0.73, y: 0.335}, {x: 0.73, y: 0.335}, ..., {...}, {x: 0.946, y: -1.1}],
 [],
 [{x: 1.47, y: 1.5}, {x: 1.47, y: 1.5}, {...}, ..., {...}, {x: 1.94, y: -2.37}]]
--------------------------------------------------------------------------------
type: 1000 * var * {
    x: float64,
    y: float64
}

pair_foxes

[[{x: 0.183, y: -1.17}, {x: 0.183, y: -1.17}, {x: 0.183, y: -1.17}],
 [{x: 0.497, y: 0.761}, {x: 0.497, y: 0.761}, ..., {...}, {x: 0.497, y: 0.761}],
 [{x: -1.06, y: 0.0902}, {x: -1.06, y: 0.0902}],
 [{x: -0.106, y: -1.28}, {x: -0.106, y: ..., ...}, ..., {x: -0.106, y: -1.28}],
 [{x: 0.338, y: 0.642}, {x: 0.338, y: 0.642}, ..., {...}, {x: 0.338, y: 0.642}],
 [{x: -0.833, y: -1.11}, {x: -0.163, y: ..., ...}, ..., {x: -0.548, y: -0.546}],
 [{x: 0.89, y: 0.67}, {x: -1.24, y: 0.543}, ..., {...}, {x: 0.211, y: -1.03}],
 [{x: 0.411, y: 1.23}, {x: -0.51, y: 0.931}, ..., {...}, {x: 0.953, y: 1.18}],
 [{x: -3.47, y: -0.273}, {x: -3.47, y: -0.273}],
 [{x: -0.553, y: 0.339}, {x: -0.553, y: ..., ...}, ..., {x: -0.553, y: 0.339}],
 ...,
 [{x: 0.856, y: 1.06}, {x: 1.12, y: -0.0269}],
 [],
 [],
 [],
 [{x: -1.51, y: 0.912}, {x: -1.51, y: 0.912}],
 [{x: -0.665, y: -0.175}, {x: -0.665, ...}, {...}, {x: -0.665, y: -0.175}],
 [{x: -0.401, y: -1.2}, {x: -1.33, y: -1.06}, ..., {...}, {x: -1.33, y: -1.06}],
 [],
 [{x: -0.0623, y: -0.596}, {x: 1.35, y: ..., ...}, ..., {x: 0.052, y: -0.242}]]
--------------------------------------------------------------------------------
type: 1000 * var * {
    x: float64,
    y: float64
}

The two arrays have the same list lengths as each other because they came from the same ak.unzip().

ak.num(pair_bunnies), ak.num(pair_foxes)

(<Array [3, 8, 2, 8, 5, 6, 12, 20, ..., 0, 0, 2, 4, 6, 0, 8] type='1000 * int64'>,
 <Array [3, 8, 2, 8, 5, 6, 12, 20, ..., 0, 0, 2, 4, 6, 0, 8] type='1000 * int64'>)

Calculating distances#

Since the arrays have the same shapes, they can be used in the same mathematical formula. Here’s the formula for distance:

distances = np.sqrt((pair_bunnies.x - pair_foxes.x)**2 + (pair_bunnies.y - pair_foxes.y)**2)
distances

[[1.27, 0.799, 2.04],
 [2.55, 1.77, 3.37, 1.57, 0.352, 2, 0.758, 2.39],
 [1.7, 0.924],
 [1.05, 0.819, 1.38, 3.03, 1.44, 2.22, 2.89, 2.53],
 [1.25, 2.61, 0.953, 0.685, 0.421],
 [1.24, 0.88, 0.781, 1.92, 1.3, 1.46],
 [2.21, 0.31, 1.36, 1.95, 0.656, 1.93, ..., 1.17, 1.92, 1.01, 0.788, 1.18],
 [0.645, 0.947, 1.42, 0.834, 4.03, 3.61, ..., 3.12, 0.906, 1.11, 1.48, 1.02],
 [4.08, 3.13],
 [0.69, 2.84, 0.312, 2.05],
 ...,
 [1.83, 0.789],
 [],
 [],
 [],
 [2.64, 0.585],
 [1.97, 1.94, 2.51, 0.434],
 [1.91, 2.49, 0.545, 1.04, 1.35, 2.28],
 [],
 [2.6, 0.607, 2.03, 2.25, 2.67, 3.33, 2.26, 2.84]]
-----------------------------------------------------------------------------
type: 1000 * var * float64

Let’s say that 1 unit is close enough for a bunny to be eaten.

eaten = (distances < 1)
eaten

[[False, True, False],
 [False, False, False, False, True, False, True, False],
 [False, True],
 [False, True, False, False, False, False, False, False],
 [False, False, True, True, True],
 [False, True, True, False, False, False],
 [False, True, False, False, True, ..., False, False, False, True, False],
 [True, True, False, True, False, False, ..., False, True, False, False, False],
 [False, False],
 [True, False, True, False],
 ...,
 [False, True],
 [],
 [],
 [],
 [False, True],
 [False, False, False, True],
 [False, False, True, False, False, False],
 [],
 [False, True, False, False, False, False, False, False]]
--------------------------------------------------------------------------------
type: 1000 * var * bool

This is great (not for the bunnies, but perhaps for the foxes). However, if we want to use this information on the original arrays, we’re stuck: this array has a different shape from the original bunnies (and the original foxes).

Perhaps the question we really wanted to ask is, “For each bunny, is there any fox that can eat it?”

Combinations with `nested=True`#

Asking a question about any fox means performing a reducer, ak.any(), over lists, one list per bunny. The list would be all of the foxes in its yard. For that, we’ll need to pass nested=True to ak.cartesian().

pair_bunnies, pair_foxes = ak.unzip(ak.cartesian([bunnies, foxes], nested=True))

Now pair_bunnies and pair_foxes are one list-depth deeper than the original bunnies and foxes.

pair_bunnies

[[[{x: -1, y: -0.727}], [{x: 0.735, ...}], [{x: -0.588, y: 0.718}]],
 [[{x: -0.449, y: -1.61}], [{x: 1.59, ...}], ..., [{x: 1.49, y: -1.42}]],
 [[{x: 0.441, y: -0.705}], [{x: -1.96, y: 0.31}]],
 [[{x: -0.121, y: -0.232}], [{x: -0.44, ...}], ..., [{x: 0.0405, y: 1.24}]],
 [[{x: -0.703, y: -0.049}], [{x: -0.209, ...}], ..., [{x: 0.525, y: 0.265}]],
 [[{x: -0.958, y: 0.119}, {x: -0.958, ...}, {x: -0.958, y: 0.119}], [...]],
 [[{x: -1.28, y: 0.235}, {x: -1.28, ...}, ..., {x: -1.28, y: 0.235}], ...],
 [[{x: 0.37, y: 0.582}, {x: 0.37, y: ..., ...}, ..., {x: 0.37, y: 0.582}], ...],
 [[{x: 0.57, y: 0.342}], [{x: -0.416, y: -0.973}]],
 [[{x: -0.0416, y: 0.802}], [{x: -0.144, ...}], ..., [{x: -0.635, y: -1.71}]],
 ...,
 [[{x: 1.7, y: -0.564}, {x: 1.7, y: -0.564}]],
 [[], [], [], []],
 [[], [], []],
 [[]],
 [[{x: 0.0432, y: -1.22}], [{x: -1.15, y: 0.45}]],
 [[{x: -0.868, y: 1.79}], [{x: 0.999, ...}], ..., [{x: -0.877, y: 0.203}]],
 [[{x: 0.73, y: 0.335}, {x: 0.73, y: 0.335}], ..., [{x: 0.946, ...}, ...]],
 [[]],
 [[{x: 1.47, y: 1.5}, {x: 1.47, y: 1.5}, {...}, {x: 1.47, y: 1.5}], [...]]]
--------------------------------------------------------------------------------
type: 1000 * var * var * {
    x: float64,
    y: float64
}

pair_foxes

[[[{x: 0.183, y: -1.17}], [{x: 0.183, ...}], [{x: 0.183, y: -1.17}]],
 [[{x: 0.497, y: 0.761}], [{x: 0.497, ...}], ..., [{x: 0.497, y: 0.761}]],
 [[{x: -1.06, y: 0.0902}], [{x: -1.06, y: 0.0902}]],
 [[{x: -0.106, y: -1.28}], [{x: -0.106, ...}], ..., [{x: -0.106, y: -1.28}]],
 [[{x: 0.338, y: 0.642}], [{x: 0.338, ...}], ..., [{x: 0.338, y: 0.642}]],
 [[{x: -0.833, y: -1.11}, {x: -0.163, ...}, {x: -0.548, y: -0.546}], ...],
 [[{x: 0.89, y: 0.67}, {x: -1.24, ...}, {...}, {x: 0.211, y: -1.03}], ...],
 [[{x: 0.411, y: 1.23}, {x: -0.51, ...}, {...}, {x: 0.953, y: 1.18}], ...],
 [[{x: -3.47, y: -0.273}], [{x: -3.47, y: -0.273}]],
 [[{x: -0.553, y: 0.339}], [{x: -0.553, ...}], ..., [{x: -0.553, y: 0.339}]],
 ...,
 [[{x: 0.856, y: 1.06}, {x: 1.12, y: -0.0269}]],
 [[], [], [], []],
 [[], [], []],
 [[]],
 [[{x: -1.51, y: 0.912}], [{x: -1.51, y: 0.912}]],
 [[{x: -0.665, y: -0.175}], [{x: -0.665, ...}], ..., [{x: -0.665, y: -0.175}]],
 [[{x: -0.401, y: -1.2}, {x: -1.33, y: -1.06}], ..., [{x: -0.401, ...}, ...]],
 [[]],
 [[{x: -0.0623, y: -0.596}, {x: 1.35, ...}, ..., {x: 0.052, y: -0.242}], ...]]
-------------------------------------------------------------------------------
type: 1000 * var * var * {
    x: float64,
    y: float64
}

We can compute distances in the same way, though it’s also one list-depth deeper.

distances = np.sqrt((pair_bunnies.x - pair_foxes.x)**2 + (pair_bunnies.y - pair_foxes.y)**2)
distances

[[[1.27], [0.799], [2.04]],
 [[2.55], [1.77], [3.37], [1.57], [0.352], [2], [0.758], [2.39]],
 [[1.7], [0.924]],
 [[1.05], [0.819], [1.38], [3.03], [1.44], [2.22], [2.89], [2.53]],
 [[1.25], [2.61], [0.953], [0.685], [0.421]],
 [[1.24, 0.88, 0.781], [1.92, 1.3, 1.46]],
 [[2.21, 0.31, 1.36, 1.95], [0.656, ..., 1.17], [1.92, 1.01, 0.788, 1.18]],
 [[0.645, 0.947, 1.42, 0.834], [4.03, ...], ..., [0.906, 1.11, 1.48, 1.02]],
 [[4.08], [3.13]],
 [[0.69], [2.84], [0.312], [2.05]],
 ...,
 [[1.83, 0.789]],
 [[], [], [], []],
 [[], [], []],
 [[]],
 [[2.64], [0.585]],
 [[1.97], [1.94], [2.51], [0.434]],
 [[1.91, 2.49], [0.545, 1.04], [1.35, 2.28]],
 [[]],
 [[2.6, 0.607, 2.03, 2.25], [2.67, 3.33, 2.26, 2.84]]]
----------------------------------------------------------------------------
type: 1000 * var * var * float64

Similarly for eaten.

eaten = (distances < 1)
eaten

[[[False], [True], [False]],
 [[False], [False], [False], [False], [True], [False], [True], [False]],
 [[False], [True]],
 [[False], [True], [False], [False], [False], [False], [False], [False]],
 [[False], [False], [True], [True], [True]],
 [[False, True, True], [False, False, False]],
 [[False, True, False, False], [True, ...], [False, False, True, False]],
 [[True, True, False, True], [False, ...], ..., [True, False, False, False]],
 [[False], [False]],
 [[True], [False], [True], [False]],
 ...,
 [[False, True]],
 [[], [], [], []],
 [[], [], []],
 [[]],
 [[False], [True]],
 [[False], [False], [False], [True]],
 [[False, False], [True, False], [False, False]],
 [[]],
 [[False, True, False, False], [False, False, False, False]]]
-----------------------------------------------------------------------------
type: 1000 * var * var * bool

Now each inner list of booleans is answering the questions, “Can fox 0 eat me?”, “Can fox 1 eat me?”, …, “Can fox n eat me?” and there are exactly as many of these lists as there are bunnies. Applying ak.any() over the innermost lists (axis=-1),

bunny_eaten = ak.any(eaten, axis=-1)
bunny_eaten

[[False, True, False],
 [False, False, False, False, True, False, True, False],
 [False, True],
 [False, True, False, False, False, False, False, False],
 [False, False, True, True, True],
 [True, False],
 [True, True, True],
 [True, False, False, False, True],
 [False, False],
 [True, False, True, False],
 ...,
 [True],
 [False, False, False, False],
 [False, False, False],
 [False],
 [False, True],
 [False, False, False, True],
 [False, True, False],
 [False],
 [True, False]]
---------------------------------------------------------
type: 1000 * var * bool

We’ve now answered the question, “Can any fox eat me?” for each bunny. After the mayhem, these are the bunnies we have left:

bunnies[~bunny_eaten]

[[{x: -1, y: -0.727}, {x: -0.588, y: 0.718}],
 [{x: -0.449, y: -1.61}, {x: 1.59, y: -0.627}, ..., {...}, {x: 1.49, y: -1.42}],
 [{x: 0.441, y: -0.705}],
 [{x: -0.121, y: -0.232}, {x: 0.23, y: 0.0559}, ..., {x: 0.0405, y: 1.24}],
 [{x: -0.703, y: -0.049}, {x: -0.209, y: -1.91}],
 [{x: 0.885, y: -0.266}],
 [],
 [{x: -0.594, y: -2.68}, {x: 0.235, y: -1.13}, {x: 0.878, y: -1.94}],
 [{x: 0.57, y: 0.342}, {x: -0.416, y: -0.973}],
 [{x: -0.144, y: -2.47}, {x: -0.635, y: -1.71}],
 ...,
 [],
 [{x: 1.11, y: -0.309}, {x: 0.155, y: -0.458}, ..., {x: 0.352, y: -1.18}],
 [{x: -0.534, y: 0.555}, {x: -1.07, y: -1.37}, {x: -0.771, y: 0.699}],
 [{x: -0.586, y: 1.86}],
 [{x: 0.0432, y: -1.22}],
 [{x: -0.868, y: 1.79}, {x: 0.999, y: -1.16}, {x: -0.389, y: 2.32}],
 [{x: 0.73, y: 0.335}, {x: 0.946, y: -1.1}],
 [{x: -0.622, y: -1.26}],
 [{x: 1.94, y: -2.37}]]
--------------------------------------------------------------------------------
type: 1000 * var * {
    x: float64,
    y: float64
}

Whereas there was originally an average of 3.5 bunnies per yard, by construction,

ak.mean(ak.num(bunnies, axis=1))

3.527

Now there’s only

ak.mean(ak.num(bunnies[~bunny_eaten], axis=1))

2.557

left.

Asymmetry in the problem#

The way we performed this calculation was asymmetric: for each bunny, we asked if it was eaten. We could have performed a similar, but different, calculation to ask, which foxes get to eat? To do that, we must reverse the order of arguments because nested=True groups from the left.

pair_foxes, pair_bunnies = ak.unzip(ak.cartesian([foxes, bunnies], nested=True))

distances = np.sqrt((pair_foxes.x - pair_bunnies.x)**2 + (pair_foxes.y - pair_bunnies.y)**2)

eating = (distances < 1)

fox_eats = ak.any(eating, axis=-1)

foxes[fox_eats]

[[{x: 0.183, y: -1.17}],
 [{x: 0.497, y: 0.761}],
 [{x: -1.06, y: 0.0902}],
 [{x: -0.106, y: -1.28}],
 [{x: 0.338, y: 0.642}],
 [{x: -0.163, y: 0.496}, {x: -0.548, y: -0.546}],
 [{x: 0.89, y: 0.67}, {x: -1.24, y: 0.543}, {x: 0.0186, y: -0.154}],
 [{x: 0.411, y: 1.23}, {x: -0.51, y: 0.931}, {x: 0.953, y: 1.18}],
 [],
 [{x: -0.553, y: 0.339}],
 ...,
 [{x: 1.12, y: -0.0269}],
 [],
 [],
 [],
 [{x: -1.51, y: 0.912}],
 [{x: -0.665, y: -0.175}],
 [{x: -0.401, y: -1.2}],
 [],
 [{x: 1.35, y: 0.905}]]
--------------------------------------------------------------------
type: 1000 * var * {
    x: float64,
    y: float64
}