How to create arrays by “unflattening” or “grouping”#

import awkward as ak
import pandas as pd
import numpy as np
from urllib.request import urlopen

Finding runs in an array#

It is often the case that one has an array of data that they wish to subdivide into common groups. Let’s imagine that we’re looking at NASA’s Earth Meteorite Landings dataset, and that we wish to find the largest meteorite in each classification. This is known as a groupby operation, followed by a reduction.

First, we should load the data

with open("../data/y77d-th95.json", "rb") as f:
    landing = ak.from_json(f)
landing.fields
['name',
 'id',
 'nametype',
 'recclass',
 'mass',
 'fall',
 'year',
 'reclat',
 'reclong',
 'geolocation',
 ':@computed_region_cbhk_fwbd',
 ':@computed_region_nnqa_25f4']

In order to find the largest meteorite by each category, we must first group the entries into categories. This is called a groupby operation, whereby we are ordering the entire array into subgroups given by a particular label. To perform a groupby in Awkward Array, we must first sort the array by the category

landing_sorted_class = landing[ak.argsort(landing.recclass)]
landing_sorted_class
[{name: 'Acapulco', id: '10', nametype: 'Valid', recclass: 'Acapulcoite', ...},
 {name: 'Silistra', id: '55584', nametype: 'Valid', recclass: ..., ...},
 {name: 'Angra dos Reis (stone)', id: '2302', nametype: 'Valid', ...},
 {name: 'Aubres', id: '4893', nametype: 'Valid', recclass: 'Aubrite', ...},
 {name: 'Bishopville', id: '5059', nametype: 'Valid', recclass: 'Aubrite', ...},
 {name: 'Bustee', id: '5181', nametype: 'Valid', recclass: 'Aubrite', ...},
 {name: 'Cumberland Falls', id: '5496', nametype: 'Valid', recclass: ..., ...},
 {name: 'Khor Temiki', id: '12299', nametype: 'Valid', recclass: ..., ...},
 {name: 'Mayo Belwa', id: '15451', nametype: 'Valid', recclass: 'Aubrite', ...},
 {name: 'Norton County', id: '17922', nametype: 'Valid', recclass: ..., ...},
 ...,
 {name: 'Aire-sur-la-Lys', id: '425', nametype: 'Valid', recclass: ..., ...},
 {name: 'Lusaka', id: '14759', nametype: 'Valid', recclass: 'Unknown', ...},
 {name: 'Dyalpur', id: '7757', nametype: 'Valid', recclass: 'Ureilite', ...},
 {name: 'Haverö', id: '11859', nametype: 'Valid', recclass: 'Ureilite', ...},
 {name: 'Jalanash', id: '12068', nametype: 'Valid', recclass: 'Ureilite', ...},
 {name: 'Lahrauli', id: '12433', nametype: 'Valid', recclass: 'Ureilite', ...},
 {name: 'Novo-Urei', id: '17933', nametype: 'Valid', recclass: 'Ureilite', ...},
 {name: 'Almahata Sitta', id: '48915', nametype: 'Valid', recclass: ..., ...},
 {name: 'Pontlyfni', id: '18865', nametype: 'Valid', recclass: ..., ...}]
--------------------------------------------------------------------------------
type: 1000 * {
    name: string,
    id: string,
    nametype: string,
    recclass: string,
    mass: ?string,
    fall: string,
    year: ?string,
    reclat: ?string,
    reclong: ?string,
    geolocation: ?{
        type: string,
        coordinates: var * float64
    },
    ":@computed_region_cbhk_fwbd": ?string,
    ":@computed_region_nnqa_25f4": ?string
}

This sorted array can be subdivided into sublists of the same category. To determine how long each of these sublists must be, Awkward provides another function ak.run_lengths() which, as the name implies, finds the lengths of consecutive runs in an array, e.g.

ak.run_lengths([1, 1, 1, 3, 3, 2, 4, 4, 4])
[3,
 2,
 1,
 3]
---------------
type: 4 * int64

The function does not accept an axis argument; Awkward Array only supports finding runs in the innermost axis=-1 axis of the array. Let’s find the lengths of each category sublist using ak.run_lengths():

lengths = ak.run_lengths(landing_sorted_class.recclass)
lengths
[1,
 1,
 1,
 9,
 1,
 3,
 1,
 1,
 4,
 2,
 ...,
 21,
 2,
 2,
 1,
 34,
 2,
 5,
 1,
 1]
-----------------
type: 118 * int64

Dividing an array into sublists#

Awkward Array provides an ak.unflatten() operation that adds a new dimension to an array, using either a single integer denoting the (regular) size of the dimension, or a list of integers representing the lengths of the sublists to create e.g.

ak.unflatten(
    ["Do", "re", "mi", "fa", "so", "la"],
    [1, 2, 2, 1]
)
[['Do'],
 ['re', 'mi'],
 ['fa', 'so'],
 ['la']]
----------------------
type: 4 * var * string

If we pass an integer instead of a list of lengths, we get a regular array

ak.unflatten(
    ["Do", "re", "mi", "fa", "so", "la"],
    2
)
[['Do', 're'],
 ['mi', 'fa'],
 ['so', 'la']]
--------------------
type: 3 * 2 * string

We can unflatten our sorted array using the length of runs each classification, in order to finalise our groupby operation.

landing_by_class = ak.unflatten(
    landing_sorted_class, 
    lengths
)
landing_by_class
[[{name: 'Acapulco', id: '10', nametype: 'Valid', recclass: ..., ...}],
 [{name: 'Silistra', id: '55584', nametype: 'Valid', recclass: ..., ...}],
 [{name: 'Angra dos Reis (stone)', id: '2302', nametype: 'Valid', ...}],
 [{name: 'Aubres', id: '4893', nametype: 'Valid', recclass: ..., ...}, ...],
 [{name: "Sutter's Mill", id: '55529', nametype: 'Valid', recclass: 'C', ...}],
 [{name: 'Bells', id: '5005', nametype: 'Valid', recclass: 'C2-ung', ...}, ...],
 [{name: 'Ningqiang', id: '16981', nametype: 'Valid', recclass: 'C3-ung', ...}],
 [{name: 'Gujba', id: '11449', nametype: 'Valid', recclass: 'CBa', ...}],
 [{name: 'Alais', id: '448', nametype: 'Valid', recclass: 'CI1', ...}, ...],
 [{name: 'Karoonda', id: '12264', nametype: 'Valid', recclass: ..., ...}, ...],
 ...,
 [{name: 'Barcelona (stone)', id: '4944', nametype: 'Valid', ...}, ..., {...}],
 [{name: 'Cumulus Hills 04075', id: '32531', nametype: 'Valid', ...}, ...],
 [{name: 'Marjalahti', id: '15426', nametype: 'Valid', ...}, {...}],
 [{name: 'Rumuruti', id: '22782', nametype: 'Valid', recclass: 'R3.8-6', ...}],
 [{name: 'Andhara', id: '2294', nametype: 'Valid', recclass: ..., ...}, ...],
 [{name: 'Aire-sur-la-Lys', id: '425', nametype: 'Valid', ...}, {...}],
 [{name: 'Dyalpur', id: '7757', nametype: 'Valid', recclass: ..., ...}, ...],
 [{name: 'Almahata Sitta', id: '48915', nametype: 'Valid', recclass: ..., ...}],
 [{name: 'Pontlyfni', id: '18865', nametype: 'Valid', recclass: ..., ...}]]
--------------------------------------------------------------------------------
type: 118 * var * {
    name: string,
    id: string,
    nametype: string,
    recclass: string,
    mass: ?string,
    fall: string,
    year: ?string,
    reclat: ?string,
    reclong: ?string,
    geolocation: ?{
        type: string,
        coordinates: var * float64
    },
    ":@computed_region_cbhk_fwbd": ?string,
    ":@computed_region_nnqa_25f4": ?string
}

We can see the categories of this grouped array by pulling out the first item of each sublist

landing_by_class.recclass[..., 0]
['Acapulcoite',
 'Achondrite-ung',
 'Angrite',
 'Aubrite',
 'C',
 'C2-ung',
 'C3-ung',
 'CBa',
 'CI1',
 'CK4',
 ...,
 'OC',
 'Pallasite',
 'Pallasite, PMG',
 'R3.8-6',
 'Stone-uncl',
 'Unknown',
 'Ureilite',
 'Ureilite-an',
 'Winonaite']
------------------
type: 118 * string

The above three steps:

  1. Sort the array

  2. Compute the length of runs within the sorted array

  3. Unflatten the sorted array by the run lengths

form a groupby operation.

Computing the mass of the largest meteorites#

Now that we have grouped our meteorite landings by classification, we can find the largest mass meteorite in each group. If we look at the type of the array, we can see that the mass field is actually a string:

landing_by_class.type.show()
118 * var * {
    name: string,
    id: string,
    nametype: string,
    recclass: string,
    mass: ?string,
    fall: string,
    year: ?string,
    reclat: ?string,
    reclong: ?string,
    geolocation: ?{
        type: string,
        coordinates: var * float64
    },
    ":@computed_region_cbhk_fwbd": ?string,
    ":@computed_region_nnqa_25f4": ?string
}

Let’s convert it to a floating point number

landing_by_class['mass'] = ak.strings_astype(landing_by_class.mass, np.float64)

Now we can find the index of the largest mass in each sublist. We’ll use keepdims=True in order to be able to use this array to index landing_by_class and pull out the corresponding record.

i_largest_mass = ak.argmax(landing_by_class.mass, axis=-1, keepdims=True)

Finding the largest meteorite is then a simple case of using i_largest_mass as an index, and flattening the result to drop the unneeded dimension

largest_meteorite = ak.flatten(
    landing_by_class[i_largest_mass], 
    axis=1,
)
largest_meteorite
[{name: 'Acapulco', id: '10', nametype: 'Valid', recclass: 'Acapulcoite', ...},
 {name: 'Silistra', id: '55584', nametype: 'Valid', recclass: ..., ...},
 {name: 'Angra dos Reis (stone)', id: '2302', nametype: 'Valid', ...},
 {name: 'Norton County', id: '17922', nametype: 'Valid', recclass: ..., ...},
 {name: "Sutter's Mill", id: '55529', nametype: 'Valid', recclass: 'C', ...},
 {name: 'Tagish Lake', id: '23782', nametype: 'Valid', recclass: 'C2-ung', ...},
 {name: 'Ningqiang', id: '16981', nametype: 'Valid', recclass: 'C3-ung', ...},
 {name: 'Gujba', id: '11449', nametype: 'Valid', recclass: 'CBa', ...},
 {name: 'Orgueil', id: '18026', nametype: 'Valid', recclass: 'CI1', ...},
 {name: 'Karoonda', id: '12264', nametype: 'Valid', recclass: 'CK4', ...},
 ...,
 {name: 'Kushiike', id: '12381', nametype: 'Valid', recclass: 'OC', ...},
 {name: 'Mineo', id: '16696', nametype: 'Valid', recclass: 'Pallasite', ...},
 {name: 'Omolon', id: '18019', nametype: 'Valid', recclass: ..., ...},
 {name: 'Rumuruti', id: '22782', nametype: 'Valid', recclass: 'R3.8-6', ...},
 {name: 'Hatford', id: '11855', nametype: 'Valid', recclass: 'Stone-uncl', ...},
 None,
 {name: 'Novo-Urei', id: '17933', nametype: 'Valid', recclass: 'Ureilite', ...},
 {name: 'Almahata Sitta', id: '48915', nametype: 'Valid', recclass: ..., ...},
 {name: 'Pontlyfni', id: '18865', nametype: 'Valid', recclass: ..., ...}]
--------------------------------------------------------------------------------
type: 118 * ?{
    name: string,
    id: string,
    nametype: string,
    recclass: string,
    fall: string,
    year: ?string,
    reclat: ?string,
    reclong: ?string,
    geolocation: ?{
        type: string,
        coordinates: var * float64
    },
    ":@computed_region_cbhk_fwbd": ?string,
    ":@computed_region_nnqa_25f4": ?string,
    mass: ?float64
}

Here are there names!

largest_meteorite.name
['Acapulco',
 'Silistra',
 'Angra dos Reis (stone)',
 'Norton County',
 "Sutter's Mill",
 'Tagish Lake',
 'Ningqiang',
 'Gujba',
 'Orgueil',
 'Karoonda',
 ...,
 'Kushiike',
 'Mineo',
 'Omolon',
 'Rumuruti',
 'Hatford',
 None,
 'Novo-Urei',
 'Almahata Sitta',
 'Pontlyfni']
--------------------------
type: 118 * ?string