Reorganizing observational dataset with new dimensions #10708

rmshkv · 2025-09-05T21:13:02Z

rmshkv
Sep 5, 2025

Hi!

I've been working on bringing an observational oceanographic dataset into xarray and am having trouble getting the dimensions to look right. Currently, it starts out as a Pandas DataFrame with rows organized by sample number ("unique_ID"). Within each row, I have metadata specifying the depth and station number each sample was collected at. There are multiple replicate samples for a given (depth, station number) combination. Each row has a number of data columns unique to each sample, and also other data values that are unique to the (depth, station number) combination, but shared between replicate samples (e.g. salinity, which is measured separately from the samples but copied to match the rows). Additionally, there are certain columns that "re-label" the depth and station number dimensions; potential density for the depth values, and lat and lon values for the station numbers.

When I bring this into xarray, I can set the sample number ("unique_ID") as the index, which becomes the single dimension for the dataset. I then use set_coords() to set the station number and depth as coordinates:

<xarray.Dataset> Size: 266kB
Dimensions:                 (unique_ID: 665)
Coordinates:
    station_ID              (unique_ID) int64 5kB 2 2 2 2 2 2 ... 50 50 50 50 50
  * unique_ID               (unique_ID) object 5kB 1 2 3 4 5 ... 681 682 683 684
    depth                   (unique_ID) object 5kB 130.0 130.0 100.0 ... 0.0 0.0
Data variables: (12/47)
...

This allows me to do a groupby mean operation on (depth, station number), which gives me a dataset with depth and station number as coordinate and means of all the sample values, but gets rid of all the data that was common to (depth, station number) values (e.g. the potential densities which correspond to depths but are different between stations, or salinities which are collected for each (depth, station number) but don't actually correspond to sample numbers).

ds_mean = ds.groupby(group=["station_ID", "depth"]).mean()

<xarray.Dataset> Size: 105kB
Dimensions:       (station_ID: 18, depth: 66)
Coordinates:
  * station_ID    (station_ID) int64 144B 2 3 4 7 8 9 20 ... 26 46 47 48 49 50
  * depth         (depth) float64 528B 0.0 10.0 20.0 ... 4.1e+03 4.3e+03 4.5e+03
Data variables:
    run_date      (station_ID, depth) float64 10kB 2.41e+05 2.41e+05 ... nan nan
...

My question is how do I reorganize this dataset so that I keep all the data variables when I do a groupby operation like this? Do I need to set station_ID and depth as dimensions before doing the groupby, and then make the data variables that only depend on those be indexed on those instead of on unique_ID alone? What functions would I use to do those steps?

Thanks in advance, and please let me know if I can clarify anything!

Answered by rmshkv

Sep 16, 2025

Hi Deepak,

Thanks! That's actually pretty much what I ended up doing. To record in case it's useful to anyone else: it ended up being more straightforward to do this in Pandas before converting to xarray. I set station_ID and depth as a Pandas MultiIndex on my metadata dataset, then converted that to xarray. From there the dimensions were the same as my data so I was able to merge the two xarray datasets (metadata and data) and get what I wanted.

View full answer

dcherian · 2025-09-12T15:44:54Z

dcherian
Sep 12, 2025
Maintainer

Hi @rmshkv !

the potential densities which correspond to depths but are different between stations, or salinities which are collected for each (depth, station number) but don't actually correspond to sample numbers).

IIUC unique_ID here is basically a pandas MultiIndex; so I'd explore constructing a multiindex on station_ID and depth and unstacking it.

To help more, we'd need a sample dataset (synthetic is fine)

1 reply

rmshkv Sep 16, 2025
Author

Hi Deepak,

Thanks! That's actually pretty much what I ended up doing. To record in case it's useful to anyone else: it ended up being more straightforward to do this in Pandas before converting to xarray. I set station_ID and depth as a Pandas MultiIndex on my metadata dataset, then converted that to xarray. From there the dimensions were the same as my data so I was able to merge the two xarray datasets (metadata and data) and get what I wanted.

Answer selected by dcherian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reorganizing observational dataset with new dimensions #10708

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Reorganizing observational dataset with new dimensions #10708

Uh oh!

rmshkv Sep 5, 2025

Replies: 1 comment · 1 reply

Uh oh!

dcherian Sep 12, 2025 Maintainer

Uh oh!

rmshkv Sep 16, 2025 Author

rmshkv
Sep 5, 2025

Replies: 1 comment 1 reply

dcherian
Sep 12, 2025
Maintainer

rmshkv Sep 16, 2025
Author