xarray logo

Xarray Indexes: Exciting new ways to slice and dice your data!

Monday, August 11th, 2025 (3 days ago)



TL;DR: xarray>=2025.6 has been through a major refactoring of its internals that makes coordinate-based data selection and alignment customizable, enabling more efficient handling of both traditional and more exotic data structures. In this post we highlight a few examples that take advantage of this new superpower!

Summary schematic from Deepak Cherian's 2025 SciPy Presentation highlighting new custom Indexes and usecases. Link to full slide deck

Indexing basics#

First thing's first, what is an index and why is it helpful?

Examples of indexes are all around you and are a fundamental way to organize and simplify access to information. In the United States, if you want a book about Natural Sciences, you can go to your local library branch and head straight to section 500. Or if you're in the mood for a classic novel go to section 800. Connecting thematic labels with numbers ({'Natural Sciences': 500, 'Literature': 800}) is a classic indexing system that's been around for hundreds of years (Dewey Decimal System, 1876). The need for an index becomes critical as the size of data grows - just imagine the time it would take to find a specific novel amongst a million uncategorized books!

The same efficiencies arise in computing. Consider a simple 1D dataset consisting of measurements Y=[10,20,30,40,50,60] at six coordinate positions X=[1, 2, 4, 8, 16, 32]. What was our measurement at X=8? To answer this in code, we need an index that is simply a key:value mapping or "hash table" between the coordinate values and integer positions i=[0,1,2,3,4,5] in the coordinates array. With only 6 coordinates, we easily see X[3]=8 so our measurement of interest is Y[3]=40.

pandas.Index#

Xarray's label-based selection allows a more expressive and simple syntax in which you don't have to think about the index (da.sel(x=8) = 40). Up until now, Xarray has relied exclusively on pandas.Index, which is still used by default:

1x = np.array([1, 2, 4, 8, 16, 32])
2y = np.array([10, 20, 30, 40, 50, 60])
3da = xr.DataArray(y, coords={'x': x})
4da
5
Loading data...
1da.sel(x=8)
2# 40
3

Alternatives to pandas.Index#

Importantly, a loop over all the coordinate values is not the only way to create an index. You might recognize that our coordinates can in fact be represented by a function X(i)=2**i where i is the integer position! Given that function we can quickly get measurement values at any position: Y(X=8) = Y[log2(8)] = Y[3]=40. Xarray now has a CoordinateTransformIndex to handle this type of on-demand lookup of coordinate positions!

xarray RangeIndex#

A simple special case of CoordinateTransformIndex is a RangeIndex where coordinates can be defined by a start, stop, and uniform step size. pandas.RangeIndex only supports integers, whereas Xarray handles floating-point values. Coordinate look-up is performed on-the-fly rather than loading all values into memory up-front when creating a Dataset, which is critical for the example below that has a coordinate array of 7TB!

1from xarray.indexes import RangeIndex
2
3index = RangeIndex.arange(0.0, 1000.0, 1e-9, dim='x') # 7TB coordinate array!
4ds = xr.Dataset(coords=xr.Coordinates.from_xindex(index))
5ds
6
Loading data...

Selection preserves the RangeIndex and does not require loading all the coordinates into memory.

sliced = ds.isel(x=slice(1_000, 50_000, 100))
sliced.x
Loading data...

Third-party custom Indexes#

In addition to a few new built-in indexes, xarray.Index provides an API that allows dealing with coordinate data and metadata in a highly customizable way for the most common Xarray operations such as selalignconcatstack. This is a powerful extension mechanism that is very important for supporting a multitude of domain-specific data structures.

rasterix RasterIndex#

Earlier we mentioned that coordinates often have a functional representation. For 2D raster images, this function often takes the form of an Affine Transform. The rasterix library extends Xarray with a RasterIndex which computes coordinates for geospatial images such as GeoTiffs via Affine Transform.

Below is a simple example of slicing a large mosaic of GeoTiffs without ever loading the coordinates into memory, note that a new Affine is defined after the slicing operation:

1# 811816322401 values!
2import rasterix
3
4#26475 GeoTiffs represented by a GDAL VRT
5da = xr.open_dataarray('https://opentopography.s3.sdsc.edu/raster/COP30/COP30_hh.vrt',
6                       engine='rasterio',
7                       parse_coordinates=False).squeeze().pipe(
8    rasterix.assign_index
9)
10da
11
Loading data...
1print('Original geotransform:\n', da.xindexes['x'].transform())
2da_sliced = da.sel(x=slice(-122.4, -120.0), y=slice(-47.1,-49.0))
3print('Sliced geotransform:\n', da_sliced.xindexes['x'].transform())
4
Original geotransform:
 | 0.00, 0.00,-180.00|
| 0.00,-0.00, 84.00|
| 0.00, 0.00, 1.00|

Sliced geotransform:
 | 0.00, 0.00,-122.40|
| 0.00,-0.00,-47.10|
| 0.00, 0.00, 1.00|

XProj CRSIndex#

We often think about metadata providing context for measurement values but metadata is also critical for coordinates! In particular, to align two different datasets we must ask if the coordinates are in the same coordinate system. In other words, do they share the same origin and scale?

There are currently over 7000 commonly used Coordinate Reference Systems (CRS) for geospatial data in the authoritative EPSG database! And of course an infinite number of custom-defined CRSs. xproj.CRSIndex gives Xarray objects an automatic awareness of the coordinate reference system operations like xr.align(), which can raise an informative error when there is a CRS mismatch:

1from xproj import CRSIndex
2lons1 = np.arange(-125, -120, 1)
3lons2 = np.arange(-122, -118, 1)
4ds1 = xr.Dataset(coords={'longitude': lons1}).proj.assign_crs(crs=4267)
5ds2 = xr.Dataset(coords={'longitude': lons2}).proj.assign_crs(crs=4326)
6ds1 + ds2
7
1MergeError: conflicting values/indexes on objects to be combined for coordinate 'crs'
2

XVec GeometryIndex#

A "vector data cube" is an n-D array that has at least one dimension indexed by a 2-D array of vector geometries. Large vector cubes can take advantage of an R-tree spatial index for efficiently selecting vector geometries within a given bounding box. The xvec.GeometryIndex provides this functionality, below is a short code snippet but please refer to the documentation for more!

1import xvec
2import geopandas as gpd
3from geodatasets import get_path
4
5# Dataset that contains demographic data indexed by U.S. counties
6counties = gpd.read_file(get_path("geoda.natregimes"))
7
8cube = xr.Dataset(
9    data_vars=dict(
10        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
11        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
12    ),
13    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
14).xvec.set_geom_indexes("county", crs=counties.crs)
15cube
16
Loading data...
1# Efficient selection using shapely.STRtree
2from shapely.geometry import box
3
4subset = cube.xvec.query(
5    "county",
6    box(-125.4, 40, -120.0, 50),
7    predicate="intersects",
8)
9
10subset['population'].xvec.plot(col='year');
11

Even more examples!#

Be sure to check out the Gallery of Custom Index Examples for more detailed examples of all the indexes mentioned in this post and more!

What's next?#

While we're extremely excited about what can already be accomplished with the new indexing capabilities, there are plenty of exciting ideas for future work.

Have an idea for your own custom index? Check out this section of the Xarray documentation.

Acknowledgments#

This work would not have been possible without technical input from the Xarray core team and community! Several developers received essential funding from a CZI Essential Open Source Software for Science (EOSS) grant as well as NASA's Open Source Tools, Frameworks, and Libraries (OSTFL) grant 80NSSC22K0345.

Back to Blog

xarray logo

© 2025, Xarray core developers. Apache 2.0 Licensed.

caf7663

TwitterGitHubYouTubeBlog RSS Feed

Xarray

DashboardBlogTeamCiting XarrayRoadmapBrand Assets
Powered by Vercel