Merge arrays with missing data#
Claire Carouge, CLEX CMS
Let’s say you have 2 datasets coming from different sources but representing the same quantity. You’d like to merge those datasets into a single one via a mean, unfortunately both datasets have missing data at different times and places. Accordingly, we want the merged dataset to follow these rules:
if both original datasets have data, we take the mean of both
if one dataset only has data, we take this data
if the data is missing in both original datasets, we keep a missing data
The strategy using xarray is to open each dataset in a DataArray, concatenate both arrays on a new dimension and then average along this dimension.
import xarray as xr
import numpy as np
First define 2 arrays of same dimensions with missing data at different places
aa = xr.DataArray([[0,1,2],[3,4,np.nan]],dims=('x','y'))
bb = xr.DataArray([[5,np.nan,6],[np.nan,7,np.nan]],dims=('x','y'))
aa
<xarray.DataArray (x: 2, y: 3)>
array([[ 0., 1., 2.],
[ 3., 4., nan]])
Dimensions without coordinates: x, y
bb
<xarray.DataArray (x: 2, y: 3)>
array([[ 5., nan, 6.],
[nan, 7., nan]])
Dimensions without coordinates: x, y
Now, if we simply sum the arrays together, we do not get what we want. The missing value take precedence. That is, if any of the array has a missing value, the sum is missing. So summing and dividing by the number of arrays won’t work
aa+bb
<xarray.DataArray (x: 2, y: 3)>
array([[ 5., nan, 8.],
[nan, 11., nan]])
Dimensions without coordinates: x, y
At the opposite, if we can do a mean, it will work as then the missing value is ignored (mean(1,nan) = 1). For this, we need to “merge” the arrays into a single array. For this we’ll use the xarray.concat()
method.
Concatenate the arrays along a new dimension we’ll call z
cc = xr.concat((aa,bb),'z')
cc
<xarray.DataArray (z: 2, x: 2, y: 3)>
array([[[ 0., 1., 2.],
[ 3., 4., nan]],
[[ 5., nan, 6.],
[nan, 7., nan]]])
Dimensions without coordinates: z, x, y
As you see above the concatenation allows us to have the 2 arrays aligned together in a new array. Now we take advantage of the fact xarray handles missing data correctly. That is, a mean will not count missing data.
cc.mean(dim='z')
<xarray.DataArray (x: 2, y: 3)>
array([[2.5, 1. , 4. ],
[3. , 5.5, nan]])
Dimensions without coordinates: x, y
Usually you would find these last 2 operations combined as you don’t need to store the results of the concat
operation.
xr.concat((aa,bb),'z').mean(dim='z')
<xarray.DataArray (x: 2, y: 3)>
array([[2.5, 1. , 4. ],
[3. , 5.5, nan]])
Dimensions without coordinates: x, y