Basic event statistics#
Scott Wales, CLEX CMS
We have a set of timeseries from multiple models, and we’d like to compute some basic statistics on where the values exceed some threshold
import numpy
import xarray
import pandas
Sample data#
Some random data as a sample, we have dimensions of ‘model’ and ‘time’
da = xarray.DataArray(numpy.random.random((5,10))-0.5, dims=['model','time'])
da['time'] = pandas.date_range('20010101', periods=da.sizes['time'], freq='MS')
da['model'] = ['A','B','C','D','E']
da
<xarray.DataArray (model: 5, time: 10)> array([[-0.32756177, -0.26286746, 0.40030123, 0.35854755, 0.43113588, 0.19164228, -0.24890014, -0.16822426, 0.0288118 , 0.30851645], [ 0.02352896, 0.01700063, 0.40107743, 0.09688021, -0.34731748, 0.33646959, -0.22218415, 0.00075536, 0.41923523, 0.46450296], [-0.36606797, 0.13691949, -0.24407689, 0.26593951, -0.36044558, -0.09755949, 0.23047747, -0.30123992, 0.49773007, -0.48146704], [-0.28529412, -0.4137749 , -0.49678032, -0.20963922, -0.37296851, 0.32198888, -0.48008865, -0.35185213, -0.19133728, 0.48969387], [ 0.35537076, -0.34188436, -0.46208259, 0.41417063, -0.07029039, -0.39533084, 0.21379516, -0.18512164, -0.43433092, -0.29407475]]) Coordinates: * time (time) datetime64[ns] 2001-01-01 2001-02-01 ... 2001-10-01 * model (model) <U1 'A' 'B' 'C' 'D' 'E'
- model: 5
- time: 10
- -0.3276 -0.2629 0.4003 0.3585 ... 0.2138 -0.1851 -0.4343 -0.2941
array([[-0.32756177, -0.26286746, 0.40030123, 0.35854755, 0.43113588, 0.19164228, -0.24890014, -0.16822426, 0.0288118 , 0.30851645], [ 0.02352896, 0.01700063, 0.40107743, 0.09688021, -0.34731748, 0.33646959, -0.22218415, 0.00075536, 0.41923523, 0.46450296], [-0.36606797, 0.13691949, -0.24407689, 0.26593951, -0.36044558, -0.09755949, 0.23047747, -0.30123992, 0.49773007, -0.48146704], [-0.28529412, -0.4137749 , -0.49678032, -0.20963922, -0.37296851, 0.32198888, -0.48008865, -0.35185213, -0.19133728, 0.48969387], [ 0.35537076, -0.34188436, -0.46208259, 0.41417063, -0.07029039, -0.39533084, 0.21379516, -0.18512164, -0.43433092, -0.29407475]])
- time(time)datetime64[ns]2001-01-01 ... 2001-10-01
array(['2001-01-01T00:00:00.000000000', '2001-02-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000', '2001-04-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000', '2001-06-01T00:00:00.000000000', '2001-07-01T00:00:00.000000000', '2001-08-01T00:00:00.000000000', '2001-09-01T00:00:00.000000000', '2001-10-01T00:00:00.000000000'], dtype='datetime64[ns]')
- model(model)<U1'A' 'B' 'C' 'D' 'E'
array(['A', 'B', 'C', 'D', 'E'], dtype='<U1')
Count below threshold for each model#
First select only the values below our threshold, then get the count of unmasked values along the time dimension
da.where(da < 0).count('time')
<xarray.DataArray (model: 5)> array([4, 2, 6, 8, 7]) Coordinates: * model (model) <U1 'A' 'B' 'C' 'D' 'E'
- model: 5
- 4 2 6 8 7
array([4, 2, 6, 8, 7])
- model(model)<U1'A' 'B' 'C' 'D' 'E'
array(['A', 'B', 'C', 'D', 'E'], dtype='<U1')
Minimum values for each model#
The same idea, this time using ‘min’ instead of ‘count’
da.where(da < 0).min('time')
<xarray.DataArray (model: 5)> array([-0.32756177, -0.34731748, -0.48146704, -0.49678032, -0.46208259]) Coordinates: * model (model) <U1 'A' 'B' 'C' 'D' 'E'
- model: 5
- -0.3276 -0.3473 -0.4815 -0.4968 -0.4621
array([-0.32756177, -0.34731748, -0.48146704, -0.49678032, -0.46208259])
- model(model)<U1'A' 'B' 'C' 'D' 'E'
array(['A', 'B', 'C', 'D', 'E'], dtype='<U1')
Date of minimum value for each model#
Start off by finding the index of the minimum value - argmin.
min_index = da.where(da < 0).argmin('time')
min_index
<xarray.DataArray (model: 5)> array([0, 4, 9, 2, 2]) Coordinates: * model (model) <U1 'A' 'B' 'C' 'D' 'E'
- model: 5
- 0 4 9 2 2
array([0, 4, 9, 2, 2])
- model(model)<U1'A' 'B' 'C' 'D' 'E'
array(['A', 'B', 'C', 'D', 'E'], dtype='<U1')
These indices can be converted to time values all at once by using min_index
as an array index
da.time[min_index]
<xarray.DataArray 'time' (model: 5)> array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000', '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: time (model) datetime64[ns] 2001-01-01 2001-05-01 ... 2001-03-01 * model (model) <U1 'A' 'B' 'C' 'D' 'E'
- model: 5
- 2001-01-01 2001-05-01 2001-10-01 2001-03-01 2001-03-01
array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000', '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]')
- time(model)datetime64[ns]2001-01-01 ... 2001-03-01
array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000', '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]')
- model(model)<U1'A' 'B' 'C' 'D' 'E'
array(['A', 'B', 'C', 'D', 'E'], dtype='<U1')
There’s a “ghost” time axis in the result which might cause problems, let’s get rid of it. Our returned values aren’t time dependent
da.time[min_index].drop('time')
<xarray.DataArray 'time' (model: 5)> array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000', '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * model (model) <U1 'A' 'B' 'C' 'D' 'E'
- model: 5
- 2001-01-01 2001-05-01 2001-10-01 2001-03-01 2001-03-01
array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000', '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]')
- model(model)<U1'A' 'B' 'C' 'D' 'E'
array(['A', 'B', 'C', 'D', 'E'], dtype='<U1')
Putting it together#
It could be helpful to put all of the results into a single dataset
event_stats = xarray.Dataset({
'count': da.where(da < 0).count('time'),
'min': da.where(da < 0).min('time'),
'min_date': da.time[min_index].drop('time'),
})
event_stats
<xarray.Dataset> Dimensions: (model: 5) Coordinates: * model (model) <U1 'A' 'B' 'C' 'D' 'E' Data variables: count (model) int64 4 2 6 8 7 min (model) float64 -0.3276 -0.3473 -0.4815 -0.4968 -0.4621 min_date (model) datetime64[ns] 2001-01-01 2001-05-01 ... 2001-03-01
- model: 5
- model(model)<U1'A' 'B' 'C' 'D' 'E'
array(['A', 'B', 'C', 'D', 'E'], dtype='<U1')
- count(model)int644 2 6 8 7
array([4, 2, 6, 8, 7])
- min(model)float64-0.3276 -0.3473 ... -0.4968 -0.4621
array([-0.32756177, -0.34731748, -0.48146704, -0.49678032, -0.46208259])
- min_date(model)datetime64[ns]2001-01-01 ... 2001-03-01
array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000', '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]')
That way you can e.g. get all the values for a specific model
event_stats.sel(model='A')
<xarray.Dataset> Dimensions: () Coordinates: model <U1 'A' Data variables: count int64 4 min float64 -0.3276 min_date datetime64[ns] 2001-01-01
- model()<U1'A'
array('A', dtype='<U1')
- count()int644
array(4)
- min()float64-0.3276
array(-0.32756177)
- min_date()datetime64[ns]2001-01-01
array('2001-01-01T00:00:00.000000000', dtype='datetime64[ns]')
Climtas event detection#
The climtas library has some functions for event detection - it will find runs where a condition is true for multiple times, then let you run a function to calculate statistics for each event
import climtas
events = climtas.event.find_events(da < 0, min_duration=2)
events
time | model | event_duration | |
---|---|---|---|
0 | 0 | 0 | 2 |
1 | 1 | 4 | 2 |
2 | 0 | 3 | 5 |
3 | 4 | 2 | 2 |
4 | 4 | 4 | 2 |
5 | 6 | 0 | 2 |
6 | 6 | 3 | 3 |
7 | 7 | 4 | 3 |
coords = climtas.event.event_coords(da, events)
def stat_func(event_data):
return {'min': event_data.min().values,
'min_date': event_data.time[event_data.argmin()].values}
minimum = climtas.event.map_events(da, events, stat_func)
coords.join(minimum)
model | time | event_duration | min | min_date | |
---|---|---|---|---|---|
0 | A | 2001-01-01 | 31 days | -0.32756177046349233 | 2001-01-01 |
1 | E | 2001-02-01 | 28 days | -0.4620825909697166 | 2001-03-01 |
2 | D | 2001-01-01 | 120 days | -0.49678032011641426 | 2001-03-01 |
3 | C | 2001-05-01 | 31 days | -0.36044557563610247 | 2001-05-01 |
4 | E | 2001-05-01 | 31 days | -0.395330843938117 | 2001-06-01 |
5 | A | 2001-07-01 | 31 days | -0.2489001360755888 | 2001-07-01 |
6 | D | 2001-07-01 | 62 days | -0.48008864559888853 | 2001-07-01 |
7 | E | 2001-08-01 | 61 days | -0.4343309190867375 | 2001-09-01 |