Group Data in Time Intervals — hsGroupByInterval • kwb.base

Builds groups of rows belonging to the same time interval and aggregates the values within the group by using a given function (e.g. sum, mean, min, max)

Usage

hsGroupByInterval(
  data,
  interval,
  FUN,
  tsField = names(data)[1],
  offset1 = 0,
  offset2 = interval/2,
  limits = FALSE,
  ...,
  dbg = FALSE
)

Arguments

data: data frame containing a timestamp field and data fields to be aggregated over time.
interval: length of time interval in seconds
FUN: function used to aggregate the values within one and the same interval, e.g. sum, mean, min, max
tsField: name of timestamp column, default: name of first column
offset1: number of seconds by which all timestamps are shifted before they are grouped into intervals. The grouping to intervals is done by dividing the timestamps (converted to number of seconds since 1970-01-01) by the interval length and taking the integer part of the division as interval number. Thus, with offset1 = 0 and an interval length of e.g. 60 seconds, the first interval is from 00:00:00 to 00:00:59, the second from 00:01:00 to 00:01:59 etc., whereas offset1 = 30 in this case would lead to intervals 00:00:30 to 00:01:29, 00:01:30 to 00:02:29 etc..
offset2: value given in seconds determining which of the timestamps in an interval represents the interval in the output. If 0, each time interval is represented by the smallest timestamp belonging to the interval. By default, offset2 is half the interval length, meaning that each time interval is represented by the timestamp in the middle of the interval.
limits: if TRUE, two additional columns will be added showing the minimum and maximum value of the interval
...: further arguments passed to aggregate, the internally called function
dbg: if TRUE, debug messages are shown

Examples


## Get an example time-series with values every one minute
step <- 60
df <- hsExampleTSeries(step)

## Calculate 5-min-means with
## offset1 = 0 (default), offset2 = interval/2 (default)
df.agg1 <- hsGroupByInterval(df, interval = 5*step, mean, limits = TRUE)
df.agg1
#>                     t               t.beg               t.end             y
#> 1 2012-01-01 12:02:30 2012-01-01 12:00:00 2012-01-01 12:04:59  5.313752e-01
#> 2 2012-01-01 12:07:30 2012-01-01 12:05:00 2012-01-01 12:09:59  7.313752e-01
#> 3 2012-01-01 12:12:30 2012-01-01 12:10:00 2012-01-01 12:14:59 -5.313752e-01
#> 4 2012-01-01 12:17:30 2012-01-01 12:15:00 2012-01-01 12:19:59 -7.313752e-01
#> 5 2012-01-01 12:22:30 2012-01-01 12:20:00 2012-01-01 12:24:59 -2.449213e-16

## Shift the interval limits with
## offset1 = 2.5*60, offset2 = interval/2 (default)
df.agg2 <- hsGroupByInterval(df, interval = 5*step, mean, limits = TRUE,
                             offset1 = 2.5*step)
df.agg2
#>                     t               t.beg               t.end             y
#> 1 2012-01-01 12:00:00 2012-01-01 11:57:30 2012-01-01 12:02:29  2.989341e-01
#> 2 2012-01-01 12:05:00 2012-01-01 12:02:30 2012-01-01 12:07:29  9.040294e-01
#> 3 2012-01-01 12:10:00 2012-01-01 12:07:30 2012-01-01 12:12:29  1.910147e-16
#> 4 2012-01-01 12:15:00 2012-01-01 12:12:30 2012-01-01 12:17:29 -9.040294e-01
#> 5 2012-01-01 12:20:00 2012-01-01 12:17:30 2012-01-01 12:22:29 -2.989341e-01

## Shift the timestamps representing the intervals with
## offset1 = 0, offset2 = 0
df.agg3 <- hsGroupByInterval(df, interval = 5*step, mean, limits = TRUE,
                             offset1 = 0, offset2 = 0)
df.agg3
#>                     t               t.beg               t.end             y
#> 1 2012-01-01 12:00:00 2012-01-01 12:00:00 2012-01-01 12:04:59  5.313752e-01
#> 2 2012-01-01 12:05:00 2012-01-01 12:05:00 2012-01-01 12:09:59  7.313752e-01
#> 3 2012-01-01 12:10:00 2012-01-01 12:10:00 2012-01-01 12:14:59 -5.313752e-01
#> 4 2012-01-01 12:15:00 2012-01-01 12:15:00 2012-01-01 12:19:59 -7.313752e-01
#> 5 2012-01-01 12:20:00 2012-01-01 12:20:00 2012-01-01 12:24:59 -2.449213e-16

## Show a plot demonstrating the effect of offset1 and offset2
if (FALSE) {
demoGroupByInterval(df)
}
## Handling NA values...

## Set y to NA at 2 random positions
df[sample(nrow(df), 2), 2] <- NA
df ## Let' have a look at df
#>                      t             y
#> 1  2012-01-01 12:00:00  0.000000e+00
#> 2  2012-01-01 12:01:00  3.090170e-01
#> 3  2012-01-01 12:02:00  5.877853e-01
#> 4  2012-01-01 12:03:00  8.090170e-01
#> 5  2012-01-01 12:04:00  9.510565e-01
#> 6  2012-01-01 12:05:00  1.000000e+00
#> 7  2012-01-01 12:06:00  9.510565e-01
#> 8  2012-01-01 12:07:00  8.090170e-01
#> 9  2012-01-01 12:08:00  5.877853e-01
#> 10 2012-01-01 12:09:00  3.090170e-01
#> 11 2012-01-01 12:10:00  1.224606e-16
#> 12 2012-01-01 12:11:00 -3.090170e-01
#> 13 2012-01-01 12:12:00 -5.877853e-01
#> 14 2012-01-01 12:13:00 -8.090170e-01
#> 15 2012-01-01 12:14:00            NA
#> 16 2012-01-01 12:15:00 -1.000000e+00
#> 17 2012-01-01 12:16:00 -9.510565e-01
#> 18 2012-01-01 12:17:00 -8.090170e-01
#> 19 2012-01-01 12:18:00 -5.877853e-01
#> 20 2012-01-01 12:19:00 -3.090170e-01
#> 21 2012-01-01 12:20:00            NA

## Count NA values per group
hsGroupByInterval(df, interval = 300, function(x){sum(is.na(x))})
#>                     t y
#> 1 2012-01-01 12:02:30 0
#> 2 2012-01-01 12:07:30 0
#> 3 2012-01-01 12:12:30 1
#> 4 2012-01-01 12:17:30 0
#> 5 2012-01-01 12:22:30 1

## default behaviour: mean(values containing at least one NA) = NA
hsGroupByInterval(df, interval = 300, mean)
#>                     t          y
#> 1 2012-01-01 12:02:30  0.5313752
#> 2 2012-01-01 12:07:30  0.7313752
#> 3 2012-01-01 12:12:30         NA
#> 4 2012-01-01 12:17:30 -0.7313752
#> 5 2012-01-01 12:22:30         NA

## ignore NA values by passing na.rm = TRUE to the aggregate function
hsGroupByInterval(df, interval = 300, mean, na.rm = TRUE)
#>                     t          y
#> 1 2012-01-01 12:02:30  0.5313752
#> 2 2012-01-01 12:07:30  0.7313752
#> 3 2012-01-01 12:12:30 -0.4264548
#> 4 2012-01-01 12:17:30 -0.7313752
#> 5 2012-01-01 12:22:30        NaN