# Feature engineering for time series data using Numba

Given a sequence of measurements `values: np.ndarray`

and observation times `times: np.ndarray`

, we want to engineer features for a machine learning model that captures the temporal trends and statistics over different temporal windows. Our data is irregularly sampled and have missing data.

Features representing the magnitude, dispersion, direction of change, and temporal trends are derived. Every time point is represented with 4 categories of features:

- Magnitude of the most recent observation within the last 6h for vital signs and 24h for laboratory measurements.
- Dispersion measured as the range over a short and long-time window.
- Direction of change (increasing, decreasing, no change) over a short and long-time window.
- Exponential moving averages (EMA) with varying decay rates that specify how the much impact each past observation has on the current mean. EMA features were calculated on the forward filled magnitudes and using an EMA algorithm specifically for irregularly sampled time series.

The engineered features include:

`dt`

: time elapsed since the measurement was made`val`

: most recent measurement, measurements are forward filled up to a maximum duration after`srng`

: range as \(\frac{x_{max}-x_{min}}{x_{max}+x_{min}}100\) over a*short window*`ssgn`

: sign of the change (-1 or +1) between the first and last measurement in a*short window*, windows without measurements are filled with 0`lrng`

: range as (max-min)/(max+min) * 100 over a*long window*`lsgn`

: sign of the change (-1 or +1) between the first and last measurement in a*long window*, windows without measurements are filled with 0`sema`

: slow exponential moving average, calculated after forward filling`fema`

: fast exponential moving average, calculated after forward filling

We also need to define the following settings for every variable:

`forward_fill_duration`

: duration in minutes to forward fill a missing variable`short_rolling_window_size`

: short rolling window size in minutes`long_rolling_window_size`

: long rolling window size in minutes`slow_ema_tau`

: decay rate for slow exponential moving average in minutes`fast_ema_tau`

: decay rate for fast exponential moving average in minutes

Some derived features have no missing data, including `dt`

, `ssgn`

, `srng`

, `lsgn`

, and `lrng`

. Other variables like `val`

, `sema`

, `fema`

have missing values encoded as NaN. Missing data is forward filled up to `forward_fill_duration`

and all missing values after the forward filling time are NaN.

```
from numba import njit, prange
@njit()
def ema(times, values, tau):
"""Exponential moving average for irregularly sampled time series.
Units for ``times`` and ``tau`` should be in hours.
Args:
times (np.ndarray): 1D array of measurement times in hours
values (np.ndarray): 1D array of measurements
tau (float): time decay in hours
"""
n = len(values)
ret = np.empty(n, dtype=np.float64) * np.nan
ret[0] = values[0]
last_i = 0
for i in range(1, n):
if np.isnan(times[i]) | np.isnan(values[i]):
continue
alpha = (times[i] - times[last_i]) / tau
w = np.exp(-alpha)
if alpha > 1e-6:
w2 = (1 - w) / alpha
else:
# use Taylor expansion for numerical stability
w2 = 1 - (alpha / 2) + (alpha * alpha / 6) - (alpha * alpha * alpha / 24)
ret[i] = (ret[last_i] * w) + (values[i] * (1 - w2)) + (values[last_i] * (w2 - w))
last_i = i
return ret
@njit()
def ffill(times, values):
"""Forward fill an array of values and times.
Times indicate the time of last observation.
Args:
arr (np.ndarray): array of values with nan
times (np.ndarray): array of times
Returns:
2D array with forward filled times (index 0)
and values (index 1).
"""
out_values = values.copy()
out_times = times.copy()
n = values.shape[0]
out = np.zeros((n, 2), dtype=np.float64)
lastval = np.nan
lasttime = np.nan
for row_idx in range(n):
if np.isfinite(values[row_idx]):
lastval = values[row_idx]
lasttime = times[row_idx]
out_values[row_idx] = lastval
out_times[row_idx] = lasttime
out[:, 0] = out_times
out[:, 1] = out_values
return out
@njit()
def sliding_windows(times, window_size):
tdiff = times - times.reshape(-1, 1)
twindows = (tdiff >= 0) & (tdiff <= window_size)
return twindows
@njit()
def feature_engineering(
times,
values,
forward_fill_duration,
short_rolling_window_size,
long_rolling_window_size,
slow_ema_tau,
fast_ema_tau,
):
"""Feature engineering.
1. Age of the observation
2. Most recent forward-filled measurement
3. Dispersion as (max-min)/(max+min) over short window
4. Sign of change between first and last value in short window
5. Dispersion as (max-min)/(max+min) over long window
6. Sign of change between first and last value in long window
7. slow EMA
8. fast EMA
Forward filling replaces nan values up to a maximum duration given by
`forward_fill_duration`. Any remaining missing values should be imputed
with the median or by sampling from a standard reference range.
Args:
values (np.ndarry): 2D array of observations
times (np.ndarry): 1D array of observation times
forward_fill_duration (float): duration to forward fill missing values
short_rolling_window_size (float): observation duration for rolling
statistics with a short lookback window
long_rolling_window_size (float): observation duration for rolling
statistics with a long lookback window
slow_ema_tau (float): slow decay rate weights more of the past
fast_ema_tau (float): fast decay rate weights more of the present
Returns:
2D matrix with shape [n_samples, 8], where the features are:
suffixes = ['dt', 'val', 'short_rng', 'short_sgn', 'long_rng', 'long_sgn', 'slow_ema', 'fast_ema']
"""
DT_IX = 0
VAL_IX = 1
SRNG_IX = 2
SSGN_IX = 3
LRNG_IX = 4
LSGN_IX = 5
SEMA_IX = 6
FEMA_IX = 7
n = values.shape[0]
derived = np.zeros((n, 8)) * np.nan
missing = np.isnan(values)
# return a median imputed array if there are no observations
if np.all(missing):
derived[:, SSGN_IX] = 0.
derived[:, LSGN_IX] = 0.
derived[:, SRNG_IX] = 0.
derived[:, LRNG_IX] = 0.
return derived
# indicate samples within a fixed window for each sample
short_windows = sliding_windows(times, short_rolling_window_size)
short_windows = short_windows & ~missing.reshape(-1,1)
long_windows = sliding_windows(times, long_rolling_window_size)
long_windows = long_windows & ~missing.reshape(-1,1)
# forward fill
xfill = ffill(times, values)
dt = times - xfill[:,0]
# 0. Age of the measurement
derived[:, DT_IX] = dt
# 1. last measured value
derived[:, VAL_IX] = xfill[:, 1]
expired = ~np.isnan(dt)
expired[expired] = dt[expired] >= forward_fill_duration
derived[expired, VAL_IX] = np.nan
# rolling statistics
for t in range(n):
# samples in this window
short_inds = short_windows[:, t]
long_inds = long_windows[:, t]
short_vals = values[short_inds]
long_vals = values[long_inds]
if short_vals.size > 0:
# 2. rolling dispersion
high = np.max(short_vals)
low = np.min(short_vals)
total = high + low
if total > 1e-6:
derived[t, SRNG_IX] = (high - low) / (total + 1e-6) * 100
else:
derived[t, SRNG_IX] = 0.
# 3. sign of change between last and first value
if short_vals.size >= 2:
derived[t, SSGN_IX] = np.sign(short_vals[-1] - short_vals[0])
else:
derived[t, SSGN_IX] = 0.
else:
derived[t,SRNG_IX] = 0.
derived[t,SSGN_IX] = 0.
if long_vals.size > 0:
# 4. rolling dispersion
high = np.max(long_vals)
low = np.min(long_vals)
total = high + low
if total > 1e-6:
derived[t, LRNG_IX] = (high - low) / (total + 1e-6) * 100
else:
derived[t, LRNG_IX] = 0.
# 5. sign of change between last and first value
if long_vals.size >= 2:
derived[t, LSGN_IX] = np.sign(long_vals[-1] - long_vals[0])
else:
derived[t, LSGN_IX] = 0.
else:
derived[t, LRNG_IX] = 0.
derived[t, LSGN_IX] = 0.
# exponential moving average using the forward filled data
ema_slow = np.empty(n, dtype=np.float64) * np.nan
ema_fast = np.empty(n, dtype=np.float64) * np.nan
m = np.isfinite(derived[:, 1])
vals = derived[m, 1]
if vals.size > 0:
ema_slow = ema(times[m], derived[m, 1], slow_ema_tau)
ema_fast = ema(times[m], derived[m, 1], fast_ema_tau)
# 6. slow EMA
derived[m, SEMA_IX] = ema_slow
# 7. fast EMA
derived[m, FEMA_IX] = ema_fast
return derived
```