# Feature engineering for time series data using Numba

Given a sequence of measurements `values: np.ndarray`

and observation times `times: np.ndarray`

, we want to engineer features for a machine learning model that captures the temporal trends and statistics over different temporal windows. Our data is irregularly sampled and have missing data.

Features representing the magnitude, dispersion, direction of change, and temporal trends are derived. Every time point is represented with 4 categories of features:

- Magnitude of the most recent observation within the last 6h for vital signs and 24h for laboratory measurements.
- Dispersion measured as the range over a short and long-time window.
- Direction of change (increasing, decreasing, no change) over a short and long-time window.
- Exponential moving averages (EMA) with varying decay rates that specify how the much impact each past observation has on the current mean. EMA features were calculated on the forward filled magnitudes and using an EMA algorithm specifically for irregularly sampled time series.

The engineered features include:

`dt`

: time elapsed since the measurement was made`val`

: most recent measurement, measurements are forward filled up to a maximum duration after`srng`

: range as \(\frac{x_{max}-x_{min}}{x_{max}+x_{min}}100\) over a*short window*`ssgn`

: sign of the change (-1 or +1) between the first and last measurement in a*short window*, windows without measurements are filled with 0`lrng`

: range as (max-min)/(max+min) * 100 over a*long window*`lsgn`

: sign of the change (-1 or +1) between the first and last measurement in a*long window*, windows without measurements are filled with 0`sema`

: slow exponential moving average, calculated after forward filling`fema`

: fast exponential moving average, calculated after forward filling

We also need to define the following settings for every variable:

`forward_fill_duration`

: duration in minutes to forward fill a missing variable`short_rolling_window_size`

: short rolling window size in minutes`long_rolling_window_size`

: long rolling window size in minutes`slow_ema_tau`

: decay rate for slow exponential moving average in minutes`fast_ema_tau`

: decay rate for fast exponential moving average in minutes

Some derived features have no missing data, including `dt`

, `ssgn`

, `srng`

, `lsgn`

, and `lrng`

. Other variables like `val`

, `sema`

, `fema`

have missing values encoded as NaN. Missing data is forward filled up to `forward_fill_duration`

and all missing values after the forward filling time are NaN.

from numba import njit, prange @njit() def ema(times, values, tau): """Exponential moving average for irregularly sampled time series. Units for ``times`` and ``tau`` should be in hours. Args: times (np.ndarray): 1D array of measurement times in hours values (np.ndarray): 1D array of measurements tau (float): time decay in hours """ n = len(values) ret = np.empty(n, dtype=np.float64) * np.nan ret[0] = values[0] last_i = 0 for i in range(1, n): if np.isnan(times[i]) | np.isnan(values[i]): continue alpha = (times[i] - times[last_i]) / tau w = np.exp(-alpha) if alpha > 1e-6: w2 = (1 - w) / alpha else: # use Taylor expansion for numerical stability w2 = 1 - (alpha / 2) + (alpha * alpha / 6) - (alpha * alpha * alpha / 24) ret[i] = (ret[last_i] * w) + (values[i] * (1 - w2)) + (values[last_i] * (w2 - w)) last_i = i return ret @njit() def ffill(times, values): """Forward fill an array of values and times. Times indicate the time of last observation. Args: arr (np.ndarray): array of values with nan times (np.ndarray): array of times Returns: 2D array with forward filled times (index 0) and values (index 1). """ out_values = values.copy() out_times = times.copy() n = values.shape[0] out = np.zeros((n, 2), dtype=np.float64) lastval = np.nan lasttime = np.nan for row_idx in range(n): if np.isfinite(values[row_idx]): lastval = values[row_idx] lasttime = times[row_idx] out_values[row_idx] = lastval out_times[row_idx] = lasttime out[:, 0] = out_times out[:, 1] = out_values return out @njit() def sliding_windows(times, window_size): tdiff = times - times.reshape(-1, 1) twindows = (tdiff >= 0) & (tdiff <= window_size) return twindows @njit() def feature_engineering( times, values, forward_fill_duration, short_rolling_window_size, long_rolling_window_size, slow_ema_tau, fast_ema_tau, ): """Feature engineering. 1. Age of the observation 2. Most recent forward-filled measurement 3. Dispersion as (max-min)/(max+min) over short window 4. Sign of change between first and last value in short window 5. Dispersion as (max-min)/(max+min) over long window 6. Sign of change between first and last value in long window 7. slow EMA 8. fast EMA Forward filling replaces nan values up to a maximum duration given by `forward_fill_duration`. Any remaining missing values should be imputed with the median or by sampling from a standard reference range. Args: values (np.ndarry): 2D array of observations times (np.ndarry): 1D array of observation times forward_fill_duration (float): duration to forward fill missing values short_rolling_window_size (float): observation duration for rolling statistics with a short lookback window long_rolling_window_size (float): observation duration for rolling statistics with a long lookback window slow_ema_tau (float): slow decay rate weights more of the past fast_ema_tau (float): fast decay rate weights more of the present Returns: 2D matrix with shape [n_samples, 8], where the features are: suffixes = ['dt', 'val', 'short_rng', 'short_sgn', 'long_rng', 'long_sgn', 'slow_ema', 'fast_ema'] """ DT_IX = 0 VAL_IX = 1 SRNG_IX = 2 SSGN_IX = 3 LRNG_IX = 4 LSGN_IX = 5 SEMA_IX = 6 FEMA_IX = 7 n = values.shape[0] derived = np.zeros((n, 8)) * np.nan missing = np.isnan(values) # return a median imputed array if there are no observations if np.all(missing): derived[:, SSGN_IX] = 0. derived[:, LSGN_IX] = 0. derived[:, SRNG_IX] = 0. derived[:, LRNG_IX] = 0. return derived # indicate samples within a fixed window for each sample short_windows = sliding_windows(times, short_rolling_window_size) short_windows = short_windows & ~missing.reshape(-1,1) long_windows = sliding_windows(times, long_rolling_window_size) long_windows = long_windows & ~missing.reshape(-1,1) # forward fill xfill = ffill(times, values) dt = times - xfill[:,0] # 0. Age of the measurement derived[:, DT_IX] = dt # 1. last measured value derived[:, VAL_IX] = xfill[:, 1] expired = ~np.isnan(dt) expired[expired] = dt[expired] >= forward_fill_duration derived[expired, VAL_IX] = np.nan # rolling statistics for t in range(n): # samples in this window short_inds = short_windows[:, t] long_inds = long_windows[:, t] short_vals = values[short_inds] long_vals = values[long_inds] if short_vals.size > 0: # 2. rolling dispersion high = np.max(short_vals) low = np.min(short_vals) total = high + low if total > 1e-6: derived[t, SRNG_IX] = (high - low) / (total + 1e-6) * 100 else: derived[t, SRNG_IX] = 0. # 3. sign of change between last and first value if short_vals.size >= 2: derived[t, SSGN_IX] = np.sign(short_vals[-1] - short_vals[0]) else: derived[t, SSGN_IX] = 0. else: derived[t,SRNG_IX] = 0. derived[t,SSGN_IX] = 0. if long_vals.size > 0: # 4. rolling dispersion high = np.max(long_vals) low = np.min(long_vals) total = high + low if total > 1e-6: derived[t, LRNG_IX] = (high - low) / (total + 1e-6) * 100 else: derived[t, LRNG_IX] = 0. # 5. sign of change between last and first value if long_vals.size >= 2: derived[t, LSGN_IX] = np.sign(long_vals[-1] - long_vals[0]) else: derived[t, LSGN_IX] = 0. else: derived[t, LRNG_IX] = 0. derived[t, LSGN_IX] = 0. # exponential moving average using the forward filled data ema_slow = np.empty(n, dtype=np.float64) * np.nan ema_fast = np.empty(n, dtype=np.float64) * np.nan m = np.isfinite(derived[:, 1]) vals = derived[m, 1] if vals.size > 0: ema_slow = ema(times[m], derived[m, 1], slow_ema_tau) ema_fast = ema(times[m], derived[m, 1], fast_ema_tau) # 6. slow EMA derived[m, SEMA_IX] = ema_slow # 7. fast EMA derived[m, FEMA_IX] = ema_fast return derived