Time series similarity with random convolutional features and locality-sensitive hashing

Given a time series, like temperature readings from collection of sensors, we want to find sensors that have similar readings. This is a common problem in applications like sensor networks, IoT, and monitoring systems.

One approach is using random convolutional features to encode the signal and then use locality-sensitive hashing (LSH) to find similar signals. This approach is very fast and can be used to search through millions of time series signals. The time series neural hashing technique introduced here is a fast general-purpose search and retrieval algorithm.

Our specifications are:

We want to index a large number of time series signals and quickly retrieve similar time series in a way that scales computationally and has low storage requirements
Signals are of variable length with stretches of missing values. Imputation of missing values is not feasible
The representation should capture both the shape/structure and magnitude of the signal
Minimal feature engineering

Random convolutional neural hashing

Given a time series \(X \in \mathbb{R}^{M \times N}\) of \(M\) samples each of length \(N\) containing missing values, we want to encode each sequence \(x_m \in \mathbb{R}^{N}\) into a fixed length embedding vector that we can use for fast similarity search.

Normalize and clip \(x_m\) in the range [0,1].
Fit a random convolutional encoder and save the parameters.
Embed \(x_m\) to \(e_m \in \mathbb{R}^{k}\) using the convolutional encoder.
Concatenate the sample stats \(s_m=\{min, max, mean, P10, P25, P50, P90\}\) from the normalized sequence to \(e_m\) such that \([e_m; s_m] \in \mathbb{R}^{k+7}\)
Weight sample stats \(s_m\) by \(\alpha\) and convolutional encodings \(e_m\) by \(1-\alpha\). (\(\alpha=0.75\))
Define a seed matrix of shape \(D \times k + 7\) from a standard normal distribution (\(D=256\)).
Calculate the hash string for each sample (e.g e9df77eb7c692f16)
Retrieval: Given a query hash, compute the hamming distance to find the nearest neighbor.

The random convolutional encoder is very fast to fit and evaluate. It's generally good at capturing the structure of the signal however the magnitude may be lost because we use a max pooling layer that introduces scale and shift invariance. However, our similarity search should consider magnitude, which is important for our specific use case. Therefore, we introduce simple statistics about the magnitude and distribution of sample observations.

Pros:

Captures both signal structure (e.g. periodicity, stretches of missing data) and magnitude.
Fast and scalable both computationally and storage-wise.
Works on time series with missing values and of variable length.

Cons:

Sacrifices some precision since we are using random convolutional network where kernel weights, dilations, stide, are randomly selected.
Hashing-based retrieval is an approximate nearest neighbor approach which also has lower precision compared to exact nearest neighbor search or similarity search using tree-based methods with a distance metric.