Asif Rahman

Python Dependency Injection

Sun, 15 Jun 2025 08:14:56 -0400

Dependency Injection (DI) is a software design pattern that lets you pass instances of a service rather than creating them directly within a class or function. The FastAPI framework provides a neat way to pass dependencies using Python’s type hints and the Depends function. The fast-depends package extracts the FastAPI code and strips out all the web framework-specific code into a small library that can be used in any Python project.

Here is an example of how to use dependency injection using the custom single-file implementation of dependency injection at the bottom of this page.

import typing as t

def get_settings() -> Settings:
 return Settings()

@inject
def get_db(
 settings: t.Annotated[Settings, Depends(get_settings)],
) -> DatabaseConnection:
 db = DatabaseConnection(settings)
 return db

@inject
def compute_something(
 db: t.Annotated[DatabaseConnection, Depends(get_db)]
):
 # Use db connection here
 pass

# Call the function with dependencies injected
result = compute_something() # Automatically resolves dependencies

db = get_db() # You can also call the dependency directly
result = compute_something(db=db) # Pass the db directly if needed

The get_* functions return instances of the required services, and the Depends function is used to declare dependencies. The inject decorator enables dependency injection for the function by automatically resolving the dependencies when the function is called. The default behavior is to cache the results of dependencies, so if a dependency is called multiple times, it will return the cached result instead of executing the function again (e.g. Settings and DB connection are created once and reused).

This pattern allows for better separation of concerns (instantiatiating dependencies outside of the function) and makes unit testing easier. For example, we can create different implementations of the database and settings classes and pass them to the compute_something function without changing its signature.

Below is the full implementation of dependency injection as a standalone Python module. You can copy it into a depends.py file and use it in your projects.

Key Components:

Depends: Marks a parameter as a dependency to be injected
inject: Decorator that enables dependency injection for a sync or async function
CustomField: Base class for creating custom parameter extractors
dependency_provider: Global provider for managing dependency overrides

Features:

Automatic dependency resolution and injection
Support for both sync and async functions
Dependency caching (can be disabled per dependency)
Type validation and casting using Pydantic
Context manager support for resource management
Custom field extractors for complex parameter handling
Dependency override system for testing and configuration

Table of Contents:

Usage
Implementation Code

Usage

Simple Dependency Injection

Dependencies can be injected into functions using the Depends class and the inject decorator.

def get_database():
 return "database_connection"

def get_user(db: str = Depends(get_database)):
 return f"user_from_{db}"

@inject
def handler(user: str = Depends(get_user)):
 return f"Hello, {user}!"

result = handler() # "Hello, user_from_database_connection!"
print(result)

Async Dependencies

async def get_async_db():
 await asyncio.sleep(0.1)
 return "async_database"

@inject
async def async_handler(db: str = Depends(get_async_db)):
 return f"DB: {db}"

result = await async_handler() # "DB: async_database"
print(result)

Dependency Caching

Caching is enabled by default, meaning that if a dependency is called multiple times within the same request, it will return the cached result instead of executing the function again.

call_count = 0

def expensive_operation():
 global call_count
 call_count += 1
 return f"result_{call_count}"

@inject
def handler(
 a: str = Depends(expensive_operation),
 b: str = Depends(expensive_operation), # Same result due to caching
):
 return f"{a}, {b}"

result = handler() # "result_1, result_1"
print(result) # Output: "result_1, result_1"

Disable Caching

The use_cache parameter can be set to False to disable caching for specific dependencies.

@inject
def handler(
 a: str = Depends(expensive_operation, use_cache=False),
 b: str = Depends(expensive_operation, use_cache=False),
):
 return f"{a}, {b}"

result = handler() # "result_1, result_2"
print(result) # Output: "result_1, result_2"

Custom Fields

class HeaderExtractor(CustomField):
 def __init__(self, header_name: str):
 super().__init__()
 self.header_name = header_name

 def use(self, **kwargs: t.Any) -> t.Dict[str, t.Any]:
 # Extract from some global context
 kwargs[self.param_name] = f"header_value_for_{self.header_name}"
 return kwargs

@inject
def api_handler(
 auth: str = HeaderExtractor("Authorization"),
 content_type: str = HeaderExtractor("Content-Type"),
):
 return {"auth": auth, "content_type": content_type}

Dependency Overrides

def original_dep():
 return "original"

def override_dep():
 return "overridden"

@inject
def handler(value: str = Depends(original_dep)):
 return value

# Override dependency
dependency_provider.override(original_dep, override_dep)
result = handler() # "overridden"

# Clear overrides
dependency_provider.clear()
result = handler() # "original"

Generator Dependencies (Context Managers)

A dependency can be a generator function, which allows for resource management (like opening and closing database connections). In this example, the database_session function is a context manager that opens a database connection and closes it after use. The Depends decorator will handle the context management automatically.

def database_session() -> t.Generator[str, None, None]:
 print("Opening connection")
 yield "db_session"
 print("Closing connection")

@inject
def handler(db: str = Depends(database_session)):
 return f"Using {db}"

result = handler()
print(result) # "Using db_session"
# Output: Opening connection
# Output: Closing connection
# Returns: "Using db_session"

Type Validation with Pydantic

Arguments can be annotated with Pydantic models for automatic validation and casting. In this example, the get_user_id function returns a string, but it will be cast to an integer when injected. The Depends decorator will handle the type casting automatically if cast=True is set (default behavior). The Annotated type from typing_extensions is used to specify the type and the dependency.

from typing import Annotated

def get_user_id() -> int:
 return "123" # Wrong type!

@inject
def handler(user_id: Annotated[int, Depends(get_user_id)]):
 return f"User ID: {user_id}"

result = handler() # user_id will be cast to int(123)

Disable Type Casting

If you want to disable type casting for a specific dependency, you can set cast=False in the inject decorator.

@inject(cast=False)
def handler(user_id: Annotated[int, Depends(get_user_id)]):
 return f"User ID type: {type(user_id)}"

result = handler() # user_id remains as string "123"

Implementation Code

Requirements:

Python 3.11+
anyio for async I/O operations
pydantic for data validation and settings management
typing_extensions for type annotations

import anyio
import asyncio
import inspect
import functools
import typing as t
from abc import ABC
from copy import deepcopy
from itertools import chain
from pydantic import ConfigDict
from collections import namedtuple
from functools import wraps, partial
from pydantic import BaseModel, create_model
from typing_extensions import Annotated, ParamSpec, get_args, get_origin
from pydantic._internal._typing_extra import try_eval_type as evaluate_forwardref
from contextlib import AsyncExitStack, ExitStack, asynccontextmanager, contextmanager

P = ParamSpec("P")
T = t.TypeVar("T")
Cls = t.TypeVar("Cls", bound="CustomField")

default_pydantic_config = {"arbitrary_types_allowed": True}


def get_config_base(config_data: t.Optional[ConfigDict] = None) -> ConfigDict:
 return config_data or ConfigDict(**default_pydantic_config)


def get_aliases(model: t.Type[BaseModel]) -> t.Tuple[str, ...]:
 return tuple(f.alias or name for name, f in model.model_fields.items())


class Depends:
 """Mark a parameter as a dependency to be injected."""

 use_cache: bool
 cast: bool

 def __init__(
 self,
 dependency: t.Callable[..., t.Any],
 *,
 use_cache: bool = True,
 cast: bool = True,
 ) -> None:
 self.dependency = dependency
 self.use_cache = use_cache
 self.cast = cast

 def __repr__(self) -> str:
 attr = getattr(self.dependency, "__name__", type(self.dependency).__name__)
 cache = "" if self.use_cache else ", use_cache=False"
 return f"{self.__class__.__name__}({attr}{cache})"


class CustomField(ABC):
 """Base class for custom field extractors."""

 param_name: t.Optional[str]
 cast: bool
 required: bool

 __slots__ = (
 "cast",
 "param_name",
 "required",
 "field",
 )

 def __init__(
 self,
 *,
 cast: bool = True,
 required: bool = True,
 ) -> None:
 self.cast = cast
 self.param_name = None
 self.required = required
 self.field = False

 def set_param_name(self: Cls, name: str) -> Cls:
 self.param_name = name
 return self

 def use(self, /, **kwargs: t.Any) -> t.Dict[str, t.Any]:
 assert self.param_name, "You should specify `param_name` before using"
 return kwargs

 def use_field(self, kwargs: t.Dict[str, t.Any]) -> None:
 raise NotImplementedError("You should implement `use_field` method.")


# Provider for dependency overrides


class Provider:
 """Provider for dependency overrides."""

 dependency_overrides: t.Dict[t.Callable[..., t.Any], t.Callable[..., t.Any]]

 def __init__(self) -> None:
 self.dependency_overrides = {}

 def clear(self) -> None:
 self.dependency_overrides = {}

 def override(
 self,
 original: t.Callable[..., t.Any],
 override: t.Callable[..., t.Any],
 ) -> None:
 self.dependency_overrides[original] = override

 @contextmanager
 def scope(
 self,
 original: t.Callable[..., t.Any],
 override: t.Callable[..., t.Any],
 ) -> t.Iterator[None]:
 self.dependency_overrides[original] = override
 yield
 self.dependency_overrides.pop(original, None)


dependency_provider = Provider()


def is_coroutine_callable(call: t.Callable[..., t.Any]) -> bool:
 if inspect.isclass(call):
 return False
 if asyncio.iscoroutinefunction(call):
 return True
 dunder_call = getattr(call, "__call__", None)
 return asyncio.iscoroutinefunction(dunder_call)


def is_gen_callable(call: t.Callable[..., t.Any]) -> bool:
 if inspect.isgeneratorfunction(call):
 return True
 dunder_call = getattr(call, "__call__", None)
 return inspect.isgeneratorfunction(dunder_call)


def is_async_gen_callable(call: t.Callable[..., t.Any]) -> bool:
 if inspect.isasyncgenfunction(call):
 return True
 dunder_call = getattr(call, "__call__", None)
 return inspect.isasyncgenfunction(dunder_call)


async def run_async(
 func: t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 *args: P.args,
 **kwargs: P.kwargs,
) -> T:
 if is_coroutine_callable(func):
 return await t.cast(t.Callable[P, t.Awaitable[T]], func)(*args, **kwargs)
 else:
 return await run_in_threadpool(t.cast(t.Callable[P, T], func), *args, **kwargs)


async def run_in_threadpool(func: t.Callable[P, T], *args: P.args, **kwargs: P.kwargs) -> T:
 if kwargs:
 func = functools.partial(func, **kwargs)
 return await anyio.to_thread.run_sync(func, *args)


def get_typed_annotation(
 annotation: t.Any,
 globalns: t.Dict[str, t.Any],
 locals: t.Dict[str, t.Any],
) -> t.Any:
 if isinstance(annotation, str):
 annotation = t.ForwardRef(annotation)

 if isinstance(annotation, t.ForwardRef):
 annotation = evaluate_forwardref(annotation, globalns, locals)

 if get_origin(annotation) is Annotated and (args := get_args(annotation)):
 solved_args = [get_typed_annotation(x, globalns, locals) for x in args]
 annotation.__origin__, annotation.__metadata__ = solved_args[0], tuple(solved_args[1:])

 return annotation


def collect_outer_stack_locals() -> t.Dict[str, t.Any]:
 """
 Collect local variables from outer stack frames to resolve type annotations.

 This function walks up the call stack and collects all local variables
 from frames outside of this module. This is necessary for resolving
 forward references and string annotations that might reference variables
 defined in the calling code.
 """
 frame = inspect.currentframe()
 frames: t.List[t.Any] = []
 current_filename = __file__ if "__file__" in globals() else None

 while frame is not None:
 frame_filename = frame.f_code.co_filename
 # Skip frames from this module to avoid internal variables
 if current_filename is None or frame_filename != current_filename:
 frames.append(frame)
 frame = frame.f_back

 locals = {}
 for f in frames[::-1]:
 locals.update(f.f_locals)

 return locals


def get_typed_signature(call: t.Callable[..., t.Any]) -> t.Tuple[inspect.Signature, t.Any]:
 signature = inspect.signature(call)
 locals = collect_outer_stack_locals()
 call = inspect.unwrap(call)
 globalns = getattr(call, "__globals__", {})

 typed_params = [
 inspect.Parameter(
 name=param.name,
 kind=param.kind,
 default=param.default,
 annotation=get_typed_annotation(
 param.annotation,
 globalns,
 locals,
 ),
 )
 for param in signature.parameters.values()
 ]

 return inspect.Signature(typed_params), get_typed_annotation(
 signature.return_annotation,
 globalns,
 locals,
 )


async def solve_generator_async(
 *sub_args: t.Any, call: t.Callable[..., t.Any], stack: AsyncExitStack, **sub_values: t.Any
) -> t.Any:
 if is_gen_callable(call):
 cm = contextmanager_in_threadpool(contextmanager(call)(**sub_values))
 elif is_async_gen_callable(call):
 cm = asynccontextmanager(call)(*sub_args, **sub_values)
 return await stack.enter_async_context(cm)


def solve_generator_sync(
 *sub_args: t.Any, call: t.Callable[..., t.Any], stack: ExitStack, **sub_values: t.Any
) -> t.Any:
 cm = contextmanager(call)(*sub_args, **sub_values)
 return stack.enter_context(cm)


@asynccontextmanager
async def contextmanager_in_threadpool(
 cm: t.ContextManager[T],
) -> t.AsyncGenerator[T, None]:
 exit_limiter = anyio.CapacityLimiter(1)
 try:
 yield await run_in_threadpool(cm.__enter__)
 except Exception as e:
 ok = bool(await anyio.to_thread.run_sync(cm.__exit__, type(e), e, None, limiter=exit_limiter))
 if not ok:
 raise e
 else:
 await anyio.to_thread.run_sync(cm.__exit__, None, None, None, limiter=exit_limiter)


async def async_map(func: t.Callable[..., T], async_iterable: t.AsyncIterable[t.Any]) -> t.AsyncIterable[T]:
 async for i in async_iterable:
 yield func(i)


class solve_wrapper(partial[T]):
 call: t.Callable[..., T]

 def __new__(
 cls,
 func: t.Callable[..., T],
 *args: t.Any,
 **kwargs: t.Any,
 ) -> "solve_wrapper[T]":
 assert len(args) > 0, "Model should be passed as first argument"
 model = args[0]
 self = super().__new__(cls, func, *args, **kwargs)
 self.call = model.call
 return self


# Core Models

PriorityPair = namedtuple("PriorityPair", ("call", "dependencies_number", "dependencies_names"))


class ResponseModel(BaseModel, t.Generic[T]):
 response: T


class CallModel(t.Generic[P, T]):
 """Model representing a callable with dependency injection."""

 call: t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]]
 is_async: bool
 is_generator: bool
 model: t.Optional[t.Type[BaseModel]]
 response_model: t.Optional[t.Type[ResponseModel[T]]]

 params: t.Dict[str, t.Tuple[t.Any, t.Any]]
 alias_arguments: t.Tuple[str, ...]

 dependencies: t.Dict[str, "CallModel[..., t.Any]"]
 extra_dependencies: t.Iterable["CallModel[..., t.Any]"]
 sorted_dependencies: t.Tuple[t.Tuple["CallModel[..., t.Any]", int], ...]
 custom_fields: t.Dict[str, CustomField]
 keyword_args: t.Tuple[str, ...]
 positional_args: t.Tuple[str, ...]
 var_positional_arg: t.Optional[str]
 var_keyword_arg: t.Optional[str]

 use_cache: bool
 cast: bool

 __slots__ = (
 "call",
 "is_async",
 "is_generator",
 "model",
 "response_model",
 "params",
 "alias_arguments",
 "keyword_args",
 "positional_args",
 "var_positional_arg",
 "var_keyword_arg",
 "dependencies",
 "extra_dependencies",
 "sorted_dependencies",
 "custom_fields",
 "use_cache",
 "cast",
 )

 @property
 def call_name(self) -> str:
 call = inspect.unwrap(self.call)
 return getattr(call, "__name__", type(call).__name__)

 @property
 def flat_params(self) -> t.Dict[str, t.Tuple[t.Any, t.Any]]:
 params = self.params
 for d in (*self.dependencies.values(), *self.extra_dependencies):
 params.update(d.flat_params)
 return params

 @property
 def flat_dependencies(
 self,
 ) -> t.Dict[
 t.Callable[..., t.Any],
 t.Tuple["CallModel[..., t.Any]", t.Tuple[t.Callable[..., t.Any], ...]],
 ]:
 flat: t.Dict[
 t.Callable[..., t.Any],
 t.Tuple[CallModel[..., t.Any], t.Tuple[t.Callable[..., t.Any], ...]],
 ] = {}

 for i in (*self.dependencies.values(), *self.extra_dependencies):
 flat.update(
 {
 i.call: (
 i,
 tuple(j.call for j in i.dependencies.values()),
 )
 }
 )
 flat.update(i.flat_dependencies)

 return flat

 def __init__(
 self,
 /,
 call: t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 model: t.Optional[t.Type[BaseModel]],
 params: t.Dict[str, t.Tuple[t.Any, t.Any]],
 response_model: t.Optional[t.Type[ResponseModel[T]]] = None,
 use_cache: bool = True,
 cast: bool = True,
 is_async: bool = False,
 is_generator: bool = False,
 dependencies: t.Optional[t.Dict[str, "CallModel[..., t.Any]"]] = None,
 extra_dependencies: t.Optional[t.Iterable["CallModel[..., t.Any]"]] = None,
 keyword_args: t.Optional[t.List[str]] = None,
 positional_args: t.Optional[t.List[str]] = None,
 var_positional_arg: t.Optional[str] = None,
 var_keyword_arg: t.Optional[str] = None,
 custom_fields: t.Optional[t.Dict[str, CustomField]] = None,
 ):
 self.call = call
 self.model = model

 if model:
 self.alias_arguments = get_aliases(model)
 else:
 self.alias_arguments = ()

 self.keyword_args = tuple(keyword_args or ())
 self.positional_args = tuple(positional_args or ())
 self.var_positional_arg = var_positional_arg
 self.var_keyword_arg = var_keyword_arg
 self.response_model = response_model
 self.use_cache = use_cache
 self.cast = cast
 self.is_async = is_async or is_coroutine_callable(call) or is_async_gen_callable(call)
 self.is_generator = is_generator or is_gen_callable(call) or is_async_gen_callable(call)

 self.dependencies = dependencies or {}
 self.extra_dependencies = extra_dependencies or ()
 self.custom_fields = custom_fields or {}

 sorted_dep: t.List[CallModel[..., t.Any]] = []
 flat = self.flat_dependencies
 for calls in flat.values():
 _sort_dep(sorted_dep, calls, flat)

 self.sorted_dependencies = tuple((i, len(i.sorted_dependencies)) for i in sorted_dep if i.use_cache)
 for name in chain(self.dependencies.keys(), self.custom_fields.keys()):
 params.pop(name, None)
 self.params = params

 def _solve(
 self,
 /,
 *args: t.Tuple[t.Any, ...],
 cache_dependencies: t.Dict[
 t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 T,
 ],
 dependency_overrides: t.Optional[
 t.Dict[
 t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 ]
 ] = None,
 **kwargs: t.Dict[str, t.Any],
 ) -> t.Generator[
 t.Tuple[t.Sequence[t.Any], t.Dict[str, t.Any], t.Callable[..., t.Any]],
 t.Any,
 T,
 ]:
 if dependency_overrides:
 call = dependency_overrides.get(self.call, self.call)
 assert self.is_async or not is_coroutine_callable(call), (
 f"You cannot use async dependency `{self.call_name}` at sync main"
 )
 else:
 call = self.call

 if self.use_cache and call in cache_dependencies:
 return cache_dependencies[call]

 kw: t.Dict[str, t.Any] = {}

 for arg in self.keyword_args:
 if (v := kwargs.pop(arg, inspect.Parameter.empty)) is not inspect.Parameter.empty:
 kw[arg] = v

 if self.var_keyword_arg is not None:
 kw[self.var_keyword_arg] = kwargs
 else:
 kw.update(kwargs)

 for arg in self.positional_args:
 if args:
 kw[arg], args = args[0], args[1:]
 else:
 break

 keyword_args: t.Iterable[str]
 if self.var_positional_arg is not None:
 kw[self.var_positional_arg] = args
 keyword_args = self.keyword_args
 else:
 keyword_args = self.keyword_args + self.positional_args
 for arg in keyword_args:
 if not self.cast and arg in self.params:
 kw[arg] = self.params[arg][1]

 if not args:
 break

 if arg not in self.dependencies:
 kw[arg], args = args[0], args[1:]

 solved_kw: t.Dict[str, t.Any]
 solved_kw = yield args, kw, call

 args_: t.Sequence[t.Any]
 if self.cast:
 assert self.model, "Cast should be used only with model"
 casted_model = self.model(**solved_kw)

 kwargs_ = {arg: getattr(casted_model, arg, solved_kw.get(arg)) for arg in keyword_args}
 if self.var_keyword_arg:
 kwargs_.update(getattr(casted_model, self.var_keyword_arg, {}))

 if self.var_positional_arg is not None:
 args_ = [getattr(casted_model, arg, solved_kw.get(arg)) for arg in self.positional_args]
 args_.extend(getattr(casted_model, self.var_positional_arg, ()))
 else:
 args_ = ()
 else:
 kwargs_ = {arg: solved_kw.get(arg) for arg in keyword_args}

 if self.var_positional_arg is not None:
 args_ = tuple(map(solved_kw.get, self.positional_args))
 else:
 args_ = ()

 response: T
 response = yield args_, kwargs_, call

 if self.cast and not self.is_generator:
 response = self._cast_response(response)

 if self.use_cache:
 cache_dependencies[call] = response

 return response

 def _cast_response(self, /, value: t.Any) -> t.Any:
 if self.response_model is not None:
 return self.response_model(response=value).response
 else:
 return value

 def solve(
 self,
 /,
 *args: t.Tuple[t.Any, ...],
 stack: ExitStack,
 cache_dependencies: t.Dict[
 t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 T,
 ],
 dependency_overrides: t.Optional[
 t.Dict[
 t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 ]
 ] = None,
 nested: bool = False,
 **kwargs: t.Dict[str, t.Any],
 ) -> T:
 cast_gen = self._solve(
 *args,
 cache_dependencies=cache_dependencies,
 dependency_overrides=dependency_overrides,
 **kwargs,
 )
 try:
 args, kwargs, _ = next(cast_gen)
 except StopIteration as e:
 cached_value: T = e.value
 return cached_value

 # Heat cache and solve extra dependencies
 for dep, _ in self.sorted_dependencies:
 dep.solve(
 *args,
 stack=stack,
 cache_dependencies=cache_dependencies,
 dependency_overrides=dependency_overrides,
 nested=True,
 **kwargs,
 )

 # Always get from cache
 for dep in self.extra_dependencies:
 dep.solve(
 *args,
 stack=stack,
 cache_dependencies=cache_dependencies,
 dependency_overrides=dependency_overrides,
 nested=True,
 **kwargs,
 )

 for dep_arg, dep in self.dependencies.items():
 kwargs[dep_arg] = dep.solve(
 stack=stack,
 cache_dependencies=cache_dependencies,
 dependency_overrides=dependency_overrides,
 nested=True,
 **kwargs,
 )

 for custom in self.custom_fields.values():
 if custom.field:
 custom.use_field(kwargs)
 else:
 kwargs = custom.use(**kwargs)

 final_args, final_kwargs, call = cast_gen.send(kwargs)

 if self.is_generator and nested:
 response = solve_generator_sync(
 *final_args,
 call=call,
 stack=stack,
 **final_kwargs,
 )
 else:
 response = call(*final_args, **final_kwargs)

 try:
 cast_gen.send(response)
 except StopIteration as e:
 value: T = e.value

 if not self.cast or nested or not self.is_generator:
 return value
 else:
 return map(self._cast_response, value)

 raise AssertionError("unreachable")

 async def asolve(
 self,
 /,
 *args: t.Tuple[t.Any, ...],
 stack: AsyncExitStack,
 cache_dependencies: t.Dict[
 t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 T,
 ],
 dependency_overrides: t.Optional[
 t.Dict[
 t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 ]
 ] = None,
 nested: bool = False,
 **kwargs: t.Dict[str, t.Any],
 ) -> T:
 cast_gen = self._solve(
 *args,
 cache_dependencies=cache_dependencies,
 dependency_overrides=dependency_overrides,
 **kwargs,
 )
 try:
 args, kwargs, _ = next(cast_gen)
 except StopIteration as e:
 cached_value: T = e.value
 return cached_value

 # Heat cache and solve extra dependencies
 dep_to_solve: t.List[t.Callable[..., t.Awaitable[t.Any]]] = []
 try:
 async with anyio.create_task_group() as tg:
 for dep, subdep in self.sorted_dependencies:
 solve = partial(
 dep.asolve,
 *args,
 stack=stack,
 cache_dependencies=cache_dependencies,
 dependency_overrides=dependency_overrides,
 nested=True,
 **kwargs,
 )
 if not subdep:
 tg.start_soon(solve)
 else:
 dep_to_solve.append(solve)
 except Exception as e:
 raise e

 for i in dep_to_solve:
 await i()

 # Always get from cache
 for dep in self.extra_dependencies:
 await dep.asolve(
 *args,
 stack=stack,
 cache_dependencies=cache_dependencies,
 dependency_overrides=dependency_overrides,
 nested=True,
 **kwargs,
 )

 for dep_arg, dep in self.dependencies.items():
 kwargs[dep_arg] = await dep.asolve(
 stack=stack,
 cache_dependencies=cache_dependencies,
 dependency_overrides=dependency_overrides,
 nested=True,
 **kwargs,
 )

 custom_to_solve: t.List[CustomField] = []

 try:
 async with anyio.create_task_group() as tg:
 for custom in self.custom_fields.values():
 if custom.field:
 tg.start_soon(run_async, custom.use_field, kwargs)
 else:
 custom_to_solve.append(custom)
 except Exception as e:
 raise e

 for j in custom_to_solve:
 kwargs = await run_async(j.use, **kwargs)

 final_args, final_kwargs, call = cast_gen.send(kwargs)

 if self.is_generator and nested:
 response = await solve_generator_async(
 *final_args,
 call=call,
 stack=stack,
 **final_kwargs,
 )
 else:
 response = await run_async(call, *final_args, **final_kwargs)

 try:
 cast_gen.send(response)
 except StopIteration as e:
 value: T = e.value

 if not self.cast or nested or not self.is_generator:
 return value
 else:
 return async_map(self._cast_response, value)

 raise AssertionError("unreachable")


def _sort_dep(
 collector: t.List["CallModel[..., t.Any]"],
 items: t.Tuple[
 "CallModel[..., t.Any]",
 t.Tuple[t.Callable[..., t.Any], ...],
 ],
 flat: t.Dict[
 t.Callable[..., t.Any],
 t.Tuple[
 "CallModel[..., t.Any]",
 t.Tuple[t.Callable[..., t.Any], ...],
 ],
 ],
) -> None:
 model, calls = items

 if model in collector:
 return

 if not calls:
 position = -1
 else:
 for i in calls:
 sub_model, _ = flat[i]
 if sub_model not in collector:
 _sort_dep(collector, flat[i], flat)

 position = max(collector.index(flat[i][0]) for i in calls)

 collector.insert(position + 1, model)


CUSTOM_ANNOTATIONS = (Depends, CustomField)


def build_call_model(
 call: t.Union[t.Callable[P, T], t.Callable[P, t.Awaitable[T]]],
 *,
 cast: bool = True,
 use_cache: bool = True,
 is_sync: t.Optional[bool] = None,
 extra_dependencies: t.Sequence[Depends] = (),
 pydantic_config: t.Optional[ConfigDict] = None,
) -> CallModel[P, T]:
 """Build a CallModel from a callable."""
 name = getattr(call, "__name__", type(call).__name__)

 is_call_async = is_coroutine_callable(call) or is_async_gen_callable(call)
 if is_sync is None:
 is_sync = not is_call_async
 else:
 assert not (is_sync and is_call_async), f"You cannot use async dependency `{name}` at sync main"

 typed_params, return_annotation = get_typed_signature(call)
 if (is_call_generator := is_gen_callable(call) or is_async_gen_callable(call)) and (
 return_args := get_args(return_annotation)
 ):
 return_annotation = return_args[0]

 class_fields: t.Dict[str, t.Tuple[t.Any, t.Any]] = {}
 dependencies: t.Dict[str, CallModel[..., t.Any]] = {}
 custom_fields: t.Dict[str, CustomField] = {}
 positional_args: t.List[str] = []
 keyword_args: t.List[str] = []
 var_positional_arg: t.Optional[str] = None
 var_keyword_arg: t.Optional[str] = None

 for param_name, param in typed_params.parameters.items():
 dep: t.Optional[Depends] = None
 custom: t.Optional[CustomField] = None

 if param.annotation is inspect.Parameter.empty:
 annotation = t.Any
 elif get_origin(param.annotation) is Annotated:
 annotated_args = get_args(param.annotation)
 type_annotation = annotated_args[0]

 custom_annotations = []
 regular_annotations = []
 for arg in annotated_args[1:]:
 if isinstance(arg, CUSTOM_ANNOTATIONS):
 custom_annotations.append(arg)
 else:
 regular_annotations.append(arg)

 assert len(custom_annotations) <= 1, (
 f"Cannot specify multiple `Annotated` Custom arguments for `{param_name}`!"
 )

 next_custom = next(iter(custom_annotations), None)
 if next_custom is not None:
 if isinstance(next_custom, Depends):
 dep = next_custom
 elif isinstance(next_custom, CustomField):
 custom = deepcopy(next_custom)
 else:
 raise AssertionError("unreachable")

 if regular_annotations:
 annotation = param.annotation
 else:
 annotation = type_annotation
 else:
 annotation = param.annotation
 else:
 annotation = param.annotation

 default: t.Any
 if param.kind == inspect.Parameter.VAR_POSITIONAL:
 default = ()
 var_positional_arg = param_name
 elif param.kind == inspect.Parameter.VAR_KEYWORD:
 default = {}
 var_keyword_arg = param_name
 elif param.default is inspect.Parameter.empty:
 default = Ellipsis
 else:
 default = param.default

 if isinstance(default, Depends):
 if dep:
 raise AssertionError(
 "You can not use both `Depends` with `Annotated` and a default",
 )
 dep, default = default, Ellipsis

 elif isinstance(default, CustomField):
 if custom:
 raise AssertionError(
 "You can not use both `CustomField` with `Annotated` and a default",
 )
 custom, default = default, Ellipsis

 else:
 class_fields[param_name] = (annotation, default)

 if dep:
 if not cast:
 dep.cast = False

 if isinstance(dep.dependency, solve_wrapper):
 dep.dependency = dep.dependency.call

 dependencies[param_name] = build_call_model(
 dep.dependency,
 cast=dep.cast,
 use_cache=dep.use_cache,
 is_sync=is_sync,
 pydantic_config=pydantic_config,
 )

 if dep.cast is True:
 class_fields[param_name] = (annotation, Ellipsis)

 keyword_args.append(param_name)

 elif custom:
 assert not (is_sync and is_coroutine_callable(custom.use)), (
 f"You cannot use async custom field `{type(custom).__name__}` at sync `{name}`"
 )

 custom.set_param_name(param_name)
 custom_fields[param_name] = custom

 if custom.cast is False:
 annotation = t.Any

 if custom.required:
 class_fields[param_name] = (annotation, default)
 else:
 class_fields[param_name] = class_fields.get(param_name, (t.Optional[annotation], None))

 keyword_args.append(param_name)

 else:
 if param.kind is param.KEYWORD_ONLY:
 keyword_args.append(param_name)
 elif param.kind not in (inspect.Parameter.VAR_POSITIONAL, inspect.Parameter.VAR_KEYWORD):
 positional_args.append(param_name)

 func_model = create_model(
 name,
 __config__=get_config_base(pydantic_config),
 **class_fields,
 )

 response_model: t.Optional[t.Type[ResponseModel[T]]] = None
 if cast and return_annotation and return_annotation is not inspect.Parameter.empty:
 response_model = create_model(
 "ResponseModel",
 __config__=get_config_base(pydantic_config),
 response=(return_annotation, Ellipsis),
 )

 return CallModel(
 call=call,
 model=func_model,
 response_model=response_model,
 params=class_fields,
 cast=cast,
 use_cache=use_cache,
 is_async=is_call_async,
 is_generator=is_call_generator,
 dependencies=dependencies,
 custom_fields=custom_fields,
 positional_args=positional_args,
 keyword_args=keyword_args,
 var_positional_arg=var_positional_arg,
 var_keyword_arg=var_keyword_arg,
 extra_dependencies=[
 build_call_model(
 d.dependency,
 cast=d.cast,
 use_cache=d.use_cache,
 is_sync=is_sync,
 pydantic_config=pydantic_config,
 )
 for d in extra_dependencies
 ],
 )


class _InjectWrapper(t.Protocol[P, T]):
 def __call__(
 self,
 func: t.Callable[P, T],
 model: t.Optional[CallModel[P, T]] = None,
 ) -> t.Callable[P, T]: ...


@t.overload
def inject(
 func: None,
 *,
 cast: bool = True,
 extra_dependencies: t.Sequence[Depends] = (),
 pydantic_config: t.Optional[ConfigDict] = None,
 dependency_overrides_provider: t.Optional[t.Any] = dependency_provider,
 wrap_model: t.Callable[[CallModel[P, T]], CallModel[P, T]] = lambda x: x,
) -> _InjectWrapper[P, T]: ...


@t.overload
def inject(
 func: t.Callable[P, T],
 *,
 cast: bool = True,
 extra_dependencies: t.Sequence[Depends] = (),
 pydantic_config: t.Optional[ConfigDict] = None,
 dependency_overrides_provider: t.Optional[t.Any] = dependency_provider,
 wrap_model: t.Callable[[CallModel[P, T]], CallModel[P, T]] = lambda x: x,
) -> t.Callable[P, T]: ...


def inject(
 func: t.Optional[t.Callable[P, T]] = None,
 *,
 cast: bool = True,
 extra_dependencies: t.Sequence[Depends] = (),
 pydantic_config: t.Optional[ConfigDict] = None,
 dependency_overrides_provider: t.Optional[t.Any] = dependency_provider,
 wrap_model: t.Callable[[CallModel[P, T]], CallModel[P, T]] = lambda x: x,
) -> t.Union[t.Callable[P, T], _InjectWrapper[P, T]]:
 """Decorator to inject dependencies into a function."""
 decorator = _wrap_inject(
 dependency_overrides_provider=dependency_overrides_provider,
 wrap_model=wrap_model,
 extra_dependencies=extra_dependencies,
 cast=cast,
 pydantic_config=pydantic_config,
 )

 if func is None:
 return decorator
 else:
 return decorator(func)


def _wrap_inject(
 dependency_overrides_provider: t.Optional[t.Any],
 wrap_model: t.Callable[[CallModel[P, T]], CallModel[P, T]],
 extra_dependencies: t.Sequence[Depends],
 cast: bool,
 pydantic_config: t.Optional[ConfigDict],
) -> _InjectWrapper[P, T]:
 if (
 dependency_overrides_provider
 and getattr(dependency_overrides_provider, "dependency_overrides", None) is not None
 ):
 overrides = dependency_overrides_provider.dependency_overrides
 else:
 overrides = None

 def func_wrapper(
 func: t.Callable[P, T],
 model: t.Optional[CallModel[P, T]] = None,
 ) -> t.Callable[P, T]:
 if model is None:
 real_model = wrap_model(
 build_call_model(
 call=func,
 extra_dependencies=extra_dependencies,
 cast=cast,
 pydantic_config=pydantic_config,
 )
 )
 else:
 real_model = model

 if real_model.is_async:
 injected_wrapper: t.Callable[P, T]

 if real_model.is_generator:
 injected_wrapper = solve_wrapper(solve_async_gen, real_model, overrides)
 else:

 @wraps(func)
 async def injected_wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
 async with AsyncExitStack() as stack:
 r = await real_model.asolve(
 *args,
 stack=stack,
 dependency_overrides=overrides,
 cache_dependencies={},
 nested=False,
 **kwargs,
 )
 return r

 raise AssertionError("unreachable")

 else:
 if real_model.is_generator:
 injected_wrapper = solve_wrapper(solve_gen, real_model, overrides)
 else:

 @wraps(func)
 def injected_wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
 with ExitStack() as stack:
 r = real_model.solve(
 *args,
 stack=stack,
 dependency_overrides=overrides,
 cache_dependencies={},
 nested=False,
 **kwargs,
 )
 return r

 raise AssertionError("unreachable")

 return injected_wrapper

 return func_wrapper


class solve_async_gen:
 _iter: t.Optional[t.AsyncIterator[t.Any]] = None

 def __init__(
 self,
 model: "CallModel[..., t.Any]",
 overrides: t.Optional[t.Any],
 *args: t.Any,
 **kwargs: t.Any,
 ):
 self.call = model
 self.args = args
 self.kwargs = kwargs
 self.overrides = overrides

 def __aiter__(self) -> "solve_async_gen":
 self.stack = AsyncExitStack()
 return self

 async def __anext__(self) -> t.Any:
 if self._iter is None:
 stack = self.stack = AsyncExitStack()
 await self.stack.__aenter__()
 self._iter = t.cast(
 t.AsyncIterator[t.Any],
 (
 await self.call.asolve(
 *self.args,
 stack=stack,
 dependency_overrides=self.overrides,
 cache_dependencies={},
 nested=False,
 **self.kwargs,
 )
 ).__aiter__(),
 )

 try:
 r = await self._iter.__anext__()
 except StopAsyncIteration as e:
 await self.stack.__aexit__(None, None, None)
 raise e
 else:
 return r


class solve_gen:
 _iter: t.Optional[t.Iterator[t.Any]] = None

 def __init__(
 self,
 model: "CallModel[..., t.Any]",
 overrides: t.Optional[t.Any],
 *args: t.Any,
 **kwargs: t.Any,
 ):
 self.call = model
 self.args = args
 self.kwargs = kwargs
 self.overrides = overrides

 def __iter__(self) -> "solve_gen":
 self.stack = ExitStack()
 return self

 def __next__(self) -> t.Any:
 if self._iter is None:
 stack = self.stack = ExitStack()
 self.stack.__enter__()
 self._iter = t.cast(
 t.Iterator[t.Any],
 iter(
 self.call.solve(
 *self.args,
 stack=stack,
 dependency_overrides=self.overrides,
 cache_dependencies={},
 nested=False,
 **self.kwargs,
 )
 ),
 )

 try:
 r = next(self._iter)
 except StopIteration as e:
 self.stack.__exit__(None, None, None)
 raise e
 else:
 return r

Autostore: File Storage Made Simple

Sat, 14 Jun 2025 17:08:21 -0400

AutoStore provides a dictionary-like interface for reading and writing files with caching and different storage backends.

AutoStore eliminates the cognitive overhead of managing different file formats, letting you focus on your data and analysis rather than the mechanics of file I/O. It automatically handles file format detection, type inference, upload/download operations, and provides a clean, intuitive API for data persistence across local and cloud storage.

Table of contents:

Why Use AutoStore?
Getting Started
- Basic Usage
- Cloud Storage (S3)
Supported Data Types
Configuration Options
- S3StorageConfig
Advanced Features
Multiple Storage Backends
Performance Considerations
- Large File Handling
When to Use AutoStore
Comparison with Alternatives

Why Use AutoStore?

Simplicity: Store and retrieve data with dictionary syntax. No need to remember APIs for different file formats.
Caching: Caching system with configurable expiration reduces redundant downloads, especially for cloud storage.
Multiple Storage Backends: Seamlessly work with local files, S3, and other cloud storage services.
Type Detection: Automatically infers the best file format based on the data type.
Multiple Data Types: Built-in support for Polars DataFrames, JSON, CSV, images, PyTorch models, NumPy arrays, and more.
Extensible Architecture: Pluggable handler system for new data types and storage backends.
Performance Optimized: Upload/download operations with efficient handling of large files.
Type-Safe Configuration: Dataclass-based configuration with IDE support and validation.

Getting Started

AutoStore requires Python 3.10+ and can be installed via pip.

pip install autostore

Basic Usage

from autostore import AutoStore

store = AutoStore("./data")

# Write data - automatically saves with appropriate extensions
store["my_dataframe"] = df # ./data/my_dataframe.parquet
store["config"] = {"key": "value"} # ./data/config.json
store["logs"] = [{"event": "start"}] # ./data/logs.jsonl

# Read data
df = store["my_dataframe"] # Returns a Polars DataFrame
config = store["config"] # Returns a dict
logs = store["logs"] # Returns a list of dicts

Cloud Storage (S3)

from autostore import AutoStore
from autostore.s3 import S3Backend, S3StorageConfig

# Register S3 backend
AutoStore.register_backend("s3", S3Backend)

# Configure S3 with caching
s3_config = S3StorageConfig(
 region_name="us-east-1",
 cache_enabled=True,
 cache_expiry_hours=12,
 multipart_threshold=64 * 1024 * 1024 # 64MB
)

# Use S3 storage
store = AutoStore("s3://my-bucket/data/", config=s3_config)
store["experiment/results"] = {"accuracy": 0.95, "epochs": 100}
results = store["experiment/results"] # Uses cache on subsequent loads

Supported Data Types

Data Type	File Extension	Description	Library Required
Polars DataFrame/LazyFrame	`.parquet`, `.csv`	High-performance DataFrames	polars
Python dict/list	`.json`	Standard JSON serialization	built-in
List of dicts	`.jsonl`	JSON Lines format	built-in
Pydantic models	`.pydantic.json`	Structured data models	pydantic
Python dataclasses	`.dataclass.json`	Dataclass serialization	built-in
String data	`.txt`, `.html`, `.md`	Plain text files	built-in
NumPy arrays	`.npy`, `.npz`	Numerical data	numpy
SciPy sparse matrices	`.sparse`	Sparse matrix data	scipy
PyTorch tensors/models	`.pt`, `.pth`	Deep learning models	torch
PIL/Pillow images	`.png`, `.jpg`, etc.	Image data	Pillow
YAML data	`.yaml`, `.yml`	Human-readable config files	PyYAML
Any Python object	`.pkl`	Pickle fallback	built-in

Configuration Options

S3StorageConfig

from s3 import S3StorageConfig

config = S3StorageConfig(
 aws_access_key_id="your-key",
 aws_secret_access_key="your-secret",
 region_name="us-east-1",
 cache_enabled=True,
 cache_expiry_hours=12,
 multipart_threshold=64 * 1024 * 1024, # Files larger than this use multipart upload
 multipart_chunksize=16 * 1024 * 1024, # Chunk size for multipart uploads
 max_concurrency=10 # Maximum concurrent uploads/downloads
)

Advanced Features

Caching System

AutoStore includes an intelligent caching system that:

Stores frequently accessed files locally
Uses ETags for cache validation
Automatically expires old cache entries
Significantly improves performance for cloud storage

# Cache management
store.cleanup_cache() # Remove expired cache entries

# Check cache status
metadata = store.get_metadata("large_file")
print(f"File size: {metadata.size} bytes")
print(f"ETag: {metadata.etag}")

Custom Data Handlers

Add support for new data types by creating custom handlers:

from pathlib import Path
from autostore.autostore import DataHandler

class CustomLogHandler(DataHandler):
 def can_handle_extension(self, extension: str) -> bool:
 return extension.lower() == ".log"

 def can_handle_data(self, data) -> bool:
 return isinstance(data, list) and all(
 isinstance(item, dict) and "timestamp" in item
 for item in data
 )

 def read_from_file(self, file_path: Path, file_extension: str):
 logs = []
 with open(file_path, 'r') as f:
 for line in f:
 if line.strip():
 logs.append(json.loads(line))
 return logs

 def write_to_file(self, data, file_path: Path, file_extension: str):
 file_path.parent.mkdir(parents=True, exist_ok=True)
 with open(file_path, 'w') as f:
 for entry in data:
 f.write(json.dumps(entry) + '\n')

 @property
 def extensions(self):
 return [".log"]

 @property
 def priority(self):
 return 15

# Register the handler
store.register_handler(CustomLogHandler())

File Operations

# Check existence
if "config" in store:
 print("Config file exists")

# List all files
for key in store.keys():
 print(f"File: {key}")

# Get file metadata
metadata = store.get_metadata("large_dataset")
print(f"Size: {metadata.size} bytes")
print(f"Modified: {metadata.modified_time}")

# Copy and move files
store.copy("original", "backup")
store.move("temp_file", "permanent_file")

# Delete files
del store["old_data"]

Context Management

# Automatic cleanup of temporary files and cache
with AutoStore("./data", config=config) as store:
 store["data"] = large_dataset
 results = store["data"]
# Temporary files are automatically cleaned up here

Multiple Storage Backends

AutoStore supports pluggable storage backends:

# Local storage
local_store = AutoStore("./data")

# S3 storage
s3_store = AutoStore("s3://bucket/prefix/")

Performance Considerations

Large File Handling

AutoStore automatically optimizes for large files:

Multipart uploads/downloads for files > 64MB
Configurable chunk sizes and concurrency
Streaming operations to minimize memory usage

When to Use AutoStore

Choose AutoStore when you need:

Data science projects with mixed file types and cloud storage
Building data pipelines with heterogeneous data sources
Rapid prototyping where you don’t want to think about file formats
Consistent data access patterns across local and cloud environments
Performance optimization through intelligent caching
Easy extensibility for custom data types and storage backends
Type-safe configuration with dataclass-based settings

Don’t choose AutoStore when:

You need complex queries (use TinyDB or databases)
You only work with one data type consistently
You need zero dependencies (use Shelve)
You require advanced database features

Comparison with Alternatives

Feature	AutoStore	Shelve	DiskCache	TinyDB	PickleDB	SQLiteDict
Multi-format Support	✅ 12+ formats	❌ Pickle only	❌ Pickle only	❌ JSON only	❌ JSON only	❌ Pickle only
Auto Format Detection	✅ Smart inference	❌ Manual	❌ Manual	❌ Manual	❌ Manual	❌ Manual
Cloud Storage	✅ S3, extensible	❌ Local only	❌ Local only	❌ Local only	❌ Local only	❌ Local only
Intelligent Caching	✅ ETag-based	❌ None	✅ Advanced	❌ None	❌ None	❌ None
Type-Safe Config	✅ Dataclasses	❌ None	✅ Classes	❌ Dicts	❌ None	❌ None
Large File Handling	✅ Multipart	❌ Limited	✅ Good	❌ Limited	❌ Limited	❌ Limited
Extensibility	✅ Handler system	❌ Limited	❌ Limited	✅ Middleware	❌ Limited	❌ Limited
Performance	✅ Cached/Optimized	🔶 Medium	✅ Fast	🔶 Medium	🔶 Medium	🔶 Medium
Standard Library	❌ External	✅ Built-in	❌ External	❌ External	❌ External	❌ External

Pypertext: HTML the Pythonic way

Sat, 14 Jun 2025 00:00:00 +0000

Create HTML elements the Pythonic way with a chainable, expressive API.

from pypertext import ht

page = ht.div(
 ht.h1("Welcome to Pypertext"),
 ht.p("Build HTML with Python!", style={"color": "blue"}),
 classes=["container", "page"],
 id="main-content"
)
print(page)
# <div class="container page" id="main-content">
# <h1>Welcome to Pypertext</h1>
# <p style="color: blue;">Build HTML with Python!</p>
# </div>

Table of contents:

Install
Core Features
API Reference

Install

pip install pypertext

Core Features

🏗️ Element Creation with `ht`

Create any HTML element using the ht factory:

from pypertext import ht

# Basic elements
ht.div("Hello World")
ht.span("Text", id="my-span")
ht.button("Click me", type="submit", classes=["btn", "primary"])

# Self-closing elements
ht.img(src="image.jpg", alt="Description")
ht.input(type="text", placeholder="Enter name")
ht.br()

⛓️ Chainable Operations

Build complex HTML structures with method chaining and operators:

# Using + operator to add children
container = ht.div(classes=["container"])
container + ht.h1("Title") + ht.p("Content") + ht.button("Action")

# Using += for incremental building
form = ht.form(action="/submit", method="post")
form += ht.input(type="text", name="username", placeholder="Username")
form += ht.input(type="password", name="password", placeholder="Password")
form += ht.button("Login", type="submit")

# Using call syntax for chaining
nav = ht.nav()
nav(
 ht.a("Home", href="/"),
 ht.a("About", href="/about"),
 ht.a("Contact", href="/contact"),
 classes=["navigation"]
)

🎨 Dynamic Styling with dict2css

Convert Python dictionaries to CSS with support for nested selectors:

from pypertext import dict2css, ht

# Simple CSS
styles = {
 "body": {"margin": "0", "font-family": "Arial, sans-serif"},
 ".container": {"max-width": "1200px", "margin": "0 auto"}
}

# Nested selectors with pseudo-classes
advanced_styles = {
 ".card": {
 "padding": "20px",
 "border": "1px solid #ddd",
 ":hover": {"box-shadow": "0 4px 8px rgba(0,0,0,0.1)"},
 ".title": {"font-size": "1.5rem", "margin-bottom": "10px"},
 "&.featured": {"border-color": "gold"},
 "> .content": {"line-height": "1.6"}
 }
}

css = dict2css(advanced_styles)
ht.style(css) # <style>{".card": ...}</style>

📄 Full Document Creation

Create complete HTML documents with the Document class:

from pypertext import Document, ht

# Basic document
doc = Document(page_title="My Website")
doc += ht.header(
 ht.nav(
 ht.a("Home", href="/"),
 ht.a("Blog", href="/blog"),
 classes=["main-nav"]
 )
)
doc += ht.main(
 ht.h1("Welcome"),
 ht.p("This is my website built with Pypertext!"),
 classes=["content"]
)
doc += ht.footer("© 2025 My Website")

print(doc)
# Outputs complete HTML document with DOCTYPE, head, and body

🔧 Flexible Content Types

Add almost any type of content as children:

# Strings and numbers
ht.div("Text content", 42, 3.14)

# Lists and iterables
ht.ul([ht.li(item) for item in ["Apple", "Banana", "Cherry"]])

# Functions for dynamic content
def get_current_time() -> str:
 from datetime import datetime
 return datetime.now().strftime("%Y-%m-%d %H:%M:%S")

ht.div("Current time: ", get_current_time)

# Other elements
header = ht.header(ht.h1("Site Title"))
main = ht.main("Content")
page = ht.div(header, main, classes=["page-layout"])

🏷️ Modify elements

element = ht.div("Content")

# Add classes
element.add_classes("container", "primary")
element.add_classes(["responsive", "animated"])

# Check for classes
if element.has_classes("container", "primary"):
 print("Has required classes")

# Remove classes
element.remove_classes("animated")

# Merge with existing classes
element.merge_attrs(classes=["new-class"])

# Append children
element.append(ht.div("Child"))

# Extend children
element.extend(ht.div("Hello"), ht.div("World"))

# Insert children at index position
element.insert(0, ht.div("First"))

📝 Attribute Handling

Flexible attribute management:

# Set attributes
button = ht.button("Submit")
button.set_attrs(type="submit", disabled=True, data_action="form-submit")

# Merge attributes (combines values for duplicate keys)
button.merge_attrs(classes=["btn"], data_extra="value")

# Dictionary-style attribute assignment
form = ht.form() + {"method": "post", "action": "/submit"}

# Style dictionaries
styled_div = ht.div(
 "Styled content",
 style={"background": "blue", "color": "white", "padding": "10px"}
)

# Private attributes starting with an underscore hold state and are not rendered
el = ht.div("Private state", _metadata={"key": "value"})
el.attributes["_metadata"] # Access private attributes

🔄 Method Chaining and Pipes

Chain operations for readable code:

# Method chaining
result = (
 ht.div("Base content")
 .add_classes("container", "main")
 .set_attrs(id="content-area")
 .append(ht.p("Additional paragraph"))
)

# Pipe pattern for custom transformations
def add_bootstrap_classes(element):
 element.add_classes("d-flex", "justify-content-center")
 return element

def add_data_attributes(element, **data):
 for key, value in data.items():
 element.set_attrs(**{f"data_{key}": value})
 return element

card = (
 ht.div("Card content")
 .pipe(add_bootstrap_classes)
 .pipe(add_data_attributes, toggle="modal", target="#myModal")
)

🌐 ASGI Integration

Use Document as an ASGI application with Starlette/FastAPI:

from starlette.applications import Starlette
from starlette.routing import Route
from pypertext import Document, ht

async def homepage(request):
 doc = Document(page_title="My App")
 doc += ht.h1("Hello from Pypertext!")
 doc += ht.p("This page was generated with Python")
 return doc

async def user_profile(request):
 username = request.path_params['username']
 doc = Document(page_title=f"Profile - {username}")
 doc += ht.h1(f"Welcome, {username}!")
 doc += ht.p("Your profile page")
 return doc

app = Starlette(routes=[
 Route("/", homepage),
 Route("/user/{username}", user_profile),
])

🧩 Custom Components

Create reusable components:

def card(title, content, **attrs):
 return ht.div(
 ht.div(title, classes=["card-title"]),
 ht.div(content, classes=["card-content"]),
 classes=["card"],
 **attrs
 )

def alert(message, type="info"):
 return ht.div(
 message,
 classes=["alert", f"alert-{type}"],
 role="alert"
 )

# Usage
page = ht.div(
 card("Welcome", "This is a reusable card component"),
 alert("Success! Your data was saved.", type="success"),
 classes=["page"]
)

Any class with get_element method can be used as an Element:

class Book:
 def __init__(self, title: str, author: str):
 self.title = title
 self.author = author

 def get_element(self):
 return ht.div(
 ht.h2(self.title, classes=["book-title"]),
 ht.p(f"by {self.author}", classes=["book-author"]),
 classes=["book"]
 )

book = Book("1984", "George Orwell")
page = ht.div(
 book,
 classes=["book-page"]
)

CSS-in-Python Styling

app_styles = {
 ":root": {
 "--primary-color": "#007bff",
 "--secondary-color": "#6c757d",
 "--font-family": "system-ui, sans-serif"
 },
 "body": {
 "font-family": "var(--font-family)",
 "line-height": "1.6",
 "margin": "0"
 },
 ".container": {
 "max-width": "1200px",
 "margin": "0 auto",
 "padding": "0 20px"
 },
 ".btn": {
 "display": "inline-block",
 "padding": "0.5rem 1rem",
 "border": "none",
 "border-radius": "0.25rem",
 "cursor": "pointer",
 "text-decoration": "none",
 "&:hover": {
 "opacity": "0.8"
 },
 "&.btn-primary": {
 "background-color": "var(--primary-color)",
 "color": "white"
 }
 },
 "@media (max-width: 768px)": {
 ".container": {
 "padding": "0 10px"
 },
 ".btn": {
 "width": "100%",
 "text-align": "center"
 }
 }
}

# Create complete styled page
doc = Document(page_title="Styled App")
doc.head += ht.style(dict2css(app_styles))
doc += ht.div(
 ht.h1("Styled with Pypertext"),
 ht.p("This page uses CSS-in-Python styling"),
 ht.button("Click me", classes=["btn", "btn-primary"]),
 classes=["container"]
)

API Reference

Core Classes

ht: Factory for creating HTML elements
Element: Base class for all HTML elements with chainable methods
Document: Specialized element for complete HTML documents with ASGI support
dict2css: Function to convert Python dictionaries to CSS strings

Element Methods

.add_classes(*classes): Add CSS classes
.remove_classes(*classes): Remove CSS classes
.has_classes(*classes): Check if element has classes
.set_attrs(**attrs): Set attributes (replaces existing)
.merge_attrs(**attrs): Merge attributes (combines existing)
.append(*children): Add children to element
.extend(*children): Extend children elements
.insert(index, *children): Insert children at specific position
.pipe(function, *args, **kwargs): Apply custom function to element

Document attributes

.head: <head> Element, contains <title>, <style>, etc.
.body: <body> Element, contains main children of the document
.title: <title> Element, holds the document title
.html: <html> Element, root of the document
.page_title: String, title of the document for <title> tag
.headers: A dictionary of response headers used in ASGI responses
.status_code: Integer, HTTP status code for ASGI responses

Presskit: Database-driven static site generator

Sat, 07 Jun 2025 00:00:00 +0000

Nearly all static site generators convert Markdown to HTML but aren’t good at generating multiple pages from database queries. Ideally we would store data in SQLite or Postgres databases and the static site generator would run queries against the database to create a page for each row in the result. Presskit was invented to do just that.

Presskit is a powerful static site generator that combines Markdown content with Jinja2 templating and database-driven page generation. Presskit lets you build dynamic static sites by connecting your content to SQLite databases and JSON data sources.

Table of contents:

Key Features
Installation
Quick Start
Basic Usage
Template Variables
Using Variables in Markdown
Data Sources and Queries
Generating Pages
Commands
- Build Commands
- Development
Advanced Configuration
- Full Configuration Example
- Custom Filters

Key Features

Jinja2 Templating: Use Jinja2 variables and logic in both Markdown content and HTML layouts
Database Integration: Load data from SQLite databases and JSON files
Dynamic Page Generation: Generate multiple pages automatically from SQLite query results
Structured Context: Access site metadata, build information, and data through a clean template context

Installation

pip install presskit

Or you can use Astral’s uv Python package manager to install Presskit as a self-contained tool so it can be run from the command line without needing to activate a virtual environment:

uv tool install presskit

Quick Start

Create a new site directory:

mkdir my-site
cd my-site

Create the basic structure:

my-site/
├── presskit.json # Configuration file
├── content/ # Markdown files
├── templates/ # HTML templates
└── public/ # Generated output (created automatically)

Build your site:

presskit build

Basic Usage

Writing Markdown Content

Create Markdown files in the content/ directory. Each file can include YAML front matter for metadata:

---
title: "Welcome to My Site"
description: "A brief introduction"
layout: page
---
# Welcome
This is my **awesome** site built with Presskit!

Creating HTML Templates

Templates go in the templates/ directory. Here’s a basic page.html template:

<!DOCTYPE html>
<html lang="{{ site.language }}">
<head>
 <meta charset="UTF-8">
 <title>{{ page.title or site.title }}</title>
 <meta name="description" content="{{ page.description or site.description }}">
</head>
<body>
 <header>
 <h1>{{ site.title }}</h1>
 </header>

 <main>
 {{ page.content }}
 </main>

 <footer>
 <p>&copy; {{ build.year }} {{ site.author }}</p>
 </footer>
</body>
</html>

Configuration

Create a presskit.json file to configure your site:

{
 "title": "My Awesome Site",
 "description": "A site built with Presskit",
 "author": "Your Name",
 "url": "https://mysite.com",
 "language": "en-US"
}

Template Variables

Presskit provides a structured context with the following variables available in all templates:

Site Variables (`site.*`)

site.title - Site title
site.description - Site description
site.author - Site author
site.url - Base site URL
site.version - Site version
site.language - Site language

Build Variables (`build.*`)

build.date - Build date (YYYY-MM-DD)
build.year - Build year
build.timestamp - Full build timestamp
build.iso_date - Build date in ISO format

Page Variables (`page.*`)

page.filename - Page filename without extension
page.filepath - Full file path
page.path - Clean URL path
page.layout - Template layout name
page.content - Processed HTML content (in templates)
page.title - Page title from front matter
page.description - Page description from front matter

Data Variables (`data.*`)

data.queries - Results from named queries
data.sources - JSON data sources
data.page_queries - Page-specific query results

Plus any custom variables from your front matter are available at the top level.

Using Variables in Markdown

You can use Jinja2 templating directly in your Markdown content:

---
title: About
category: personal
---
# About {{ site.author }}
This site was built on {{ build.date }} and is currently version {{ site.version }}.
{% if category == "personal" %}
This is a personal page about {{ site.author }}.
{% endif %}

Data Sources and Queries

Presskit’s data integration feature allows you to connect your static site to data sources, enabling content generation while maintaining the performance benefits of static sites. This powerful feature bridges the gap between static and dynamic websites.

This enables data-driven pages that display statistics, reports, or any structured data. Ideal for portfolios showcasing project metrics, business dashboards, or documentation sites pulling from APIs.

This encourages separation of concerns where you keep your content in databases where it can be easily edited, queried, and managed, while your site structure remains in version control.

Configuring Data Sources

Add data sources to your presskit.json:

{
 "title": "My Blog",
 "sources": {
 "blog_db": {
 "type": "sqlite",
 "path": "data/blog.db"
 },
 "config": {
 "type": "json",
 "path": "data/site-config.json"
 }
 },
 "default_source": "blog_db"
}

Adding Queries

Define queries to load data from your sources:

{
 "sources": {
 "blog_db": {
 "type": "sqlite",
 "path": "data/blog.db"
 }
 },
 "queries": [
 {
 "name": "recent_posts",
 "source": "blog_db",
 "query": "SELECT title, slug, date, excerpt FROM posts ORDER BY date DESC LIMIT 5"
 },
 {
 "name": "categories",
 "source": "blog_db",
 "query": "SELECT name, slug, COUNT(*) as post_count FROM categories JOIN posts ON categories.id = posts.category_id GROUP BY categories.id"
 }
 ]
}

Using Query Data in Templates

Access query results through the data.queries object:

<section class="recent-posts">
 <h2>Recent Posts</h2>
 {% for post in data.queries.recent_posts %}
 <article>
 <h3><a href="/posts/{{ post.slug }}">{{ post.title }}</a></h3>
 <time>{{ post.date | date_format('%B %d, %Y') }}</time>
 <p>{{ post.excerpt }}</p>
 </article>
 {% endfor %}
</section>

<aside class="categories">
 <h3>Categories</h3>
 <ul>
 {% for category in data.queries.categories %}
 <li><a href="/category/{{ category.slug }}">{{ category.name }} ({{ category.post_count }})</a></li>
 {% endfor %}
 </ul>
</aside>

Page-Level Queries

You can also define queries in individual Markdown files:

---
title: "Author Profile"
queries:
 author_posts:
 source: "blog_db"
 query: "SELECT title, slug, date FROM posts WHERE author_id = {{ author_id }} ORDER BY date DESC"
variables:
 author_id: 123
---

# {{ author.name }}

## Recent Posts by This Author

{% for post in data.page_queries.author_posts %}
- [{{ post.title }}](/posts/{{ post.slug }}) - {{ post.date | date_format('%Y-%m-%d') }}
{% endfor %}

The above example shows how to define a query that fetches posts by a specific author using the author_id variable.

Generating Pages

The most powerful feature of Presskit is generating multiple pages from database queries.

Generator Queries

Mark a query as a generator to create multiple pages:

{
 "queries": [
 {
 "name": "blog_posts",
 "source": "blog_db",
 "query": "SELECT title, slug, content, date, author FROM posts WHERE published = 1",
 "generator": true,
 "template": "post",
 "output_path": "posts/#{slug}"
 }
 ]
}

Generator Configuration

generator: true - Marks this as a page generator
template - Template to use for generated pages
output_path - Path pattern with placeholders like #{field_name}

Creating Generator Templates

Create a template for your generated pages (templates/post.html):

<!DOCTYPE html>
<html>
<head>
 <title>{{ title }} | {{ site.title }}</title>
</head>
<body>
 <article>
 <h1>{{ title }}</h1>
 <time>{{ date | date_format('%B %d, %Y') }}</time>
 <div class="content">
 {{ content | safe }}
 </div>
 <p>By {{ author }}</p>
 </article>

 <nav>
 <a href="/">← Back to Home</a>
 </nav>
</body>
</html>

Nested Queries

You can create parent-child query relationships:

{
 "queries": [
 {
 "name": "authors",
 "source": "blog_db",
 "query": "SELECT id, name, bio, slug FROM authors"
 },
 {
 "name": "authors.posts",
 "source": "blog_db",
 "query": "SELECT title, slug, date FROM posts WHERE author_id = {{ id }} ORDER BY date DESC"
 }
 ]
}

Access nested data in templates:

{% for author in data.queries.authors %}
<div class="author">
 <h2>{{ author.name }}</h2>
 <p>{{ author.bio }}</p>

 <h3>Posts by {{ author.name }}</h3>
 {% for post in author.posts %}
 <p><a href="/posts/{{ post.slug }}">{{ post.title }}</a> - {{ post.date }}</p>
 {% endfor %}
</div>
{% endfor %}

Commands

Build Commands

# Build entire site
presskit build

# Build specific file
presskit build content/about.md

# Execute queries and cache results
presskit data

# Generate pages from generator queries 
presskit generate

# Check query cache status
presskit status

Development

# Start development server
presskit server

# Clean build artifacts
presskit clean

Advanced Configuration

Full Configuration Example

{
 "title": "My Blog",
 "description": "A blog about web development",
 "author": "Jane Developer",
 "url": "https://myblog.dev",
 "version": "2.1.0",
 "language": "en-US",

 "content_dir": "content",
 "templates_dir": "templates",
 "output_dir": "public",
 "cache_dir": ".cache",

 "default_template": "page",
 "markdown_extension": "md",
 "workers": 8,

 "server_host": "0.0.0.0",
 "server_port": 8000,

 "sources": {
 "blog_db": {
 "type": "sqlite",
 "path": "data/blog.sqlite3"
 },
 "config": {
 "type": "json",
 "path": "data/config.json"
 }
 },

 "default_source": "blog_db",

 "variables": {
 "environment": "production",
 "analytics_id": "GA-XXXXX"
 },

 "queries": [
 {
 "name": "posts",
 "source": "blog_db",
 "query": "SELECT * FROM posts WHERE status = 'published' ORDER BY date DESC",
 "generator": true,
 "template": "post",
 "output_path": "blog/#{slug}"
 },
 {
 "name": "recent_posts",
 "source": "blog_db",
 "query": "SELECT title, slug, excerpt, date FROM posts WHERE status = 'published' ORDER BY date DESC LIMIT 5"
 }
 ]
}

Custom Filters

Presskit includes useful Jinja2 filters:

date_format(format)
Format dates (e.g., {{ date | date_format('%B %d, %Y') }})

Google Chrome On-Device Embedding Model

Tue, 13 May 2025 00:00:00 +0000

Google Chrome bundles a text embedding model used to cluster browsing history as part of the Topics API and for semantic search. They also ship a number of other models with Chrome.

First I had to track down these on-device models. I started at the usual place where apps store their application data in the ~/Library/Application Support/ folder on macOS. Searching for find ~/Library/Application\ Support/ -maxdepth 4 'optimization*' returns the the folder I was looking for: ~/Library/Application Support/Google/Chrome/optimization_guide_model_store. The Chromium source code says this about optimization guide: “The optimization guide component contains code for processing hints and machine learning models received from the remote Chrome Optimization Guide Service”.

Models are stored as tflite files, which is rebranded as LiteRT:

LiteRT (short for Lite Runtime), formerly known as TensorFlow Lite, is Google’s high-performance runtime for on-device AI. LiteRT is also the format Google uses to ship their Gemma3N edge-device models.

$ cd ~/Library/Application\ Support/Google/Chrome/optimization_guide_model_store
$ tree -L 4 .
.
├── 13
│   └── E6DC4029A1E4B4C1
│   └── 205CA176C885321E
│   ├── model-info.pb
│   └── model.tflite
├── 15
│   └── E6DC4029A1E4B4C1
│   └── 255B83C178FA9DD9
│   ├── model-info.pb
│   ├── model.tflite
│   ├── override_list.pb.gz
│   └── VERSION.txt
├── 2
│   └── E6DC4029A1E4B4C1
│   └── 2DFDB6405E512759
│   ├── model-info.pb
│   └── model.tflite
├── 20
│   └── E6DC4029A1E4B4C1
│   └── 91EF641BEE15B40C
│   ├── model-info.pb
│   └── model.tflite
├── 24
│   └── E6DC4029A1E4B4C1
│   └── 1B8C0D25285420AB
│   ├── enus_denylist_encoded_241007.txt
│   ├── model-info.pb
│   ├── model.tflite
│   └── vocab_en-us.txt
├── 25
│   └── E6DC4029A1E4B4C1
│   └── C278361C6A5A6107
│   ├── model-info.pb
│   ├── model.tflite
│   └── visual_model_desktop.tflite
├── 26
│   └── E6DC4029A1E4B4C1
│   └── 141FCE0CF6807549
│   ├── model-info.pb
│   └── model.tflite
├── 43
│   └── E6DC4029A1E4B4C1
│   └── E234446CB5BACE99
│   ├── model-info.pb
│   ├── model.tflite
│   └── sentencepiece.model
├── 45
│   └── E6DC4029A1E4B4C1
│   └── 063B3FABDDE10CE8
│   ├── model-info.pb
│   └── model.tflite
└── 9
 └── E6DC4029A1E4B4C1
 └── B5ECF67C32B2BD47
 ├── model-info.pb
 └── model.tflite

31 directories, 26 files

One of the models (255B83C178FA9DD9) is the “Browsing Topics Privacy Sandbox feature” which maps recent browsing history to a set of interest-based categories to serve relevant ads. The Topics API is a replacement for the FLOC proposal. The version.txt refers to a taxonomy, which presumably is related to the interest-based categories.

The visual_model_desktop.tflite file appears to be part of a phishing classifier based on the Chromium source code.

There is a sentencepiece.model file which is a text-tokenizer. The accompanying model.tflite file is the largest at 107 MB.

$ find . -type f -name "*model.tflite" -print0 | xargs -0 du -h

64K ./20/E6DC4029A1E4B4C1/91EF641BEE15B40C/model.tflite
16K ./9/E6DC4029A1E4B4C1/B5ECF67C32B2BD47/model.tflite
132K ./45/E6DC4029A1E4B4C1/063B3FABDDE10CE8/model.tflite
8.0K ./26/E6DC4029A1E4B4C1/141FCE0CF6807549/model.tflite
107M ./43/E6DC4029A1E4B4C1/E234446CB5BACE99/model.tflite <---
4.4M ./24/E6DC4029A1E4B4C1/1B8C0D25285420AB/model.tflite
2.6M ./15/E6DC4029A1E4B4C1/255B83C178FA9DD9/model.tflite
384K ./2/E6DC4029A1E4B4C1/2DFDB6405E512759/model.tflite
1.2M ./13/E6DC4029A1E4B4C1/205CA176C885321E/model.tflite
180K ./25/E6DC4029A1E4B4C1/C278361C6A5A6107/model.tflite

Let’s load up the sentencepiece tokenizer and the model and get embeddings.

import numpy as np
import tensorflow as tf
import sentencepiece as spm

tokenizer = spm.SentencePieceProcessor(model_file='sentencepiece.model')
interpreter = tf.lite.Interpreter(model_path='model.tflite')
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

def get_embedding(
 interpreter: tf.lite.Interpreter,
 tokenizer: spm.SentencePieceProcessor,
 text: str,
) -> np.ndarray:
 """
 Embedding vector for a given text. Max token length is 64.

 Args:
 interpreter: TFLite interpreter.
 tokenizer: SentencePiece tokenizer.
 text: Text to be tokenized and embedded.

 Returns:
 np.ndarray: Embedding vector of shape (768,).

 Example:
 >>> tokenizer = spm.SentencePieceProcessor(model_file="sentencepiece.model")
 >>> interpreter = tf.lite.Interpreter(model_path="model.tflite")
 >>> get_embedding(interpreter, tokenizer, "New York") # (768,)
 """
 input_shape = input_details[0]["shape"]
 seq_len = input_shape[1] if len(input_shape) > 1 else 64
 tokens = tokenizer.encode(text, out_type=int)
 tokens = np.pad(
 tokens[:seq_len],
 (0, max(0, seq_len - len(tokens))),
 "constant"
 )[np.newaxis, :] # (1, 64)
 interpreter.set_tensor(input_details[0]["index"], tokens.astype(np.int32))
 interpreter.invoke()
 embedding = interpreter.get_tensor(output_details[0]["index"]) # (1, 768)
 return embedding[0] # (768,)

text = "New York"
embedding = get_embedding(interpreter, tokenizer, text)
# [-2.49071661e-02 2.41535041e-03 -1.51733104e-02 -1.12882648e-02...]

This model has a max sequence length of 64 inpt tokens and outputs a 768 dimension vector. That is a small input size of around 200-220 characters.

Forecasting by frequency interpolation of time series

Sat, 18 Jan 2025 00:00:00 +0000

The FITS algorithm propsed by Xu et.al in FITS: Modeling time series with 10k parameters, uses a neat trick from signal processing to forecasting.

The principle it exploits is: increasing the resolution in the frequency domain also increases the signal length in the time domain. In other words, longer time series provides a higher frequency resolution.

The algorithm extends the resolution of the power spectral density as follows:

De-mean the time series by subtracting the mean. This gets rid of the DC component (the dominant zero-frequency amplitude in the PSD).
Compute the real Fourier transform (rFFT) of the time series getting back a complex valued signal.
1. The rFFT condenses the length of the time series N to N/2+1 complex numbers.
Learn the complex valued signal, the amplitude and phase, using a linear layer with complex type.
1. The layer has input size F and output size F * length_ratio. Where, F is the length of the complex valued signal and length_ratio is (sequence_length + prediction_length) * sequence_length. Length ratio is therefore > 1 and F * length_ratio is larger than the original frequency resolution F.
Add back the mean to the predicted signal.
Compute the inverse rFFT to get the predicted time series, which is now longer than the original time series.

The method is fast, because the FFT is fast, It is also memory efficient, because working in the frequency domain and using complex valued paramaters reduces the number of parameters to learn. A major benefit is the model supervises both the forecasting horizon but also backcasting on the look-back window. One concern is that the method does not do a good job at capturing the local-linear trends in the time series but should do a good job at capturing the periodicity.

Small language models (updated June 2025)

Mon, 30 Dec 2024 00:00:00 +0000

Last updated: 2025-06-15

Small language models are increasingly capable of performing a wide range of tasks locally on-device and in the web browser. This page lists some interesting small language models. I am classifying small models as those with fewer than 1 billion parameters. This page will be updated regularly as I evaluate new models.

Why? Small language models are ideal for structured problems where reasoning and “thinking” are not necessary. This actually covers a wide range of use cases like entity extraction, structured data extraction, summarization, classification, multi-turn conversations, text composition, text revision, and content-tagging. Small LM’s are also ideal candidates for fine tuning to learn domain-specific knowledge.

Limitations: By virtue of being small and compressed, small language models have some important limitations. Complex reasoning tasks should be broken down to simpler steps. Small LM’s should avoid math and code generation tasks. They also have limited world knowledge, unlike larger models that have “overfit” or memorized large amounts of factual information up to their training cutoff date. As such, small LM’s are more likely to hallucinate (provide made-up information) when asked about facts or events. Although fine tuning is not necessary for many tasks and comes with its own challenges like forgetting world knowledge that is learned during training, it can still be useful to get the small LM to learn domain-specific knowledge.

SmolLM2 from HuggingFace (135M, 360M, 1.7B) - General-purpose. The smallest models can run efficiently in the web browser. Good for entity extraction, summarizing small text, structured data extraction

NuExtract-v1.5 from NuMind (Tiny, Base, Large) - Fine-tuned for structured entity extraction. This model takes a text input and an example JSON output and returns a JSON string that matches the example schema. This is different from structured output generation through token sampling. NuExtract generates JSON strings directly and is trained to do so with high accuracy. Token sampling on the other hand constrains the decoder logits to produce valid JSON that follows a predefined grammar or regex pattern. Token sampling is more flexible and works with any base model but has additional overhead in the generation process. NuExtract promises to be more efficient (by directly producing JSON strings) and accurate for structured extraction tasks.

Arctic-embed from Snowflake (22M-335M) - Embedding-only, excels at retrieval tasks. I’ve used this model for a few retrieval problems and it works as advertised. I’ve found these smaller embedding models are a good alternative to BERT for pure retrieval tasks.

Nomic-Embed-text from Nomic - Embedding-only, excels at retrieval tasks. I’ve found Nomic to be highly capable for it’s size. The model handles sequence lengthds from 2048 to 8192 input tokens. nomic-embed-text -v1.5 was trained with Matryoshka Representation learning, which means you can choose the output embedding dimension from 64 up to 768. The highest dimension 768 is most accurate and accuracy is decent down to 256 dimensions, after which it drops off quickly.

Qwen 3 provides a 0.6B parameter model with a 32K context length and 1.5GB size and Apache 2.0 license. This model is small but has thinking and reasoning capability. Ollama supports Qwen 3 and the thinking can be enabled/disabled using the think parameter or by passing /no_think in the prompt. Qwen 3 docs provide some guidance and best practices:

For thinking mode, use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 (the default setting in generation_config.json). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the Best Practices section.

Convolutions as spectral filters

Tue, 24 Dec 2024 00:00:00 +0000

Convolution in the time domain is a sliding dot-product between a kernel and a signal. This operation requires that we align the kernel with each position of the signal and compute the dot-product at each position, making sure that the kernel does not extend beyond the signal boundaries by padding the signal and cutting the output to the original signal length.

Alternatively, a convolution in the time domain is equivalent to element-wise multiplication in the frequency domain. We can perform the convolution more efficiently using the Fast Fourier Transform (FFT). By computing the Power Spectral Density (PSD) of both the signal and the kernel using the FFT and then multiplying them element-wise. Finally, we can reconstruct the convolved signal using the inverse FFT. FFT-based convolution is faster than the direct convolution, especially for long signals and kernels.

This provides a different perspective on convolutions by thinking of convolutions as spectral filters. Consider a sine wave with a frequency at $f$ convolved with a gaussian kernel. The power spectra of a pure sine wave has a bar at the frequency $f$. The power spectra of a gaussian kernel is a negative exponential. The narrower the gaussian we get a more gentle exponential decay in the frequency domain. If we multiply the two power spectra element-wise, we get basically zeros everywhere the two power spectra do not overlap. Only at the frequency $f$ we get a non-zero value. Convolution in the time domain is equivalent to multiplication in the frequency domain.

I think this is a very intuitive way to understand why convolutions work. A convolution filters out the frequencies that are not present in both the signal and the kernel. Only the features of the signal that share characteristics with the features in the kernel are amplitude modulated and preserved in the output.

We can do some other interesting things in the frequency domain, like filtering out noise and reconstructing the signal using the inverse Fourier transform. Let’s see how we can decompose a signal into its frequency components using the FFT and reconstruct the signal using the iFFT.

{% include ‘freq-recon.html’ %}

A signal that has been reconstructed from the top-5 dominant frequencies.

Given a time series of values value and timesteps ts, the rFFT of the signal is computed using the following code snippet. We first detrend the signal by subtracting the mean and dividing by the standard deviation (z-score). This removes the zero-Hz fequency (DC offset), which would otherwise dominate the power spectrum. We then compute the rFFT of the detrended signal to get the complex values (amplitude and phase). The amplitudes are the absolute values of the complex values. The frequencies are computed using the rfftfreq function.

# Examine the PSD using the rFFT
xmean = np.mean(value)
xvar = np.var(value)
zvalues = (value - xmean) / np.sqrt(xvar)
tsnorm = (ts - ts[0]) / (ts[-1] - ts[0])
rfft_values = np.fft.rfft(zvalues) # complex values
amplitudes = np.abs(rfft_values) # amplitudes
freqs = np.fft.rfftfreq(len(value), d=tsnorm[1] - tsnorm[0])

We plot the power spectral density (PSD) of the signal, which tells us the energy at each frequency.

{% include ‘freq-psd.html’ %}

Power spectrum of the original signal.

Since period is the inverse of frequency, by identifying the frequencies that carry most of the energy, we can also discover the most dominant periods. The signal has a few dominant frequencies. We can select the top-5 frequencies (10.96, 9.96, 20.92, 21.91, 4.98) and reconstruct the signal using the inverse Fourier transform. This is equivalent to filtering for frequencies that capture most of the signal energy and removing the rest. This allows us to denoise the signal by removing the high-frequency components. Notice the reconstructed signal is a smoothed version of the original signal.

# Extract the top 5 dominant frequencies
top5 = np.argsort(amplitudes)[::-1][:5]
top5_freqs = freqs[top5]
print("Top 5 frequencies:", top5_freqs.round(2))

# Reconstruct the signal using the top 5 frequencies
rfft_values_filtered = np.zeros_like(rfft_values)
rfft_values_filtered[top5] = rfft_values[top5]
recon = np.fft.irfft(rfft_values_filtered)
recon = recon * np.sqrt(xvar) + xmean

You can also low-pass filter the signal by setting an upper-bound on the cut-off frequency and setting all amplitudes above the cut-off frequency to zero, then reconstruct the signal using the inverse Fourier transform.

Finally, we can implement a 1D convolution using the FFT in PyTorch.

import torch
import torch.fft as fft
from scipy.fftpack import next_fast_len

def conv1d_fft(signal: torch.Tensor, kernel: torch.Tensor, dim: int=-1):
 """Convolve two 1D tensors using FFT.

 Args:
 signal (Tensor): Shape (batch_size, N) where N is the signal length
 kernel (Tensor): Shape (batch_size, M) where M is the kernel length
 dim (int, optional): Dimension along which to convolve. Default is -1.

 Returns:
 Tensor: Shape (batch_size, N) containing the convolved signal
 """
 N = signal.size(dim) # signal length
 M = kernel.size(dim) # kernel length

 fast_len = next_fast_len(N + M - 1)

 F_f = fft.rfft(signal, fast_len, dim=dim) # shape (N, fast_len // 2 + 1)
 F_g = fft.rfft(kernel, fast_len, dim=dim) # shape (N, fast_len // 2 + 1)

 F_fg = F_f * F_g.conj()
 out = fft.irfft(F_fg, fast_len, dim=dim)
 out = out.roll((-1,), dims=(dim,))
 idx = torch.as_tensor(range(fast_len - N, fast_len)).to(out.device)
 out = out.index_select(dim, idx)

 return out

Zero-build Vue JS apps

Sat, 14 Dec 2024 00:00:00 +0000

Here is a template for a Vue app that does not require a build step. This is useful for small projects where you want to quickly iterate on a UI without having to setup a build process using Node.js and the NPM package manager. I find this type of zero-build setup especially rewarding when I work on a project for a brief time, deploy it, and then have to come back to it after a few months to make a small change or fix a bug. A few of the benefits I’ve noticed:

All the code is in one place so I don’t have to look across a large code repository to re-learn the structure of the UI.
Vue helps organize the code in a standard way and has great documentation.
The deployment process is to copy the HTML file and any other assets to a server and launch caddy to serve the files.

My typical workflow is to start with the HTML template below and add data, methods, and computed properties as needed. Bootstrap CSS is well-known, documented, and easy to use so it’s a good place to start. Tailwind CSS is another option, but newer versions (v3+) of Tailwind require a build step to generate the CSS file and the last minified CDN version (v2.2.19) is a large 2.93 MB file compared to Bootstrap’s 233 KB file.

<!DOCTYPE html>
<html lang="en">
 <head>
 <meta charset="UTF-8" />
 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
 <title>Document</title>
 <script src="https://unpkg.com/vue@3/dist/vue.global.prod.js"></script>
 <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet"/>
 <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/js/bootstrap.bundle.min.js"></script>
 </head>

 <body>
 <div id="app">
 <!-- Custom template -->
 <example-component name="Hello Vue!"></example-component>
 </div>

 <!-- Template tags should be defined outside the mounted app (#app) -->
 <template id="example-component">
 <div v-text="name"></div>
 </template>

 <script type="module">
 // Custom Vue component
 const exampleComponent = {
 // The template is defined in the <template> tag
 template: document.getElementById("example-component"),
 props: ["name"],
 data() {
 return {};
 },
 async mounted() {},
 methods: {},
 };

 const app = Vue.createApp({
 data() {
 return {};
 },
 computed: {},
 async mounted() {},
 methods: {},
 // Register the custom component
 components: {
 "example-component": exampleComponent,
 },
 });
 app.mount("#app");
 </script>
 </body>
</html>

Static site build script

Sat, 07 Dec 2024 00:00:00 +0000

This little shell script compiles a folder of markdown files into HTML files using Pandoc.

First it preprocesses markdown files as Mustache templates. This lets you use variables in your markdown files that are defined in a metadata.yaml file or the frontmatter. The script then uses Pandoc to convert the markdown files to standalone HTML files.

Usage

Save the source to a file named ./dev and make the script executable:

chmod +x ./dev

On MacOS run the setup command to install the required dependencies:

Pandoc
Mustache

./dev setup

To build the site run:

./dev build

To force a full rebuild run:

./dev build -F

To build a specific file run:

./dev build content/notes.md

If you have caddy and npx installed, you can run a local server and watch for changes:

./dev run

Source

#!/bin/bash
# Usage:
# chmod +x dev
# ./dev [COMMAND]

set -e

# Files modified in the last 30 minutes will be rebuilt
MMIN=30

ERROR='\033[0;31m'
SUCCESS='\033[0;32m'
CODE='\033[0;36m'
NC='\033[0m' # No Color

cmd_helps=()

defhelp() {
 local command="${1?}"
 local text="${2?}"
 local help_str
 help_str="$(printf ' %-24s %s' "$command" "$text")"
 cmd_helps+=("$help_str")
}

# Print out help information
cmd_help() {
 echo "Script for performing dev tasks."
 echo
 echo "Usage: ./dev [COMMAND]"
 echo "Replace [COMMAND] with a word from the list below."
 echo
 echo "COMMAND list:"
 for str in "${cmd_helps[@]}"; do
 echo -e "$str"
 done
}
defhelp help 'View all help.'

# ------------------------------------------------------------------------------
# Repo
# ------------------------------------------------------------------------------

cmd_clean() {
 echo "Cleaning up..."
 rm -f public/*.html
 rm -f public/*.xml
}
defhelp clean 'Clean up.'

cmd_setup() {
 echo "Setting up..."
 # check if jq is installed
 if ! command -v jq &> /dev/null; then
 echo "Installing jq..."
 brew install jq
 fi

 # check if pandoc is installed
 if ! command -v pandoc &> /dev/null; then
 echo "Installing pandoc..."
 brew install pandoc
 fi

 # check if mustache is installed
 if ! command -v mustache &> /dev/null; then
 echo "Installing mustache..."
 go install github.com/cbroglie/mustache/cmd/mustache@latest
 fi
}

# Build a file or all files, optionally force a full rebuild
# Build all files that have changed in the last 30 days
# ./dex build
# Build all files
# ./dex build -F
# Build a specific file
# ./dex build file.md
cmd_build() {
 echo "Building..."

 REBUILD=0

 # Check if given -F flag to force a full rebuild
 # ignore the flag if it is not given
 while getopts "F" opt; do
 case ${opt} in
 F)
 REBUILD=1
 ;;
 \?)
 # ignore unknown flags
 ;;
 esac
 done

 # check if any files in templates/ have changed in the last 30 minutes, if so, force a full rebuild
 if [ $REBUILD -eq 0 ]; then
 if [ $(find templates -type f -mmin -$MMIN | wc -l) -gt 0 ]; then
 echo "Templates have changed, forcing a full rebuild..."
 REBUILD=1
 fi
 fi

 # Markdown extension (e.g. md, markdown, mdown).
 MEXT="md"

 # if rebuild=0 and a file name is given, build that file
 if [ $REBUILD -eq 0 ] && [ $# -eq 1 ]; then
 FILES="$1"
 fi

 # Only check for files if FILES is not set
 if [ -z "$FILES" ]; then
 # get all markdown files that have changed in the last 30 minutes if not forcing a full rebuild
 # otherwise, get all markdown files
 if [ $REBUILD -eq 0 ]; then
 echo "Incremental build..."
 FILES=$(find content -type f -name "*.$MEXT" -mmin -$MMIN)
 else
 echo "Full build..."
 FILES=$(find content -type f -name "*.$MEXT")
 fi
 fi

 # if there are no files, exit
 if [ -z "$FILES" ]; then
 echo "${ERROR}No files to process!${NC}"
 exit 0
 fi

 # Location of the root directory with this Makefile, templates/, content/, public/
 ROOT=$(pwd)

 echo "Files to process:"
 echo "---"
 echo "$FILES"
 echo "---"

 # build each file
 for file in $FILES; do
 echo "Building: $file"
 # get the file name without the extension
 FILENAME=$(basename -- "$file")
 FILENAME="${FILENAME%.*}"

 # gather yaml front matter from the file if it exists using sed and awk
 frontmatter=$(awk '
 # Start capturing when we find the opening --- line
 /^---$/ { if (capture) exit; capture=1; next }
 # Print lines only if capture is active
 capture { print }
 ' < "$file")

 # strip leading and trailing --- from the frontmatter and add it back
 # we do this as a sanity check in case the file does not have frontmatter properly formatted
 frontmatter=$(echo "$frontmatter" | sed 's/^---//' | sed 's/---$//')
 frontmatter=$(echo $'\n'"---"$'\n'"$frontmatter"$'\n'"---")

 # collect all context data in one place
 context=$(cat content/metadata.yaml <(echo "$frontmatter"))

 # preprocess the markdown file with mustache, use the frontmatter and metadata.yaml as context
 # cat the frontmatter and metadata.yaml firstthen pipe the markdown file to mustache
 inputtext=$(cat <(echo "$context") | mustache "$file")

 pandoc -r markdown+simple_tables+table_captions+yaml_metadata_block+auto_identifiers+header_attributes+fenced_code_blocks+fenced_code_attributes+tex_math_dollars \
 -w html \
 --tab-stop=2 \
 --toc \
 --mathjax \
 --metadata-file content/metadata.yaml \
 -V builddate="$(date +"%a, %d %b %Y %H:%M:%S %z")" \
 -V year="$(date +"%Y")" \
 --template=./templates/bear.html \
 -o ./public/"$FILENAME".html \
 <(echo "$inputtext")

 done

 # print a success message
 echo -e "${SUCCESS}Build complete!${NC}"

}
defhelp build 'Build the site.'

cmd_run() {
 echo "Starting server..."
 npx nodemon --watch 'content/**/*' -e md,html,yaml --exec './dev build' \
 & caddy file-server --listen :8000 --root ./public \
 & wait
}

# --------------------------------------------------------------------------
# Core script logic
# -----------------------------------------------------------------------------

silent() {
 "$@" > /dev/null 2>&1
}

# If no command given
if [ $# -eq 0 ]; then
 echo -e "${ERROR}ERROR: This script requires a command!${NC}"
 cmd_help
 exit 1
fi
cmd="$1"
shift
if silent type "cmd_$cmd"; then
 "cmd_$cmd" "$@"
 exit $?
else
 echo -e "${ERROR}ERROR: Unknown command!${NC}"
 echo "Type './dev help' for available commands."
 exit 1
fi

ECG beat detection algorithm

Wed, 13 Mar 2024 00:00:00 +0000

A basic component of processing electrocardiogram (ECG) signals is detecting the heart beat. Beat detection is used to calculate the heart rate, to derive measures of heart rate variability, to develop signal quality indicators, and to detect diseases. There are thousands of publications and strategies to detecting the R-peak of the QRS complex of a heart beat from an ECG signal with varying degrees of accuracy (see section on other beat detection algorithms for a survey). Methods can range from threshold based peak detectors, to wavelet-based signal processing, to probabilistically combining multiple methods.

We will use the Pan-Tompkins algorithm, one of the most widely implemented peak-detection algorithms, to detect the R-peak of the ECG signal in this chapter. Data for this exercise is from the 2017 Physionet Challenge which was aimed to classifying atrial fibrillation from single channel ECG signals. The data was sampled at 300 Hz and band pass filtered. First, we start with a short introduction to ECG wave analysis.

ECG waves

ECG analysis starts with understanding the wave morphology and intervals.

Features derived from the a single beat in an ECG. Picture from Philips DXL ECG Algorithm Physician Guide

The P-wave reflects atrial deploarization. The amplitude of the P-wave is decreases in diseases like atrial fibrillation, which is a type of arrythmia or abnormal heartbeat. Therefore, we typically want to quantify the amplitude and duration of the P-wave for AFib classification. The distance between the P-wave onset and onset of the QRS complex is the PR interval with a normal duration of 120-220 ms.

The QRS complex reflects depolarization of the left ventricle (since the electrical vector of the left ventricle is much larger than that of the right ventricle). A short QRS duration proves the ventricles are functioning properly and a broad QRS duration indicates that ventricular activation is slow and there could be a dysfunction in the electrical conduction system of the heart. The R-peak of the QRS complex is used to calculate the instantaneous heart rate from the interval between subsequent R-peaks (RR-interval). An RR-interval of 400 ms is equivalent to an instantaneous heart rate of 150 beats per minute ($60 s / 400 ms * 1000 ms / s$).

The ST segment is another important morphological feature of the ECG wave since ST elevation and depression are both associated with heart dysfunction like acute myocardial ischemia or ST-elevation myocardial infarction (STEMI). Elevation or depression are calculated as the difference (in millimeters) between the J point (where the ST segment starts) and the PR segment. Finally, the T-wave reflects a repolarization of the contractile cells and is also associated with a range o heart conditions.

Pan-Thompkins algorithm

The Pan-Thompkins algorithm is widely used and can be used for real-time continuous QRS detection. The algorithm is based on analysis of slope, width, and amplitude of ECG using a series of filters. An ECG signal first goes through a bandpass filter, then a differentiator, a squaring operation, a moving window integrator, and finally adaptive thresholding and search-back to find the R-peak.

Pan-Thompkins algorithm for QRS detection.

Raw ECG signals include muscle noise (from respiration), motion artifacts, the QRS complex, and P-T Waves. The band pass filter is designed to match the spectrum of QRS complex, attenuates muscle noise, 60Hz interference, baseline wander, and T-wave interference. Pass band of 5-15 Hz maximizes the QRS energy.

Filtering on waveforms can have negative effects. While low pass filters successfully reduce noise in ECG traces, they also reduce the QRS amplitude. High pass filters (e.g. low cutoff at 0.5 Hz) reduce baseline wander, but also introduce ST distortion. Using forward/backward filtering (highpass, reverse time, highpass, reverse time) removes most of the distortion introduced by high pass filters on the ST segment.

Here we use cascading filters combining a low pass filter and a high pass filter to mimic a bandpass filter. The filter attenuates the P and T waves (which peak at <5 Hz), which is a desired feature since the goal is to detect the QRS complex.

The filtered ECG is then differenced and squared to amplify the QRS complex. The derivative filter further suppresses low frequency components of P and T waves. Squaring makes the signal positive and enhances the derivatives by amplifying the high frequency QRS complex.

Next, the moving average filter over a 150ms window captures the duration of the QRS complex and gives us the integrated signal. This suppresses the smaller oscillations by smoothing out the residual high frequency components. Here we have to define an optimal window length for the moving average. Large windows merge the QRS and T waves together and small windows would produce several peaks at the QRS complex making it difficult to find the R-peak. In addition to detecting the QRS complex the moving average filter gives us the width of the QRS complex.

There are numerous heurestics for peak detection from the integrated signal (e.g. simple thresholding of the moving window integral). Pan-Thompkins proposes to use adpative thresholding and search-back to select a range of time values that correspond to QRS complexes by adapting to changes in ECG by computing running estimates of signal and noise peaks. Here instead I smooth the integrated signal with a gaussian filter to get the energy of the signal. The peak then corresponds to zero-crossings of the first difference where $x[i+1] < x[i]$.

import numpy as np
import scipy.signal

ecg = np.loadtxt("ecg.txt") # load the ECG signal
fs = 300 # Hz
tvec = np.arange(len(ecg)) / fs # time vector

max_QRS_duration = 0.150 # sec
low_cutoff = 5
high_cutoff = 15
window_size = int(max_QRS_duration * fs)

# apply a bandpass filter to the ECG signal
lowpass = scipy.signal.butter(1, high_cutoff / (fs / 2.0), "low")
highpass = scipy.signal.butter(1, low_cutoff / (fs / 2.0), "high")
ecg_low = scipy.signal.filtfilt(*lowpass, x=ecg)
ecg_band = scipy.signal.filtfilt(*highpass, x=ecg_low)

diff = np.diff(ecg_band)
squared = np.square(diff)

# moving average filter
# apply padding on both sides of the signal and convolve to get the integrated signal
mwa = np.pad(squared, (window_size - 1, 0), "constant", constant_values=(0, 0))
mwa = np.convolve(mwa, np.ones(window_size), "valid")
for i in range(1, window_size):
 mwa[i - 1] = mwa[i - 1] / i
mwa[window_size - 1 :] = mwa[window_size - 1 :] / window_size
mwa[: int(max_QRS_duration * fs * 2)] = 0

# smooth the moving window integrated signal with a gaussian filter and take the derivative
energy = scipy.ndimage.gaussian_filter1d(mwa, fs / 8.0)
energy_diff = np.diff(energy)

# peaks are the points where the derivative crosses zero, adjust window size
zero_crossings = (energy_diff[:-1] > 0) & (energy_diff[1:] < 0)
zero_crossings = np.flatnonzero(zero_crossings)
zero_crossings -= int(window_size / 2)

Stages of the beat detection algorithm. The vertical red line indicates the peaks of the QRS complex.

The figure above visualizes each step of the beat detection algorith. The vertical red line are all the zero-crossings (zero_crossings), which lines up with the peak of the ECG signal.

Other beat detection algorithms

Algorithm	Reference
Zeelenberg (1979)	Engelse, W.A.H., Zeelenberg, C (1979). A single scan algorithm for QRS detection and feature extraction, IEEE Comp. in Cardiology, vol. 6, pp. 37-42.
Pan (1985)	Pan, J., & Tompkins, W. J. (1985). A real-time QRS detection algorithm. IEEE transactions on biomedical engineering, (3), 230-236.
Hamilton (2002)	Hamilton, P. (2002, September). Open source ECG analysis. In Computers in cardiology (pp. 101-104). IEEE.
Zong (2003)	Zong, W., Heldt, T., Moody, G. B., & Mark, R. G. (2003). An open-source algorithm to detect onset of arterial blood pressure pulses. In Computers in Cardiology, 2003 (pp. 259-262). IEEE.
Christov (2004)	Ivaylo I. Christov, Real time electrocardiogram QRS detection using combined adaptive threshold, BioMedical Engineering OnLine 2004, vol. 3:28, 2004.
Elgendi (2010)	Elgendi, Mohamed & Jonkman, Mirjam & De Boer, Friso. (2010). Frequency Bands Effects on QRS Detection. The 3rd International Conference on Bio-inspired Systems and Signal Processing (BIOSIGNALS2010). 428-431.
Kalidas (2017)	Vignesh Kalidas and Lakshman Tamil (2017). Real-time QRS detector using Stationary Wavelet Transform for Automated ECG Analysis. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE). Uses the Pan and Tompkins thresolding.
Nabian (2018)	Nabian, M., Yin, Y., Wormwood, J., Quigley, K. S., Barrett, L. F., Ostadabbas, S. (2018). An Open-Source Feature Extraction Tool for the Analysis of Peripheral Physiological Data. IEEE Journal of Translational Engineering in Health and Medicine, 6, 1-11.
Rodrigues (2021)	Rodrigues, Tiago & Samoutphonh, Sirisack & Plácido da Silva, Hugo & Fred, Ana. (2021). A Low-Complexity R-peak Detection Algorithm with Adaptive Thresholding for Wearable Devices.

Additional resources

Heart Rate Variability and Atrial Fibrillation

Wed, 13 Mar 2024 00:00:00 +0000

Atrial fibrillation (AFib) is a sustained cardiac arrythmia and is classified according to the temporal pattern of irregularly spaced heart beats. Patients with AFib have cardiac hemodynamic dysfunction, have up to 2-fold increase in risk of mortality, and a 6-fold increase in risk of stroke. The electrocardiographic presentation of AFib is continuous and rapid irregular electrical activity of the atria and absence of the P-wave because ventricular response is poorly coupled with atrial activity. These hallmark characteristic of AFib make ECG monitoring the most convenient tool to assist AFib diagnosis. Automated algorithms rely on one or more characteristics of the waveform including irregular rhythm, high-frequency chaotic atrial waveform, and absence of P waves. Measures of heart rate variability (HRV) and morphological analysis are the most common approaches.

R-R interval

We can use a beat detection algorithm to find the R-peaks of the QRS complexes, which are recorded as indices $R_{peaks}[n]$ for the $n$-th beat. We can convert the indices to an R-R interval (RRI) in units of seconds by taking the first difference and dividing by the sampling rate $f_s$:

$$RRI[n] = \frac{R_{peaks}[n] - R_{peaks}[n-1]}{f_s}$$

Poincare plots

A Poincare plot is a diagram in which the RR-interval ($RRI[n]$) is plotted as a function of the previous RR-interval ($RRI[n-1]$) and is a visual representation of heart rate variability. Patients with AFib have irregular RRI and the dispersion is wider. As such, we can quantify this dispersion by fitting an ellipsis and measuring the standard deviation along the major and minor axes. The area of the fitted ellipsis is also larger in AFib patients.

Javascript custom events

Mon, 08 Jan 2024 00:00:00 +0000

Custom events are a way to decouple a Javascript application. Trigger an event with the onclick handler (#!js onclick="triggerEvent(this,'my-event')") of an element, and listen for it elsewhere in the application.

function triggerEvent(el, name) {
 document.body.dispatchEvent(new CustomEvent('app:' + name, { detail: el }))
}
function main() {
 document.body.addEventListener('app:my-event', function (event) {
 console.log(event.detail) // element that triggered the event
 });
}
document.addEventListener('DOMContentLoaded', main)

Audio biosignal processing of phonocardiograms

Fri, 01 Jul 2022 00:00:00 +0000

A phonocardiogram (PCG) is a non-invasive assessment of the mechanical function of the heart. Cardiac auscultation and the analysis of the phonocardiogram can unveil fundamental clinical information regarding heart malfunctioning caused by congenital and acquired heart disease. This is achieved by detecting abnormal sound waves, or heart murmurs, in the PCG signal. This article uses data from the 2022 PhysioNet challenge to explore the spectral properties of PCG signals.

Below we see the time domain signal for a 5 sec window of PCG data. We can clearly see the S1 and S2 waves, which correspond to the beginning and end of the systolic phase of the heart beat, respectively.

The corresponding spectrogram shows the power at each frequency over time. The beats are clearly visible in both the time and frequency domains. There is some high frequency noise at about 1.5sec that we will next remove using a digital filter.

Spectrogram of the PCG signal

A Butterworth low-pass filter applied to this signal with 250 Hz cutoff frequency removes the high frequency noise at 1.5sec.

import scipy.signal

def butter(sig, freq, low_cutoff_hz, high_cutoff_hz, btype="low"):
 b, a = scipy.signal.butter(low_cutoff_hz, high_cutoff_hz / (freq / 2.0), btype)
 return scipy.signal.filtfilt(b, a, sig)

Most of the signal of interest is below 500Hz so low pass filtering removes the high frequency components and leaves the cardiac cycles intact. The figure below shows both the low pass filtered signal in the time domain (left) and frequency domain (right) on top and the high pass filtered signal on the bottom.

Low and high passed signals

Zooming in shows the cardiac cycles in more detail.

Original, low, and high passed signals

Time series similarity with random convolutional features and locality-sensitive hashing

Fri, 01 Jul 2022 00:00:00 +0000

Given a time series, like temperature readings from collection of sensors, we want to find sensors that have similar readings. This is a common problem in applications like sensor networks, IoT, and monitoring systems.

One approach is using random convolutional features to encode the signal and then use locality-sensitive hashing (LSH) to find similar signals. This approach is very fast and can be used to search through millions of time series signals. The time series neural hashing technique introduced here is a fast general-purpose search and retrieval algorithm.

Our specifications are:

We want to index a large number of time series signals and quickly retrieve similar time series in a way that scales computationally and has low storage requirements
Signals are of variable length with stretches of missing values. Imputation of missing values is not feasible
The representation should capture both the shape/structure and magnitude of the signal
Minimal feature engineering

Random convolutional neural hashing

Given a time series $X \in \mathbb{R}^{M \times N}$ of $M$ samples each of length $N$ containing missing values, we want to encode each sequence $x_m \in \mathbb{R}^{N}$ into a fixed length embedding vector that we can use for fast similarity search.

Normalize and clip $x_m$ in the range [0,1].
Fit a random convolutional encoder and save the parameters.
Embed $x_m$ to $e_m \in \mathbb{R}^{k}$ using the convolutional encoder.
Concatenate the sample stats $s_m={min, max, mean, P10, P25, P50, P90}$ from the normalized sequence to $e_m$ such that $[e_m; s_m] \in \mathbb{R}^{k+7}$
Weight sample stats $s_m$ by $\alpha$ and convolutional encodings $e_m$ by $1-\alpha$. ($\alpha=0.75$)
Define a seed matrix of shape $D \times k + 7$ from a standard normal distribution ($D=256$).
Calculate the hash string for each sample (e.g e9df77eb7c692f16)
Retrieval: Given a query hash, compute the hamming distance to find the nearest neighbor.

The random convolutional encoder is very fast to fit and evaluate. It’s generally good at capturing the structure of the signal however the magnitude may be lost because we use a max pooling layer that introduces scale and shift invariance. However, our similarity search should consider magnitude, which is important for our specific use case. Therefore, we introduce simple statistics about the magnitude and distribution of sample observations.

Pros:

Captures both signal structure (e.g. periodicity, stretches of missing data) and magnitude.
Fast and scalable both computationally and storage-wise.
Works on time series with missing values and of variable length.

Cons:

Sacrifices some precision since we are using random convolutional network where kernel weights, dilations, stide, are randomly selected.
Hashing-based retrieval is an approximate nearest neighbor approach which also has lower precision compared to exact nearest neighbor search or similarity search using tree-based methods with a distance metric.

Python project setup with Makefile and setup.py

Tue, 08 Mar 2022 00:00:00 +0000

All my Python projects are setup in the same way - as a Python package with a Makefile. This is a template for that setup.

$ tree .
.
├── Makefile
├── setup.py
├── myproject
│   ├── __init__.py

The setup.py file defines the requirements for the package, and the Makefile defines the commands to run.

The setup creates a virtual environment in .venv directory, installs the package in editable mode, and installs the development dependencies. Editable installs are useful when developing a package, as changes to the code are immediately available without having to reinstall the package. If you add a new dependency to the setup.py file, you will need to run make setup again to install it.

Here is an example setup.py file:

from setuptools import setup

VERSION = "0.1.0"

setup(
 # package name, which can be different from project, this is the name used
 # when installing the package with pip, e.g. pip install mypackage
 name="mypackage",
 version=VERSION,
 maintainer="",
 maintainer_email="",
 description="",
 license="",
 python_requires=">=3.10",
 # static files to include in the package
 package_data={
 "myproject": [
 "var/*",
 ]
 },
 # command line entry points
 entry_points={
 "console_scripts": [
 "myproject = myproject.__main__:main",
 ]
 },
 # fol
 packages=["myproject"],
 install_requires=[
 "pip",
 "pyarrow==13.0.0",
 "polars==0.18.15",
 "awscli==1.29.38",
 "boto3==1.28.38",
 "botocore==1.31.38",
 "python-dotenv==1.0.0",
 "uvicorn==0.24.0.post1",
 "requests==2.31.0",
 ],
 extras_require={
 "dev": [
 "ruff",
 "ipykernel",
 "pytest",
 ]
 },
 include_package_data=True,
 zip_safe=False,
 classifiers=[
 "Intended Audience :: Science/Research",
 "Programming Language :: Python :: 3.10",
 "Operating System :: OS Independent",
 ],
 keywords="python",
)

The Makefile looks like this:

help: ## Show this help
 @echo "\nSpecify a command. The choices are:\n"
 @grep -E '^[0-9a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf " \033[0;36m%-12s\033[m %s\n", $$1, $$2}'
 @echo ""
.PHONY: help

clean: ## Clean
 rm -rf ./.venv
 rm -rf ./dist
 rm -rf ./mypackage.egg-info
 rm -rf ./mypackage/__pycache__
 rm -rf ./myproject/*.so
 rm -rf ./myproject/__pycache__/
.PHONY: clean

setup: ## Editable install
 test -d .venv || python3 -m venv .venv
 . .venv/bin/activate; \
 python -m pip install --upgrade -i https://pypi.example.com/simple/ -e .[dev]
.PHONY: setup

server: ## Start local server
 uvicorn myproject.app:app --reload --workers=1 --reload-include="./myproject*"
.PHONY: server

Running make or make help will show the available commands:

Specify a command. The choices are:

 help Show this help
 setup Editable install
 server Start local server

Taxonomy of health data for machine learning

Wed, 23 Feb 2022 00:00:00 +0000

There is a wide variety of data types collected in the health system that can be utilized by machine learning models. These can include

patient-level information like demographics and socio-economic factors
hospital encounter-level information like admission source, ICU unit type, and discharge location
outcomes including diagnoses like billing codes and patient outcomes
interventions a patient received in the hospital like medications, invasive mechanical ventilation, oxygen support, pressors, fluids, blood transfusions, and ECMO
findings from radiological images, pathology images, and video recordings
laboratory measurements like blood gases, metabolic panels, liver panels, lipid panels, complete blood count, urinalysis, urine output, microbiology, and omics data
continuous waveforms like ECG, PPG, PCG, ABP, and etCO2 signals
nurse charted or automated vital sign collection including temperature, heart rate, blood pressure, and oxygen saturation
clinican and radiological notes

In-patient Data collected in the hospital is linked to patients using a unique medical record number (MRN). Data collected in out-patient settings, including at home, a nursing home, or in ambulatory care may not also be linked to the patients MRN. Even inside the hospital, linking waveforms (especially in time) with patient data in electronic health records is a significant challenge.

Cosine similarity 1D convolutions

Mon, 31 Jan 2022 00:00:00 +0000

The cosine similarity function below provides sign and scale independent 1D convolutions. It has a learnable parameter $p$ where large values of $p$ increase the sharpness of the cosine similarity. Here $u$ is the signal and $v$ is the kernel (e.g. $[1,2,3]$).

$$\text{CosineSim}(u,v)=\text{sign}(u \cdot v)\text{abs}(\frac{u \cdot v}{|u|_2 \cdot |v|_2})^{p^2}$$

We consider the kernel $[1,2,3]$ and a 1D time series signal with 5 distinct motifs:

A. exact match $[1,2,3]$
B. negative sign, exact match $[-1,-2,-3]$
C. downscaled exact match $[0.2, 0.4, 0.6]$
D. median $[2,2,2]$
E. reversed $[3,2,1]$

We can compare a 1D convolution with kernel_size=3, dilation = 1, and padding = (kernel_size-1)*dilation = 2 against the cosine similarity distance.

1D convolution correctly gives the largest activation to both of the exact matches (A and B). However, the convolution also gives a large activation to parts of the signal where there should not be a match. The median sequence of values B $[2,2,2]$ and the reversed sequence E $[3,2,1]$ get a significantly high activation despite having no similarity to the filter. The downscaled exact match C is not selected by the convolution because of the scale of the filter. Standard convolutions on raw (unnormalized) data are not scale and sign independent. Common normalizations, like batch or layer norm, calculates normalizing terms over samples or channels but not point-wise. The resulting convolved signal requires max pooling to find the subsequences of greatest correlation with the filter.

In contrast, the cosine similarity gives a score of 1 or -1 only for exact matches. The feature is detected independent of sign or scale. The figure below shows that the cosine similarity distance correctly detects the motifs A, B, and C. If we set the sharpness parameter to an arbitratily large value $p=9$ then the only points are the exact matches.

The output of CosineSim is clearly interpreted as the points of maximal correlation between the signal susequence and the filter, where filters represent subsequence templates.

def cosine_similarity(signal, kernel, sharpness, padding):
 """
 Compute the cosine similarity distance between a signal and a kernel.
 Outputs a sequence of the same length as the signal.

 Parameters
 ----------
 signal : torch.Tensor
 input [batch_size, channels, length]
 kernel : torch.Tensor
 filter [kernel_size]
 sharpness : float
 sharpness parameter
 padding : int
 padding size, (kernel size-1) * dilation

 Returns
 -------
 torch.Tensor
 output [batch_size, channels, length]
 """
 kernel_size = kernel.size(-1)
 x = F.pad(signal, (padding,padding))
 x = x.unfold(2, kernel_size, 1)
 sim = F.cosine_similarity(x, kernel, dim=-1)
 sgn = torch.sign(torch.einsum('bdij,k->bdi', x, kernel))
 sim = sgn * torch.pow(torch.abs(sim), (sharpness**2))
 sim = sim[:, :, : -padding].contiguous()
 return sim

Group-by and count in Numpy

Wed, 26 Jan 2022 00:00:00 +0000

The crosstab function takes a list of array-like objects and returns a contingency table of counts. A pure numpy implementation of a pivot-table like this is useful in environments where we don’t want to import the pandas package.

from typing import Tuple, List
import numpy as np

def crosstab(*args) -> Tuple[Tuple[np.ndarray], np.ndarray]:
 """
 Contingency table of counts.

 Parameters
 ----------
 args : list of array-like
 Arrays of discrete categorical data.

 Returns
 -------
 actual_levels : Tuple[np.ndarray]
 The actual levels of the categorical variables.
 count : np.ndarray
 The counts of the categorical variables cross-tabulated.

 Examples
 --------
 >>> categorical = [1,3,2,3]
 >>> covariate = [5,3,3,4]
 >>> levels, count = crosstab(categorical, covariate)
 """
 levels, indices = zip(*[np.unique(a, return_inverse=True) for a in args])
 count = np.zeros(list(map(len, levels)), dtype=int)
 np.add.at(count, indices, 1)
 return levels, count

Separable temporal convolutions

Sat, 22 Jan 2022 00:00:00 +0000

Given a multivariate time series $x \in \mathbb{R}^{B \times D \times T}$ with $D=3$ channels, $T=4$ timesteps and batch size $B=1$.

x = [
 [1,5,10,20],
 [100,150,200,250],
 [1000,1500,2000,2500],
]

# [batch_size=1, in_channels=3, timesteps=4]
xt = torch.FloatTensor(x)
xt = xt.unsqueeze(0)

The separable convolution learns a group of filters for each channel independently, without interactions across channels. Below, I define a 1D convolutional layer with kernel size 2 and learn 1 filter per channel (num_channels=1).

separable = True
in_channels = xt.shape[1]
num_channels = 1
kernel_size = 2
stride = 1
layer_i = 0
dilation_size = 2 ** layer_i
padding = (kernel_size - 1) * dilation_size
groups = in_channels if separable else 1
out_channels = in_channels * num_channels

For illustrative purposes, the weights are initialized to 1 and bias to 0.

conv1 = nn.Conv1d(
 in_channels,
 out_channels,
 kernel_size,
 stride=stride,
 padding=padding,
 dilation=dilation_size,
 groups=groups,
)
torch.nn.init.constant_(conv1.weight, 1)
torch.nn.init.constant_(conv1.bias, 0)

A separable convolution with kernel_size=2 and num_channels=1 is simply a weighted sum along each channel.

tensor([[[ 1, 6, 15, 30],
 [ 100, 250, 350, 450],
 [1000, 2500, 3500, 4500]]])

Increasing num_channels will learn a set of independent filters for each channel. For example, with num_channels=3 gives a total number of output channels of num_channels*in_channels=9.

tensor([[[ 1, 6, 15, 30],
 [ 1, 6, 15, 30],
 [ 1, 6, 15, 30],
 [ 100, 250, 350, 450],
 [ 100, 250, 350, 450],
 [ 100, 250, 350, 450],
 [1000, 2500, 3500, 4500],
 [1000, 2500, 3500, 4500],
 [1000, 2500, 3500, 4500]]])

When separable=False and num_channels=1, you get mixing between the channels:

tensor([[[1101, 2756, 3865, 4980],
 [1101, 2756, 3865, 4980],
 [1101, 2756, 3865, 4980]]])

Challenges in Machine Learning for Health

Wed, 01 Dec 2021 00:00:00 +0000

The secondary analysis of health data is challenging due to confounding, bias, uncertainty, and missingness.

Health data is irregularly sampled in time

Some data types are more frequently acquired than others. Vitals and laboratory measurmeents are taken for most in-patients. Clinical notes are also usually available for most patients. Ventilator settings, ECG signals, invasive blood pressure measurements, imaging data, and genomics data are are more infrequent. In the intensive care unit (ICU), for instance, continuous signals from ECG and ABP are typically sampled at 125Hz to 500Hz. The raw ECG, ABP, PPG signal can be downsampled to “high-frequency numerics” at 1sec or 1min interval. Some EHR databases will further apply a median filter so the highest sampling rate of vital signs like heart rate, blood pressure, temperature, and oxygen saturation is 5min. Ventilation settings and arterial blood gases can be charted every 6-12hours. Other metabolic and liver panels are often ordered every 12-24hours. These signals are acquired asynchronously, at irregular time intervals, and at different sampling rates.

Adherence to a standard nomenclature

EHR databases may not follow a standard nomenclature (like SNOMED, LOINC, MDIL, etc…), which introduces uncertainty in mapping measurements to standard concepts. OMOP is an example of an EHR database schema that attempts to heavily standardize concepts. However, the publicly accessible MIMIC and Philips eICU databases are not standardized and it is left up to the user to create concept mappings for medications, vital signs, laboratory measurements, and disease diagnoses. Many numeric fields like a laboratory measurement for creatinine can be entered as free text in the EHR user interface, which introduces further noise for the secondary analysis of EHR data.

Critical care data encodes physiology, clinical practice patterns, and clinicians concern

Electronic health records encode more than a patients physiology. The pattern of measurements, lab orders, and treatments capture the clinical decisions made at the bedside. The pattern of clinical decisions changes between institutions, between units in a hospital, the size of the institution (e.g. teaching vs community hospital). Disentangling physiology from clinical concern and care patterns is important to produce generalizable disease prediction models. However, this is not as important when using health data for operational research and to optimize hospital operations (e.g. forecasting bed occupancy).

Mapping patients across care settings

Connecting in-patient data collected in the hospital with out-patient data collected at home or in ambulatory care and emergency medical services is especially challenging. Even critical care data collected in the ICU can have periods of missing data where sensors like ECG, invasive ABP using an arterial line catheter, or PPG are missing for extended periods.

Data reliability

Nurse charted measurements can sometimes be unreliable. As an example, nurse charted respiratory rate at the bedside in the general ward is often rounded to a multiple of 5. Even continuously acquired signals can be unreliable. Timestamps between multiple sensors from different vendors (and sometimes the same vendor) can become desynchronized. The internal clock can drift. Some wearable devices may only save downsampled data instead of the raw signal.

Dataset and concept shift

Data drifts over time, institutions, hospital units, and countries. For example, ventilation settings and disease progression will change as more care givers adopt lung protective ventilation protocols; more patients have permissive hypertension and are on vasopressors in the neuro-ICU than in other units. Models that rely on interventions or physiological variables subject to dataset shift are susceptible to silent failure. Appropriate metrics to track dataset shift, model performance, and a pipeline to retrain and redeploy models are necessary.

Generalizability of machine learning models

Algorithms developed for the general ward (GW) setting may not translate to higher acuity settings, and vice versa. Most patients in the GW are not on continuous monitoring and instead vitals are aperiodically nurse charted. Differences in the severity of illness and treatment patterns are also significantly different. Models should be carefully deployed in settings where they were designed and tested.

Reproducibility

Reproducibility of the data extraction pipeline is important for documentation and verification. This becomes challenging for a large team working on different parts of the data extraction pipeline with tasks like concept mapping, data labeling, cohort selection, and data pre-processing divided among scientists. Start by coordinating on data storage, naming conventions (for code and datasets), and tooling. I have found that documenting the pipeline in single function call helps with posterity and reproducibility.

Actionability

Algorithms that are actionable are tied to a protocol or clinical decision support system that is integrated into the clinical workflow. For example, predicting the that a patient is at high risk of hemodynamic shock on it’s own is not useful unless it is tied to a protocol that triggers an alert and a clinical decision support system that recommends appropriate fluids or pressors.

Quantile binning with missing data

Sun, 03 Oct 2021 00:00:00 +0000

This uses Numpy and Numba for fast binning of numerical data to quantiles. It also supports missing data.

import numpy as np
from numba import njit

def _find_binning_thresholds(data, max_bins=256, subsample=int(2e5)):
 if not (2 <= max_bins <= 256):
 raise ValueError(
 "max_bins={} should be no smaller than 2 "
 "and no larger than 256.".format(max_bins)
 )

 percentiles = np.linspace(0, 100, num=max_bins + 1)
 percentiles = percentiles[1:-1]
 binning_thresholds = []
 for f_idx in range(data.shape[1]):
 col_data = np.ascontiguousarray(data[:, f_idx], dtype=np.float64)
 mask = np.isfinite(col_data)
 col_data = col_data[mask]
 distinct_values = np.unique(col_data)
 if len(distinct_values) <= max_bins:
 midpoints = distinct_values[:-1] + distinct_values[1:]
 midpoints *= 0.5
 else:
 midpoints = np.percentile(
 col_data, percentiles, interpolation="midpoint"
 ).astype(np.float64)
 binning_thresholds.append(np.unique(midpoints))
 return binning_thresholds


@njit()
def _map_num_col_to_bins(data, binning_thresholds, binned):
 for i in range(data.shape[0]):
 left, right = 0, binning_thresholds.shape[0]
 while left < right:
 middle = (right + left - 1) // 2
 if data[i] <= binning_thresholds[middle]:
 right = middle
 else:
 left = middle + 1
 binned[i] = left


def _map_to_bins(data, binning_thresholds, binned):
 """Bin numerical values to discrete integer-coded levels."""
 for feature_idx in range(data.shape[1]):
 _map_num_col_to_bins(
 data[:, feature_idx],
 binning_thresholds[feature_idx],
 binned[:, feature_idx],
 )


def _assign_nan_to_bin(binned, X, actual_n_bins, assign_nan_to_unique_bin=False):
 mask = np.isnan(X)
 for i in range(X.shape[1]):
 binned[mask[:, i], i] = actual_n_bins[i] if assign_nan_to_unique_bin else np.nan
 return binned


class QuantileBinning():
 def __init__(self):
 self.bin_thresholds = []
 self.n_bins = []

 def fit(self, X):
 self.bin_thresholds = _find_binning_thresholds(X)
 self.n_bins = np.array(
 [thresholds.shape[0] + 1 for thresholds in self.bin_thresholds], dtype=np.uint32
 )

 def transform(self, X, assign_nan_to_unique_bin=False):
 binned = np.zeros_like(X, dtype=np.float32, order="F")
 _map_to_bins(X, self.bin_thresholds, binned)
 binned = _assign_nan_to_bin(binned, X, self.n_bins, assign_nan_to_unique_bin)
 return binned

Ensemble decision trees in Numba

Tue, 21 Sep 2021 00:00:00 +0000

Representing ensembles of decision trees using numpy arrays with fast numba operations.

import math
import numpy as np
from numba import njit, prange


@njit
def take(X, inds):
 """Multidimensional indexing for numba"""
 n = len(X)
 y = np.zeros(n, dtype=X.dtype)
 for i in range(n):
 y[i] = X[i, inds[i]]
 return y


@njit
def next_node(node_id, value, thr):
 """A vectorized operation to find the next node in a binary tree given a
 value and threshold.
 Args:
 node_id: int or array of the current node, root node_id = 0
 Return:
 new node id, left node when value <= thr, right node when value > thr
 """
 return (node_id << 1) + 1 + (1 * (thr < value))


@njit
def leaf(X, features, thresholds, reset_leaf_index=1):
 """Find the leaf node index along a decision path given a tree feature
 indices, thresholds, and design matrix.
 Args:
 X: 2D design matrix of shape [nsamples, nfeatures]
 features: feature indices as dtype np.int32 of shape [internal_nodes]
 thresholds: split thresholds as dtype np.float64 of shape [internal_nodes]
 reset_leaf_index: returns leaf index initialized from 1
 """
 nsamples = len(X)
 node_id = np.zeros(nsamples, dtype=np.int64)
 n = len(features)
 depth = int(math.log(n+1)/math.log(2))
 internal_nodes = 2**(depth) - 1
 for i in range(depth):
 feature_ind = features[node_id]
 value = take(X, feature_ind)
 thr = thresholds[node_id]
 node_id = next_node(node_id, value, thr)
 if reset_leaf_index == 1:
 node_id = node_id - internal_nodes
 return node_id


@njit
def leaf_tokens(X, trees, nleaves_per_tree):
 """Tokenize a design matrix with leaf indices.
 Args:
 X: 2D design matrix of shape [nsamples, nfeatures]
 trees: 3D matrix deifining an ensemble of decision trees of shape
 [ntrees, internal_nodes, 2]
 nleaves_per_tree: number of leaves in each decision tree, 2**depth
 """
 nsamples = X.shape[0]
 ntrees = trees.shape[0]
 leaves = np.zeros((nsamples,ntrees), dtype=np.int64)
 for i in prange(0,ntrees):
 features = trees[i,:,0].astype(np.int64)
 thresholds = trees[i,:,1]
 leaves[:,i] = leaf(X, features, thresholds) + nleaves_per_tree * i
 return leaves


@njit
def random_ensemble_decision_trees(ntrees, depth, nfeatures):
 """Generate a random ensemble of decision trees
 Args:
 ntrees: number of trees in ensemble
 depth: height of each tree
 nfeatures: number of features in design matrix
 Returns:
 trees: 3D matrix of shape [ntrees, internal_nodes, 2] where internal_nodes
 is the number of non-leaf nodes (including root node) calculated
 as 2^{depth} - 1 and the last dimension includes feature index and
 splitting thresholds
 """
 internal_nodes = 2**(depth) - 1
 total_internal_nodes = ntrees * internal_nodes
 trees = np.zeros((total_internal_nodes, 2)) # [features, thresholds]
 trees[:,0] = np.random.randint(0, nfeatures, total_internal_nodes)
 trees[:,1] = np.random.rand(total_internal_nodes)
 trees = trees.reshape(ntrees, internal_nodes, -1) # [ntrees, internal_nodes, 2]
 return trees


nsamples, nfeatures = 1000, 30
X = np.random.rand(nsamples,nfeatures)

ntrees = 1000
depth = 1
nleaves = 2**depth
total_nleaves = ntrees * nleaves
trees = random_ensemble_decision_trees(ntrees=ntrees, depth=depth, nfeatures=nfeatures)
leaves = leaf_tokens(X, trees, nleaves)

Feature engineering for time series data using Numba

Tue, 07 Sep 2021 00:00:00 +0000

Feature engineering over a multivariate time series with missing data.

Given a sequence of measurements values: np.ndarray and observation times times: np.ndarray, we want to engineer features for a machine learning model that captures the temporal trends and statistics over different temporal windows. Our data is irregularly sampled and have missing data.

Features representing the magnitude, dispersion, direction of change, and temporal trends are derived. Every time point is represented with 4 categories of features:

Magnitude of the most recent observation within the last 6h for vital signs and 24h for laboratory measurements.
Dispersion measured as the range over a short and long-time window.
Direction of change (increasing, decreasing, no change) over a short and long-time window.
Exponential moving averages (EMA) with varying decay rates that specify how the much impact each past observation has on the current mean. EMA features were calculated on the forward filled magnitudes and using an EMA algorithm specifically for irregularly sampled time series.

The engineered features include:

dt: time elapsed since the measurement was made
val: most recent measurement, measurements are forward filled up to a maximum duration after
srng: range as $\frac{x_{max}-x_{min}}{x_{max}+x_{min}}100$ over a short window
ssgn: sign of the change (-1 or +1) between the first and last measurement in a short window, windows without measurements are filled with 0
lrng: range as (max-min)/(max+min) * 100 over a long window
lsgn: sign of the change (-1 or +1) between the first and last measurement in a long window, windows without measurements are filled with 0
sema: slow exponential moving average, calculated after forward filling
fema: fast exponential moving average, calculated after forward filling

We also need to define the following settings for every variable:

forward_fill_duration: duration in minutes to forward fill a missing variable
short_rolling_window_size: short rolling window size in minutes
long_rolling_window_size: long rolling window size in minutes
slow_ema_tau: decay rate for slow exponential moving average in minutes
fast_ema_tau: decay rate for fast exponential moving average in minutes

Some derived features have no missing data, including dt, ssgn, srng, lsgn, and lrng. Other variables like val, sema, fema have missing values encoded as NaN. Missing data is forward filled up to forward_fill_duration and all missing values after the forward filling time are NaN.

from numba import njit, prange

@njit()
def ema(times, values, tau):
 """Exponential moving average for irregularly sampled time series.
 Units for ``times`` and ``tau`` should be in hours.

 Args:
 times (np.ndarray): 1D array of measurement times in hours
 values (np.ndarray): 1D array of measurements
 tau (float): time decay in hours
 """
 n = len(values)
 ret = np.empty(n, dtype=np.float64) * np.nan
 ret[0] = values[0]
 last_i = 0
 for i in range(1, n):
 if np.isnan(times[i]) | np.isnan(values[i]):
 continue
 alpha = (times[i] - times[last_i]) / tau
 w = np.exp(-alpha)
 if alpha > 1e-6:
 w2 = (1 - w) / alpha
 else:
 # use Taylor expansion for numerical stability
 w2 = 1 - (alpha / 2) + (alpha * alpha / 6) - (alpha * alpha * alpha / 24)
 ret[i] = (ret[last_i] * w) + (values[i] * (1 - w2)) + (values[last_i] * (w2 - w))
 last_i = i
 return ret


@njit()
def ffill(times, values):
 """Forward fill an array of values and times.
 Times indicate the time of last observation.

 Args:
 arr (np.ndarray): array of values with nan
 times (np.ndarray): array of times

 Returns:
 2D array with forward filled times (index 0)
 and values (index 1).
 """
 out_values = values.copy()
 out_times = times.copy()
 n = values.shape[0]
 out = np.zeros((n, 2), dtype=np.float64)
 lastval = np.nan
 lasttime = np.nan
 for row_idx in range(n):
 if np.isfinite(values[row_idx]):
 lastval = values[row_idx]
 lasttime = times[row_idx]
 out_values[row_idx] = lastval
 out_times[row_idx] = lasttime
 out[:, 0] = out_times
 out[:, 1] = out_values
 return out


@njit()
def sliding_windows(times, window_size):
 tdiff = times - times.reshape(-1, 1)
 twindows = (tdiff >= 0) & (tdiff <= window_size)
 return twindows


@njit()
def feature_engineering(
 times,
 values,
 forward_fill_duration,
 short_rolling_window_size,
 long_rolling_window_size,
 slow_ema_tau,
 fast_ema_tau,
):
 """Feature engineering.

 1. Age of the observation
 2. Most recent forward-filled measurement
 3. Dispersion as (max-min)/(max+min) over short window
 4. Sign of change between first and last value in short window
 5. Dispersion as (max-min)/(max+min) over long window
 6. Sign of change between first and last value in long window
 7. slow EMA
 8. fast EMA

 Forward filling replaces nan values up to a maximum duration given by
 `forward_fill_duration`. Any remaining missing values should be imputed
 with the median or by sampling from a standard reference range.

 Args:
 values (np.ndarry): 2D array of observations
 times (np.ndarry): 1D array of observation times
 forward_fill_duration (float): duration to forward fill missing values
 short_rolling_window_size (float): observation duration for rolling
 statistics with a short lookback window
 long_rolling_window_size (float): observation duration for rolling
 statistics with a long lookback window
 slow_ema_tau (float): slow decay rate weights more of the past
 fast_ema_tau (float): fast decay rate weights more of the present

 Returns:
 2D matrix with shape [n_samples, 8], where the features are:
 suffixes = ['dt', 'val', 'short_rng', 'short_sgn', 'long_rng', 'long_sgn', 'slow_ema', 'fast_ema']
 """
 DT_IX = 0
 VAL_IX = 1
 SRNG_IX = 2
 SSGN_IX = 3
 LRNG_IX = 4
 LSGN_IX = 5
 SEMA_IX = 6
 FEMA_IX = 7

 n = values.shape[0]
 derived = np.zeros((n, 8)) * np.nan

 missing = np.isnan(values)

 # return a median imputed array if there are no observations
 if np.all(missing):
 derived[:, SSGN_IX] = 0.
 derived[:, LSGN_IX] = 0.
 derived[:, SRNG_IX] = 0.
 derived[:, LRNG_IX] = 0.
 return derived

 # indicate samples within a fixed window for each sample
 short_windows = sliding_windows(times, short_rolling_window_size)
 short_windows = short_windows & ~missing.reshape(-1,1)
 long_windows = sliding_windows(times, long_rolling_window_size)
 long_windows = long_windows & ~missing.reshape(-1,1)

 # forward fill
 xfill = ffill(times, values)
 dt = times - xfill[:,0]

 # 0. Age of the measurement
 derived[:, DT_IX] = dt

 # 1. last measured value
 derived[:, VAL_IX] = xfill[:, 1]
 expired = ~np.isnan(dt)
 expired[expired] = dt[expired] >= forward_fill_duration
 derived[expired, VAL_IX] = np.nan

 # rolling statistics
 for t in range(n):
 # samples in this window
 short_inds = short_windows[:, t]
 long_inds = long_windows[:, t]
 short_vals = values[short_inds]
 long_vals = values[long_inds]

 if short_vals.size > 0:
 # 2. rolling dispersion
 high = np.max(short_vals)
 low = np.min(short_vals)
 total = high + low
 if total > 1e-6:
 derived[t, SRNG_IX] = (high - low) / (total + 1e-6) * 100
 else:
 derived[t, SRNG_IX] = 0.

 # 3. sign of change between last and first value
 if short_vals.size >= 2:
 derived[t, SSGN_IX] = np.sign(short_vals[-1] - short_vals[0])
 else:
 derived[t, SSGN_IX] = 0.
 else:
 derived[t,SRNG_IX] = 0.
 derived[t,SSGN_IX] = 0.

 if long_vals.size > 0:
 # 4. rolling dispersion
 high = np.max(long_vals)
 low = np.min(long_vals)
 total = high + low
 if total > 1e-6:
 derived[t, LRNG_IX] = (high - low) / (total + 1e-6) * 100
 else:
 derived[t, LRNG_IX] = 0.

 # 5. sign of change between last and first value
 if long_vals.size >= 2:
 derived[t, LSGN_IX] = np.sign(long_vals[-1] - long_vals[0])
 else:
 derived[t, LSGN_IX] = 0.
 else:
 derived[t, LRNG_IX] = 0.
 derived[t, LSGN_IX] = 0.

 # exponential moving average using the forward filled data
 ema_slow = np.empty(n, dtype=np.float64) * np.nan
 ema_fast = np.empty(n, dtype=np.float64) * np.nan
 m = np.isfinite(derived[:, 1])
 vals = derived[m, 1]
 if vals.size > 0:
 ema_slow = ema(times[m], derived[m, 1], slow_ema_tau)
 ema_fast = ema(times[m], derived[m, 1], fast_ema_tau)
 # 6. slow EMA
 derived[m, SEMA_IX] = ema_slow
 # 7. fast EMA
 derived[m, FEMA_IX] = ema_fast

 return derived

Select a random window of maximum duration in NumPy

Mon, 02 Aug 2021 00:00:00 +0000

Given a sequence of indices $x$ and times $t$, we want to select a window of data from $x$ with a maximum duration and a maximum number of measurements.

def select_random_window(
 x: np.ndarray, t: np.ndarray, max_window_size: int, dur: float
):
 """Select a random window of consecutive elements with a maximum duration.
 This function has extra logic that ensures the selected windows have length
 max_window_size whenever possible.

 Args:
 x (np.ndarray): sequence of indices
 max_window_size (int): maximum number of observations

 Returns:
 1D array of samples in the observation window.
 """
 L = len(x)
 if L <= max_window_size:
 return x
 if (t[-1] - t[0]) <= dur:
 return x
 L_tmp = L
 if L < max_window_size - 6:
 L_tmp = L - int(max_window_size / 2)
 else:
 L_tmp = L - 1
 start = random.randint(0, L_tmp)
 t_start = t[start]
 t_end = t_start + dur
 if t_end > t[-1]:
 return x[start:]
 end = np.argwhere((t - t_end) >= 0)[0][0]
 start = start if end <= L else max(0, start - (end - L))
 return x[start:end]

Causal inference learners

Mon, 01 Feb 2021 00:00:00 +0000

Given covariates $X$, treatment indicator $W$ and a binary outcome $Y \in {0,1}$.

Inverse probability of treatment weights

Fit a model on observations $X$ to predict the treatment $W$, $p_{w_i}(x_i)=P(W=w_i|X=x_i)$ and use the probability of being treated as a sample weight to predict the outcome $Y$.

$$IPTW_i=\frac{1}{p_{w_i}}$$

S-learner

A single model is trained on observations and treatments.

$$\begin{aligned} \hat{\mu} = M(Y \sim (X,W))\\ \hat{\tau}(x)=\hat{\mu}(x,1) - \hat{\mu}(x,0) \end{aligned}$$

T-learner

Two models, one model per treatment, are trained on the observations only.

$$\begin{aligned} \hat{\mu}_0 = M_0(Y^0 \sim X^0)\\ \hat{\mu}_1 = M_1(Y^1 \sim X^1)\\ \hat{\tau}(x)=\hat{\mu}_1(x)-\hat{\mu}_0(x) \end{aligned}$$

X-learner

The X-learner first estimates the response function ($\hat{\mu}0$ and $\hat{\mu}1$), then coomputes the imputed treatment effects ($\tilde{D}{i}^{1}$ and $\tilde{D}{i}^{0}$), then estimates the conditional average treatment effect for treated and controls ($\hat{\tau}_1$ and $\hat{\tau}_0$), and finally averages the estimates.

$$\begin{aligned} \hat{\mu}0 = M_1(Y^0 \sim X^0)\\ \hat{\mu}1 = M_2(Y^1 \sim X^1)\\ \tilde{D}{i}^{1}=Y_i^1 - \hat{\mu}{0}(X_i^1)\\ \tilde{D}_{i}^{0}=\hat{\mu}1(X_i^0) - Y{i}^{0}\\ \hat{\tau}_1=M_3(\tilde{D}^{1} \sim X^{1})\\ \hat{\tau}_0=M_4(\tilde{D}^{0} \sim X^{0})\\ \hat{\tau}(x)=g(x)\hat{\tau}_0(x)+(1-g(x))\hat{\tau}_1(x) \end{aligned}$$

$g(x)\in[0,1]$ is a weighting function which is chosen to minimize the variance of $\hat{\tau}(x)$. $g(x)$ can be estimated by the propsensity score or set to a constant and equal o the ratio of treated to untreated samples.

G-Computation

Parameters we can derive from the G-Computation:

For a binary outcome, $Y\in{0,1}$, $P(Y=1)=E[Y]$.
$E[Y^1]-E[Y^0]$ is the causal risk difference due to treatment
$E[Y^1]/E[Y^0]$ is the causal relative risk
$EY^1/EY^0$ is the causal odds ratio due to treatment

Cloudflare pages for static hosting

Thu, 21 Jan 2021 00:00:00 +0000

The asifr.com website has been hosted on a cheap DigitalOcean server for the last 8 years. I hadn’t touched the server for many years and it stagnated running Ubuntu 14 and didn’t have HTTPS. Today, I migrated the static site to Cloudflare Pages for free hosting in two steps:

Create a github repository with the static files. The repo can be public or private.
Connect Cloudflare pages to the github repo and set the path to the static files in your repo (e.g. /public or /dist).

Offline, I use pandoc to compile a folder of Markdown files into static HTML files.

Cloudflare Pages automatically rebuilds the site when I commit the changed files to the github repo. However, since we don’t have a cloud-based build step and are generating the static files offline, Cloudflare Pages simply serves the public static files.

Cloudflare gives a nice domain name: asifr.pages.dev, but also provides custom DNS service so I can point asifr.com to asifr.pages.dev. It takes a few hours for the domain name propagation once the nameservers are transferred and the CNAME is created. Now asifr.com has secure HTTPS connections and is served from Cloudflare’s edge network.

The benefits are: I get a free hosting service since Cloudflare Pages provides unlimited bandwidth, requests, and sites (the limitation of the free-tier is I get 500 builds per month – which is plenty for a simple personal site that is updated infrequently). You also get security, HTTPS, analytics, and users get fast access to the site from anywhere in the world.

Interpretable univariate risk curves

Sun, 05 Jul 2020 00:00:00 +0000

High accuracy complex models, like neural networks and generalized additive models, come at the expense of interpretability. The contribution of individual features to the outcome are difficult to understand in complex models and have recieved significant criticism, partcularly in high-stakes settings (like medicine and criminal cases) where trust is critical. Intelligible models develop trust with the user and helps us debug counter-intuitive or inaccurate relationships learned by the model. Before applying complex models, it’s usually a good idea to look at odds ratios and risk curves to visualzie how each variable is related to the outcome. Odds ratios can quantify the strength of the relationship and is a good first step in ranking features that have a strong association with the outcome, compared to features with weaker associations. Risk curves help us visualize the shape of the association (e.g. linear, non-linear, monotonic, increasing, decreasing).

Odds-ratios

Odds ratios are widely used to compare the relative odds of the occurrence of the outcome (e.g. disease), given exposure to a feature. The odds ratio can also be used to determine whether a particular feature is a risk factor for a an outcome, and to compare the magnitude of various risk factors for that outcome.

OR = 1 Feature does not affect odds of outcome
OR > 1 Feature is associated with higher odds of outcome
OR < 1 Feature is associated with lower odds of outcome

import statsmodels.api as sm
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

def univariatelr(X,Y,feature_name,binary=False):
 res = sm.Logit(Y, X, missing='drop').fit(disp=0, intercept=True)
 ci = np.round(np.exp(res.conf_int(alpha=0.05, cols=None)),2).squeeze()
 r = {'feature':feature_name,
 'nobs': np.sum(X==1) if binary else res.nobs,
 'coeff': round(res.params[0],2),
 'OR': round(np.exp(res.params[0]),2),
 'ci_low': ci[0],
 'ci_high': ci[1],
 'p-value': np.round(res.pvalues[0],2)}
 return r

The coefficients of a logistic regression model are the log-odds and taking the exponential of the coefficient gives us the odds-ratio. Odds-ratios are plotted along with the confidence interval. If the confidence interval overlaps with OR=1, than the feature has a weak or no association with the outcome. Below is an example of the odds-ratios from univariate logistic regressions where the outcome label is survived or expired in the ICU.

Risk curves

Risk curves describe the relationship between a feature and an outcome. For example a risk curve can be used to understand the relationship between a physiological measure like heart rate and an outcome like the probability of developing a disease. These graphical interpretations help explain hidden relationships in data, are interpretable by a non-technical audience, and relatively easy to construct.

The risk curve for lactate shows that the risk of mortality monotonically increases with increasing lactate. We could also reasonably draw a cutoff around Lactate > 2 mmol to and group these high lactate patients for further analysis.

Building such a risk curve is simple enough. Given a table of feature value and label:

Value	Label
0.800	0
4.767	1
0.700	0
1.100	0
1.575	0

First bucket each sample into a bin, for example: (0.1 , 0.6), (0.6, 1.2), …, (4.4, 5.). You can use quantiles to determine these bins or just equally divide your data into bins. We can use the histogram function to divide our data into equal sized bins with a specified range and count the number of samples that fall into each bin (h_control, h_treated).

x = df.Value
y = df.Label
x = x[(x>=minval) & (x<=maxval)]
xmin = x.min()
xmax = x.max()
h_control, b_control = np.histogram(x[y==0], range=(xmin, xmax), bins=10)
h_treated, b_treated = np.histogram(x[y==1], range=(xmin, xmax), bins=10)
risk = np.log((h_treated/h_treated.sum())/(h_control/h_control.sum()))
vals = b_unstable[:-1]
plt.plot(vals, risk)

The log-odds (risk) is simply the fraction of treated samples in the bin divided by the fraction of untreated samples in the bin.

Transform Grouped Pandas DataFrame to Numpy Array

Mon, 22 Jun 2020 00:00:00 +0000

This snippet transforms a tall Pandas DataFrame with time-series data into a Numpy array while preserving the grouping. This is a common use case for me when preparing training data for recurrent neural networks, where each training sample belongs to a group (EventID below), feature values (FeatureValue) are orded by time (DateTime), and I want to get the length of each sample (needed to train an RNN with variable length sequences).

EventID	DateTime	FeatureValue
1	0	80
1	5	90
2	0	75
2	10	80

event_col = 'EventID'
time_col = 'DateTime'
value_col = 'FeatureValue'
xt = df.loc[:,[time_col, value_col]].values
g = df.reset_index(drop=True).groupby(event_col)
xtg = [xt[i.values,:] for k,i in g.groups.items()]
SignalLengths = [len(i.values) for k,i in g.groups.items()]
X_signal = np.array(xtg)
EventIDs = list(g.groups.keys())

AWS Lambda Web Scraper

Sun, 21 Jun 2020 00:00:00 +0000

This is a small AWS Lambda function to scrape websites using axios and store the data in a MongoDB document. You can setup an API Gateway to the Lambda function and use GET requests to call the function.

Features

Randomly selects from a set of headers with each call.
Automatically sets the host and referer to the same domain.
Saves the response to MongoDB.
Optionally sets the header to json if you expect the output to be json format
Optionally sets the request to XMLHttpRequest

Install the required Node modules: npm install axios mongodb dotenv

const request = require('axios');
const MongoClient = require('mongodb').MongoClient;
const crypto = require('crypto');

// local .env files are loaded into process.env
require('dotenv').config({silent: false});

// load the MongoDB connection string from the .env file
const mongo_host = process.env.MONGO

// database and collection name
databaseName = 'scraper'
collectionName = 'rawdata'

// set of headers from which we will randomly select
let headers_list = [
 // Firefox 77 Mac
 {
 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
 "Accept-Language": "en-US,en;q=0.5",
 "Referer": "https://www.google.com/",
 "DNT": "1",
 "Connection": "keep-alive",
 "Upgrade-Insecure-Requests": "1"
 },
 // Firefox 77 Windows
 {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
 "Accept-Language": "en-US,en;q=0.5",
 "Accept-Encoding": "gzip, deflate, br",
 "Referer": "https://www.google.com/",
 "DNT": "1",
 "Connection": "keep-alive",
 "Upgrade-Insecure-Requests": "1"
 },
 // Chrome 83 Mac
 {
 "Connection": "keep-alive",
 "DNT": "1",
 "Upgrade-Insecure-Requests": "1",
 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
 "Sec-Fetch-Site": "none",
 "Sec-Fetch-Mode": "navigate",
 "Sec-Fetch-Dest": "document",
 "Referer": "https://www.google.com/",
 "Accept-Encoding": "gzip, deflate, br",
 "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
 },
 // Chrome 83 Windows
 {
 "Connection": "keep-alive",
 "Upgrade-Insecure-Requests": "1",
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
 "Sec-Fetch-Site": "same-origin",
 "Sec-Fetch-Mode": "navigate",
 "Sec-Fetch-User": "?1",
 "Sec-Fetch-Dest": "document",
 "Referer": "https://www.google.com/",
 "Accept-Encoding": "gzip, deflate, br",
 "Accept-Language": "en-US,en;q=0.9"
 }
]

function isValidURL(string) {
 var res = string.match(/(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/g);
 return (res !== null)
};

module.exports.scrape = async event => {
 // start by parsing the body assuming a POST statement with a JSON body
 let body = JSON.parse(event.body)

 // url is required
 if (!('url' in body)) {
 return {ok: 0, msg: 'Missing URL'}
 }

 // check the url is valid
 if (!isValidURL(body.url)) {
 return {ok: 0, msg: 'Invalid URL'}
 }

 let url = body.url
 let host = new URL(url)

 // randomly select a header
 let headers = headers_list[Math.floor(Math.random() * headers_list.length)]
 // the request should look like it is originating from the host
 headers['Host'] = host.host
 // referer is from the same domain, referers from google.com are often
 // redirected, which we want to avoid
 headers['Referer'] = host.origin

 // set json headers if we expect the response to be in json
 if ('json' in body) {
 headers['Accept'] = 'application/json, text/javascript, */*; q=0.01'
 }

 // set XMLHttpRequest header, which helps when calling private APIs that
 // would typically be loaded by AJAX calls
 if ('ajax' in body) {
 headers['X-Requested-With'] = 'XMLHttpRequest'
 }

 // send a GET request with our headers
 const response = await request({
 'url': url,
 'method': 'get',
 'headers': headers,
 });

 if (response.status == 200) {
 // create a data object containing the response body and headers
 let date = new Date()
 let data = {
 'url': url,
 'url_hash': crypto.createHash('md5').update(url).digest("hex"),
 'host': host.host,
 'data': response.data,
 'processed': false,
 'scraped_at': date,
 'scraped_year': date.getFullYear(),
 'scraped_month': date.getMonth() + 1,
 'scraped_day': date.getDate(),
 'response_headers': response.headers,
 'request_headers': headers
 }

 // create a connection to the MongoDB
 const client = await MongoClient.connect(mongo_host, {useUnifiedTopology: true});
 // select the database
 const db = client.db(databaseName);
 // insert data into collection in database
 let r = await db.collection(collectionName).insertOne(data);
 // close the connection to MongoDB
 client.close();

 if (r.insertedCount == 1) {
 // return the newly created ObjectID if a new document was successfully inserted
 return {
 ok: 1,
 url: url,
 insertedId: r.insertedId,
 };
 }
 } else {
 return {
 ok: 0,
 url: url,
 status: response.status,
 msg: 'Bad response status'
 };
 }

 return {
 ok: 0,
 url: url
 };
};

Deploy static website with rsync

Sun, 21 Jun 2020 00:00:00 +0000

Create a Makefile and use rsync to deploy your ./public folder to a remote server. Replace <IP-ADDRESS> with the website location and set FOLDER to the remote location where you want to upload your local files. In practice it’s nice to have a staging folder to test out your website before deployment. Use make staging and make deploy to sync changes.

.PHONY: staging deploy

SERVER = <IP-ADDRESS>
FOLDER = /var/www/html
STAGING = /var/www/html/staging
USER = root

deploy:
 rsync -zarvh ./public/* --compress --recursive --checksum --delete --itemize-changes --exclude-from exclude.rsync $(USER)@$(SERVER):$(FOLDER)

staging:
 rsync -zarvh ./public/* --compress --recursive --checksum --delete --itemize-changes --exclude-from exclude.rsync $(USER)@$(SERVER):$(STAGING)

Svelte Webpack Boilerplate

Fri, 22 May 2020 00:00:00 +0000

This setup uses webpack to bundle code and Svelte to create the user interface. This looks for files under /src and saves the compiled and minified javascript code in /public/js.

Install the required Node modules: npm install svelte svelte-loader.

const path = require('path');

module.exports = {
 entry: {
 dashboard: './src/dashboard.js',
 account: './src/account.js',
 },
 output: {
 path: path.resolve(__dirname, 'public/js/'),
 filename: "[name].js"
 },
 mode: "production",
 module: {
 rules: [
 {
 test: /\.(html|svelte)$/,
 exclude: /node_modules/,
 use: {
 loader: 'svelte-loader',
 options: {}
 },
 }
 ]
 },
 plugins: [],
 resolve: {
 alias: {
 svelte: path.resolve('node_modules', 'svelte')
 },
 extensions: ['.mjs', '.js', '.svelte'],
 mainFields: ['svelte', 'browser', 'module', 'main']
 }
};

Arrhythmia classification with stationary first order Markov process

Mon, 24 Dec 2018 00:00:00 +0000

Time series sequences can be described by their statistical properties like the mean level, trend, perodicity, autocorrelation, variability, and entropy. Most sequence classification models exploit these and other statistical differences between time series signals. However, some features are computationally expensive to calculate and they may not have sufficient discriminative power. Here I will show how Markov chains can be used to derive a simple discriminative statistical measure to separate time series sequences by summarizing the transition probabilities between consecutive timesteps.

A classic paper on ECG time series classification: A new method for detecting atrial fibrillation using R-R intervals by George Moody and Roger Mark (1990), uses markov chains to classify sequences of heart rate measurements as normal sinus rhythm (NS) and atrial fibrillation (AF). By using the sequence of heart rate values directly we don’t have to engineer new features and can work with the transition matrix derived from the observed sequence. This method is a good choice when signal variability is informative, as is the case in arrhythmia classification. The method is also computationally inexpensive and easy to deploy on low-power devices. The derived Markov score is also discriminative and can be used as a feature in downstream machine learning models.

The outline of the method is as follows: transitions between consecutive probabilities are summarized in a transition matrix assuming a stationary first-order Markov process (current value depends only on previous value). The log of the ratio of the two transition matrices for group A and group B yields the log-odds matrix $S$. The score is calculated as the sum of the log-odds for each new observed transition.

Transition matrix

In more detail:

Discretize heart rate into a number of bins. We’ll call these bins states.
Calculate a transition probability ($p_{i,j}$ = number of transitions from $state_i$ to $state_j$ / total number of transitions) between each state and store it in a transition matrix (above figure). We’ll call this matrix $T$. We want a transition matrix for each class. In our example we will be distinguishing a normal sinus rhythm ($T_\text{NS}$) from an arrhythmia like atrial fibrilation ($T_\text{AF}$).
Calculate the odds ratio of observing a state transiton in a normal rhythm and an abnormal rhythm: $p_{i,j}^{\text{NS}} / p_{i,j}^{\text{AF}}$. Store the odds ratios for each state transition in a score matrix $S$. In practice we can get the score matrix $S$ simply by dividing the two transition matrices: $S = T_\text{NS} / T_\text{AF}$.
Take the log of the score matrix so $S = \text{log}(T_\text{NS} / T_\text{AF})$. This gives us the log odds. We can interpret the score matrix as $S_{i,j}=0$ if there is an equal likelihood of observing a transition from $state_i$ to $state_j$ in both a normal sinus and AF rhythm. Note that taking the $log(A/B)$ is equivalent to subtracting $log(A)-log(B)$. It follows that positive values of $S$ ($S_{i,j}>0$) means a greater chance that the state transition from $state_i$ to $state_j$ comes from a normal rhythm and negative values of $S$ ($S_{i,j}<0$) means you are more likely to observe the state transition in an AF rhythm.
To get the final score we sum up the transition probabilities from the score matrix for each observation and choose an appropriate threshold at which we are confident the transition comes from an AF rhythm.

I will use ECG data from the Physionet 2017 Challenge to illustrate how a Markov process is used to classify time series signals. Participants in the challenge had to classify ECG signals collected from mobile sensors as normal sinus rhythm, atrial fibrillation, noise or other. For our purposes we ignore the noise and other labels and simply classify normal and atrial fibrillation.

Since the data is provided as raw ECG signals we first extract heart rate by detecting the R-peaks of the ECG beat and calculating the R to R intervals (heart rate = 60 / RR interval in seconds). The distribution of RR intervals between the two groups already shows a good separation between Normal and AF rhythms.

R-R interval distribution

Next, calculate the transition matrix for all RR intervals in the training data.

def transition_matrix(transitions):
 """Takes an array of discrete values and returns a transition matrix
 Args:
 transitions: list of states
 Returns:
 transition matrix, rows must sum to 1
 """
 n = 1 + max(transitions) # number of states
 M = np.zeros((n,n))
 for (i,j) in zip(transitions,transitions[1:]):
 M[i][j] += 1
 # now convert to probabilities:
 for row in M:
 s = sum(row)
 if s > 0:
 row[:] = [f/s for f in row]
 return np.array(M)

The rows of the transition matrix sum to 1 and shows the probability of observing a transition between RR intervals (in milliseconds) from time $t-1$ to $t$. The RR intervals were first discretized into bins of step size 50. The coarseness of the discretization is a hyperparameter that can be tuned to find the level that gives the best class separation.

Normal sinus rhythm (left), Atrial fibrillation (right) transition matrix

Dividing and taking the log of the two transition matrices gives the us the score matrix where each element is the log-odds ratio of observing a transition between two consecutive states. In the score matrix below, transitions in red are more likely to be observed in atrial fibrilation rhythms and transitions in green are more likely to be observed in normal sinus rhythm.

Score matrix

The final score is calculated by summing over transitions in a newly observed sequence of RR intervals.

$$M = \sum_{t=2}^{t=N} S(t-1,t)$$

The score can be plotted over time to show how the score changes with new observations. Examples for normal sinus rhythms have an increasing score and examples with atrial fibrillation have a decreasing score. We can use a cutoff threshold of zero such that a score > 0 will classify the example as a normal sinus rhythm.

Markov score

Some closing throughts: Why is the Markov score discriminative? Because we calculated the log-odds ratio between the two classes. The score works well in binary classification where the variability between consecutive observations is an informative discriminator. It may not work well in cases where the signal is periodic and the classification accuracy depends on information captured in the periodicity of the signal. In these cases a simple differencing or STL decomposition may help remove the periodic signal. The score presented here is also limited by the fact that it is first-order, meaning we look at only consecutive differences between time $t-1$ and $t$. In practice it may be better to use the score as a feature in a machine learning classifier since it captures a discriminiative attribute regarding the variability of the signal.

Nowcasting: Maintaining real time estimates of infrequently observed time series

Wed, 05 Dec 2018 00:00:00 +0000

Time series analysis appears in every disciple from physiology to retail pricing. A time series variable is typically measured sequentially at fixed intervals of time (often equispaced but not necessarily). Variables may be measured less frequently than theoretically possible for reasons of cost, effort, or convention. With local level linear trend models we can maintain realtime measures of infrequently measured values (see Predicting the Present with Bayesian Structural Time Series). The problem has been referred to as nowcasting because the goal is to maintain a current estimate of the value of a time series by forecasting the current value instead of the future value. The term itself is not very important as the task is essentially a standard forecasting problem.

Consider a measurement like US weekly initial claims for unemployment (ICNSA), which is a recession leading indicator. Can we learn this week’s number before it is released? To answer this question we would need a real time signal correlated with the outcome (ICNSA numbers). We can use Google Correlate to extract the top 100 search terms that are most correlated with the ICNSA signal. Google Correlate finds search terms that vary in a similar way to your own time series, ICNSA signal in our case. The 100 search term time series signals are our explanatory (also caled exogenous) variables that can be included as regressors to improve the ICNSA forecast performance. The idea is that contemporaneous signals (exogenous variables) are correlated in time with the unobserved signal (endogenous variable) we are trying to estimate and by regressing on these features can improve our forecast. The temporal structure in these observed signals can be exploited to infer the behaviour of an unobserved signal. Here we will explore using structural time series models that decompose a signal into additive components consisting a linear trend and a mean level.

US weekly initial claims for unemployment (ICNSA)

Brief description of structural time series models

The general approach to time series analysis is to first remove or model the parts that change through time to get a stationary series (a time series is stationary if its statistical properties, like variance, don’t change through time). Next, we use a time series model to capture the correlation in the stationary series. A series can be decomposed into:

trend components (long-term change in the mean level)
seasonality component (variation in mean that is periodic in nature and you generally know the period beforehand)
cycles (variation that oscillates but not according to some known or fixed period)
exogenous variables that have some correlation with the endogenous variable
noise

The various components can be combined additively to model the endogenous variable $y$ at time $t$. Such additive models are desirable because we can interpret each term, progressively increase model complexity, and easily diagnose model performance. More concretely a typical model will be written as:

$$ y_t=\mu_t+\gamma_t+\beta^Tx_t+\epsilon_t $$

where $y_t$ is the endogenous variable we want to forecast, $\mu_t$ captures changes in the mean level over time, $\gamma_t$ models the periodic nature of the signal, $\beta^T\boldsymbol{x}_t$ is a regression term with exogenous variables and $\epsilon_t$ is the noise term.

The local linear trend model decomposes the time series into a local level component and a trend component.

$$ \mu_t = \mu_{t-1}+\delta_{t-1}+u_t $$

$$ \delta_t = \delta_{t-1}+v_t $$

The current level of the trend is $\mu_t$, the current “slope” of the trend is $\delta_t$, and the noise terms are $u_t$ and $v_t$.

This kind of model is referred to as UnobservedComponents in statsmodels.

from statsmodels.tsa.statespace.structural import UnobservedComponents

# train on all time points before this and forecast time points after
interventionidx = 200

# df: dataframe with ICNSA and exogenous variables
# regression_columns: exogenous variables
intervention = df.index[interventionidx]
model = UnobservedComponents(
 df.loc[:intervention, 'ICNSA'].values,
 exog = df.loc[:intervention, regression_columns].values,
 level = 'local linear trend'
)
fit = model.fit(maxiter=1000)

We can compare a few models: without exogenous variables from Google Correlate, with the top 10 most correlated search terms, and with the bottom 10 least correlated search terms. The figures below show the ICNSA values in blue and the model predictions in red. The model is trained on observations until 2008 (vertical dashed line) and forecasts are made for the unobserved time after 2008. 95% confidence intervals are in grey.

ICNSA signal with model predictions

It’s clear that adding additional features to the model improves both the fit to the observed data and the forecast. But adding uncorrelated data can have undesired effects on your forecasts. The unobserved components model in statsmodels is unable to pick the best features since it does not have any kind of regularization. Ideally, we want to select only those correlated search terms that gives the best model fit and forecast. The original paper on Bayesian Structural Time Series model provides a methodology for feature selection.

In addition to applications in forecasting, state space models like the one described above can be used to infer the effect of an intervention, like an ad campaign, for counterfactual inference (see Inferring Causal Impact from Bayesian Structural Time-Series Models by Kay Brodersen et. al. (2015))

Useful references:

Original paper: Predicting the Present with Bayesian Structural Time Series
Inferring Causal Impact from Bayesian Structural Time-Series Models
Causal Impact R package
Rob Hyndmans books and papers, the teaching slides are particularly accessible

State space models and the Kalman filter

Tue, 02 Feb 2016 00:00:00 +0000

Linear state-space models are used in time-series analysis for filtering, prediction, and smoothing problems. They assume that the observations are generated linearly from a latent linear dynamical system. Although many real world processes are non-linear, the lineary makes the model easy to analyze and efficient to estimate. In addition, many non-linear systems can be approximated using linear models, thus the linear state-space model is an important tool for time-series applications.

Consider the basic structural model with a local level term and a trend term:

$$ y_t=\mu_t+\lambda_t w_t+\epsilon_t $$

$$ \mu_{t+1}=\mu_t+v_t+W_{1t} $$

$$ v_{t+1}=v_t+W_{2t} $$

$$ \lambda_{t+1}=\lambda_t+W_{3t} $$

where $\epsilon_t \sim N(0,\sigma_{y}^{2})$, $W_{1t} \sim N(0,\sigma_{\mu}^{2})$, and $W_{2t} \sim N(0,\sigma_{v}^{2})$. Here we allow the local level (intercept) and the trend (slope) to vary in time. Note the term local here is in contrast to global, where the level $\mu$ is fixed ($\sigma_{\mu}^{2}=0$) and there is a constant level across time.

In this case we have added an intervention variable $\lambda$ and $w$, where $\lambda$ is a weighting term and $w$ is a function where the value is zero before the intervention and unity after the intervention.

Setting all the noise terms $\eta=(\epsilon_t, W_{1t}, W_{2t})$ to zero yields the simple equation of a line with constant intercept and slope. At $t=1$:

$$ y_1=\mu_1 $$

$$ \mu_1=\mu_0+v_0 $$

$$ v_1=v_0 $$

$$ y_1=\mu_0 + v_0 $$

At $t=2$:

$$ y_2=\mu_2 $$

$$ \mu_2=\mu_1+v_1=\mu_0+v_0+v_0 $$

$$ v_2=v_1=v_0 $$

$$ y_2=\mu_0+2v_0 $$

At $t=3$:

$$y_3=\mu_2 $$

$$ \mu_3=\mu_2+v_2=\mu_0+v_0+v_0+b_0 $$

$$ v_3=v_2=v_1=v_0 $$

$$ y_3=\mu_0+3v_0 $$

Therefore, in this case the linear trend model simplifies to

$$ y_t=\mu_0+v_0g_t+\epsilon_t $$

where $g_t=t$ for $t=1,…,n$ is effectively time and $\mu_0$ and $v_0$ are the initial values of the level and the slope.

The state space model above can be expressed algebraically in one unified formulation. Using matrix algebra, these models can be written in the following general format:

$$ y_t=Z_{t}^{T}\alpha_t+\epsilon_t $$

$$ \alpha_{t+1}=T_t \alpha_t + R_t \eta_t $$

The first equation is the observation or measurement equation because it links the observed data with the unobserved latent state $\alpha$. The second equation is the transition or state equation because it defines how the latent state evolves over time. $\alpha$ is the state vector, $Z_t$ is the observation or design vector, $T_t$ is the transition matrix, $R_t$ is usually an identity matrix and in cases where it is not identity $R_t$ is called the selection matrix. Finally, $\eta$ is state disturbances.

We can express the local linear trend model in state space form:

$$ \alpha_t=\begin{pmatrix}\mu_t\v_t\end{pmatrix}, \quad \eta_t=\begin{pmatrix}\psi_t\\zeta_t\end{pmatrix}, \quad T_t=\begin{bmatrix}1 & 1\0 & 1\end{bmatrix}, \quad Z_t=\begin{pmatrix}1\0\end{pmatrix} $$

$$ Q_t=\begin{bmatrix}\sigma_{\mu}^2 & 0\0 & \sigma_{v}^2\end{bmatrix}, \quad R_t=\begin{bmatrix}1 & 0\0 & 1\end{bmatrix} $$

The primary tool for fitting state space model to data is the Kalman filter, which recursively computes the predictive distribution $p(\alpha_{t+1}\mid y_{1:t}))$ by combining $p(\alpha_{t}\mid y_{1:t-1}))$ with $y_t$ using a standard set of formulas that is logically equivalent to linear regression.

Intervention variables can be added to assess the influence of an external change or stimulus to the development in a time series. Three possible interventions are the level shift, slope shift, and a pulse where the value suddenly changes at the moment of the interventiona and than immediately returns to the value before the intervention took place. Changes in the value of level shift and slope shift are permanent after the intervention. A level shift can be expressed as follows:

$$ y_t=\mu_t+\lambda_t w_t+\epsilon_t $$

$$ \mu_{t+1}=\mu_t+W_{1t} $$

$$ \lambda_{t+1}=\lambda_t+W_{3t} $$

The dummy variable $w_t$ equals zero at all time points before the intervention and equals unity at the time points after the intervention.

The state space equations can also be cast as a probabilistic model such that for the measurement model we have $y_t\sim p(y_t \mid \alpha_t)$ and for the latent state model we have $\alpha_t\sim p(\alpha_t \mid \alpha_{t-1})$.

Resources:

An Introduction to State Space Time Series Analysis by Commandeur and Koopman

Pandoc static site generator

Wed, 20 May 2015 00:00:00 +0000

Pandoc is one of the most useful command-line document converters I’ve used. Entire websites can be generated from simple text files. I’m partial to the Markdown syntax, but really any plain-text file can be read into Pandoc and it will output HTML, PDF, and even Word Docs. This website for instance is built using a single Makefile that runs a Pandoc command.

pandoc -r markdown+simple_tables+table_captions+yaml_metadata_block+auto_identifiers
 +header_attributes+fenced_code_blocks+fenced_code_attributes+tex_math_dollars
 -w html --toc --mathjax --include-before-body=$(PANDOC)/navigation.html
 --template=$(PANDOC)/layout.html --css=$(PANDOC)/style.css -o $(ROOT)/$@ $<

Basically, for every Markdown .md file in the directory, Pandoc converts it to an HTML file styled using a given template. The Markdown file supports YAML headers and the variables are made available in the template file. Templates also support basic logic (if/else) and loops (for), which allows for some very smart template files that can change the HTML output based on the YAML headers in your Markdown file.

Pandoc also support syntax highlighting for embedded code.

Data science at the command line

Thu, 25 Dec 2014 00:00:00 +0000

The UNIX command line is a powerful tool for diving into large data files and piping specialized utilities to create summary statistics. This is a compilation of some useful commands.

Parsing CSV

Read the first line of a CSV file, which contains the column names, and list each column name in a new line after splitting by comma and stripping surrounding quotes. Notice awk can create an array and has operations like length that returns the number of elements in an array.

head -n 1 WebExtract.txt | awk '{ split($0, a, ","); max = length(a) }
 END { for (x=1; x<=max;x++) {gsub(/"/, "", a[x]); print a[x]} }'

Take the average of a column of integers in a CSV file after filtering for a category.

cat iris.csv | grep 'Iris-setosa' | awk -F "," '{ sum += $1; n += 1; }
 END { printf "%0.5f\n", sum/n }'

To select a single line in a CSV file and output one column.

sed -n '2405p;2405q' WebExtract.txt | awk -F "\",\"" '{ print $3; }'

We can also select a subset of lines in the middle of the document and return a single column.

sed -n '1000,1010p' WebExtract.txt | awk -F "\",\"" { 'print $3; }'

Count the number of lines with wc.

cat trip_data_1.csv | wc -l

JSON files can be parsed using jq. The following command parses a 35MB file for the attribute city, sorts the cities, and returns the number of occurrences.

cat ./yelp_train_academic_dataset_business.json | jq '.city' | sort | uniq -c

Parsing JSON

To test a command on a large file we can select just the first few lines.

head -n 100 ./yelp_train_academic_dataset_business.json | jq '.city' | sort | uniq -c

To further filter the output for occurrences greater than 5 we can use awk.

head -n 100 ./yelp_train_academic_dataset_business.json | jq '.city' |
 sort | uniq -c | awk '{if ($1 > 5) {print $0}}'

The above filter also works reasonably well on the entire dataset.

cat ./yelp_train_academic_dataset_business.json | jq '.city' |
 sort | uniq -c | awk '{if ($1 > 100) {print $0}}'

To get the total number of entires with over 100 occurrences we can take the sum.

cat ./yelp_train_academic_dataset_business.json | jq '.city' |
 sort | uniq -c | awk '{ if ($1 > 100) {sum += $1;} } END { print sum }'

To filter by a key within jq we can pipe commands and return only the city names.

head -n 3 ./yelp_train_academic_dataset_business.json | jq 'select(.city |
 contains("De Forest")) | .city'

Punchcard visualization using D3.js

Tue, 12 Aug 2014 00:00:00 +0000

Consider using a punchcard visualization when you want to represent both counts and proportionality over a time period across categorical data.

The graphic is particularly useful in interactive documents with hover effects showing the underlying count.

Data Mining PubMed

Wed, 12 Mar 2014 00:00:00 +0000

The National Institutes of Health provides a full programming interface to search PubMed called E-Utilities. Interacting with the PubMed database is conveniently through simple HTTP requests and returns the article metadata as XML. Every article in PubMed has a title, author, abstract, journal, year, volume, issue, pages, and keywords, amoung other metadata. Getting the metadata from PubMed, however, involves two separate queries. Very simply, the first query returns a list of PubMed IDs for articles matching the search criteria and the second query returns article data for a given PMID.

The workflow is divided into two parts:

Query E-Search passing it your search term and it returns a list of PMIDs that are used to query E-Fetch for the article metadata.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?term=electrical+stimulation
&amp;retmax=10&amp;tool=pmquery&amp;db=pubmed

E-Search returns a list of PMIDs:

 <eSearchResult>
 <Count>157380</Count>
 <RetMax>10</RetMax>
 <RetStart>0</RetStart>
 <IdList>
 <Id>23858010</Id>
 <Id>23856563</Id>
 <Id>23856146</Id>
 <Id>23855510</Id>
 <Id>23839460</Id>
 <Id>23839375</Id>
 <Id>23853340</Id>
 <Id>23853339</Id>
 <Id>23853324</Id>
 <Id>23853296</Id>
 </IdList>
 ...

Next, query E-Fetch for the article data. You can request multiple PMIDs at once and even the return type (XML, text, JSON). The API also supports pagination to iteratively get many thousands of results.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
?db=pubmed&id=23856563,23858010&retmode=xml

I’ve written several interfaces to access the PubMed API, including in PHP, Python, and C#. For instance, the Python script was written specifically to data-mine PubMed. Given a search term pmquery.py will query PubMed and save each article to a text file. For some search terms, like “transcranial magnetic stimulation” this results in over 9000 articles returned by Pubmed. So the process is iterative and can take some time (minutes). The PHP implementation provides a web-based search interface. For desktop based applications, see the C# code and the Scholared app for a working example.

Asif Rahman

Python Dependency Injection

Usage

Simple Dependency Injection

Async Dependencies

Dependency Caching

Disable Caching

Custom Fields

Dependency Overrides

Generator Dependencies (Context Managers)

Type Validation with Pydantic

Disable Type Casting

Implementation Code

Autostore: File Storage Made Simple

Why Use AutoStore?

Getting Started

Basic Usage

Cloud Storage (S3)

Supported Data Types

Configuration Options

S3StorageConfig

Advanced Features

Caching System

Custom Data Handlers

File Operations

Context Management

Multiple Storage Backends

Performance Considerations

Large File Handling

When to Use AutoStore

Comparison with Alternatives

Pypertext: HTML the Pythonic way

Install

Core Features

🏗️ Element Creation with ht

⛓️ Chainable Operations

🎨 Dynamic Styling with dict2css

📄 Full Document Creation

🔧 Flexible Content Types

🏷️ Modify elements

📝 Attribute Handling

🔄 Method Chaining and Pipes

🌐 ASGI Integration

🧩 Custom Components

CSS-in-Python Styling

API Reference

Core Classes

Element Methods

Document attributes

Presskit: Database-driven static site generator

Key Features

Installation

Quick Start

Basic Usage

Writing Markdown Content

Creating HTML Templates

Configuration

Template Variables

Site Variables (site.*)

Build Variables (build.*)

Page Variables (page.*)

Data Variables (data.*)

Using Variables in Markdown

Data Sources and Queries

Configuring Data Sources

Adding Queries

Using Query Data in Templates

Page-Level Queries

Generating Pages

Generator Queries

Generator Configuration

Creating Generator Templates

Nested Queries

Commands

Build Commands

Development

Advanced Configuration

Full Configuration Example

Custom Filters

Google Chrome On-Device Embedding Model

🏗️ Element Creation with `ht`

Site Variables (`site.*`)

Build Variables (`build.*`)

Page Variables (`page.*`)

Data Variables (`data.*`)