Asif Rahman

Autostore: File Storage Made Simple

AutoStore is a Python library for simple file storage with caching and supports multiple backends.

AutoStore provides a dictionary-like interface for reading and writing files with caching and different storage backends.

AutoStore eliminates the cognitive overhead of managing different file formats, letting you focus on your data and analysis rather than the mechanics of file I/O. It automatically handles file format detection, type inference, upload/download operations, and provides a clean, intuitive API for data persistence across local and cloud storage.

Table of contents:

Why Use AutoStore?

Getting Started

AutoStore requires Python 3.10+ and can be installed via pip.

pip install autostore

Basic Usage

from autostore import AutoStore

store = AutoStore("./data")

# Write data - automatically saves with appropriate extensions
store["my_dataframe"] = df           # ./data/my_dataframe.parquet
store["config"] = {"key": "value"}   # ./data/config.json
store["logs"] = [{"event": "start"}] # ./data/logs.jsonl

# Read data
df = store["my_dataframe"]           # Returns a Polars DataFrame
config = store["config"]             # Returns a dict
logs = store["logs"]                 # Returns a list of dicts

Cloud Storage (S3)

from autostore import AutoStore
from autostore.s3 import S3Backend, S3StorageConfig

# Register S3 backend
AutoStore.register_backend("s3", S3Backend)

# Configure S3 with caching
s3_config = S3StorageConfig(
    region_name="us-east-1",
    cache_enabled=True,
    cache_expiry_hours=12,
    multipart_threshold=64 * 1024 * 1024  # 64MB
)

# Use S3 storage
store = AutoStore("s3://my-bucket/data/", config=s3_config)
store["experiment/results"] = {"accuracy": 0.95, "epochs": 100}
results = store["experiment/results"]  # Uses cache on subsequent loads

Supported Data Types

Data TypeFile ExtensionDescriptionLibrary Required
Polars DataFrame/LazyFrame.parquet, .csvHigh-performance DataFramespolars
Python dict/list.jsonStandard JSON serializationbuilt-in
List of dicts.jsonlJSON Lines formatbuilt-in
Pydantic models.pydantic.jsonStructured data modelspydantic
Python dataclasses.dataclass.jsonDataclass serializationbuilt-in
String data.txt, .html, .mdPlain text filesbuilt-in
NumPy arrays.npy, .npzNumerical datanumpy
SciPy sparse matrices.sparseSparse matrix datascipy
PyTorch tensors/models.pt, .pthDeep learning modelstorch
PIL/Pillow images.png, .jpg, etc.Image dataPillow
YAML data.yaml, .ymlHuman-readable config filesPyYAML
Any Python object.pklPickle fallbackbuilt-in

Configuration Options

S3StorageConfig

from s3 import S3StorageConfig

config = S3StorageConfig(
    aws_access_key_id="your-key",
    aws_secret_access_key="your-secret",
    region_name="us-east-1",
    cache_enabled=True,
    cache_expiry_hours=12,
    multipart_threshold=64 * 1024 * 1024,  # Files larger than this use multipart upload
    multipart_chunksize=16 * 1024 * 1024,  # Chunk size for multipart uploads
    max_concurrency=10                     # Maximum concurrent uploads/downloads
)

Advanced Features

Caching System

AutoStore includes an intelligent caching system that:

# Cache management
store.cleanup_cache()  # Remove expired cache entries

# Check cache status
metadata = store.get_metadata("large_file")
print(f"File size: {metadata.size} bytes")
print(f"ETag: {metadata.etag}")

Custom Data Handlers

Add support for new data types by creating custom handlers:

from pathlib import Path
from autostore.autostore import DataHandler

class CustomLogHandler(DataHandler):
    def can_handle_extension(self, extension: str) -> bool:
        return extension.lower() == ".log"

    def can_handle_data(self, data) -> bool:
        return isinstance(data, list) and all(
            isinstance(item, dict) and "timestamp" in item
            for item in data
        )

    def read_from_file(self, file_path: Path, file_extension: str):
        logs = []
        with open(file_path, 'r') as f:
            for line in f:
                if line.strip():
                    logs.append(json.loads(line))
        return logs

    def write_to_file(self, data, file_path: Path, file_extension: str):
        file_path.parent.mkdir(parents=True, exist_ok=True)
        with open(file_path, 'w') as f:
            for entry in data:
                f.write(json.dumps(entry) + '\n')

    @property
    def extensions(self):
        return [".log"]

    @property
    def priority(self):
        return 15

# Register the handler
store.register_handler(CustomLogHandler())

File Operations

# Check existence
if "config" in store:
    print("Config file exists")

# List all files
for key in store.keys():
    print(f"File: {key}")

# Get file metadata
metadata = store.get_metadata("large_dataset")
print(f"Size: {metadata.size} bytes")
print(f"Modified: {metadata.modified_time}")

# Copy and move files
store.copy("original", "backup")
store.move("temp_file", "permanent_file")

# Delete files
del store["old_data"]

Context Management

# Automatic cleanup of temporary files and cache
with AutoStore("./data", config=config) as store:
    store["data"] = large_dataset
    results = store["data"]
# Temporary files are automatically cleaned up here

Multiple Storage Backends

AutoStore supports pluggable storage backends:

# Local storage
local_store = AutoStore("./data")

# S3 storage
s3_store = AutoStore("s3://bucket/prefix/")

Performance Considerations

Large File Handling

AutoStore automatically optimizes for large files:

When to Use AutoStore

Choose AutoStore when you need:

Don’t choose AutoStore when:

Comparison with Alternatives

FeatureAutoStoreShelveDiskCacheTinyDBPickleDBSQLiteDict
Multi-format Support✅ 12+ formats❌ Pickle only❌ Pickle only❌ JSON only❌ JSON only❌ Pickle only
Auto Format Detection✅ Smart inference❌ Manual❌ Manual❌ Manual❌ Manual❌ Manual
Cloud Storage✅ S3, extensible❌ Local only❌ Local only❌ Local only❌ Local only❌ Local only
Intelligent Caching✅ ETag-based❌ None✅ Advanced❌ None❌ None❌ None
Type-Safe Config✅ Dataclasses❌ None✅ Classes❌ Dicts❌ None❌ None
Large File Handling✅ Multipart❌ Limited✅ Good❌ Limited❌ Limited❌ Limited
Extensibility✅ Handler system❌ Limited❌ Limited✅ Middleware❌ Limited❌ Limited
Performance✅ Cached/Optimized🔶 Medium✅ Fast🔶 Medium🔶 Medium🔶 Medium
Standard Library❌ External✅ Built-in❌ External❌ External❌ External❌ External

#Python