silas/mkv

Silas Brack 689b85e6f2 Fixes

2026-03-07 17:14:54 +01:00

11 KiB

Raw Permalink Blame History

mkv

Distributed key-value store for blobs. Thin index server (Rust + SQLite) in front of nginx volume servers. Inspired by minikeyvalue.

Usage

# Start the index server (replicates to 2 of 3 volumes)
mkv -d /tmp/index.db -v http://vol1:8080,http://vol2:8080,http://vol3:8080 -r 2 serve -p 3000

# Store a file
curl -X PUT -d "contents" http://localhost:3000/path/to/key

# Retrieve (returns 302 redirect to nginx)
curl -L http://localhost:3000/path/to/key

# Check existence and size
curl -I http://localhost:3000/path/to/key

# Delete
curl -X DELETE http://localhost:3000/path/to/key

# List keys (with optional prefix filter)
curl http://localhost:3000/?prefix=path/to/

Operations

# Rebuild index by scanning all volumes (stop the server first)
mkv -d /tmp/index.db -v http://vol1:8080,http://vol2:8080,http://vol3:8080 -r 2 rebuild

# Rebalance after adding/removing volumes (preview with --dry-run)
mkv -d /tmp/index.db -v http://vol1:8080,http://vol2:8080,http://vol3:8080,http://vol4:8080 -r 2 rebalance --dry-run
mkv -d /tmp/index.db -v http://vol1:8080,http://vol2:8080,http://vol3:8080,http://vol4:8080 -r 2 rebalance

Volume servers

Any nginx with WebDAV enabled works:

server {
    listen 80;
    root /data;
    location / {
        dav_methods PUT DELETE;
        create_full_put_path on;
        autoindex on;
        autoindex_format json;
    }
}

What it does

HTTP API — PUT, GET (302 redirect), DELETE, HEAD, LIST with prefix filtering
Replication — fan-out writes to N volumes concurrently, all-or-nothing with rollback
Consistent hashing — stable volume assignment; adding/removing a volume only moves ~1/N of keys
Rebuild — reconstructs the SQLite index by scanning nginx autoindex on all volumes
Rebalance — migrates data to correct volumes after topology changes, with --dry-run preview
Key-as-path — blobs stored at /{key} on nginx, no content-addressing or sidecar files
Single binary — no config files, everything via CLI flags

What it doesn't do

Checksums — no integrity verification; bit rot goes undetected
Auth — no access control; anyone who can reach the server can read/write/delete
Encryption — blobs stored as plain files on nginx
Streaming / range requests — entire blob must fit in memory
Metadata — no EXIF, tags, or content types; key path is all you get
Versioning — PUT overwrites; no history
Compression — blobs stored as-is

Comparison to minikeyvalue

mkv is a ground-up rewrite of minikeyvalue in Rust.

	mkv	minikeyvalue
Language	Rust	Go
Index	SQLite (WAL mode)	LevelDB
Storage paths	key-as-path (`/{key}`)	content-addressed (md5 + base64)
GET behavior	Index lookup, 302 redirect	HEAD to volume first, then 302 redirect
PUT overwrite	Allowed	Forbidden (returns 403)
Hash function	SHA-256 per volume, sort by score	MD5 per volume, sort by score
MD5 of values	No	Yes (stored in index)
Health checker	No	No (checks per-request via HEAD)
Subvolumes	No	Yes (configurable fan-out directories)
Soft delete	No (hard delete)	Yes (UNLINK + DELETE two-phase)
S3 API	No	Partial (list, multipart upload)
App code	~600 lines	~1,000 lines
Tests	17 (unit + integration)	1

Performance (10k keys, 1KB values, 100 concurrency)

Tested on the same machine with shared nginx volumes:

Operation	mkv	minikeyvalue
PUT	10,000 req/s	10,500 req/s
GET (full round-trip)	7,000 req/s	6,500 req/s
GET (index only)	15,800 req/s	13,800 req/s
DELETE	13,300 req/s	13,600 req/s

Both are bottlenecked by nginx volume I/O. The index layer (SQLite) can sustain 378,000 writes/sec in isolation.

Error responses

Every error returns a plain-text body with a human-readable message.

Status	Error	When
`404 Not Found`	`not found`	GET, HEAD, DELETE for a key that doesn't exist
`500 Internal Server Error`	`corrupt record for key {key}: no volumes`	Key exists in index but has no volume locations (data integrity issue)
`500 Internal Server Error`	`database error: {detail}`	SQLite failure (disk full, corruption, locked)
`502 Bad Gateway`	`not all volume writes succeeded`	PUT where one or more volume writes failed; all volumes are rolled back
`503 Service Unavailable`	`need {n} volumes but only {m} available`	PUT when fewer volumes are configured than the replication factor requires

Failure modes

PUT writes to all target volumes concurrently, then updates the index. If any volume write fails, all volumes are rolled back (best-effort) and the client gets 502. If volume writes succeed but the index update fails, volumes are rolled back and the client gets 500.

DELETE removes the key from the index and issues best-effort deletes to all volumes. Volume delete failures are logged but do not fail the request — the client always gets 204 if the key existed. This can leave orphaned blobs on volumes; use rebuild to reconcile.

GET looks up the key in the index and returns a 302 redirect to the first volume. If the volume is unreachable, the client sees the failure directly from nginx (the index server does not proxy the blob).

Security

mkv assumes a trusted network. There is no built-in authentication, authorization, or encryption. This is the same security model as minikeyvalue — neither system is designed for direct exposure to the public internet.

Trust model

The index server and volume servers (nginx) are expected to live on the same private network. GET requests return a 302 redirect to a volume URL, so clients must be able to reach the volumes directly. Anyone who can reach the index server can read, write, and delete any key. Anyone who can reach a volume can read any blob.

Deploying with auth

Put a reverse proxy in front of the index server and handle authentication there:

Basic auth or API keys at the reverse proxy for simple setups
mTLS for machine-to-machine access
OAuth / JWT validation at the proxy for multi-user setups

Volume servers should be on a private network that clients cannot reach directly, or use nginx's secure_link module to validate signed redirect URLs.

What neither mkv nor minikeyvalue protect against

Unauthorized reads/writes (no auth)
Data in transit (no TLS unless the proxy adds it)
Data at rest (blobs are plain files on disk)
Malicious keys (no input sanitization beyond what nginx enforces on paths)
Index tampering (SQLite file has no integrity protection)

Development

Principles

Explicit over clever — no magic helpers, no macros that hide control flow, no trait gymnastics. Code reads top-to-bottom. A new reader should understand what a function does without chasing through layers of indirection.
Pure functions — isolate decision logic from IO. A function that takes data and returns data is testable, composable, and easy to reason about. Keep it that way. Don't sneak in network calls or logging.
Linear flow — avoid callbacks, deep nesting, and async gymnastics where possible. A handler should read like a sequence of steps: look up the record, pick a volume, build the response.
Minimize shared state — pass values explicitly. Don't hold locks across IO. Don't reach into globals.
Minimize indirection — don't hide logic behind abstractions that exist "in case we need to swap the implementation later." We won't. A three-line function inline is better than a trait with one implementor.

Applying the principles: separate decisions from execution

Every request handler does two things: decides what should happen, then executes IO to make it happen. These should be separate functions.

A decision is a pure function. It takes data in, returns a description of what to do. It doesn't call the network, doesn't touch the database, doesn't log. It can be tested with assert_eq! and nothing else.

Execution is the messy part — HTTP calls, SQLite writes, error recovery. It reads the decision and carries it out. It's tested with integration tests.

Where this applies today

Already pure

hasher.rs — the entire module is pure. volumes_for_key is a deterministic function of its inputs. No IO, no state mutation. This is the gold standard for the project.

rebalance.rs::plan_rebalance — takes a slice of records and returns a list of moves. Pure decision logic, tested with unit tests.

db.rs encode/parse — parse_volumes and encode_volumes are pure transformations between JSON strings and Vec<String>.

Mixed (decision + execution interleaved)

server.rs::put_key — this handler does three things in one function:

Decide which volumes to write to (pure — volumes_for_key)
Execute fan-out PUTs to nginx (IO)
Decide whether to rollback based on results (pure — check which succeeded)
Execute rollback DELETEs and/or index write (IO)

Steps 1 and 3 could be extracted as pure functions if they grow more complex.

Intentionally impure

rebuild.rs — walks nginx autoindex and bulk-inserts into SQLite. The IO is the whole point; there's no decision logic worth extracting.

db.rs — wraps SQLite behind Arc<Mutex<Connection>> with spawn_blocking to avoid blocking the tokio runtime. The mutex serializes all access; SQLITE_OPEN_NO_MUTEX disables SQLite's internal locking since the application mutex handles it.

Guidelines

If a function takes only data and returns only data, it's pure. Keep it that way. Don't sneak in logging, metrics, or "just one network call."
If a handler has an if or match that decides between outcomes, that decision can probably be a pure function. Extract it. Name it. Test it.
IO boundaries should be thin. Format URL, make request, check status, return bytes. No business logic.
Don't over-abstract. A three-line pure function inline in a handler is fine. Extract it when it gets complex enough to need its own tests, or when the same decision appears in multiple places (e.g., rebuild and rebalance both use volumes_for_key).
Errors are data. AppError is a value, not an exception. Functions return Result, handlers pattern-match on it. The IntoResponse impl is the only place where errors become HTTP responses — one place, one mapping.

Anti-patterns to avoid

God handler — a 100-line async fn that reads the DB, calls volumes, makes decisions, handles errors, and formats the response. Break it up.
Hidden state reads — if a function needs data, pass it in. Don't reach into a global or lock a mutex inside a "pure" function.
Testing IO to test logic — if you need a Docker container running to test whether volume selection works correctly, the logic isn't separated from the IO.

11 KiB Raw Permalink Blame History

mkv