257 lines
11 KiB
Markdown
257 lines
11 KiB
Markdown
# mkv
|
|
|
|
Distributed key-value store for blobs. Thin index server (Rust + SQLite) in front of nginx volume servers. Inspired by [minikeyvalue](https://github.com/geohot/minikeyvalue).
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# Start the index server (replicates to 2 of 3 volumes)
|
|
mkv -d /tmp/index.db -v http://vol1:8080,http://vol2:8080,http://vol3:8080 -r 2 serve -p 3000
|
|
|
|
# Store a file
|
|
curl -X PUT -d "contents" http://localhost:3000/path/to/key
|
|
|
|
# Retrieve (returns 302 redirect to nginx)
|
|
curl -L http://localhost:3000/path/to/key
|
|
|
|
# Check existence and size
|
|
curl -I http://localhost:3000/path/to/key
|
|
|
|
# Delete
|
|
curl -X DELETE http://localhost:3000/path/to/key
|
|
|
|
# List keys (with optional prefix filter)
|
|
curl http://localhost:3000/?prefix=path/to/
|
|
```
|
|
|
|
### Operations
|
|
|
|
```bash
|
|
# Rebuild index by scanning all volumes (disaster recovery)
|
|
mkv -d /tmp/index.db -v http://vol1:8080,http://vol2:8080,http://vol3:8080 -r 2 rebuild
|
|
|
|
# Rebalance after adding/removing volumes (preview with --dry-run)
|
|
mkv -d /tmp/index.db -v http://vol1:8080,http://vol2:8080,http://vol3:8080,http://vol4:8080 -r 2 rebalance --dry-run
|
|
mkv -d /tmp/index.db -v http://vol1:8080,http://vol2:8080,http://vol3:8080,http://vol4:8080 -r 2 rebalance
|
|
```
|
|
|
|
### Volume servers
|
|
|
|
Any nginx with WebDAV enabled works:
|
|
|
|
```nginx
|
|
server {
|
|
listen 80;
|
|
root /data;
|
|
location / {
|
|
dav_methods PUT DELETE;
|
|
create_full_put_path on;
|
|
autoindex on;
|
|
autoindex_format json;
|
|
}
|
|
}
|
|
```
|
|
|
|
## What it does
|
|
|
|
- **HTTP API** — PUT, GET (302 redirect), DELETE, HEAD, LIST with prefix filtering
|
|
- **Replication** — fan-out writes to N volumes concurrently, all-or-nothing with rollback
|
|
- **Consistent hashing** — stable volume assignment; adding/removing a volume only moves ~1/N of keys
|
|
- **Rebuild** — reconstructs the SQLite index by scanning nginx autoindex on all volumes
|
|
- **Rebalance** — migrates data to correct volumes after topology changes, with `--dry-run` preview
|
|
- **Key-as-path** — blobs stored at `/{key}` on nginx, no content-addressing or sidecar files
|
|
- **Single binary** — no config files, everything via CLI flags
|
|
|
|
## What it doesn't do
|
|
|
|
- **Checksums** — no integrity verification; bit rot goes undetected
|
|
- **Auth** — no access control; anyone who can reach the server can read/write/delete
|
|
- **Encryption** — blobs stored as plain files on nginx
|
|
- **Streaming / range requests** — entire blob must fit in memory
|
|
- **Metadata** — no EXIF, tags, or content types; key path is all you get
|
|
- **Versioning** — PUT overwrites; no history
|
|
- **Compression** — blobs stored as-is
|
|
|
|
## Comparison to minikeyvalue
|
|
|
|
mkv is a ground-up rewrite of [minikeyvalue](https://github.com/geohot/minikeyvalue) in Rust.
|
|
|
|
| | mkv | minikeyvalue |
|
|
|--|-----|--------------|
|
|
| Language | Rust | Go |
|
|
| Index | SQLite (WAL mode) | LevelDB |
|
|
| Storage paths | key-as-path (`/{key}`) | content-addressed (md5 + base64) |
|
|
| GET behavior | Index lookup, 302 redirect | HEAD to volume first, then 302 redirect |
|
|
| PUT overwrite | Allowed | Forbidden (returns 403) |
|
|
| Hash function | SHA-256 per volume, sort by score | MD5 per volume, sort by score |
|
|
| MD5 of values | No | Yes (stored in index) |
|
|
| Health checker | No | No (checks per-request via HEAD) |
|
|
| Subvolumes | No | Yes (configurable fan-out directories) |
|
|
| Soft delete | No (hard delete) | Yes (UNLINK + DELETE two-phase) |
|
|
| S3 API | No | Partial (list, multipart upload) |
|
|
| App code | ~600 lines | ~1,000 lines |
|
|
| Tests | 17 (unit + integration) | 1 |
|
|
|
|
### Performance (10k keys, 1KB values, 100 concurrency)
|
|
|
|
Tested on the same machine with shared nginx volumes:
|
|
|
|
| Operation | mkv | minikeyvalue |
|
|
|-----------|-----|--------------|
|
|
| PUT | 10,000 req/s | 10,500 req/s |
|
|
| GET (full round-trip) | 7,000 req/s | 6,500 req/s |
|
|
| GET (index only) | 15,800 req/s | 13,800 req/s |
|
|
| DELETE | 13,300 req/s | 13,600 req/s |
|
|
|
|
Both are bottlenecked by nginx volume I/O. The index layer (SQLite) can sustain 378,000 writes/sec in isolation.
|
|
|
|
## Error responses
|
|
|
|
Every error returns a plain-text body with a human-readable message.
|
|
|
|
| Status | Error | When |
|
|
|--------|-------|------|
|
|
| `404 Not Found` | `not found` | GET, HEAD, DELETE for a key that doesn't exist |
|
|
| `500 Internal Server Error` | `corrupt record for key {key}: no volumes` | Key exists in index but has no volume locations (data integrity issue) |
|
|
| `500 Internal Server Error` | `database error: {detail}` | SQLite failure (disk full, corruption, locked) |
|
|
| `502 Bad Gateway` | `not all volume writes succeeded` | PUT where one or more volume writes failed; all volumes are rolled back |
|
|
| `503 Service Unavailable` | `need {n} volumes but only {m} available` | PUT when fewer volumes are configured than the replication factor requires |
|
|
|
|
### Failure modes
|
|
|
|
**PUT** writes to all target volumes concurrently, then updates the index. If any volume write fails, all volumes are rolled back (best-effort) and the client gets 502. If volume writes succeed but the index update fails, volumes are rolled back and the client gets 500.
|
|
|
|
**DELETE** removes the key from the index and issues best-effort deletes to all volumes. Volume delete failures are logged but do not fail the request — the client always gets 204 if the key existed. This can leave orphaned blobs on volumes; use `rebuild` to reconcile.
|
|
|
|
**GET** looks up the key in the index and returns a 302 redirect to the first volume. If the volume is unreachable, the client sees the failure directly from nginx (the index server does not proxy the blob).
|
|
|
|
## Security
|
|
|
|
mkv assumes a **trusted network**. There is no built-in authentication, authorization, or encryption. This is the same security model as minikeyvalue — neither system is designed for direct exposure to the public internet.
|
|
|
|
### Trust model
|
|
|
|
The index server and volume servers (nginx) are expected to live on the same private network. GET requests return a 302 redirect to a volume URL, so clients must be able to reach the volumes directly. Anyone who can reach the index server can read, write, and delete any key. Anyone who can reach a volume can read any blob.
|
|
|
|
### Deploying with auth
|
|
|
|
Put a reverse proxy in front of the index server and handle authentication there:
|
|
|
|
- **Basic auth or API keys** at the reverse proxy for simple setups
|
|
- **mTLS** for machine-to-machine access
|
|
- **OAuth / JWT** validation at the proxy for multi-user setups
|
|
|
|
Volume servers should be on a private network that clients cannot reach directly, or use nginx's `secure_link` module to validate signed redirect URLs.
|
|
|
|
### What neither mkv nor minikeyvalue protect against
|
|
|
|
- Unauthorized reads/writes (no auth)
|
|
- Data in transit (no TLS unless the proxy adds it)
|
|
- Data at rest (blobs are plain files on disk)
|
|
- Malicious keys (no input sanitization beyond what nginx enforces on paths)
|
|
- Index tampering (SQLite file has no integrity protection)
|
|
|
|
|
|
# Development
|
|
|
|
## Principles
|
|
|
|
1. **Explicit over clever** — no magic helpers, no macros that hide control
|
|
flow, no trait gymnastics. Code reads top-to-bottom. A new reader should
|
|
understand what a function does without chasing through layers of
|
|
indirection.
|
|
|
|
2. **Pure functions** — isolate decision logic from IO. A function that takes
|
|
data and returns data is testable, composable, and easy to reason about.
|
|
Keep it that way. Don't sneak in network calls or logging.
|
|
|
|
3. **Linear flow** — avoid callbacks, deep nesting, and async gymnastics where
|
|
possible. A handler should read like a sequence of steps: look up the
|
|
record, pick a volume, build the response.
|
|
|
|
4. **Minimize shared state** — pass values explicitly. Don't hold locks across
|
|
IO. Don't reach into globals.
|
|
|
|
5. **Minimize indirection** — don't hide logic behind abstractions that exist
|
|
"in case we need to swap the implementation later." We won't. A three-line
|
|
function inline is better than a trait with one implementor.
|
|
|
|
## Applying the principles: separate decisions from execution
|
|
|
|
Every request handler does two things: **decides** what should happen, then
|
|
**executes** IO to make it happen. These should be separate functions.
|
|
|
|
A decision is a pure function. It takes data in, returns a description of what
|
|
to do. It doesn't call the network, doesn't touch the database, doesn't log.
|
|
It can be tested with `assert_eq!` and nothing else.
|
|
|
|
Execution is the messy part — HTTP calls, SQLite writes, error recovery. It
|
|
reads the decision and carries it out. It's tested with integration tests.
|
|
|
|
## Where this applies today
|
|
|
|
### Already pure
|
|
|
|
**`hasher.rs`** — the entire module is pure. `volumes_for_key` is a
|
|
deterministic function of its inputs. No IO, no state mutation. This is the
|
|
gold standard for the project.
|
|
|
|
**`rebalance.rs::plan_rebalance`** — takes a slice of records and returns a
|
|
list of moves. Pure decision logic, tested with unit tests.
|
|
|
|
**`db.rs` encode/parse** — `parse_volumes` and `encode_volumes` are pure
|
|
transformations between JSON strings and `Vec<String>`.
|
|
|
|
### Mixed (decision + execution interleaved)
|
|
|
|
**`server.rs::put_key`** — this handler does three things in one function:
|
|
|
|
1. *Decide* which volumes to write to (pure — `volumes_for_key`)
|
|
2. *Execute* fan-out PUTs to nginx (IO)
|
|
3. *Decide* whether to rollback based on results (pure — check which succeeded)
|
|
4. *Execute* rollback DELETEs and/or index write (IO)
|
|
|
|
Steps 1 and 3 could be extracted as pure functions if they grow more complex.
|
|
|
|
### Intentionally impure
|
|
|
|
**`rebuild.rs`** — walks nginx autoindex and bulk-inserts into SQLite. The IO
|
|
is the whole point; there's no decision logic worth extracting.
|
|
|
|
**`db.rs`** — wraps SQLite behind `Arc<Mutex<Connection>>` with
|
|
`spawn_blocking` to avoid blocking the tokio runtime. The mutex serializes all
|
|
access; `SQLITE_OPEN_NO_MUTEX` disables SQLite's internal locking since the
|
|
application mutex handles it.
|
|
|
|
## Guidelines
|
|
|
|
1. **If a function takes only data and returns only data, it's pure.** Keep it
|
|
that way. Don't sneak in logging, metrics, or "just one network call."
|
|
|
|
2. **If a handler has an `if` or `match` that decides between outcomes, that
|
|
decision can probably be a pure function.** Extract it. Name it. Test it.
|
|
|
|
3. **IO boundaries should be thin.** Format URL, make request, check status,
|
|
return bytes. No business logic.
|
|
|
|
4. **Don't over-abstract.** A three-line pure function inline in a handler is
|
|
fine. Extract it when it gets complex enough to need its own tests, or when
|
|
the same decision appears in multiple places (e.g., rebuild and rebalance
|
|
both use `volumes_for_key`).
|
|
|
|
5. **Errors are data.** `AppError` is a value, not an exception. Functions
|
|
return `Result`, handlers pattern-match on it. The `IntoResponse` impl is
|
|
the only place where errors become HTTP responses — one place, one mapping.
|
|
|
|
## Anti-patterns to avoid
|
|
|
|
- **God handler** — a 100-line async fn that reads the DB, calls volumes, makes
|
|
decisions, handles errors, and formats the response. Break it up.
|
|
|
|
- **Hidden state reads** — if a function needs data, pass it in. Don't reach
|
|
into a global or lock a mutex inside a "pure" function.
|
|
|
|
- **Testing IO to test logic** — if you need a Docker container running to test
|
|
whether volume selection works correctly, the logic isn't separated from the
|
|
IO.
|
|
|