nebari - noun - the surface roots that flare out from the base of a bonsai tree
Warning: This crate is early in development. The format of the file is not considered stable yet. Do not use in production.
This crate provides the
Roots type, which is the transactional storage layer
BonsaiDb. It is loosely inspired by
Inserting a key-value pair in an on-disk tree with full revision history:
For more examples, check out
Nebari exposes multiple levels of functionality. The lowest level functionality
TreeFile is a key-value store that uses an append-only file format for its
TreeFiles and a transaction log,
ACID-compliant, multi-tree transactions.
Each tree supports:
- Key-value storage: Keys can be any arbitrary byte sequence up to 65,535 bytes long. For efficiency, keys should be kept to smaller lengths. Values can be up to 4 gigabytes (2^32 bytes) in size.
- Flexible query options: Fetch records one key at a time, multiple keys at once, or ranges of keys.
Powerful multi-key operations: Internally, all functions that alter the
data in a tree use
TreeFile::modify()which allows operating on one or more keys and performing various operations.
Pluggable low-level modifications: The
Vaulttrait allows you to bring your own encryption, compression, or other functionality to this format. Each independently-addressible chunk of data that is written to the file passes through the vault.
Optional full revision history. If you don't want to lose old revisions of
data, you can use a
VersionedTreeRootto store information that allows scanning old revision information. Or, if you want to avoid the extra IO, use the
UnversionedTreeRootwhich only stores the information needed to retrieve the latest data in the file.
Atomicity: Every operation on a
TreeFileis done atomically.
Operation::CompareSwapcan be used to perform atomic operations that require evaluating the currently stored value.
Consistency: Atomic locking operations are used when publishing a new transaction state. This ensures that readers can never operate on a partially updated state.
Isolation: Currently, each tree can only be accessed exclusively within a transaction. This means that if two transactions request the same tree, one will execute and complete before the second is allowed access to the tree. This strategy could be modified in the future to allow for more flexibility.
Durability: The append-only file format is designed to only allow reading data that has been fully flushed to disk. Any writes that were interrupted will be truncated from the end of the file.
Transaction IDs are recorded in the tree headers. When restoring from disk, the transaction IDs are verified with the transaction log. Because of the append-only format, if we encounter a transaction that wasn't recorded, we can continue scanning the file to recover the previous state. We do this until we find a successfluly commited transaction.
This process is much simpler than most database implementations due to the simple guarantees that append-only formats provide.
Why use an append-only file format?
@ecton wasn't a database engineer before starting this project, and depending on your viewpoint may still not be considered a database engineer. Implementing ACID-compliance is not something that should be attempted lightly.
Creating ACID-compliance with append-only formats is much easier to achieve, however, as long as you can guarantee two things:
- When opening a previously existing file, can you identify where the last valid write occurred?
- When writing the file, do not report that a transaction has succeeded until the file is fully flushed to disk.
The B-Tree implementation in Nebari is designed to offer those exact guarantees.
The major downside of append-only formats is that deleted data isn't cleaned up until a maintenance process occurs: compaction. This process rewrites the files contents, skipping over entries that are no longer alive. This process can happen without blocking the file from being operated on, but it does introduce IO overhead during the operation.
Nebari provides APIs that perform compaction, but currently delegates scheduling and automation to consumers of this library.
This project, like all projects from Khonsu Labs, are open-source. This repository is available under the MIT License or the Apache License 2.0.