nebari - noun - the surface roots that flare out from the base of a...

It is loosely inspired by

nebari

nebari - noun - the surface roots that flare out from the base of a bonsai tree

Warning: This crate is early in development. The format of the file is not considered stable yet. Do not use in production.

This crate provides the Roots type, which is the transactional storage layer for BonsaiDb. It is loosely inspired by Couchstore.

Examples

Inserting a key-value pair in an on-disk tree with full revision history:

For more examples, check out nebari/examples/.

Features

Nebari exposes multiple levels of functionality. The lowest level functionality is the TreeFile. A TreeFile is a key-value store that uses an append-only file format for its implementation.

Using TreeFiles and a transaction log, Roots enables ACID-compliant, multi-tree transactions.

Each tree supports:

  • Key-value storage: Keys can be any arbitrary byte sequence up to 65,535 bytes long. For efficiency, keys should be kept to smaller lengths. Values can be up to 4 gigabytes (2^32 bytes) in size.
  • Flexible query options: Fetch records one key at a time, multiple keys at once, or ranges of keys.
  • Powerful multi-key operations: Internally, all functions that alter the data in a tree use TreeFile::modify() which allows operating on one or more keys and performing various operations.
  • Pluggable low-level modifications: The Vault trait allows you to bring your own encryption, compression, or other functionality to this format. Each independently-addressible chunk of data that is written to the file passes through the vault.
  • Optional full revision history. If you don't want to lose old revisions of data, you can use a VersionedTreeRoot to store information that allows scanning old revision information. Or, if you want to avoid the extra IO, use the UnversionedTreeRoot which only stores the information needed to retrieve the latest data in the file.
  • ACID-compliance:
    • Atomicity: Every operation on a TreeFile is done atomically. Operation::CompareSwap can be used to perform atomic operations that require evaluating the currently stored value.

    • Consistency: Atomic locking operations are used when publishing a new transaction state. This ensures that readers can never operate on a partially updated state.

    • Isolation: Currently, each tree can only be accessed exclusively within a transaction. This means that if two transactions request the same tree, one will execute and complete before the second is allowed access to the tree. This strategy could be modified in the future to allow for more flexibility.

    • Durability: The append-only file format is designed to only allow reading data that has been fully flushed to disk. Any writes that were interrupted will be truncated from the end of the file.

      Transaction IDs are recorded in the tree headers. When restoring from disk, the transaction IDs are verified with the transaction log. Because of the append-only format, if we encounter a transaction that wasn't recorded, we can continue scanning the file to recover the previous state. We do this until we find a successfluly commited transaction.

      This process is much simpler than most database implementations due to the simple guarantees that append-only formats provide.

Why use an append-only file format?

@ecton wasn't a database engineer before starting this project, and depending on your viewpoint may still not be considered a database engineer. Implementing ACID-compliance is not something that should be attempted lightly.

Creating ACID-compliance with append-only formats is much easier to achieve, however, as long as you can guarantee two things:

  • When opening a previously existing file, can you identify where the last valid write occurred?
  • When writing the file, do not report that a transaction has succeeded until the file is fully flushed to disk.

The B-Tree implementation in Nebari is designed to offer those exact guarantees.

The major downside of append-only formats is that deleted data isn't cleaned up until a maintenance process occurs: compaction. This process rewrites the files contents, skipping over entries that are no longer alive. This process can happen without blocking the file from being operated on, but it does introduce IO overhead during the operation.

Nebari provides APIs that perform compaction, but currently delegates scheduling and automation to consumers of this library.

Open-source Licenses

This project, like all projects from Khonsu Labs, are open-source. This repository is available under the MIT License or the Apache License 2.0.

Issues

Collection of the latest Issues

daxpedda

daxpedda

help wanted
0

Nebari should expand it's testing suite to include fuzz testing.

Personally I don't have any experience with it, but some pointers here:

ecton

ecton

bug
0

From Discord, @rrichardson discovered that the MemoryFile implementation is incredibly slower than the real-file based implementation, at least when consumed from BonsaiDb -- on the order of magnitude of 10 minutes vs 6 seconds for a particular importing operation.

ecton

ecton

enhancement
0

Currently, TreeFiles store blobs/chunks in the same file that nodes are written to. When compacting a database, all of the blobs that are alive must be transferred to the new file.

Over time, this is a lot of wasted IOPS if your application is never deleting data. In this day and age, a common way to operate is to "store everything" and only delete once it becomes a problem.

The main idea of this issue is simple:

  • Change all of the tree file operations to use a new trait ChunkStorage to write non-node chunks. This may require adding a new parameter to each operation.
  • Allow specifying a ChunkStorage implementation when creating a TreeFile/Roots instance.
  • If no ChunkStorage is specified, chunks should be written in-line like they are today.
  • The ChunkStorage implementation can use 63 bits of information to note where the chunk is stored. The 64th bit will be used by Nebari to note that the chunk is stored externally.

The hard part will be compaction. Nebari doesn't keep track of chunks. The way compaction works currently is data is copied when its referenced, otherwise its skipped. To achieve the goal of "not rewriting everything", the ChunkStorage implementation needs to receive enough information to be able to determine on its own how to compact itself, or opt not to. At this time, I'm not sure of a good way to solve this.

More intelligent compaction can be achieved by using TreeFile to implement ChunkStorage. While this causes extra overhead, the TreeFile could return unique "chunk IDs" that are stable, but the actual location on disk can be moved around. This is where the idea of "tiered" storage comes in, as this TreeFile could do many things including:

  • Embed statistics about read frequency of each key, allowing compaction to group frequently used data closer together, or moving infrequently accessed keys to slower storage.
  • Subdivide storage into segments that can be defragmented independently.
ecton

ecton

enhancement
0
  • #35
  • #40
  • Consider allowing larger transaction payloads
  • Add atomic upgrade
    • TransactionLog should expose a new error when the log format is the old format.
    • TransactionLog::repair should be added to rewrite transaction log to new file using new format, atomic swap to overwrite.
    • Add unit test upgrading an existing log file to the new format. Ensure testing one that contains out-of-order log entries.
ecton

ecton

bug
0

While fixing some edge cases in 9e799b9fe3e1b5378c88d73bd51ca80861279454, I thought of another edge case. I believe it's one that we should care about: Imagine if you roll a transaction back, the trees might have some data referencing the rolled back transaction. Next, close the database, and reopen the database. Push a transaction using different trees, and then open the other tree. Because the unused transaction id wasn't persisted to disk, the new transaction is assigned the same id as the abandoned transaction.

One fix would be to write all transactions to the log, but that seems wasteful. Another approach would be to add another piece of data that is incremented once per log-file open which is checked alongside the transaction id to ensure that the transaction ID came from the same session that successfully wrote it.

This would avoid writing data for all dropped transactions, but it might require updating the root headers depending on how its implemented.

ecton

ecton

enhancement
0

Was just fixing a unit test that wasn't working correctly and realized that there's no true validation happening on the transaction log. We should consider adding a CRC to each entry to help detect bitrot/disk errors.

ecton

ecton

enhancement
0

I was pondering khonsulabs/bonsaidb#184 and how it was a shame that floats weren't orderable at the bit level so that they could be used in keys. It occurred to me that we might be able to convert the current usage of [u8] as the key type into something like what BonsaiDb has in its Key trait -- perhaps the Key trait moves to Nebari.

Instead of requiring Key to be binary-sortable, now it just needs to be any type that is Ord and can convert to bytes. Taking it a step further this could mean that the Key trait could be replaced as a combination of transmog::BorrowedDeserializer and Ord.

I'm not sure if all of this is actually feasible, but it could open up the possibility of any Ord type being able to be used as a Key in Nebari and BonsaiDb.

(And yes, I'm aware that floats are only PartialOrd, but that can be worked around by a simple wrapper like ordered_float)

ecton

ecton

enhancement
0

It occurred to me that one aspect of locking that many databases support is a way to lock multiple tables for read to ensure that none of them change while your series of queries execute. Currently, BonsaiDb has no way to guarantee that two view queries or two collection Get calls both happen on the same transaction state, and Nebari technically can support it but doesn't provide a mechanism to succeed without loops currently.

This would be a new API very similar to the transaction API, except the trees returned would be normal trees. While the tree states are being acquired, each touched tree must keep its read handle locked to prevent any writer from publishing. Once all states have been acquired, the writers can be unblocked. All queries to the tree should use the cached state not the tree's current state.

If a compaction occurs, the operation must be re-performed.

ecton

ecton

0

While we are trying to keep BonsaiDb lightweight, it might be interesting to explore creating a lightweight async wrapper. I'm not sure we'd use it within BonsaiDb, but if this wrapper also dealt with the lifetime issues transparently, it could be worth it to switch.

Lifetime issues: spawn_blocking requires 'static lifetimes for the closure being executed. Using something like cryo we can take borrowed arguments and temporary extend the lifetime to static.

Refs #30

happydpc

happydpc

enhancement
12

Bonsaidb is a wonderful DB. Can this implement internal branch support for a struct with the versioned tree? Thanks,

ecton

ecton

0

Stumbled across dhat. It would be nice to put measurements behind how much memory Nebari uses compared to Sled. I know from experience trying to make the benchmarks run on VPSes that Nebari definitely needs less memory than Sled to operate.

ecton

ecton

enhancement
0

When thinking about how to implement quorum clustering for BonsaiDb, there are a few approaches. One way to achieve higher throughput is by allowing all nodes to act as leaders, not just one. The problem is that if two leaders execute transactions that update the same documents twice to two different values in different orders, chaos ensues.

The fix is to either make transactions always execute in the same order (thus essentially having a single leader), or it's to execute everything in parallel, but place nodes that ended up getting out of the majority-agreed order into a repair-state.

This repair mode will need to roll back the documents tree to a previous transaction ID. In Nebari, this needs to be supported at the Roots level, integrated with the TransactionManager to update the actual transaction log on-disk.

ecton

ecton

enhancement
0

Originally I skipped building an iterator interface for scanning, but I realized there is no reason it can't be done. Ultimately the iterator is about the values produced, not about the other callbacks.

There are two possible ways to go about this: switch to storing the scanning state on a manually managed stack or by using a bounded channel interface.

The bounded channel interface may sound like a hack, but it has real merits: It keeps the file locked for read, preventing rename from unlinking the file contents during a compaction. This means that the iterator will be surprisingly resilient. The downside is that it will necessarily require a thread, meaning it would be a feature unavailable to no_std if we approached it that way.

The alternate requires opening the file for each step in the scan operation -- our FileManager layer will cache it, but that still is wasteful.

ecton

ecton

enhancement
0

Many times when reading data from disk, we read a length and immediately ask the OS to ensure a Vec can hold that much data, potentially resizing the Vec.

If there is an issue with the data on-disk, this could lead to attempting to allocate too much data. The subsequent read call will fail due to an EOF, if the allocation succeeds.

Before requesting a resize or reservation, we should check that the bytes requested are less than or equal to the number of bytes that remain in the reader.

ecton

ecton

enhancement
0

On the Scaleway VM, Nebari's inserts land inbetween Sled and SQLite in raw throughput, but locally on my machine, Sled and SQLite are able to maximize my NVME storage -- 1.5GB/s while Nebari caps out at around 430MB/s.

It would be interesting to profile large blob insertions and see what the performance bottleneck is. It might also speed up other things.

ecton

ecton

enhancement
0

One of the limitations of compacting files is that all of the version information is removed. Back when we had a few data accidents on CouchDB, this caused some of the mistakes to be unrecoverable -- ie, the customer had to go fix their stuff rather than us save the day.

I'd like to have a way to compact databases while retaining the most recent X versions of each document, or the most recent X writes in addition to a minimum of the latest revision per each key.

ecton

ecton

enhancement
0

When Nebari was originally being developed within the BonsaiDb repository, it wasn't that unreasonable to allow some of the code coverage of unit tests to be done by the integration within BonsaiDB. Now that they're completely separate projects, Roots really should be unit tested independently.

ecton

ecton

enhancement
0

We will want to be able to discard old entries in the transaction log. The safe way to do this will be to keep track of the last transactions that were committed for each tree, but that requires knowing all the trees that the transaction file contains -- something not currently tracked. The only way to gather the information would be to do a full scan of the existing transaction log.

ecton

ecton

enhancement
0

After benchmarking scans, it's clear we're doing something that isn't optimal. Additionally, while reverse scanning isn't benchmarked, there is a TODO in the code to address optimizations that hadn't been written for the reverse iteration case.

Given the performance on smaller data sets (that mostly fit within cache), I suspect we're not skipping as many nodes as we could.

ecton

ecton

enhancement
0

Compaction (#3) implements validation as part of that process, but we should also have an ad-hoc validation check that can be performed. For flexibility, these configurations should be supported:

  • Validate on read always
  • Do a full validation on database open, but never again during the process execution
  • Do a manual check at any time
ecton

ecton

enhancement
1

The current implementation of read-file caching isn't great -- it never closes any files. This is fine for now, but we'll eventually run into open-file limits.

The read-files (and maybe write-files) should be switched to an Lru cache to automatically close file handles that haven't been needed as we need to open new files.

ecton

ecton

enhancement
0

The current implementation of get_multiple and scan both take a key reader and key evaluator function. The reader is invoked for each key as it is found, and the result controls the iteration over the tree. If data is requested to be read, it will be requested after the scan is completed, ordered from the start of the file to the end of the file to try to optimize access.

For callers that need to scan the content of the data but might still want to abort the scan, this forces the scan to iterate over the entire tree. This change in functionality could enable the key evaluator to be invoked immediately after the read data request is returned. This would allow the evaluator or next call to the reader to stop the scan.

Currently, no code that calls these functions in BonsaiDb needs this functionality, but as a general purpose library, this functionality could be very useful.

Versions

Find the latest versions by id

v0.5.3 - May 03, 2022

Fixed

  • File operations are now fully persisted to disk to the best ability provided by each operating system. @justinj discovered that no fsync() operations were happening, and reported the finding. Nebari's TreeFile was using File::flush() instead of File::sync_data()/sync_all(). This means that it would be possible for an OS-level buffer to not be flushed to disk before Nebari reported a successful write.

    Interestingly, this change has no noticable impact on performance on Linux. However, on Mac OS, File::sync_data() performs a fcntl with F_FULLFSYNC, which has a significant impact on performance. This is the correct behavior, however, as without this level of guarantee, sudden power loss could result in data loss.

    Many people argue that using F_BARRIERFSYNC is sufficient for most people, but Apple's own documentation states this is necessary:

    Only use F_FULLFSYNC when your app requires a strong expectation of data persistence. Note that F_FULLFSYNC represents a best-effort guarantee that iOS writes data to the disk, but data can still be lost in the case of sudden power loss.

    For now, the stance of Nebari's authors is that F_FULLFSYNC is the proper way to implement true ACID-compliance.

v0.5.2 - May 02, 2022

Fixed

  • Another edge case similar to the one found in v0.5.1 was discovered through newly implemented fuzzer-based testing. When a node is fully absorbed to the bottom of the next, in some cases, the modification iterator would not back up to reconsider the node. When inserting a new key in this situation, if the new key was greater than the lowest key in the next node, the tree would get out of order.

    The exact circumstances of this bug are similarly as rare as described in v0.5.1's entry.

Added

  • Feature paranoid enables extra sanity checks. This feature flag was added for purposes of fuzzing. It enables extra sanity checks in release builds that are always present in debug builds. These sanity checks are useful in catching bugs, but they represent that a database would be corrupted if the state was persisted to disk.

    These checks slow down modifications to the database significantly.

v0.5.1 - Apr 30, 2022

Fixed

  • modify() operations on larger trees (> 50k entries) that performed multiple modification operations could trigger a debug_assert in debug builds, or worse, yield incorrect databases in release builds.

    The offending situations occur with edge cases surrounding "absorbing" nodes to rebalance trees as entries are deleted. This particular edge case only arose when the absorb phase moved entries in both directions and performed subsequent operations before the next save to disk occurred.

    This bug should only have been able to be experienced if you were using large modify() operations that did many deletions as well as insertions, and even then, only in certain circumstances.

v0.5.0 - Mar 12, 2022

Breaking Changes

  • KeyEvaluation has been renamed to ScanEvaluation.

  • All scan() functions have been updated with the node_evaluatorcallback now returns aScanEvaluation instead of a bool. To preserve existing behavior, returnScanEvaluation::ReadDatainstead of true and ScanEvaluation::Stop instead of false.

    The new functionality unlocked with this change is that scan operations can now be directed as to whether to skip navigating into an interior node. The new reduce() function uses this ability to skip scanning nodes when an already reduced value is available on a node.

Added

  • TreeFile::reduce(), Tree::reduce(), TransactionTree::reduce() have been added as a way to return aggregated information stored within the nodes. A practical use case is the ability to retrieve the number of alive/deleted keys over a given range, but this functionality extends to embedded indexes through the existing Reducer trait.

v0.4.0 - Mar 01, 2022

Breaking Changes

  • get_multiple has been changed to accept an Iterator over borrowed byte slices.

  • ExecutingTransaction::tree now returns a LockedTransactionTree, which holds a shared reference to the transaction now. Previously tree() required an exclusive reference to the transaction, preventing consumers of Nebari from using multiple threads to process more complicated transactions.

    This API is paired by a new addition: ExecutingTransaction::unlocked_tree. This API returns an UnlockedTransactionTree which can be sent across thread boundaries safely. It offers a lock() function to return a LockedTransactionTree when the tread is ready to operate on the tree.

  • TransactionManager::push has been made private. This is a result of the previous breaking change. TransactionManager::new_transaction() is a new function that returns a ManagedTransaction. ManagedTransaction::commit() is the new way to commit a transaction in a transaction manager.

Fixed

  • TransactionManager now enforces that transaction log entries are written sequentially. The ACID-compliance of Nebari was never broken when non-sequential log entries are written, but scanning the log file could fail to retrieve items as the scanning algorithm expects the file to be ordered sequentially.

Added

  • ThreadPool::new(usize) allows creating a thread pool with a maximum number of threads set. ThreadPool::default() continues to use num_cpus::get to configure this value automatically.

v0.3.2 - Feb 23, 2022

Fixed

  • Fixed potential infinite loop when scanning for a transaction ID that does not exist.
  • Reading associated transaction log data now works when the data is larger than the page size. Previously, the data returned included the extra bytes that the transaction log inserts at page boundaries.

v0.3.1 - Feb 14, 2022

Changed

  • BorrowedRange now exposes its fields as public. Without this, there was no way to implement BorrowByteRange outside of this crate.
  • This crate now explicitly states its minimum supported Rust version (MSRV). The MSRV did not change as part of this update. It previously was not documented.

v0.3.0 - Feb 09, 2022

Breaking Changes

  • ManagedFile has had its metadata functions moved to a new trait File which ManagedFile must be an implementor of. This allows dyn File to be used internally. As a result, PagedWriter no longer takes a file type generic parameter.
  • ManagedFile has had its functions open_for_read and open_for_append have been moved to a new trait, ManagedFileOpener.
  • FileManager::replace_with now takes the replacement file itself instead of the file's Path.
  • compare_and_swap has had the old parameter loosened to &[u8], avoiding an extra allocation.
  • TreeFile::push() has been renamed TreeFile::set() and now accepts any type that can convert to `ArcBytes<'static>.

Added

  • AnyFileManager has been added to make it easy to select between memory or standard files at runtime.
  • Tree::first[_key](), TransactionTree::first[_key](), and TreeFile::first[_key]() have been added, pairing the functionality provided by last() and last_key().

v0.2.2 - Feb 01, 2022

Fixed

  • Fixed a hypothetical locking deadlock if transactions for trees passed into State::new_transaction or Roots::new_transaction in a consistent order.

v0.2.1 - Jan 27, 2022

Fixed

  • Removing a key in a versioned tree would cause subsequent scan() operations to fail if the key evaluator requested reading data from key that has no current data. A safeguard has been put in place to ensure that even if KeyEvaluation::ReadData is returned on an index that contains no position it will skip the operation rather than attempting to read data from the start of the file.

    Updating the crate should restore access to any "broken" files.

v0.2.0 - Jan 26, 2022

Breaking Changes

  • tree::State::read() now returns an Arc containing the state, rather than a read guard. This change has no noticable impact on microbenchmarks, but yields more fair writing performance under heavy-read conditions -- something the current microbenchmarks don't test but in-development Commerce Benchmark for BonsaiDb unvieled.
  • Buffer has been renamed to ArcBytes. This type has been extracted into its own crate, allowing it to be used in bonsaidb::core. The new crate is available here.
  • Root::scan, Tree::scan, Tree::get_range, TransactionTree::scan, and TransactionTree::get_range now take types that implement RangeBounds<&[u8]>. BorrowByteRange is a trait that can be used to help borrow ranges of owned data.

Added

  • nebari::tree::U64Range has been exposed. This type makes it easier to work with ranges of u64s.

v0.1.1 - Jan 06, 2022

Added

  • Tree::replace has been added, which calls through to TransactionTree::replace.
  • Tree::modify and TransactionTree::modify have been added, which execute a lower-level modification on the underlying tree.

v0.1.0 - Jan 04, 2022

This release signifies that we believe Nebari is stable enough that other projects could use it. While it's had a fair amount of internal testing while developing BonsaiDb, it should still be considered alpha and used with caution.

Any breaking file format changes from this point forward will result in major version number increments.

Information - Updated May 21, 2022

Stars: 141
Forks: 6
Issues: 23

Repositories & Extras

influxdb provides an asynchronous Rust interface to an InfluxDB database

influxdb provides an asynchronous Rust interface to an Integer 32, sponsored by Stephan Buys of

influxdb provides an asynchronous Rust interface to an InfluxDB database

Rust DataBase Connectivity (RDBC)

Love them or hate them, the JDBC standards have made it easy to use a wide range of desktop and server products with many different...

Rust DataBase Connectivity (RDBC)

Rdb is a relational database implemented in Rust

Unlike databases like PostgreSQL and SQLite, Rdb does not operate on a client-server model

Rdb is a relational database implemented in Rust

Simple Database Implementation written in Rust

This repository uses cargo 1

Simple Database Implementation written in Rust

A graph database written in rust

IndraDB's original design is heavily inspired by homepage

A graph database written in rust

Jin is a small relational database engine written in Rust with the standard library and...

Jin is a small relational database engine written in here or run:

Jin is a small relational database engine written in Rust with the standard library and...

Shappy key-value database created by Rust

Copyright (©) 2021 Sh-Zh-7

Shappy key-value database created by Rust
Facebook Instagram Twitter GitHub Dribbble
Privacy