Tantivy is a full text search engine library written in Rust

It is closer to Elasticsearch or benchmark break downs

Tantivy is a full-text search engine library written in Rust.

It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy is, in fact, strongly inspired by Lucene's design.

If you are looking for an alternative to Elasticsearch or Apache Solr, check out Quickwit, our search engine built on top of Tantivy.

Benchmark

The following benchmark breakdowns performance for different types of queries/collections.

Your mileage WILL vary depending on the nature of queries and their load.

Features

  • Full-text search
  • Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera, Vaporetto, and tantivy-tokenizer-tiny-segmenter) and Korean (lindera + lindera-ko-dic-builder)
  • Fast (check out the :racehorse: :sparkles: benchmark :sparkles: :racehorse:)
  • Tiny startup time (<10ms), perfect for command-line tools
  • BM25 scoring (the same as Lucene)
  • Natural query language (e.g. (michael AND jackson) OR "king of pop")
  • Phrase queries search (e.g. "michael jackson")
  • Incremental indexing
  • Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
  • Mmap directory
  • SIMD integer compression when the platform/CPU includes the SSE2 instruction set
  • Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
  • &[u8] fast fields
  • Text, i64, u64, f64, dates, and hierarchical facet fields
  • LZ4 compressed document store
  • Range queries
  • Faceted search
  • Configurable indexing (optional term frequency and position indexing)
  • JSON Field
  • Aggregation Collector: range buckets, average, and stats metrics
  • LogMergePolicy with deletes
  • Searcher Warmer API
  • Cheesy logo with a horse

Non-features

Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out Quickwit.

Getting started

Tantivy works on stable Rust (>= 1.27) and supports Linux, macOS, and Windows.

  • Tantivy's simple search example
  • tantivy-cli and its tutorial - tantivy-cli is an actual command-line interface that makes it easy for you to create a search engine, index documents, and search via the CLI or a small server with a REST API. It walks you through getting a Wikipedia search engine up and running in a few minutes.
  • Reference doc for the last released version

How can I support this project?

There are many ways to support this project.

  • Use Tantivy and tell us about your experience on Discord or by email ([email protected])
  • Report bugs
  • Write a blog post
  • Help with documentation by asking questions or submitting PRs
  • Contribute code (you can join our Discord server)
  • Talk about Tantivy around you

Contributing code

We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.

Clone and build locally

Tantivy compiles on stable Rust but requires Rust >= 1.27. To check out and run tests, you can simply run:

Run tests

Some tests will not run with just cargo test because of fail-rs. To run the tests exhaustively, run ./run-tests.sh.

Debug

You might find it useful to step through the programme with a debugger.

A failing test

Make sure you haven't run cargo clean after the most recent cargo test or cargo build to guarantee that the target/ directory exists. Use this bash script to find the name of the most recent debug build of Tantivy and run it under rust-gdb:

Now that you are in rust-gdb, you can set breakpoints on lines and methods that match your source code and run the debug executable with flags that you normally pass to cargo test like this:

An example

By default, rustc compiles everything in the examples/ directory in debug mode. This makes it easy for you to make examples to reproduce bugs:

Companies Using Tantivy

       

FAQ

Can I use Tantivy in other languages?

  • Python → tantivy-py
  • Ruby → tantiny

You can also find other bindings on GitHub but they may be less maintained.

What are some examples of Tantivy use?

  • seshat: A matrix message database/indexer
  • tantiny: Tiny full-text search for Ruby
  • lnx: adaptable, typo tolerant search engine with a REST API
  • and more!

On average, how much faster is Tantivy compared to Lucene?

  • According to our search latency benchmark, Tantivy is approximately 2x faster than Lucene.
Issues

Collection of the latest Issues

infiniteregrets

infiniteregrets

2
PSeitz

PSeitz

0

Datasets

For the fast field codecs we need to have good datasets to test them. Ideally this are datasets which we would expect to be indexed in a search engine.

This PR adds some datasets.

Sources

A list of sources, which contain useful datasets.

  • https://github.com/RoaringBitmap/real-roaring-datasets
  • Several columns with float data (temp etc.) #404

Integer datasets

  • Timestamps (Above PR: hdfs, webserver)
  • Ids (Above PR: wikipedia)
  • Sparse

Float datasets

  • All docs = 1.
  • Temperature (Above PR: nook)
  • Prices
  • Sparse
  • Latitude, Longitude
PSeitz

PSeitz

0
ChillFish8

ChillFish8

help wanted
0

Expose a public API for interacting with JSON field types outside of the QueryParser

As of right now, there is no way to public way to build queries to interact with the new JSON field type outside of the QueryParser, due to the JsonTermWriter not being publically exposed, and also possibly slightly overkill.

Ideally a simple(ish) API allows users to produce a term to query with by supplying the Field, JSON Path and value itself.

fulmicoton

fulmicoton

0

Right now the collector API makes it difficult to work in a distributed environment.

We want to merge the segment fruits together on the different node, ship the merged result to a central node and merge those together.

The first merge is akin to a combiner in the hadoop world. Its outcome needs to be mergeable too.

We can probably fix tantivy by apply the following change to the Collector trait.

oersted

oersted

2

I will be shortly starting the implementation of a new query type on a fork. I'd like to hear the opinion of the community and the maintainers on it so that it is consistent with the project's needs and standards from the beginning.

Phrase with Gaps

The project I'm working on requires running a large volume of queries that look something like this.

Generally, efficiently retrieving sentences that match patterns with gaps. Or from a different perspective, the intersection of multiple phrase queries, where the phrases must appear in a given order and have one or more arbitrary terms between them.

Implementation

I plan to follow the existing architecture for PhraseQuery closely with minor extensions.

For instance:

With relevant modifications to PhraseScorer and PhraseWeight.

As described above, I believe it would be enough to simply reuse the current phrase query algorithm, execute a query for each phrase separated by a gap, and then verify their order and distancing. However, I'm open to suggestions of more efficient algorithms.

Generalizations

These extensions are in principle not in the scope of this PR, since I don't strictly need them for my use case, and I believe that the current specification of the feature is already generic enough to be useful to other users. I will be somewhat flexible though if I see that a more general version is in demand.

  • Bounded Gap Lengths: Gaps may have min_len and max_len. My plan would be equivalent to min_len: Some(1) and max_len: None. I can see that particularly min_len: None could be useful for optional terms in phrases.
  • Slop: It is a related concept, with the main difference being that slop: 1 on a two-term phrase allows having either an arbitrary term in the middle or reversing the order of the terms.
  • Spans: The Span API from Lucene/Elasticsearch is also very related to what I'm proposing, I believe slop is implemented on top of it.
  • Regex Phrases: We could consider defining a whole regex-style API for specifying patterns of terms. It might look something like this.

Questions

  • I'd like to know if such a feature would be useful to others, if similar requests have been made before, and if such a thing belongs within the scope of tantivy in the first place.
  • Any suggestions for how to implement this efficiently are most welcome.
  • What requirements should this fulfill for a pull-request to be accepted? I plan to implement this regardless of it being merged, but it would be nice if it would also help others.
  • Perhaps I'm missing some existing functionality that makes this redundant. Or it might be more appropriate to implement this as a filter on top of results from tantivy, instead of inside tantivy. Let me know.
  • The implementation of PhraseQuery is already a good guide, but reiterating which parts of the codebase I would need to extend for this would be helpful.
  • Again, I'm somewhat flexible to suggestions for expanding the specification to support wider use-cases, as long as it doesn't increase the workload substantially.
fulmicoton

fulmicoton

0

Right now, a segment version is identified by (segment_id, delete_opstamp).

Investigate if there is an opportunity to replace the delete opstamp by the number of deleted docs. The number of deleted docs per segment can only grow. It starts at 0. etc.

liamwarfield

liamwarfield

0

Is your feature request related to a problem? Please describe. During discussion in issue #916 it was brought up by @fulmicoton that current snippet scoring only takes into account how many times search terms show up in a fragment. This could be improved.

Describe the solution you'd like More factors should be taken into account when scoring snippets. Some ways to adjust scores would be:

  1. Weigh the first time a term appears in a document higher. The first time a term is used is more likely to be a definition, or the start of a section that the user is interested in.
  2. If a snippet has multiple seperate terms, it should score higher. For example, say that we have the search terms "flour", "water", and "sugar". The snippet "add the sugar and water to the flour" should have a higher score than "flour, flour, flour, flour everywhere!". Currently the second might have a higher score.
  3. Currently terms are weighted as score = 1.0 / (1.0 + doc_freq as Score); This is OK, but I'd like to try weighting by score = -ln(1.0 / (1.0 + doc_freq as Score));. This weight would be closer to the IDF, and feels more "information theory"y (E = -k * ln(p)).

[Optional] describe alternatives you've considered I'm open to other ideas here, and I don't have a clear idea of how much the factors above should affect the score. Currently I am going to try messing around with params in my own project until I find something that feels good.

fulmicoton

fulmicoton

0

I love this abbrevation because its expresses precisely something very common, and in a very concise manner. However, at least two people mentioned they thought it was a typo, so we should probably replace its usage to the verbose "if and only if".

evanxg852000

evanxg852000

good first issue
1

Right now the documentation page for https://docs.rs/tantivy/latest/tantivy/termdict/type.TermDictionary.html renders almost empty without listing methods and other item info. This could be the case for other types as well.

This task is to check the documentation rendered at https://docs.rs/tantivy/latest/tantivy and fix pages that are not rendering documentation content.

fulmicoton

fulmicoton

0

It is probably possible to remove the clippy::uninit_vec thing in the LZ4 compression using MaybeUninit.

Right now clippy complains because we end up creating a slice containing uninitialized data.

appaquet

appaquet

9

I recently moved my workspace to a new mac with a M1 processor, and we have a few integration tests that require creating Tantivy indices on disk. This ended up being very very slow on MacOS... I traced down the issue to the fact that MmapDirectory calls a sync_all on each file flush, which is unreasonably slow on MacOS. This seems to be a common issue on MacOS and the reason why some softwares expose a flag to disable it altogether (ex: Docker for mac)

I made changes to expose a MmapDirectorySettings (see here) which improved the performance. You can see this gist for details, but creating an index with 100 documents is about 2x slower with sync_all on MacOS. Every call to sync_all takes ~20-30ms on MacOS, while taking ~5ms on a slower Linux machine.

How would you go about that? On our side, we only use MacOS for our dev flow and don't mind the decrease in safety. But I think it would make sense for Tantivy to such settings for the MmapDirectory. Would you like me to open a PR with my changes?

Thanks !

fmassot

fmassot

enhancement
5

Currently, we serialize these values for each linear block:

  • data_start_offset (u64)
  • value_start_pos (u64)
  • positive_val_offset (u64)
  • slope (f32)
  • num_bits u8

It represents 29 bytes. But it's possible to remove 16 bytes.

Let's try to understand what we really need first and without bitpacking. We want to model some y values [y1, y2,...yn] as a linear function y = a.x + b

But we want all values to be positive, so we find b' to satisfy this requirement and in the end, we have: y = a.x + b'

We know that our y values will not match the linear function but we will write the difference between the expected values and the real ones on the disk: ywritten = yreal - (ax + b') (1)

To rebuild yreal, we only need to know ywritten, b', a. The x value is given by the caller so we don't need to write it.

But we also want to bitpack yw values. For that we need to store additional information, the num_bits used for bitpacking. And that's all:

  • data_start_offset can be deduced by summing num_bits * block_size for each block before the block where we want to read. we can do it when we open the fast field file.
  • value_start_pos is not needed as we can see in (1).

Is it worth saving 16 bytes?

Saving 16 bytes is not changing the world nor is a priority. But removing variables that we don't need is a good thing because it makes the code more readable.

What can we do?

For backward compatibility, I think we can add a new codec in the future release that will be more optimized than the current one.

fmassot

fmassot

3

I thought this was present in tantivy but for now, there is only a NgramTokenizer that tokenizes words into the n-grams.

Lucene offers a ShinglerFilter which creates shingles or token n-grams, it creates combinations of tokens and not letters.

For example, this dataset published token n-grams and that would be interesting to index it with tantivy instead of having some SQL dump.

fmassot

fmassot

2

Is your feature request related to a problem? Please describe. When using (badly) the scorer API and especially the seek method, I ended up in some infinite loop.

It comes from this little piece for code in the SkipReader:

Unfortunately for me, my function was doing a seek on a doc_id + 1 without checking the value was TERMINATED... I understand that this is a low-level API and that the programmer should be careful. What I suggest is to add a debug_assert in the seek of the SkipReader so that we have the feedback immediately.

PSeitz

PSeitz

0

An issue occurred where merge after commit selected only empty segments.

Investigate if checking for non-empty segments before triggering a merge is sufficient. There are also different possible scenarios, when we end up with an empty segment:

  • fresh new segments ends up with no docs after processing the deletes
  • old segments ends up with no docs after processing deletes
  • merged segment ends up with no docs after processing the deletes

To Reproduce

Change test_functional_indexing_sorted to run with 15threads

let mut index_writer = index.writer_with_num_threads(15, 150_000_000)?;

Maybe run in a loop or multiple times.

Evian-Zhang

Evian-Zhang

help wanted
1

Is your feature request related to a problem? Please describe. Modern search engines have a functionality called autocompletion. For example, when user types in "tan", the search engine suggest the word "tantivy"; when user types in "tantivy", the search engine suggest the phrase "tantivy search engine". For frontend, this is often seen as a drop down list below the search area.

Lucene has a module Lucene suggest which provides such functionality.

Describe the solution you'd like Implement a module which provides such functionality.

Evian-Zhang

Evian-Zhang

1

For example, I want to search for sentence "Alice AND Bob", which contains three words, rather than meaning the bool query; I want to search for sentence "weight*height" rather than meaning the regex query.

Versions

Find the latest versions by id

0.17 - Mar 09, 2022

  • LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@shikhar @fulmicoton) #115
  • Adds a searcher Warmer API (@shikhar @fulmicoton)
  • Change to non-strict schema. Ignore fields in data which are not defined in schema. Previously this returned an error. #1211
  • Facets are necessarily indexed. Existing index with indexed facets should work out of the box. Index without facets that are marked with index: false should be broken (but they were already broken in a sense). (@fulmicoton) #1195 .
  • Bugfix that could in theory impact durability in theory on some filesystems #1224
  • Schema now offers not indexing fieldnorms (@lpouget) #922
  • Reduce the number of fsync calls #1225
  • Fix opening bytes index with dynamic codec (@PSeitz) #1278
  • Added an aggregation collector compatible with Elasticsearch (@PSeitz)
  • Added a JSON schema type @fulmicoton #1251
  • Added support for slop in phrase queries @halvorboe #1068

0.16.1 - Sep 10, 2021

Major Bugfix on multivalued fastfield. #1151

0.15.3 - Jun 30, 2021

  • Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101

0.15.2 - Jun 16, 2021

  • Major bugfix. DocStore still panics when a deleted doc is at the beginning of a block. (@appaquet) #1088

0.15.1 - Jun 14, 2021

  • Major bugfix. DocStore panics when first block is deleted. (@appaquet) #1077

0.15 - Jun 07, 2021

  • API Changes. Using Range instead of (start, end) in the API and internals (FileSlice, OwnedBytes, Snippets, ...) This change is breaking but migration is trivial.
  • Added an Histogram collector. (@fulmicoton) #994
  • Added support for Option. (@fulmicoton)
  • DocAddress is now a struct (@scampi) #987
  • Bugfix consistent tie break handling in facet's topk (@hardikpnsp) #357
  • Date field support for range queries (@rihardsk) #516
  • Added lz4-flex as the default compression scheme in tantivy (@PSeitz) #1009
  • Renamed a lot of symbols to avoid all uppercasing on acronyms, as per new clippy recommendation. For instance, RAMDirectory -> RamDirectory. (@fulmicoton)
  • Simplified positions index format (@fulmicoton) #1022
  • Moved bitpacking to bitpacker subcrate and add BlockedBitpacker, which bitpacks blocks of 128 elements (@PSeitz) #1030
  • Added support for more-like-this query in tantivy (@evanxg852000) #1011
  • Added support for sorting an index, e.g presorting documents in an index by a timestamp field. This can heavily improve performance for certain scenarios, by utilizing the sorted data (Top-n optimizations)(@PSeitz). #1026
  • Add iterator over documents in doc store (@PSeitz). #1044
  • Fix log merge policy (@PSeitz). #1043
  • Add detection to avoid small doc store blocks on merge (@PSeitz). #1054
  • Make doc store compression dynamic (@PSeitz). #1060
  • Switch to json for footer version handling (@PSeitz). #1060
  • Updated TermMerger implementation to rely on the union feature of the FST (@scampi) #469
  • Add boolean marking whether position is required in the query_terms API call (@fulmicoton). #1070

0.14 - Feb 05, 2021

  • Remove dependency to atomicwrites #833 .Implemented by @fulmicoton upon suggestion and research from @asafigan).
  • Migrated tantivy error from the now deprecated failure crate to thiserror #760. (@hirevo)
  • API Change. Accessing the typed value off a Schema::Value now returns an Option instead of panicking if the type does not match.
  • Large API Change in the Directory API. Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a FileSlice that can be reduced and eventually read into an OwnedBytes object. Long and blocking io operation are still required by they do not span over the entire file.
  • Added support for Brotli compression in the DocStore. (@ppodolsky)
  • Added helper for building intersections and unions in BooleanQuery (@guilload)
  • Bugfix in Query::explain
  • Removed dependency on notify #924. Replaced with FileWatcher struct that polls meta file every 500ms in background thread. (@halvorboe @guilload)
  • Added FilterCollector, which wraps another collector and filters docs using a predicate over a fast field (@barrotsteindev)
  • Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@fulmicoton)
  • FilterCollector now supports all Fast Field value types (@barrotsteindev)
  • FastField are not all loaded when opening the segment reader. (@fulmicoton)

This version breaks compatibility and requires users to reindex everything.

0.13.3 - Jan 13, 2021

Minor Bugfix. Avoid relying on serde's reexport of PhantomData. (#975)

0.13.2 - Oct 01, 2020

HotFix. Acquiring a facet reader on a segment that does not contain any doc with this facet returns None. (#896)

0.13.1 - Sep 19, 2020

Made Query and Collector Send + Sync. Updated misc dependency versions.

0.13 - Aug 19, 2020

Tantivy 0.13 introduce a change in the index format that will require you to reindex your index (BlockWAND information are added in the skiplist). The index size increase is minor as this information is only added for full blocks. If you have a massive index for which reindexing is not an option, please contact me so that we can discuss possible solutions.

  • Bugfix in FuzzyTermQuery not matching terms by prefix when it should (@Peachball)
  • Relaxed constraints on the custom/tweak score functions. At the segment level, they can be mut, and they are not required to be Sync + Send.
  • MMapDirectory::open does not return a Result anymore.
  • Change in the DocSet and Scorer API. (@fulmicoton). A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet. .advance() returns the new DocId. Scorer::skip(target) has been replaced by Scorer::seek(target) and returns the resulting DocId. As a result, iterating through DocSet now looks as follows

The change made it possible to greatly simplify a lot of the docset's code.

  • Misc internal optimization and introduction of the Scorer::for_each_pruning function. (@fulmicoton)
  • Added an offset option to the Top(.*)Collectors. (@robyoung)
  • Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks to the PISA team for answering all my questions!)

0.12 - Feb 19, 2020

  • Removing static dispatch in tokenizers for simplicity. (#762)
  • Added backward iteration for TermDictionary stream. (@halvorboe)
  • Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland)
  • Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713)
  • Important Bugfix #777, causing tantivy to retain memory mapping. (diagnosed by @poljar)
  • Added support for field boosting. (#547, @fulmicoton)

0.11.3 - Dec 20, 2019

  • Fixed DateTime as a fast field (#735)

0.11.1 - Dec 17, 2019

  • Bug fix #729

0.11 - Dec 15, 2019

  • Added f64 field. Internally reuse u64 code the same way i64 does (@fdb-hiroshima)
  • Various bugfixes in the query parser.
    • Better handling of hyphens in query parser. (#609)
    • Better handling of whitespaces.
  • Closes #498 - add support for Elastic-style unbounded range queries for alphanumeric types eg. "title:>hello", "weight:>=70.5", "height:<200" (@petr-tik)
  • API change around Box<BoxableTokenizer>. See detail in #629
  • Avoid rebuilding Regex automaton whenever a regex query is reused. #639 (@brainlock)
  • Add footer with some metadata to index files. #605 (@fdb-hiroshima)
  • Add a method to check the compatibility of the footer in the index with the running version of tantivy (@petr-tik)
  • TopDocs collector: ensure stable sorting on equal score. #671 (@brainlock)
  • Added handling of pre-tokenized text fields (#642), which will enable users to load tokens created outside tantivy. See usage in examples/pre_tokenized_text. (@kkoziara)
  • Fix crash when committing multiple times with deleted documents. #681 (@brainlock)

How to update?

  • The index format is changed. You are required to reindex your data to use tantivy 0.11.
  • Box<dyn BoxableTokenizer> has been replaced by a BoxedTokenizer struct.
  • Regex are now compiled when the RegexQuery instance is built. As a result, it can now return an error and handling the Result is required.
  • tantivy::version() now returns a Version object. This object implements ToString()

0.10.3 - Nov 10, 2019

  • Fix crash when committing multiple times with deleted documents. #681 (@brainlock)

0.10.2 - Oct 01, 2019

Hotfix for #656

0.10.1 - Jul 30, 2019

  • Closes #544. A few users experienced problems with the directory watching system. Avoid watching the mmap directory until someone effectively creates a reader that uses this functionality.

0.10.0 - Jul 11, 2019

Tantivy 0.10.0 index format is compatible with the index format in 0.9.0.

  • Added an API to easily tweak or entirely replace the default score. See TopDocs::tweak_scoreand TopScore::custom_score (@pmasurel)
  • Added an ASCII folding filter (@drusellers)
  • Bugfix in query.count in presence of deletes (@pmasurel)
  • Added .explain(...) in Query and Weight to (@pmasurel)
  • Added an efficient way to delete_all_documents in IndexWriter (@petr-tik). All segments are simply removed.

Minor

  • Switched to Rust 2018 (@uvd)
  • Small simplification of the code. Calling .freq() or .doc() when .advance() has never been called on segment postings should panic from now on.
  • Tokens exceeding u16::max_value() - 4 chars are discarded silently instead of panicking.
  • Fast fields are now preloaded when the SegmentReader is created.
  • IndexMeta is now public. (@hntd187)
  • IndexWriter add_document, delete_term. IndexWriter is Sync, making it possible to use it with a Arc<RwLock<IndexWriter>>. add_document and delete_term can only require a read lock. (@pmasurel)
  • Introducing Opstamp as an expressive type alias for u64. (@petr-tik)
  • Stamper now relies on AtomicU64 on all platforms (@petr-tik)
  • Bugfix - Files get deleted slightly earlier
  • Compilation resources improved (@fdb-hiroshima)

How to update?

Your program should be usable as is.

Fast fields

Fast fields used to be accessed directly from the SegmentReader. The API changed, you are now required to acquire your fast field reader via the segment_reader.fast_fields(), and use one of the typed method:

  • .u64(), .i64() if your field is single-valued ;
  • .u64s(), .i64s() if your field is multi-valued ;
  • .bytes() if your field is bytes fast field.

0.9.1 - Mar 28, 2019

Hotfix . All language were using the English stemmer.

0.9 - Mar 20, 2019

0.9.0 index format is not compatible with the previous index format.

Bugfix

Some Mmap objects were being leaked, and would never get released. (@fulmicoton)

New Features

  • Added IndexReader. By default, index is reloaded automatically upon new commits (@fulmicoton)
  • Stemming in other language possible (@pentlander)
  • Added grouped add and delete operations. They are guaranteed to happen together (i.e. they cannot be split by a commit). In addition, adds are guaranteed to happen on the same segment. (@elbow-jason)
  • Added DateTime field (@barrotsteindev)

Misc improvements

  • Indexer memory footprint improved. (VInt comp, inlining the first block. (@fulmicoton)
  • Removed most unsafe (@fulmicoton)
  • Segments with no docs are deleted earlier (@barrotsteindev)
  • Removed INT_STORED and INT_INDEXED. It is now possible to use STORED and INDEXED for int fields. (@fulmicoton)

0.8.2b - Feb 14, 2019

0.8.2 fixes build for non x86_64 platforms. See #496 for details.

0.8.1 - Jan 23, 2019

Hotfix of #476.

Merge was reflecting deletes before commit was passed. Thanks @barrotsteindev for reporting the bug.

0.8.0 - Dec 26, 2018

  • API Breaking change in the collector API. (@jwolfe, @fulmicoton)
  • Multithreaded search (@jwolfe, @fulmicoton)

0.7.2 - Dec 18, 2018

Bugfix #457 Removing faulty debug_assert!.

0.7.1 - Nov 02, 2018

  • Bugfix: NGramTokenizer panics on non ascii chars
  • Added a space usage API

0.7.0 - Sep 16, 2018

  • Skip data for doc ids and positions (@fulmicoton), greatly improving performance
  • Tantivy error now rely on the failure crate (@drusellers)
  • Added support for AND, OR, NOT syntax in addition to the +,- syntax
  • Added a snippet generator with highlight (@vigneshsarma, @fulmicoton)
  • Added a TopFieldCollector (@pentlander)

0.6.1 - Jul 10, 2018

  • Bugfix #324. GC removing was removing file that were still in u seful
  • Added support for parsing AllQuery and RangeQuery via QueryParser
    • AllQuery: *
    • RangeQuery:
      • Inclusive field:[startIncl to endIncl]
      • Exclusive field:{startExcl to endExcl}
      • Mixed field:[startIncl to endExcl} and vice versa
      • Unbounded field:[start to *], field:[* to end]

0.6.0 - Jun 22, 2018

Special thanks to @drusellers and @jason-wolfe for their contributions to this release!

From now on Tantivy compiles on stable rust.

  • Removed C code. Tantivy is now pure Rust. (@pmasurel)
  • BM25 (@pmasurel)
  • Approximate field norms encoded over 1 byte. (@pmasurel)
  • Compiles on stable rust (@pmasurel)
  • Add &[u8] fastfield for associating arbitrary bytes to each document (@jason-wolfe) (#270)
    • Completely uncompressed
    • Internally: One u64 fast field for indexes, one fast field for the bytes themselves.
  • Add NGram token support (@drusellers)
  • Add Stopword Filter support (@drusellers)
  • Add a FuzzyTermQuery (@drusellers)
  • Add a RegexQuery (@drusellers)
  • Various performance improvements (@pmasurel)_

0.5.2 - May 06, 2018

Hotfix of 0.5.x for the following issues

  • bugfix #274
  • bugfix #280
  • bugfix #289

Information - Updated Apr 04, 2022

Stars: 6.1K
Forks: 358
Issues: 151

Tantivy is a full text search engine library written in Rust

It is closer to Elasticsearch or benchmark break downs

Tantivy is a full text search engine library written in Rust

A Full-Text Search Engine in Rust

Toshi will always target stable Rust and will try our best to never make any use of unsafe Rust

A Full-Text Search Engine in Rust

Rust lang bookmarking tool

Rust and Rocket used bookmarking tool for search bar

Rust lang bookmarking tool

Ternary search tree collection in rust with similar API to std::collections as it possible

Ternary search tree is a type of trie (sometimes called a prefix tree) where nodes are arranged in a manner similar to a binary search...

Ternary search tree collection in rust with similar API to std::collections as it possible

Sonic-channel is a rust client for the sonic search backend

Quick and easy way to get started with search in rust

Sonic-channel is a rust client for the sonic search backend

amber is a code search and replace tool written by Rust

amber is a code search and replace tool written by ack,

amber is a code search and replace tool written by Rust

txtai: AI-powered search engine for Rust

Overview of the functionality provided by txtai

txtai: AI-powered search engine for Rust

recon_metadata, book details and metadata search library written in Rust using

recon_metadata, book details and metadata search library written in reqwest

recon_metadata, book details and metadata search library written in Rust using

Roogle is a Rust API search engine, which allows you to search functions by names...

Roogle is a Rust API search engine, which allows you to search functions by names and type signatures

Roogle is a Rust API search engine, which allows you to search functions by names...

Non-official rust library to search Nyaa

si does not provide any APIs so I thought it would be cool to have a way to do so in Rust and that's why...

Non-official rust library to search Nyaa
Facebook Instagram Twitter GitHub Dribbble
Privacy