tantivy-cli is the project hosting the command line interface for tantivy, a search engine project.
Tutorial: Indexing Wikipedia with Tantivy CLI
In this tutorial, we will create a brand new index with the articles of English wikipedia in it.
Installing the tantivy CLI.
There are a couple ways to install
If you are a Rust programmer, you probably have
cargo installed and you can just
cargo install tantivy-cli
Creating the index:
Let's create a directory in which your index will be stored.
We will now initialize the index and create its schema. The schema defines the list of your fields, and for each field:
- its name
- its type, currently
- how it should be indexed.
You can find more information about the latter on tantivy's schema documentation page
In our case, our documents will contain
- a title
- a body
- a url
We want the title and the body to be tokenized and indexed. We also want to add the term frequency and term positions to our index.
tantivy new will start a wizard that will help you
define the schema of the new index.
Like all the other commands of
tantivy, you will have to
pass it your index directory via the
parameter as follows:
Answer the questions as follows:
After the wizard has finished, a
meta.json should exist in
It is a fairly human readable JSON, so you can check its content.
It contains two sections:
- segments (currently empty, but we will change that soon)
Indexing the document:
index command offers a way to index a json file.
The file must contain one JSON object per line.
The structure of this JSON object must match that of our schema definition.
For this tutorial, you can download a corpus with the 5 million+ English Wikipedia articles in the right format here: wiki-articles.json (2.34 GB).
Make sure to decompress the file. Also, you can avoid this if you have
bzcat installed so that you can read it compressed.
If you are in a rush you can download 100 articles in the right format here (11 MB).
index command will index your document.
By default it will use as 3 thread, each with a buffer size of 1GB split a
across these threads.
You can change the number of threads by passing it the
-t parameter, and the total
buffer size used by the threads heap by using the
-m. Note that tantivy's memory usage
is greater than just this buffer size parameter.
On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), on 8 threads, indexing wikipedia takes around 9 minutes.
While tantivy is indexing, you can peek at the index directory to check what is happening.
The main file is
You should also see a lot of files with a UUID as filename, and different extensions. Our index is in fact divided in segments. Each segment acts as an individual smaller index. Its name is simply a uuid.
If you decided to index the complete wikipedia, you may also see some of these files disappear. Having too many segments can hurt search performance, so tantivy actually automatically starts merging segments.
Serve the search index:
Tantivy's cli also embeds a search server. You can run it with the following command.
By default, it will serve on port
You can search for the top 20 most relevant documents for the query
Barack Obama by accessing
the following url in your browser
By default this query is treated as
barack OR obama.
You can also search for documents that contains both term, by adding a
+ sign before the terms in your query.
- makes it possible to remove documents the documents containing a specific term.
Finally tantivy handle phrase queries.
Search the index via the command line
You may also use the
search command to stream all documents matching a specific query.
The documents are returned in an unspecified order.
Benchmark the index:
Tantivy's cli provides a simple benchmark tool. You can run it with the following command.