tantivy-cli is the project hosting the command line interface for tantivy, a search engine project

tantivy, a search engine project

tantivy-cli is the project hosting the command line interface for tantivy, a search engine project.

Tutorial: Indexing Wikipedia with Tantivy CLI

Introduction

In this tutorial, we will create a brand new index with the articles of English wikipedia in it.

Installing the tantivy CLI.

There are a couple ways to install tantivy-cli.

If you are a Rust programmer, you probably have cargo installed and you can just run cargo install tantivy-cli

Creating the index: new

Let's create a directory in which your index will be stored.

We will now initialize the index and create its schema. The schema defines the list of your fields, and for each field:

  • its name
  • its type, currently u64, i64 or str
  • how it should be indexed.

You can find more information about the latter on tantivy's schema documentation page

In our case, our documents will contain

  • a title
  • a body
  • a url

We want the title and the body to be tokenized and indexed. We also want to add the term frequency and term positions to our index.

Running tantivy new will start a wizard that will help you define the schema of the new index.

Like all the other commands of tantivy, you will have to pass it your index directory via the -i or --index parameter as follows:

Answer the questions as follows:

After the wizard has finished, a meta.json should exist in wikipedia-index/meta.json. It is a fairly human readable JSON, so you can check its content.

It contains two sections:

  • segments (currently empty, but we will change that soon)
  • schema

Indexing the document: index

Tantivy's index command offers a way to index a json file. The file must contain one JSON object per line. The structure of this JSON object must match that of our schema definition.

For this tutorial, you can download a corpus with the 5 million+ English Wikipedia articles in the right format here: wiki-articles.json (2.34 GB). Make sure to decompress the file. Also, you can avoid this if you have bzcat installed so that you can read it compressed.

If you are in a rush you can download 100 articles in the right format here (11 MB).

The index command will index your document. By default it will use as 3 thread, each with a buffer size of 1GB split a across these threads.

You can change the number of threads by passing it the -t parameter, and the total buffer size used by the threads heap by using the -m. Note that tantivy's memory usage is greater than just this buffer size parameter.

On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), on 8 threads, indexing wikipedia takes around 9 minutes.

While tantivy is indexing, you can peek at the index directory to check what is happening.

The main file is meta.json.

You should also see a lot of files with a UUID as filename, and different extensions. Our index is in fact divided in segments. Each segment acts as an individual smaller index. Its name is simply a uuid.

If you decided to index the complete wikipedia, you may also see some of these files disappear. Having too many segments can hurt search performance, so tantivy actually automatically starts merging segments.

Serve the search index: serve

Tantivy's cli also embeds a search server. You can run it with the following command.

By default, it will serve on port 3000.

You can search for the top 20 most relevant documents for the query Barack Obama by accessing the following url in your browser

http://localhost:3000/api/?q=barack+obama&nhits=20

By default this query is treated as barack OR obama. You can also search for documents that contains both term, by adding a + sign before the terms in your query.

http://localhost:3000/api/?q=%2Bbarack%20%2Bobama&nhits=20

Also, - makes it possible to remove documents the documents containing a specific term.

http://localhost:3000/api/?q=-barack%20%2Bobama&nhits=20

Finally tantivy handle phrase queries.

http://localhost:3000/api/?q=%22barack%20obama%22&nhits=20

Search the index via the command line

You may also use the search command to stream all documents matching a specific query. The documents are returned in an unspecified order.

Benchmark the index: bench

Tantivy's cli provides a simple benchmark tool. You can run it with the following command.

Issues

Collection of the latest Issues

spinscale

spinscale

1

When creating a new index with tantivy, one of the questions is

Should the field be fast (Y/N)?

I'd guess that noone would like to deny that question. Maybe add some more context to this one :-)

petr-tik

petr-tik

0

Currently, passing a json file with fields that aren't indexable leads to an error like this

I think it will be more user-friendly, if we ignore fields that we don't need to index and log it to stdout, instead of stopping indexing. This will reduce friction for new users - they will not have to change/preprocess their json files and can gradually add new fields and reindex.

petr-tik

petr-tik

1

Going through the quickstart guide I noticed, the cli offers 2 field options - int or text, while core tantivy offers text, u64, i64, DateTime, Facet and Bytes.

Most likely this will be the first point of call for new users, who want to check tantivy out.

Going forward, if tantivy-cli wants to stay a real and relevant part of tantivy, we will need to invest in continuous feature parity.

It might be easier if we subsume tantivy-cli/src under bin/tantivy-cli in core tantivy. This will help us check, if new fields are included in cli application as well as extend the API endpoint.

We can also use CI to cross-compile and deploy ready tantivy-cli binaries for people to download and play around with.

fulmicoton

fulmicoton

enhancement
0

... Especially #23 made it clear that the fact that tantivy-cli is waiting for documents from stdin if no in put file is given can be confusing.

We could also want ot use indicatif to show the progress at least when indexing from a file.

vchugreev

vchugreev

1

Does not work:

D:\Projects\tantivy-cli\target\debug>tantivy new -i wikipedia-index

Creating new index Let's define it's schema!

New field name ? title Error: Field name must match the pattern [_a-zA-Z0-9]+

wuranbo

wuranbo

2

Like elastic 5.0 have done. https://www.elastic.co/guide/en/elasticsearch/reference/master/grok-processor.html. With regex support, the tantivy-cli should be more practical, eg. use the Nginx or Apache log directly as input file.

@fulmicoton what about you thought? This is what I want to do with the https://github.com/BurntSushi/fst in my owner project.

So I will take it.

But as a very Rust newbie I may take some time. If you think it is a bad idea, actually I will still do it in my fork to familiar the code base of tantivy. ^_^

Versions

Find the latest versions by id

0.14.0 - Feb 18, 2021

0.6.1 - Aug 27, 2018

0.5.1 - Mar 11, 2018

0.5.0 - Feb 21, 2018

0.4.2 - Jul 19, 2017

  • No more AVX in the release binaries.
  • Bugfix for unindexed fields.

0.4.0 - Jul 16, 2017

Tantivy-cli for tantivy 0.4.0

0.3.0 - Apr 23, 2017

0.2.0 - Dec 11, 2016

Information - Updated May 31, 2022

Stars: 196
Forks: 49
Issues: 13

serde-json for no_std programs

MIT license (LICENSE-MIT or

serde-json for no_std programs
JSON

591

JSON parser which picks up values directly without performing tokenization in Rust

This JSON parser is implemented based on an abstract that utilizes in memory indexing and parsing

JSON parser which picks up values directly without performing tokenization in Rust
JSON

3.0K

Serde JSON  

Serde is a framework for serializing and deserializing Rust data structures efficiently and generically

Serde JSON  

SIMD JSON for Rust  

Rust port of extremely fast serde compatibility

SIMD JSON for Rust  

JSON-E Rust data-struct paramter crate for lightweight embedded content with objects and much more

What makes JSON-e unique is that it extensive documentation and ease of use

JSON-E Rust data-struct paramter crate for lightweight embedded content with objects and much more

JSON-RPC library designed for async/await in Rust

Designed to be the successor to tracking issue for next stable release (0

JSON-RPC library designed for async/await in Rust

A JSON-LD implementation for Rust

NOTE: This crate is in early development

A JSON-LD implementation for Rust
JSON

140

json_typegen - Rust types from JSON samples

json_typegen is a collection of tools for generating types from

json_typegen - Rust types from JSON samples

JSON File Parser

A CLI application that reads from a stream of JSON files, and computes some data-quality metrics

JSON File Parser
Facebook Instagram Twitter GitHub Dribbble
Privacy