toshi-search/toshi

A Full-Text Search Engine in Rust

Toshi will always target stable Rust and will try our best to never make any use of unsafe Rust

Toshi

A Full-Text Search Engine in Rust

Please note that this is far from production ready, also Toshi is still under active development, I'm just slow.

Description

Toshi is meant to be a full-text search engine similar to Elasticsearch. Toshi strives to be to Elasticsearch what Tantivy is to Lucene.

Motivations

Toshi will always target stable Rust and will try our best to never make any use of unsafe Rust. While underlying libraries may make some use of unsafe, Toshi will make a concerted effort to vet these libraries in an effort to be completely free of unsafe Rust usage. The reason I chose this was because I felt that for this to actually become an attractive option for people to consider it would have to have be safe, stable and consistent. This was why stable Rust was chosen because of the guarantees and safety it provides. I did not want to go down the rabbit hole of using nightly features to then have issues with their stability later on. Since Toshi is not meant to be a library, I'm perfectly fine with having this requirement because people who would want to use this more than likely will take it off the shelf and not modify it. My motivation was to cater to that use case when building Toshi.

Build Requirements

At this current time Toshi should build and work fine on Windows, Mac OS X, and Linux. From dependency requirements you are going to need 1.39.0 and Cargo installed in order to build. You can get rust easily from rustup.

Configuration

There is a default configuration file in config/config.toml:

Host

host = "localhost"

The hostname Toshi will bind to upon start.

Port

port = 8080

The port Toshi will bind to upon start.

Path

path = "data/"

The data path where Toshi will store its data and indices.

Writer Memory

writer_memory = 200000000

The amount of memory (in bytes) Toshi should allocate to commits for new documents.

Log Level

log_level = "info"

The detail level to use for Toshi's logging.

Json Parsing

json_parsing_threads = 4

When Toshi does a bulk ingest of documents it will spin up a number of threads to parse the document's json as it's received. This controls the number of threads spawned to handle this job.

Bulk Buffer

bulk_buffer_size = 10000

This will control the buffer size for parsing documents into an index. It will control the amount of memory a bulk ingest will take up by blocking when the message buffer is filled. If you want to go totally off the rails you can set this to 0 in order to make the buffer unbounded.

Auto Commit Duration

auto_commit_duration = 10

This controls how often an index will automatically commit documents if there are docs to be committed. Set this to 0 to disable this feature, but you will have to do commits yourself when you submit documents.

Merge Policy

Tantivy will merge index segments according to the configuration outlined here. There are 2 options for this. "log" which is the default segment merge behavior. Log has 3 additional values to it as well. Any of these 3 values can be omitted to use Tantivy's default value. The default values are listed below.

In addition there is the "nomerge" option, in which Tantivy will do no merging of segments.

Experimental Settings

In general these settings aren't ready for usage yet as they are very unstable or flat out broken. Right now the distribution of Toshi is behind this flag, so if experimental is set to false then all these settings are ignored.

Building and Running

Toshi can be built using cargo build --release. Once Toshi is built you can run ./target/release/toshi from the top level directory to start Toshi according to the configuration in config/config.toml

You should get a startup message like this.

You can verify Toshi is running with:

which should return:

Once toshi is running it's best to check the requests.http file in the root of this project to see some more examples of usage.

Example Queries

Term Query
Fuzzy Term Query
Phrase Query
Range Query
Regex Query
Boolean Query
Usage

To try any of the above queries you can use the above example

Also, to note, limit is optional, 10 is the default value. It's only included here for completeness.

Running Tests

cargo test

What is a Toshi?

Toshi is a three year old Shiba Inu. He is a very good boy and is the official mascot of this project. Toshi personally reviews all code before it is committed to this repository and is dedicated to only accepting the highest quality contributions from his human. He will, though, accept treats for easier code reviews.

Issues

Collection of the latest Issues

seekeramento

seekeramento

Comment Icon4

Is your feature request related to a problem? Please describe. I am not able to find a way to re-index or even delete an existing index. Once the index is created, I have to assume that the schema will never change.

Does another search engine have this functionality? Can you describe it's function? N/A

Do you have a specific use case you are trying to solve? Create an index for a given schema Add some documents to your created index. Update the schema by introducing two additional fields. You should be able to add documents or editing existing ones with the two newly introduced fields.

Additional context N/A

taras

taras

Comment Icon0

We're looking for a lightweight alternative to Elastic Search that we could use in automated tests. It's not 100% clear from the README if the goal of this project is to create something that's compatible in API to Elastic Search.

For it to work for our purpose we need the REST API to be the same and for Toshi to accept queries in the same format. Does this align with how you see the goals of this project?

Thank you for working on it. It's a great idea.

prawnsalad

prawnsalad

Comment Icon3

Describe the bug curl -v -H "Content-Type: application/json" --data-binary http://localhost:8000/emails/_bulk

If bulkbody.json is over 8kb in size, toshi returns 400 bad request {"message":"Error in Index: 'The provided string is not valid JSON'"}

To Reproduce Steps to reproduce the behavior: Post a a request over 8kb in size

Expected behavior Should be able to accept 8kb POST data or be configurable

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Rust Version:
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

godlockin

godlockin

Comment Icon3

Hi folks,

I want to say, your were doing a great job on migrating/creating a full-text search engine by rust, maybe it will take a piece of cake from the ES.

I cloned the project, compiled and did some tests based on the requests in requests.http and others, Toshi works well and its operations were quite similar with ES which means a lot for me.

env: Mac rustc: 1.55.0 branch: master

But, unfortunately, I got some blocks on my further research:

  1. how to config & deploy multiple instance in a single cloud instance? I can't find the guidance on the config & deploys through. Do I need to copy the execution file toshi along with the config folder to another path? Do I need to keep the hierarchical structure of the folders?
  2. what are the differences among these config files, config.toml config-bench.toml config-rpc.toml, each of them contains the similar content in master branch, which config file do I need in which scenario?
  3. do we have other way to do operations in Toshi other than the restful api? e.g. some rust (or maybe other language) sdk or some?
prabhatsharma

prabhatsharma

Comment Icon1

Currently when searching we need to provide the fields we explicitly need to search for. e.g.

{ "query": { "term": { "lyrics": "cool" } }, "limit": 10 }

Is it possible to specify a query that can search on all the fields? Am I thinking about it in right way? I have been able to do so in elasticsearch. Also looking at tantivy wikipedia example it seems we should be able to do it.

prabhatsharma

prabhatsharma

Comment Icon5

I am looking to list the existing indexes that are available. Do we have an API endpoint for that?

Additional context I am building a Javascript front end UI for toshi. Once the UI is in a state to be released, we can have it as a companion to toshi.

aguynamedben

aguynamedben

Comment Icon3

Describe the bug I'm trying to use the _bulk endpoint. I read the tests in the code, and understand that it want's line-by-line JSON as the request body. I got it all working, and I'm trying to index Wikipedia articles that look like this:

I regularly (but not always) get 400s with no helpful response body. I don't see a panic or any logs in Toshi's stdout. From looking at the bulk_insert handler, it looks like it's probably to the index_documents call at the bottom. It seems that within index_document there is some sort of 100ms timeout. I've seen the error happen after a slight hiccup in my script's output, so I'm wondering if a delay within Toshi is causing the 400s.

It seems that if I use a small batch size, i.e. 5 or 10 records, the timeout is less likely, but I'm trying to insert 5m documents, so I wish to use a batch size of 100 or 1,000, and flush at the end.

Any ideas?

Thanks for sharing this project, it's really, really cool!

To Reproduce Steps to reproduce the behavior:

  1. Bulk insert records quickly.
  2. Watch for error.

Expected behavior 201 Created

Desktop (please complete the following information):

  • OS: macOS 11.3.1
  • Rust Version: 1.52.1
  • Toshi Version: master @ bbfa8e
sempervictus

sempervictus

Comment Icon3

Describe the bug Commandline invocation of /usr/bin/toshi --config /etc/toshi/config.toml which works from the shell fails in systemd service:

To Reproduce Steps to reproduce the behavior:

  1. Build release binary
  2. Write config and place in path shown above
  3. Create the data path and set appropriate ownership
  4. Create systemd service as shown
  5. Run CLI invocation, then run service
  6. Observe failure and logs

Expected behavior Same behavior in systemd service context as in shell

Desktop (please complete the following information):

  • OS: Arch
  • Rust Version: Stable (Arch Linux upstream)
  • Version: current master
mdaniel

mdaniel

good first issue
Comment Icon2

What happened

Accidentally omitting document content returns 500 Internal Server Error with a body of {"message":"Internal error","uri":"/new_index"}

What was expected

Emitting any kind of helpful message would be helpful. Also, in my experience, when the client receives a 500 response, there is usually something informative on the server-side. But in this case, the server emits the same message that the client receives, which isn't helpful.

This bug is actually just the worst offender of a whole class of bugs where if something doesn't go Toshi's way, it just gives back a raspberry, but I'd say getting a 500 for an empty document is pretty far up the list for me

How to reproduce

Assuming you create an index based on the cargo test schema, then send in an indexing request of the form

jlerche

jlerche

Comment Icon1

I totally get that refactoring to be agnostic to discovery mechanisms would be a significant time investment. On that front, I'd be happy to contribute the kubernetes part if you decide to go that route.

With that said, it's fairly straightforward to use the kubernetes API. An HTTP request is made to #404<namespace>/endpoints?labelSelector=<name-defined-in-k8s-config>. The response is something like this, assuming serde for serialization

Retrieving the ip addresses is as simple as

Per #19 if leader election wanted to be done, kubernetes has a unique number tied to each API object called resourceVersion. Here, each Address has a TargetRef field which will have resource_version field. The leader can be chosen via min/max of the resource version associated with it. Kubernetes can also expose the pod name to the container via environment variable so any toshi node can know its kubernetes identifier.

LucioFranco

LucioFranco

enhancement
Comment Icon0

This is just a tracking issue for additional config items that need to be added:

  • Consul client buffer size

Information - Updated Mar 30, 2022

Stars: 3.5K
Forks: 108
Issues: 24

This is an example of a Rust server that functions as a remote schema for...

Rust + Hasura Rust server that functions as a Hasura

This is an example of a Rust server that functions as a remote schema for...

Newport Engine is a modular 2D and 3D game engine built in Rust for Rust

It is designed to be easily extendable and easy to use

Newport Engine is a modular 2D and 3D game engine built in Rust for Rust

Newport Engine is a modular 2D and 3D game engine built in Rust for Rust

It is designed to be easily extendable and easy to use

Newport Engine is a modular 2D and 3D game engine built in Rust for Rust

liboqs-rust: Rust bindings for liboqs

Qyantum Safe liboqs rust bindings

liboqs-rust: Rust bindings for liboqs

msgflo-rust: Rust participant support for MsgFlo

Flowhub visual programming IDE

msgflo-rust: Rust participant support for MsgFlo
Actix

1.0K

How to be a full stack Rust Developer

Read Rust the Rust blog posts at Steadylearner

How to be a full stack Rust Developer

Rust library translation (rust-src/rust-std/stdlib/rustlib translation)

This is the place to translate Having a documentation in your native language is essential if you don't speak English, and still enjoyable even if...

Rust library translation (rust-src/rust-std/stdlib/rustlib translation)

False Positive for rust-lang/rust#83583

The deprecation lint proc_macro_derive_resolution_fallback is intended to catch proc macro generated code that refers to items from parent modules that should not be in scope:

False Positive for rust-lang/rust#83583

A CHIP-8 &amp; SuperChip interpreter written in Rust using rust-sdl2

If you're getting compile errors it may be because

A CHIP-8 &amp; SuperChip interpreter written in Rust using rust-sdl2

Rust-Svelte-on-Rust

Starter template for Rocket backend server

Rust-Svelte-on-Rust

xbuild is a build tool for rust and rust/flutter projects with support for cross compiling...

xbuild is a build tool for rust and rust/flutter projects with support for cross compiling and

xbuild is a build tool for rust and rust/flutter projects with support for cross compiling...
Facebook Instagram Twitter GitHub Dribbble
Privacy