HTSlib bindings for Rust

you'll need to statically compile it with MUSL as follows:

This library provides HTSlib bindings and a high level Rust API for reading and writing BAM files.

To clone this repository, issue

$ git clone --recursive https://github.com/rust-bio/rust-htslib.git

ensuring that the HTSlib submodule is fetched, too. If you only want to use the library, there is no need to clone the repository. Go on to the Usage section in this case.

Requirements

rust-htslib comes with pre-built bindings to htslib for Mac and Linux. You will need a C toolchain compatible with the cc crate. The build script for this crate will automatically build a link htslib.

MUSL build

To compile this for MUSL crate you need docker and cross:

$ cargo install cross
$ cross build 				              # will build with GNU GCC or LLVM toolchains

If you want to run rust-htslib code on AWS lambda, you'll need to statically compile it with MUSL as follows:

$ cross build --target x86_64-unknown-linux-musl      # will build with MUSL toolchain

Alternatively, you can also install it locally by installing the development headers of zlib, bzip2 and xz. For instance, in Debian systems one needs the following dependencies:

$ sudo apt-get install zlib1g-dev libbz2-dev liblzma-dev clang pkg-config

We provide Dockerfile bases that provide these dependencies. Refer to the docker directory in this repository for the latest instructions, including LLVM installation.

On OSX:

$ brew install FiloSottile/musl-cross/musl-cross
$ brew install bzip2 zlib xz curl-openssl

Usage

Add this to your Cargo.toml:

[dependencies]
rust-htslib = "*"

By default rust-htslib links to bzip2-sys and lzma-sys for full CRAM support. If you do not need CRAM support, or you do need to support CRAM files with these compression methods, you can deactivate these features to reduce you dependency count:

[dependencies]
rust-htslib = { version = "*", default-features = false }

rust-htslib has optional support for serde, to allow (de)serialization of bam::Record via any serde-supported format.

Http access to files is available with the curl feature.

Beta-level S3 and Google Cloud Storge support is available with the s3 and gcs features.

rust-htslib can optionally use bindgen to generate bindings to htslib. This can slow down the build substantially. Enabling the bindgen feature will cause hts-sys to use a create a binding file for your architecture. Pre-built bindings are supplied for Mac and Linux. The bindgen feature on Windows is untested - please file a bug if you need help.

[dependencies]
rust-htslib = { version = "*", features = ["serde_feature"] }

For more information, please see the docs.

Alternatives

There's noodles by Michael Macias which implements a large part of htslib's C functionality in pure Rust (still experimental though).

Authors

  • Johannes Köster
  • Christopher Schröder
  • Patrick Marks
  • David Lähnemann
  • Manuel Holtgrewe
  • Julian Gehring

For other contributors, see here.

License

Licensed under the MIT license https://opensource.org/licenses/MIT. This project may not be copied, modified, or distributed except according to those terms. Some test files are taken from https://github.com/samtools/htslib.

Issues

Collection of the latest Issues

ctsa

ctsa

0

I'd like to use rust-htslib in an application where we can retain a straightforward build process for open-source users. One benchmark for this use case is to ask how hard it is to build the tool for a Centos7 cluster without root.

Prior to the 0.38.0 release, rust-htslib wasn't complicating build portability by the above criteria - centos7 + stock gcc + rustup seemed to be sufficient. The new changes adding zlib-ng support now require cmake v3.12+, which I can't find a way to workaround with feature selection, e.g. default-features = false. Is there a feature switch which would remove the cmake requirement or at least reduce it to cmake 2.8?

essut

essut

0

Hi, I am trying to solve issue rust-bio/rust-bio-tools#52, where the user triggered a code path that should not be possible: a FORMAT tag having a flag tag type.

I was able to reproduce by this passing a INFO tag as a FORMAT tag rbt vcf-to-txt --fmt T < tests/test.vcf , which is surprising to me since I expected the rust-htslib would panic given the tag does not exist as a FORMAT tag.

I then wrote my own code and using test.vcf as the input:

If "T" is interpreted as a INFO tag, all is well:

However, I am surprised that interpreting "T" as a FORMAT tag does not generate any errors:

Although in the header information, "T" is identified as a INFO tag:

I am not sure if this is an expected behaviour or not. If it is not, it would help fixing this here instead of relying on downstream tools to catch this error. I am also not sure if this had been discussed before, so apologies for duplicates.

Versions: rust-bio-tools 0.39.0 rust-htslib 0.38.2

jemma-nelson

jemma-nelson

2

Would it be possible to add support for writing BGZF files? I'm trying to accelerate a small program that both reads and writes Fastq, and it would be cool to support BGZF on both the input and output sides of things, ideally with multi-threading support.

Currently, it looks like only the Reader from htslib is exposed: https://github.com/rust-bio/rust-htslib/blob/master/src/bgzf/mod.rs

If this isn't feasible, totally understand. If it sounds feasible but you don't have the desire to do it, let me know and I can take a stab at putting a PR together.

wdesouza

wdesouza

2

Is it possible to merge VCF headers to combine records from different files? I'am trying to do something like this

xosxos

xosxos

1

In the documentation it says under

Add/replace a float-typed INFO entry.

However when trying to add an INFO field from another VCF file I get the following error:

I haven't followed the source code beyond the bindings to C, however I was hoping there is some simple explanation to this specific error. Should the tag be defined in the header before trying to push? The same error applies to all functions when adding fields. Replacing seems to work fine.

ebioman

ebioman

8

Hi I am currently trying to create uBAM (unmapped BAM) files and running into problems creating non-mapped entries from any kind of given sequence (e.g. FASTA, FASTQ,...). Essentially I can create a new BAM default record and BAM::Writer will write it but then I get either:

  1. for SAM output a segfault
  2. for BAM output no error at all, but the generated BAM is truncated

To understand better, lets take a BAM input and generate a BAM output file

This compiles but fails then with a Segmentation fault (core dumped) for Sam. If I write the original record instead of the new empty one everything works fine. I guess there is some combination of feature or rather missing feature which breaks it but I cant find which one.

ebioman

ebioman

0

Hi thanks for this great library! I am a bit confused currently what the fastest way is to get a pileup at a known position. Currently I found the only way being from IndexedReader where here bam would be the IndexedReader:

This iterator part and checking the position to match my position of interest is though pretty slow and I am wondering whether I am doing here something very obviously wrong ? I guess by fetching a 1bp specific position I still can get quite a large range depending on my read length, but still I feel I am missing something here...

kentwait

kentwait

1

Is there a way to get the inserted or deleted sequence while doing a pileup?

For example the reference is the first line and there are 3 mapped reads. The first read has an insertion GGC and the third has a deletion of GGC.

Using the example code below will get the length of the indel (3) but I don't know how to get the inserted or deleted bases.

In samtools mpileup this would be +3GGC for insertion or -3GGC for deletion. Is there an existing method for this or do I need to manually process this by reading the CIGAR of each read?

Thank you for the help.

DonFreed

DonFreed

27

Thank you for the very nice library!

I'm working on a tool that passes reads between threads. Unfortunately, the tool is producing non-deterministic segmentation faults and other memory errors. I've traced some of these issues back to rust-htslib, which will crash somewhat randomly when bam::record::Records are passes between threads. I am not too familiar with this library, so I am wondering if this is the expected behavior?

Here is a simplified example that can reproduce the crash:

Compiling with RUSTFLAGS="-g" cargo build and then running with a SAM passed through stdin produces the following:

Re-running the same command will produce the crash in different parts of the program. I've attached a backtrace from one of the crashes: crash backtrace.txt

wdesouza

wdesouza

0

The bio-types crate has been updated adding is_empty method to Record trait. Since the latest version of rust-htslib doesn't implement this method for Record it broke the crate. I am using rust-htslib = "0.35.2" in my project and it is not compiling due this error. Is it possible to set a fixed version of bio-types for future versions of rust-htslib?

cargo check on master rust-htslib:

ahcm

ahcm

2

thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 5915', .cargo/registry/src/github.com-1ecc6299db9ec823/rust-htslib-0.35.2/src/bam/record.rs:825:6 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Sample code from website:

malthesr

malthesr

0

I'm having trouble constructing Genotypes. The struct is public, but as I understand it custom_derive marks the inner field private and provides no alternative constructor? Consider for example:

This fails to compile with the error:

I'm finding it hard to understand the custom_derive source and docs, however, so perhaps I'm just missing something?

andrewpatto

andrewpatto

1

Just in doing some work on printing example BAM headers - and thought we'd use to_hashmap() to get us the header data nicely split up. But it appears to be panicking on the unwrap() at line 88 of src/bam/header.rs

let cap = TAG_RE.captures(part).unwrap();

which would make sense as the TAG_RE expects to see a colon, but the comment records do not necessarily fit that pattern

For example, the public cram

"#404"

has some header records

without investigating any further, my guess is that those are the records that are failing. I'm not enough of a bioinformatics expect to know if it is the headers themselves at fault, so I'll just leave it here as an issue.

kentwait

kentwait

2

The following text file was bgzip compressed using bgzip 1.10.2

bgzip test.pileup.txt

Then tabix-indexed using tabix 1.10.2

tabix -s 1 -b 2 -e 2 -c \# test.pileup.txt.gz

When I try to read test.pileup.txt I get an InvalidIndex error for the following code let reader = tbx::Reader::from_path(path).expect("Failed to create TabixPileup from path");

Failed to create TabixPileup from path: InvalidIndex

Let me know if I'm missing something here. I'm stumped. I've tried both with a header and without.

test.pileup.txt test.noheader.pileup.txt

hjeremyli

hjeremyli

documentation
9

As far as I can tell, there is currently no way to update genotypes in a bcf Record.

Related: https://github.com/rust-bio/rust-htslib/blob/ed217c2fe460794d249e58a85db0b0c8a2929e55/src/bcf/record.rs#L360

Since the GT field has a String format (as specified in the header) I tried to work around this by using Record.push_format_string() like so:

However, this yields the following error:

This above works if I replace the tag argument in push_format_string() with something other than "GT", which seem to suggest that attempting to edit the GT field using this method is treated differently than other String fields.

It seems I'm likely missing something here, but is there a straightforward way of updating genotypes for a given record as I attempted to do above with the current release? Something I imagine that would provide similar functionality to the bcf_update_genotypes macro in htslib.

Adaminius

Adaminius

2

Hello!

I am receiving errors like malloc: Incorrect checksum for freed object 0x7fdf6c4066b8: probably modified after being freed in my code, and I think it might be due to a bug in rust-htslib's use of bcf_get_format_values() in htslib for strings or genotypes. Here is the smallest example I could cut my code down to and still reproduce the issue:

I have the above code ready to run with an example VCF in this repo.

I have reproduced the issue on macOS 10.14 and Ubuntu 16.04.

I am new to Rust and relatively inexperienced with other low-level languages, so I am having difficulty debugging this myself, so I would really appreciate it if one of the authors could take a look into this.

Thank you!

Versions

Find the latest versions by id

rust-htslib-v0.39.5 - May 09, 2022

Bug Fixes

  • set path in release-please config (d8f7c6e)

rust-htslib-v0.39.4 - May 09, 2022

Bug Fixes

  • perform checkout before running release please (cbc6a0a)

rust-htslib-v0.39.3 - May 04, 2022

Bug Fixes

  • change the type to c_char so it can be compiled for aarch64 (#337) (a21aff2)

rust-htslib-v0.39.2 - Aug 23, 2021

Bug Fixes

  • Configuration when cross-compiling. Even when cross-compiling, build.rs runs on the build host. Hence within build.rs #[cfg(target_os)] always reflects the host, not the target. Use $CARGO_CFG_TARGET_OS instead to query target properties. (#329) (d5198e6)

hts-sys-v2.0.2 - Aug 23, 2021

Bug Fixes

  • Configuration when cross-compiling. Even when cross-compiling, build.rs runs on the build host. Hence within build.rs #[cfg(target_os)] always reflects the host, not the target. Use $CARGO_CFG_TARGET_OS instead to query target properties. (#329) (d5198e6)

v0.38.2 - Jul 06, 2021

Bug Fixes

  • add ID to automatic release handling (1244393)

v0.38.1 - Jul 06, 2021

Bug Fixes

v0.38.0 - Jul 06, 2021

⚠ BREAKING CHANGES

  • Improve bcf Record filter interface and improve docs (#306)

Features

  • Improve bcf Record filter interface and improve docs (#306) (f45e91d)

Bug Fixes

rust-htslib-v0.39.1 - Jul 06, 2021

Bug Fixes

  • bump hts-sys version to 2.0.1 (336c8b8)

rust-htslib-v0.39.0 - Jul 06, 2021

⚠ BREAKING CHANGES

  • dummy major version bump to move away from previous versions that were following htslib versions.
  • bump to new major version (for technical reasons).
  • dummy breaking change to increase hts-sys major version.

Bug Fixes

  • bump to new major version (for technical reasons). (9c6db30)
  • dummy breaking change to increase hts-sys major version. (93415cb)
  • dummy changes (3af5ede)
  • dummy major version bump to move away from previous versions that were following htslib versions. (aaa70a8)
  • dummy release (74d1565)
  • dummy release (af2f84e)
  • dummy release (b97915f)
  • handle subcrate with release-please (0a4605f)
  • trigger dummy release (7c5a7de)
  • update changelog (deef08f)

rust-htslib-v0.38.3 - Jul 06, 2021

Bug Fixes

  • dummy fix for triggering release (e92e6b1)

hts-sys-v2.0.1 - Jul 06, 2021

This is a dummy release for technical reasons: we needed to decouple the hts-sys version from htslib, such that it becomes easier to apply custom fixes while keeping a semantic versioning scheme in the future.

hts-sys-v2.0.0 - Jul 06, 2021

⚠ BREAKING CHANGES

  • dummy major version bump to move away from previous versions that were following htslib versions.

Bug Fixes

  • dummy major version bump to move away from previous versions that were following htslib versions. (aaa70a8)

hts-sys-v1.0.0 - Jul 06, 2021

⚠ BREAKING CHANGES

  • bump to new major version (for technical reasons).
  • dummy breaking change to increase hts-sys major version.

Bug Fixes

  • bump to new major version (for technical reasons). (9c6db30)
  • dummy breaking change to increase hts-sys major version. (93415cb)
  • dummy changes (3af5ede)
  • update changelog (deef08f)

hts-sys-v0.1.0 - Jul 06, 2021

Bug Fixes

  • dummy release

hts-sys-1.11.1-fix2 - Jul 06, 2021

Dummy release for technical reasons.

v0.37.0 - Jul 05, 2021

Added

  • bcf::Record methods end, clear, and rlen (@mbhall88).

Changes

  • bcf::IndexReader::fetch parameter end is now an Option<u64>. This is inline with htslib regions, which do not require an end position (@mbhall88)
  • Removed unused dependencies (@sreenathkrishnan).
  • Improved documentation (@mbhall88).
  • Improved error message when failing to load index files (@johanneskoester).
  • Improved API for accessing AUX fields in BAM records (@jch-13).
  • Fixed compiler warnings (@fxwiegand).
  • BAM header representation is now always kept in sync between textual and binary (@jch-13).

v0.36.0 - Jan 07, 2021

Changes

  • Improved genotype API in VCF/BCF records (@MaltheSR).
  • Read pair orientation inference for BAM records (@johanneskoester).

v0.35.2 - Nov 23, 2020

This release contains a fix for problems with semantic versioning in Cargo when specifying the hts-sys dependency.

v0.35.1 - Nov 23, 2020

  • Fixed wrongly define missing value constants in bcf::record (@johanneskoester).
  • Bump hts-sys depedency to the latest version, containing build fixes for macOS (@johanneskoester).

v0.35.0 - Nov 19, 2020

Changes

  • BREAKING: info and format field access in BCF records now allocates a separate buffer each time. In addition, it is also possible to pass a buffer that has been created manually before (@johanneskoester)
  • Fixes for building on macOS (@brainstorm)

Added

  • ability to push genotypes into BCF record (@MaltheSR, @tedil).

v0.34.0 - Nov 13, 2020

Added

  • Ability to set minimum refetch distance in bam::RecordBuffer.

v0.33.0 - Nov 04, 2020

Changes

  • BREAKING: Rename feature 'serde' as 'serde_feature' (for technical reasons)
  • BREAKING: Consolidate module-wide errors into a crate-wide error module
  • Making bcf::IndexedReader always unpack records to reflect the behaviour of bcf::Reader.
  • Adding bcf::errors::Error::FileNotFound and using it.
  • Fixes for musl compilation (@brainstorm).
  • Improved BCF constants handling (@juliangehring)
  • Fixes for tabix reader (@felix-clark, @brainstorm).
  • Fixes for BCF handling (@holtgrewe, @tedil).
  • Documentation improvements (@vsoch, @brainstorm, @edmundlth).
  • BREAKING: Improved, more ergonomic BAM fetch API (@TyberiusPrime, @brainstorm, @tedil).
  • BREAKING: Let BamRecordExtensions return iterators instead of vectors (@TyberiusPrime).
  • Handle all errors via a unified single thiserror based enum (@landesfeind).
  • BREAKING: BAM read API now returns Option (@slazicoicr).

Added

  • Support for reading indexed FASTA files (@landesfeind, @pmarks, @brainstorm).
  • Support for shared threadpools when reading and writing BAM (@pmarks, @nlhepler).
  • Serde support for Cigar strings (@FelixMoelder, @pmarks, @johanneskoester).
  • Expose bgzf functionality (@landesfeind).
  • Iterator over BAM records using Rc-pointers (@TyberiusPrime, @tedil).
  • Ability to obtain pairs of read and genome intervals from BAM (aligned_block_pairs) (@TyberiusPrime, @brainstorm).

v0.32.0 - Nov 04, 2020

Changes

  • Method seq_len() of bam::Record is now public.
  • Speedup when parsing BAM headers (thanks to @juliangehring).
  • Compatibility fixes for older rust versions (thanks to @pmarks and @brainstorm).

v0.31.0 - Jun 23, 2020

  • Bam record buffer now returns reference counted (Rc) objects. This makes the API more ergonomic to use.
  • Switched to thiserror instead of snafu for error handling.
  • Various cleanups and little fixes.

This release breaks parts of the API due to the error handling changes and the difference in the bam record buffer.

v0.30.0 - Apr 03, 2020

  • Removed fn header_mut() from bam::Read trait.
  • Fixed a major performance regression when reading bam files (issue #195).

v0.29.0 - Mar 26, 2020

Also return u64 for target_len in BAM header.

v0.28.0 - Mar 26, 2020

  • Return u64 wherever htslib has migrated to using 64 bit.
  • Implement more bio-types (Interval, Locus, Strand).

v0.27.0 - Mar 17, 2020

This release adds various improvements. Moreover, it migrates to htslib 1.10. This is a breaking change, since various data types had to change.

v0.26.1 - Mar 17, 2020

This is a bugfix release. See CHANGELOG.md for details.

Information - Updated May 21, 2022

Stars: 176
Forks: 53
Issues: 17

Serde is a framework for serializing and deserializing Rust data structures efficiently and generically

Rust Greatest JSON weapon is Serde with over 4.4K stars on github and a massive developer community. This is considered a core Rust library for every developer to learn in BRC's opinion

Serde is a framework for serializing and deserializing Rust data structures efficiently and generically

A fast and flexible CSV reader and writer for Rust, with support for Serde

Dual-licensed under MIT or the If you're new to Rust, the

A fast and flexible CSV reader and writer for Rust, with support for Serde
JSON

3.0K

Serde JSON  

Serde is a framework for serializing and deserializing Rust data structures efficiently and generically

Serde JSON  

SIMD JSON for Rust  

Rust port of extremely fast serde compatibility

SIMD JSON for Rust  

This is an experimental serializer for Rust's serde ecosystem, which can convert Rust objects to...

This is an experimental serializer for Rust's serde ecosystem, which can convert Rust objects to Python values and back

This is an experimental serializer for Rust's serde ecosystem, which can convert Rust objects to...
JSON

111

A Rust JSON5 serializer and deserializer which speaks Serde

Deserialize a JSON5 string with from_str

A Rust JSON5 serializer and deserializer which speaks Serde

This crate is a Rust library for using the Serde serialization framework with

This crate is a Rust library for using the Gura file format

This crate is a Rust library for using the Serde serialization framework with
JSON

272

Custom de/serialization functions for Rust's serde

Apply a prefix to each field name of a struct, without changing the de/serialize implementations of the struct using serde_with::rust::StringWithSeparator::&lt;CommaSeparator&gt;

Custom de/serialization functions for Rust's serde

serde-reflection: Format Description and Code Generation for Serde

This project aims to bring the features of a traditional IDL to Rust and Serde

serde-reflection: Format Description and Code Generation for Serde

Serde JSON  

Serde is a framework for serializing and deserializing Rust data structures efficiently and generically

Serde JSON  
Facebook Instagram Twitter GitHub Dribbble
Privacy