curtisalexander/readstat-rs

Command-line tool for working with SAS binary — sas7bdat — files

ReadStat library developed by ReadStat repository is included as a readstat-sys crate is created

readstat-rs

Command-line tool for working with SAS binary — sas7bdat — files.

Get metadata, preview data, or convert data to csv, feather (or the Arrow IPC format), ndjson, or parquet formats.

ReadStat

The command-line tool is developed in Rust and is only possible due to the excellent ReadStat library developed by Evan Miller.

The ReadStat repository is included as a git submodule within this repository. In order to build and link, first a readstat-sys crate is created. Then the readstat binary utilizes readstat-sys as a dependency.

Install

Download a Release

[Mostly] static binaries for Linux, macOS, and Windows may be found at the Releases page.

Build

Linux and macOS

Building is as straightforward as cargo build.

Windows

Building on Windows requires LLVM 12 be downloaded and installed. In addition, the path to libclang needs to be set in the environment variable LIBCLANG_PATH. If LIBCLANG_PATH is not set then the readstat-sys build script assumes the needed path to be C:\Program Files\LLVM\lib.

For details see the following.

  • Check for LIBCLANG_PATH
  • Building in Github Actions

Run

After building or installing, the binary is invoked using subcommands. Currently, the following subcommands have been implemented:

  • metadata → writes the following to standard out or json
    • row count
    • variable count
    • table name
    • table label
    • file encoding
    • format version
    • bitness
    • creation time
    • modified time
    • compression
    • byte order
    • variable names
    • variable type classes
    • variable types
    • variable labels
    • variable format classes
    • variable formats
    • arrow data types
  • preview → writes the first 10 rows (or optionally the number of rows provided by the user) of parsed data in csv format to standard out
  • data → writes parsed data in csv, feather, ndjson, or parquet format to a file

Metadata

To write metadata to standard out, invoke the following.

To write metadata to json, invoke the following. This is useful for reading the metadata programmatically.

Preview Data

To write parsed data (as a csv) to standard out, invoke the following (default is to write the first 10 rows).

To write the first 100 rows of parsed data (as a csv) to standard out, invoke the following.

Data

:memo: The data subcommand includes a parameter for --format, which is the file format that is to be written. Currently, the following formats have been implemented:

  • csv
  • feather
  • ndjson
  • parquet

csv

To write parsed data (as csv) to a file, invoke the following (default is to write all parsed data to the specified file).

The default --format is csv. Thus the parameter is elided from the below examples.

To write the first 100 rows of parsed data (as csv) to a file, invoke the following.

feather

To write parsed data (as feather) to a file, invoke the following (default is to write all parsed data to the specified file).

To write the first 100 rows of parsed data (as feather) to a file, invoke the following.

ndjson

To write parsed data (as ndjson) to a file, invoke the following (default is to write all parsed data to the specified file).

To write the first 100 rows of parsed data (as ndjson) to a file, invoke the following.

parquet

To write parsed data (as parquet) to a file, invoke the following (default is to write all parsed data to the specified file).

To write the first 100 rows of parsed data (as parquet) to a file, invoke the following.

Parallelism

The data subcommand includes a parameter for --parallel. If invoked with this parameter, the reading of a sas7bdat will occur in parallel. If the total rows to process is greater than stream-rows (if unset, the default rows to stream is 50,000), then each chunk of rows is read in parallel. Note that all processors on the users's machine are used with the --parallel option. In the future, may consider allowing the user to throttle this number.

Note that although reading is in parallel, writing is still sequential. Thus one should only anticipate moderate speed-ups as much of the time is spent writing.

:heavy_exclamation_mark: Utilizing the --parallel parameter will increase memory usage — there will be multiple threads simultaneously reading chunks from the sas7bdat. In addition because all processors are utilized, CPU usage will maxed out during reading.

:warning: Also, note that utilizing the --parallel parameter will write rows out of order from the original sas7bdat.

Reader

The preview and data subcommands include a parameter for --reader. The possible values for --reader include the following.

  • mem → Parse and read the entire sas7bdat into memory before writing to either standard out or a file
  • stream (default) → Parse and read at most stream-rows into memory before writing to disk
    • stream-rows may be set via the command line parameter --stream-rows or if elided will default to 50,000 rows

Why is this useful?

  • mem is useful for testing purposes
  • stream is useful for keeping memory usage low for large datasets (and hence is the default)
  • In general, users should not need to deviate from the default — stream — unless they have a specific need
  • In addition, by enabling these options as command line parameters hyperfine may be used to benchmark across an assortment of file sizes

Debug

Debug information is printed to standard out by setting the environment variable RUST_LOG=debug before the call to readstat.

:warning: This is quite verbose! If using the preview or data subcommand, will write debug information for every single value!

Help

For full details run with --help.

Floating Point Truncation

:warning: Decimal values are truncated to contain only 14 decimal digits!

For example, the number 1.123456789012345 created within SAS would be returned as 1.12345678901234 within Rust.

Why does this happen? Is this an implementation error? No, truncation to only 14 decimal digits has been purposely implemented within the Rust code.

As a specific example, when testing with the cars.sas7bdat dataset (which was created originally on Windows), the numeric value 4.6 as observed within SAS was being returned as 4.6000000000000005 (16 digits) within Rust. Values created on Windows with an x64 processor are only accurate to 15 digits.

Only utilizing 14 decimal digits mirrors the approach of the ReadStat binary when writing to csv.

Finally, SAS represents all numeric values in floating-point representation which creates a challenge for all parsed numerics!

Sources

  • How SAS Stores Numeric Values
  • Accuracy on x64 Windows Processors
    • SAS on Windows with x64 processors can only represent 15 digits
  • Floating-point arithmetic may give inaccurate results in Excel

Date, Time, and Datetimes

Currently any dates, times, or datetimes in the following SAS formats are parsed and read as dates, times, or datetimes.

  • Dates
    • DATEw.
    • DDMMYYw.
    • DDMMYYxw.
    • MMDDYYw.
    • MMDDYYxw.
    • YYMMDDw.
    • YYMMDDxw.
  • Times
    • TIMEw.d
  • Datetimes
    • DATETIMEw.d

:warning: If the format does not match one of the above SAS formats, or if the value does not have a format applied, then the value will be parsed and read as a numeric value!

Details

SAS stores dates, times, and datetimes internally as numeric values. To distinguish among dates, times, datetimes, or numeric values, a SAS format is read from the variable metadata. If the format matches one of the above SAS formats then the numeric value is converted and read into memory using one of the Arrow types:

  • Date32Type
  • Time32SecondType
  • TimestampSecondType

If values are read into memory as Arrow date, time, or datetime types, then when they are serialized (from an Arrow record batch to csv, feather, ndjson, or parquet) they are treated as dates, times, or datetimes and not as numeric values.

Finally, more work is planned to handle other SAS dates, times, and datetimes that have SAS formats other than those listed above.

Testing

To perform unit / integration tests, run the following within the readstat directory.

Datasets

Formally tested (via integration tests) against the following datasets. See the README.md for data sources.

  • ahs2019n.sas7bdat → US Census data
  • all_types.sas7bdat → SAS dataset containing all SAS types
  • cars.sas7bdat → SAS cars dataset
  • hasmissing.sas7bdat → SAS dataset containing missing values
  • intel.sas7bdat
  • messydata.sas7bdat
  • rand_ds.sas7bdat → Created using create_rand_ds.sas
  • rand_ds_largepage_err.sas7bdat → Created using create_rand_ds.sas with BUFSIZE set to 2M
  • rand_ds_largepage_ok.sas7bdat → Created using create_rand_ds.sas with BUFSIZE set to 1M
  • scientific_notation.sas7bdat → Used to test float parsing
  • somedata.sas7bdat
  • somemiss.sas7bdat

Valgrind

To ensure no memory leaks, valgrind may be utilized. For example, to ensure no memory leaks for the test parse_file_metadata_test, run the following from within the readstat directory.

Platform Support

  • :heavy_check_mark: Linux → successfully builds and runs
    • glibc
    • musl (using the jemalloc allocator)
  • :heavy_check_mark: macOS → successfully builds and runs
  • :heavy_check_mark: Windows → successfully builds and runs
    • As of ReadStat 1.1.5, able to build using MSVC in lieu of setting up an msys2 environment
    • Requires libclang in order to build as libclang is required by bindgen

Benchmarking

Benchmarking performed with hyperfine.

This example compares the performance of the Rust binary with the performance of the C binary built from the ReadStat repository. In general, hope that performance is fairly close to that of the C binary.

To run, execute the following from within the readstat directory.

:memo: First experiments on Windows are challenging to interpret due to file caching. Need further research into utilizing the --prepare option provided by hyperfine on Windows.

Other, future, benchmarking may be performed when/if channels and threads are developed.

Profiling

Profiling performed with cargo flamegraph.

To run, execute the following from within the readstat directory.

Flamegraph is written to readstat/flamegraph.svg.

:memo: Have yet to utilize flamegraphs in order to improve performance.

Github Actions

Below is the rough git tag dance to delete and/or add tags to trigger Github Actions.

Goals

Short Term

Short term, developing the command-line tool was a helpful exercise in binding to a C library using bindgen and the Rust FFI. It definitely required a review of C pointers (and for which I claim no expertise)!

Long Term

The long term goals of this repository are uncertain. Possibilities include:

  • :heavy_check_mark: Developing a command line tool that performs transformations from sas7bdat to other file types
    • text
      • csv
      • ndjson
    • binary
      • feather
      • parquet
  • :heavy_check_mark: Developing a command line tool that expands the functionality made available by the readstat command line tool
  • Completing and publishing the readstat-sys crate that binds to ReadStat
  • Developing and publishing a Rust library — readstat — that allows Rust programmers to work with sas7bdat files
    • Implementing a custom serde data format for sas7bdat files (implement serialize first and deserialize later (if possible))

Resources

The following have been incredibly helpful while developing!

  • How to not RiiR
  • Making a *-sys crate
  • Rust Closures in FFI
  • Rust FFI: Microsoft Flight Simulator SDK
    • Part 1
    • Part 2
  • Stack Overflow answers by Jake Goulding
  • ReadStat pull request to add MSVC/Windows support
  • jamovi-readstat appveyor.yml file to build ReadStat on Windows
  • Arrow documentation for utilizing ArrayBuilders
Issues

Collection of the latest Issues

curtisalexander

curtisalexander

bug
Comment Icon0

Create SAS example of a datetime with milliseconds — using format datetime22.3 — and verify that the Arrow data type is arrow::datatypes::DataType::Timestampe(arrow::datatypes::TimeUnit::Millisecond, None).

As an example, work with the all_types.sas7bdat dataset.

curtisalexander

curtisalexander

documentation
Comment Icon0

Move development content to wiki instead of keeping in main README

curtisalexander

curtisalexander

enhancement
Comment Icon0

All numeric values in SAS are floats. However, is there a way to coerce floats into other, smaller types? For example, if a column in SAS only has 1/0 then it can be represented as an integer and the extra precision of a float is not needed.

One idea - scan all values as they are read in, maintaining metadata for the column to determine if storage as a float is actually needed. This would maybe work when reading the whole file into memory but would be challenging to implement if a file is streamed over.

curtisalexander

curtisalexander

Comment Icon0

Revert the following section in the Github Actions file — main.yml — once #139 sorted out.

From

To

or alternatively...

curtisalexander

curtisalexander

enhancement
Comment Icon0

Limit columns parsed

Design considerations:

  • How to pass in list of columns to include
  • What to do if column is not in the list
  • Via the metadata subcommand dump an appropriate data structure / file that lists out all columns available
    • User can then cut down this full list to focus on just what is desired
curtisalexander

curtisalexander

enhancement
Comment Icon2

Expand metadata captured:

  • file encoding
  • row count
    • may need error handling (check if < 0)
  • column count
    • may need error handling (check if < 0)
  • column names
  • column types
  • column labels
  • column storage width
    • could query and use something like lazy static to create an appropriately sized Vec at runtime???
  • column display width
  • file label
  • encoding
  • file format version
  • creation time
  • modified time
  • compression
  • endianness
curtisalexander

curtisalexander

documentation
Comment Icon0

Note how iconv and zlib are linked against for the various platforms.

Versions

Find the latest versions by id

v0.9.4 - May 19, 2022

v0.9.3 - Mar 09, 2022

v0.9.2 - Mar 02, 2022

v0.9.1 - Feb 28, 2022

v0.9.0 - Feb 26, 2022

v0.8.3 - Jan 22, 2022

v0.8.2 - Jan 19, 2022

v0.8.1 - Jan 18, 2022

v0.8.0 - Mar 01, 2022

v0.7.0 - Jan 09, 2022

v0.6.1 - Jan 08, 2022

v0.6.0 - Nov 17, 2021

v0.5.2 - Sep 15, 2021

v0.5.1 - Aug 11, 2021

v0.5.0 - Aug 09, 2021

v0.4.2 - Jul 26, 2021

v0.4.1 - Jul 25, 2021

v0.4.0 - Jul 22, 2021

v0.3.2 - Apr 15, 2021

v0.3.1 - Apr 14, 2021

v0.3.0 - Apr 12, 2021

v0.2.3 - Jan 15, 2021

v0.2.2 - Jan 15, 2021

v0.2.1 - Jan 01, 2021

v0.2.0 - Dec 29, 2020

v0.1.8 - Dec 28, 2020

v0.1.7 - Dec 22, 2020

v0.1.6 - Dec 22, 2020

v0.1.5 - Dec 22, 2020

v0.1.4 - Dec 16, 2020

Information - Updated May 28, 2022

Stars: 2
Forks: 0
Issues: 14

rust-clipboard is a cross-platform library for getting and setting the contents of the OS-level clipboard

It has been tested on Windows, Mac OSX, GNU/Linux, and FreeBSD

rust-clipboard is a cross-platform library for getting and setting the contents of the OS-level clipboard

EventStoreDB Rust Client

EventStoreDB rust gRPC gRPC Client

EventStoreDB Rust Client

Rust CLI template

A quick and dirty CLI boilerplate template for Rust

Rust CLI template

Rust CLI Template

Template for creating Rust-based CLI tools

Rust CLI Template

Pure Rust client for YubiHSM 2 devices from Yubico

YubiHSM 2 devices from Documentation

Pure Rust client for YubiHSM 2 devices from Yubico

Parsec Rust Client

When using the JWT-SVID authentication method, the client will expect the SPIFFE_ENDPOINT_SOCKET environment variable to contain the path of the Workload API endpoint

Parsec Rust Client

A Rust client for connecting to OmniSciDB via its RPC protocol of Thrift

OmniSciDB via its RPC protocol of Cargo, using cargo build

A Rust client for connecting to OmniSciDB via its RPC protocol of Thrift

Cardano Rust CLI

com/clemenshorn/cardano-rust-cli

Cardano Rust CLI

Godot Rust CLI is an easy to incorporate Rust modules into your Godot project

Also keep in mind that the main branch will usually be ahead of the version on GitHub repo

Godot Rust CLI is an easy to incorporate Rust modules into your Godot project

Rust client implementation for Brasil API

Rust client implementation for BrasilAPI GitHub Project

Rust client implementation for Brasil API

Rust client for TDlib

Library allows you to interact with Telegram Database library

Rust client for TDlib
Facebook Instagram Twitter GitHub Dribbble
Privacy