Skip to content

Psy-Fer/ruSTAR

Repository files navigation

ruSTAR

A Rust reimplementation of STAR (Spliced Transcripts Alignment to a Reference), the widely-used RNA-seq aligner originally written in C++ by Alexander Dobin.

Overview

ruSTAR aims to be a faithful port of STAR, matching the original behavior as closely as possible. It uses the same genome index format, accepts the same --camelCase command-line parameters, and produces compatible SAM/BAM output.

Current status: End-to-end single-end and paired-end RNA-seq alignment with splice junction detection, two-pass mode, chimeric alignment detection, and multi-threaded parallel processing. 268 tests passing.

Quick Start

Build

cargo build --release

Generate genome index

target/release/ruSTAR --runMode genomeGenerate \
  --genomeDir /path/to/genome_index \
  --genomeFastaFiles /path/to/genome.fa

Align reads

target/release/ruSTAR \
  --genomeDir /path/to/genome_index \
  --readFilesIn reads.fq \
  --outSAMtype SAM \
  --outSAMstrandField intronMotif \
  --outFileNamePrefix /path/to/output_

Paired-end alignment

target/release/ruSTAR \
  --genomeDir /path/to/genome_index \
  --readFilesIn reads_1.fq reads_2.fq \
  --outSAMtype SAM \
  --outFileNamePrefix /path/to/output_

BAM output

target/release/ruSTAR \
  --genomeDir /path/to/genome_index \
  --readFilesIn reads.fq \
  --outSAMtype BAM Unsorted \
  --outFileNamePrefix /path/to/output_

Two-pass mode

target/release/ruSTAR \
  --genomeDir /path/to/genome_index \
  --readFilesIn reads.fq \
  --twopassMode Basic \
  --outFileNamePrefix /path/to/output_

Accuracy Comparison vs STAR

Benchmarked on 10,000 yeast RNA-seq reads (150 bp SE, ERR12389696), compared to STAR 2.7.x with identical parameters and genome index.

Single-End Alignment Rates

Metric ruSTAR STAR
Unique mapped 92.6% 92.6%
Multi-mapped 7.4% 7.4%
Soft-clipped reads 26.0% 26.0%
Splice rate 2.2% 2.2%
Shared splice junctions 67 / 72 STAR junctions
Motif agreement (shared junctions) 100%

Strict Per-Read Comparison (SE)

A read is counted as a match only if it aligns to the exact same chromosome, exact same start position, and has identical splice junctions (intron coordinates). Any difference in any of these is a mismatch.

Result Count %
Exact match (chr + pos + CIGAR identical) 8799 98.57%
Splice match (chr + pos + introns match, CIGAR differs) 1 0.01%
Total match 8800 98.57%
Mismatch — unavoidable tie-breaking 126 1.41%
Mismatch — fixable algorithm differences 27 0.30%
Parity (excluding unavoidable ties) 8800 / 8827 99.69%

Mismatch Classification

Category Count Fixable?
Diff chromosome, both multi-mapper (repeat copy tie-breaking) 100 No — same score, different copy chosen
Same chr, identical CIGAR, different position (repeat copy tie-breaking) ~19 No — same score, different copy chosen
Same chr + pos, different splice junctions 4 Partial
Same chr, STAR spliced / ruSTAR not (missed splice) 1 Yes
ruSTAR mapped, STAR unmapped (false splice) 1 Yes (adapter contamination → 279 kb intron)
MAPQ inflation / deflation 7 Partial

Unavoidable ties (~119 reads): Both tools find the same set of equally-scored alignments but choose different ones as primary due to internal processing order. Neither alignment is more correct than the other.

STAR false splice (1 read): ERR12389696.5825571 — STAR creates a 607 kb intron from adapter-contaminated bases, scoring 2 points higher than the correct soft-clipped alignment. ruSTAR correctly soft-clips this read.

MAPQ Agreement (SE)

Metric Value
MAPQ agreement (position-matched reads) 99.9%
MAPQ inflation (ruSTAR=255, STAR<255) 5 reads
MAPQ deflation (ruSTAR<255, STAR=255) 2 reads

Paired-End (10k yeast read pairs, 150 bp)

Metric ruSTAR STAR
Both mates mapped 8383 (99.9%) 8390 (100%)
Half-mapped pairs 0 0
Unmapped pairs 0 0
Per-mate position agreement 98.3%
Per-mate CIGAR agreement 97.5%

7-pair gap vs STAR: ruSTAR uses STAR's combined-read PE path ([mate1_fwd][SPACER][RC(mate2)]), producing near-identical output. The remaining 7-pair difference stems from scoring edge cases.

Supported Features

  • Single-end and paired-end alignment with mate rescue
  • SAM and unsorted BAM output (--outSAMtype SAM or BAM Unsorted)
  • Multi-threaded parallel alignment (--runThreadN)
  • GTF-based junction annotation with scoring bonus (--sjdbGTFfile)
  • Two-pass mode for novel junction discovery (--twopassMode Basic)
  • Chimeric alignment detection for single-end reads (--chimSegmentMin)
  • Post-alignment read filtering (--outFilterType BySJout)
  • Splice junction output (SJ.out.tab)
  • Gzip-compressed FASTQ input (--readFilesCommand zcat)
  • SAM optional tags: NH, HI, AS, NM, nM, XS, jM, jI, MD
  • --outSAMattributes control (Standard/All/None/explicit)
  • SECONDARY flag (0x100) on multi-mapper alignments
  • Configurable output limits (--outSAMmultNmax)
  • Bidirectional seed search (L-to-R and R-to-L)
  • Junction boundary optimization (jR scanning)
  • Deterministic output (identical SAM across runs)
  • Log.final.out statistics file (STAR-compatible, MultiQC-parseable)

Known Limitations

  • No coordinate-sorted BAM output (use samtools sort post-alignment)
  • No paired-end chimeric detection
  • No --quantMode GeneCounts
  • No --outReadsUnmapped Fastx
  • No --outStd SAM/BAM (stdout output)
  • Residual MAPQ inflation (5 reads in 10k SE benchmark) — missed splice/indel secondary alignments
  • No STARsolo single-cell features

See ROADMAP.md for detailed implementation tracking.

Building from Source

Requires Rust 2024 edition (rustc 1.85+).

cargo build --release    # Release build
cargo test               # Run tests
cargo clippy             # Lint
cargo fmt                # Format

Development

The majority of ruSTAR's code was written by Claude Code (Anthropic's AI coding assistant), with technical direction, architecture decisions, and validation by the project maintainer.

License

MIT (matching the original STAR license)

About

A rust implementation of STAR

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors