Skip to content

HLA Vignette

hextraza edited this page Mar 20, 2025 · 2 revisions

Aligning RNA-seq data to the IMGT HLA database using nimble

This vignette demonstrates how to create a nimble-compatible reference library from the IMGT HLA database and use it to align paired-end read data. We will:

  1. Download the IMGT HLA combined nucleotide data
  2. Generate a corresponding nimble-compatible CSV file containing locus metadata
  3. Create a nimble library using nimble generate
  4. Align our .fastq.gz files

This vignette assumes that you are on macOS or Linux, but it should be adaptable to Windows as well.

Step 1: Clone the IMGT HLA combined .fasta file

An easy way to get access to the reference sequence data we need is via the IMGTHLA GitHub repo. We're only interested in the combined HLA nucleotide sequence file:

wget https://raw.githubusercontent.com/ANHIG/IMGTHLA/refs/heads/Latest/fasta/hla_nuc.fasta

This is a sequence file representing every HLA allele in the IMGT database. At this point, we could run nimble generate to create the nimble library and align our data. However, the results would contain counts for each allele. While this is desirable for some cases, nimble also has the ability to bin counts based on provided metadata, allowing us to output per-locus data instead, which we demonstrate below as an example of this feature.

Step 2: Create metadata CSV

We need to create a .csv containing locus metadata derived from the HLA .fasta by using a script. A nimble library metadata .csv must contain a name column in order to determine how to assign the metadata to each allele. We will also add a locus column containing our locus data. Here's an example Python script to do this:

import csv
from Bio import SeqIO

# Input and output file paths
fasta_file = "hla_nuc.fasta"
output_csv = "hla_metadata.csv"

# Prepare metadata rows
metadata_rows = [["name", "locus"]]

with open(fasta_file, "r") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        header_parts = record.description.split()
        seq_name = header_parts[0]  # Extract HLA identifier
        locus = header_parts[1].split("*")[0]  # Extract locus from second column
        metadata_rows.append([seq_name, locus])

# Write metadata to CSV
with open(output_csv, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(metadata_rows)

print(f"Metadata CSV saved as {output_csv}")

Step 3: Generate the nimble library

With hla_nuc.fasta and hla_metadata.csv ready, we can now generate a nimble library:

python -m nimble generate --file hla_nuc.fasta --opt-file hla_metadata.csv --output_path hla_library.json

Step 4: Configure alignment parameters for HLA

HLA is a complex region of the genome characterized by high sequence similarity. With this many features to differentiate during the alignment process, we have found that it increases data quality to configure nimble to be stringent in what it considers a passing alignment. The default parameters on the library created by nimble generate are lenient, so we'll need to modify that file before we continue. If you open it in a text editor, you'll see that it's a standard .json file in the nimble library format. You should see a block that looks similar to:

{
  "score_threshold": 60,
  "score_percent": 0.5,
  "num_mismatches": 0,
  "discard_multiple_matches": false,
  "intersect_level: 0",
  "group_on": "",
  "discard_multi_hits": 0,
  "require_valid_pair": false,
  "max_hits_to_report": 10,
  "trim_target_length": 50,
  "trim_strictness": 0.9
}

These are the aligner parameters for this library. It is currently configured to accept any alignment with a score >= 50% of the read length, which is probably insufficient for differentiating HLA alleles, likely leading to a lot of ambiguity in the final output data. To solve this, raise the score_percent value to 0.99. Also change group_on to "locus" rather than an empty string. This is the flag that controls how nimble bins the count data. If we leave it alone, the data will include the number of reads that aligned to each HLA allele. If we set it to "locus", the data will include the number of reads that uniquely aligned to each HLA locus, instead.

Step 5: Align the nimble library to your input data

We're now ready to align your input data. Depending on whether you're providing a single-read or paired-end .fastq.gz, or a .bam file, the parameters will vary slightly. See the usage guide and the CLI parameters for more information, but as an example, running paired-end .fastq.gz files looks like:

python -m nimble align -r library.json -i input_R1.fastq.gz input_R2.fastq.gz -o results.tsv

Conclusion

You're done! We've downloaded a .fasta file, generated custom metadata in order to have nimble bin the counts in a way that fits our analysis, and configured nimble to align our data in a more biologically-relevant fashion by modifying the reference library.

Clone this wiki locally