Skip to content

fkallen/Accelign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

931 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Accelign - Fast GPU-accelerated sequence alignments

Accelign is a CUDA/C++ library providing GPU-accelerated sequence alignments to find the optimal alignment score between two sequences. We support:

  • Global alignments (score only)
  • Local alignments (score only, and score+traceback endpoints)
  • Semiglobal alignments (score only, and score+traceback endpoints)
  • Arbitrary alphabet sizes, especially 5 (DNA) and 25 (Protein)
  • one-to-one alignments using a substitution matrix
  • one-to-all alignments using a position specific scoring matrix (pssm)

Publication

This work is presented in the following paper:

(preprint) Kallenborn, F., Dabbaghie, F., Steinegger, M., & Schmidt, B. (2025). Accelign: a GPU-based Library for Accelerating Pairwise Sequence Alignment. https://doi.org/10.64898/2025.12.17.694868

Hardware requirements

CUDA-capable GPU, preferably with Ampere architecture or newer. We have tested Accelign on sm_80 (A100), sm_86 (RTX 3090), sm_89 (RTX 4090, L40S), sm_90 (H100, GraceHopper), and sm_120 (RTX Pro 6000).

Software requirements

  • CUDA Toolkit >= 12.9 . Older versions have not been tested. CUDA Toolkit 13 cannot be used to compile for Volta (sm_70) and older GPU architectures.
  • C++17 compiler supported by CUDA. In addition, certain CPU benchmarks may require a C++23 compiler

Project structure

  • include/accelign: Contains our implementation and ready-to-use C++ alignment tuning configs derived from the tuning results
  • benchmarks: Contains our case-studies
  • examples: Contains fundamental usage examples of Accelign
  • tuning: Contains our performance tuning machinery
  • factory_generator: Contains scripts to generate a static library of aligners.

How to use

General remarks:

  • Accelign is header-only
  • Accelign classes use the accelign namespace.
  • We use the terms subject and query for the two sequences being aligned. The subject will be the sequence along the vertical DP matrix axis, the query is oriented along the horizontal axis.
  • Scoring parameters are passed as signed values, not absolute values. For example,for a gap penalty of 5, pass -5 to Accelign.
  • When using affine gap penalties, a stretch of N consecutive gaps has a score of (gap_open_score + (N-1) * gap_extend_score).
  • Sequence conversion and data transfers are responsibility of the user.
  • We partition the DP matrix into one or more tiles based on the GPU thread configuration and query length. Processing of alignments with more than one tile requires temporary storage

Low-level API

The fundamental alignment API is implemented in global_alignment_api.cuh, local_alignment_api.cuh, and semiglobal_alignment_api.cuh. These contain template classes which can be configured for specific alignments.

template<int alphabetSize_, class ScoreType_, PenaltyType penaltyType_, class AlignCompConfig_>
class GlobalAlignment_scalarscore;

template<int alphabetSize_, class ScoreType_, PenaltyType penaltyType_, class AlignCompConfig_>
class GlobalAlignment_vec2score;

template<int alphabetSize_, class ScoreType_, PenaltyType penaltyType_, class AlignCompConfig_>
class LocalAlignment_scalarscore;

template<int alphabetSize_, class ScoreType_, PenaltyType penaltyType_, class AlignCompConfig_>
class LocalAlignment_vec2score;

template<int alphabetSize_, class ScoreType_, PenaltyType penaltyType_, class AlignCompConfig_>
class SemiglobalAlignment_scalarscore;

template<int alphabetSize_, class ScoreType_, PenaltyType penaltyType_, class AlignCompConfig_>
class SemiglobalAlignment_vec2score;

Alphabet size

Specifies the supported number of different input symbols, for example 5 for DNA/RNA sequences, and 25 for proteins. Input sequences must be encoded in the range [0, alphabetSize-1], and it is the responsibility of the user to ensure correctly converted sequences. We provide a utility class GpuSequenceConverter to convert sequences residing in GPU memory.

PenaltyType

Enum value which can be "Linear" or "Affine" to select linear gap penalty or affine gap penalty, respectively.

ScoreType

Specifies the fundamental data type for alignment computation. For *_scalarscore classes, ScoreType must be float or int (32-bit computations). For *_vec2score classes, ScoreType must be half2 or short2 (16x2-bit computations). int and short2 make use of DPX instructions, which might be faster on supported hardware. Using half2 or short2 instead of float or int can be up to two times faster. However, keep in mind that these types have limited resolution, and overflow / underflow may occur. For example, the half2 score is only guaranteed to be exact if all values in the DP matrix are in the range [-2047, 2047]. Checking for overflow / underflow during computation would be slower than using the 32-bit variant. If you require exact scores for arbitrary inputs, please use float / int.

Note: For local alignments, it is possible to detect scores which may have overflown in a post processing step, and correct them by recomputing the alignment in 32-bit precision. Let x be the final 16-bit alignment score, and b the largest entry in the substitution matrix. For all cases where x < 2048 (half2) or x+b <= 32767 (short), x is the exact result.

AlignCompConfig

AlignCompConfig is the most important parameter. It controls GPU thread arrangement, supported sequence length, and determines if a substitution matrix or position-specific scoring matrix should be used.

template<int blocksize_, int groupsize_, int numItems_, class SingleLayoutConfig>
struct AlignCompConfigSubstmat;

template<int blocksize_, int groupsize_, int numItems_, class SinglePssmScoreType>
struct AlignCompConfigPssm;

Layout

blocksize gives the number of threads per block. Each pairwise alignment is computed cooperatively by groupsize threads. The DP matrix tile size T = groupsize * numItems. The maximum query sequence length which can be aligned without the use of temporary storage is given by tile size. For a fixed subject length, if the performance of processing tile size T is X, processing of query length L < T will give at most (L/T) * X performance. Thus, for best performance without temporary storage, tile size should be close to the average expected query length.

groupsize must be 4, 8, 16, or 32. Any numItems > 0 is fine. However, we recommend limiting numItems to 32 or 64 for alignments with affine gap penalty and linear gap penalty, respectively. Larger values could lead to excessive register spilling, limitting performance.

Substitution matrix

We expect a substitution matrix with alphabetsize rows and alphabetsize columns. substitutionmatrix[s][q] gives the substitution score between a letter s belong to the subject and a letter q belonging to the query.

SingleLayoutConfig must be of type SubstMatConfig. It specifies internal substitution matrix layout in GPU memory and its element type. Allowed element types depend on the layout.

template<SubstMatLayout layout_, class ElementType_>
struct SubstMatConfig

The most straight-forward SubstMatConfigs are SubstMatConfig<SubstMatLayout::oneXone_scalar, float (or int)> for 32-bit alignments, SubstMatConfig<SubstMatLayout::twoXtwo_vec2, half2 (or short2)> for 16x2-bit one-to-one alignments, and SubstMatConfig<SubstMatLayout::twoXone_vec2, half2 (or short2)> for 16x2-bit one-to-all alignments.

The full internal substitution matrix must fit into shared memory. Different layout and element type combinations may have different trade-offs in terms of shared memory requirements and performance. Please take a look at our tuning program in tuning/tuning_substmat.cu to see the full list of supported combinations.

PSSM

A PSSM can only be used for one-to-all alignments. Specifically, a single query sequence given via PSSM is aligned to all subject sequences. We expect the PSSM to have alphabetsize rows and query length columns. The substitution score between a subject letter s and the query letter at position i is stored in PSSM[s][i].

Internally, the GPU PSSM uses the SinglePssmScoreType from AlignCompConfigPssm as element type. For 32-bit alignments, it should be scalar datatype convertible to the score type, i.e. choose SinglePssmScoreType = float or half for 32-bit float alignments. The GPU PSSM for 32-bit alignments has O(alphabetsize) rows and O(groupsize * numItems) columns.

For 16x2-bit alignments, we expect either a vector datatype or a scalar type. For example, for half2 alignments, it would be valid to set SinglePssmScoreType = half2, __nv_fp8x2_e4m3, or half. With a vector type, the GPU PSSM has O(alphabetsize^2) rows and O(groupsize * numItems) columns. With a scalar type, only O(alphabetsize) columns are used which reduces shared memory requirements. However, in this case the number of accesses to the PSSM increases which decreases performance. Using a scalar pssm type for 16x2-bit alignments is often necessary for larger alphabets like the amino acid alphabet, since the PSSM necessary for the current DP matrix tile must fit into shared memory.


A concrete aligner class could be of the following type. It can compute global alignments with linear gap penalty, using 32-bit computations. The alphabet size is 5, and the DP matix tile size is 16 * 16 = 256.

using GpuAlignerSubstmatFloat = accelign::GlobalAlignment_scalarscore<
    5, 
    float, 
    PenaltyType::Linear, 
    accelign::AlignCompConfigSubstmat<
        512, 
        16, 
        16, 
        accelign::SubstMatConfig<accelign::SubstMatLayout::oneXone_scalar, float>
    >
>;

Creating a GPU substitution matrix

accelign::ConvertedGpuSubstitutionMatrix<alphabetSize, float> gpuSubstitutionMatrixFloat;
cuda::std::mdspan<const int, cuda::std::dextents<int,2>> substitutionMatrixView(substitutionMatrix.data(),alphabetSize,alphabetSize);
gpuAlignerFloat.makeGpuSubstmat(
    gpuSubstitutionMatrixFloat,
    substitutionMatrixView,
    stream
);

Creating a GPU PSSM

accelign::ConvertedGpuPSSM<alphabetSize, float, penaltyType> gpuPssmFloat;
cuda::std::mdspan<int, cuda::std::dextents<int,2>> pssmView(h_pssm.data(), alphabetSize, queryLength);
gpuAlignerPssmFloat.makeGpuPssm(
    gpuPssmFloat,
    pssmView,
    stream
);

Launching the alignments

Accelign computes batches of alignments. Sequence input is given via structs which provide specific functions. We provide structs to specify the input data for different alignments. Custom input structs are possible as long as they provide the same public members.

!!! ATTENTION. All data pointers must point to GPU accessible memory. Data transfers is responsibility of the user !!!

!!! ATTENTION. This is the last chance to correctly encode the sequence data. Encoding is the responsibility of the user !!!

In our input structs, we use pointers + offsets to describe a collection of sequences. We require the base pointer P to a contiguous section of memory. The starting pointer of the i-th query or subject is P + offsets[i]. The sequence length is lengths[i].

The input struct for one-to-one alignments with substitution matrix is set up like this:

accelign::OneToOneInputDataSubstMat inputData;
inputData.d_subjects = (const std::int8_t*)d_subjects.data().get();
inputData.d_subjectOffsets = d_subjectBeginOffsets.data().get();
inputData.d_subjectLengths = d_subjectLengths.data().get();
inputData.d_queries = (const std::int8_t*)d_queries.data().get();
inputData.d_queryOffsets = d_queryBeginOffsets.data().get();
inputData.d_queryLengths = d_queryLengths.data().get();
inputData.numAlignments = numAlignments;
inputData.maximumSubjectLength = maximumSubjectLength;
inputData.maximumQueryLength = maximumQueryLength;

After all data is correctly set up, the alignment function corresponding to your intended use-case can be called. Keep in mind that you might have to allocate temporary storage.

 // The gap scores
accelign::GapScores gapInfo;
gapInfo.gapscore = -4; //for linear gap penalty
gapInfo.gapopenscore = -5; //for affine gap penalty
gapInfo.gapextendscore = -1; //for affine gap penalty

if(gpuAlignerFloat.isSingleTile(inputData.maximumQueryLength)){
    gpuAlignerFloat.substmat_scoreOnly_oneToOne_singleTile(
        d_scoreOutput.data().get(),
        gpuSubstitutionMatrixFloat,
        inputData,
        gapInfo,
        stream
    );
}else{
    //query temporay storage size and allocate it. Allocating minimumSuggestedTempBytes ensures that at least 1 thread block per SM can be scheduled.

    const size_t minTempBytes = gpuAlignerFloat.getMinimumSuggestedTempBytes_multiTile(inputData.maximumSubjectLength);
    thrust::device_vector<char> d_temp(minTempBytes);

    gpuAlignerFloat.substmat_scoreOnly_oneToOne_multiTile(
        d_temp.data().get(),
        d_temp.size(),
        d_scoreOutput.data().get(),
        gpuSubstitutionMatrixFloat,
        inputData,
        gapInfo,
        stream
    );
}

In general, an aligner provides the following methods:

makeGpuPssm
makeGpuSubstmat
getMinimumSuggestedTempBytes_multiTile
pssm_scoreOnly_oneToAll_singleTile
pssm_scoreOnly_oneToAll_multiTile
substmat_scoreOnly_oneToAll_singleTile
substmat_scoreOnly_oneToAll_multiTile
substmat_endPosition_oneToAll_singleTile (not for 16x2-bit, and global)
substmat_endPosition_oneToAll_multiTile (not for 16x2-bit, and global)
substmat_startPosition_oneToAll_singleTile (not for 16x2-bit, and global)
substmat_startPosition_oneToAll_multiTile (not for 16x2-bit, and global)
substmat_scoreOnly_oneToOne_singleTile
substmat_scoreOnly_oneToOne_multiTile
substmat_endPosition_oneToOne_singleTile (not for 16x2-bit, and global)
substmat_endPosition_oneToOne_multiTile (not for 16x2-bit, and global)
substmat_startPosition_oneToOne_singleTile (not for 16x2-bit, and global)
substmat_startPosition_oneToOne_multiTile (not for 16x2-bit, and global)

Multi-config alignments

The struct GpuMultiConfigAligner can be used as a collection of aligner instances of the same Aligner template class but with different tile sizes.

template<
    template<int,class,PenaltyType,class> class Aligner, 
    int alphabetSize_, 
    class ScoreType_, 
    PenaltyType penaltyType_, 
    class ListOfTypesSingleTile,
    class ListOfTypesMultiTile
>
struct GpuMultiConfigAligner

ListOfTypesSingleTile and ListOfTypesMultiTile must be tuples containing the AlignCompConfigs to instantiate. Each tuple must be sorted in ascending order by tile size, and each tile size must not appear multiple times in the same tuple. Must not specify both substitution matrix and pssm configs. One tuple is allowed to be empty.

Give a maximum query length, a linear search is performed to find the best fitting single-tile tilesize. If the query length is greater than the largest single-tile tilesize, a heuristic is used to select one of the available multi-tile tilesizes.

For example, the following Aligner class provides single-tile tilesizes 128 and 256. Longer queries are processed using a tile sizes of 256.

accelign::GpuMultiConfigAligner<
    accelign::GlobalAlignment_vec2score,
    5,
    half2,
    accelign::PenaltyType::Linear,
    std::tuple<
        AlignCompConfigPssm<512,4,32,half2>,
        AlignCompConfigPssm<512,8,32,half2>
    >,
    std::tuple<
        AlignCompConfigPssm<512,8,32,half2>,
    >
>

GpuMultiConfigAligner provides the same methods and restrictions as the single config aligners.

Hardware considerations

The 16x2-bit score type half2 is supported in hardware for sm_80 and newer. By default, our kernels return immediately without computations for GPUs < sm_80. This can be prevented by defining the macro COMPILE_HALF2_FOR_ARCHS_WITHOUT_HALF2_MATH during compilation.

The 16x2-bit score type short2 is supported in hardware via DPX instructions for sm_90 and newer. By default, our kernels return immediately without computations for GPUs < sm_90. This can be prevented by defining the macro COMPILE_SHORT2_FOR_ARCHS_WITHOUT_DPX during compilation.

The 32-bit score type float is supported in hardware by all current and previous GPU architectures. The 32-bit score type int makes use of hardware DPX instructions for GPUs >= sm_90 and will fall back to software emulation on older GPUs.

Stream-ordered memory allocation

Accelign performs device memory allocations for the (construction of) substitution matrix / pssm. Currently, a standard thrust::device_vector is used for such allocations by default. However, for more flexible and stream-ordered allocations, we also provide an optional implementation using RMM's rmm::device_uvector. It will use the rmm::mr::get_current_device_resource() memory resource for allocations. To enable the RMM implementation instead of the Thrust implementation, compile with -DUSE_RMM_FOR_GPU_SUBSTMAT -DUSE_RMM_FOR_GPU_PSSM and specify RMM's include path (and linker options, for recent versions of RMM). Note that an appropriate RMM memory resource should be set at runtime.

High-level API

We provide the C++ interfaces GpuOneToOneAlignerI, GpuOneToOneStartEndPosAlignerI, GpuOneToAllAlignerI for score-only alignments using substitution matrix, score + traceback endpoints alignments using substitution matrix, and score-only alignments using a PSSM, respectively.

These are implemented by our classes

template<
    int cudaArch_,
    int alphabetSize_,
    class ScoreType_, 
    PenaltyType penaltyType_,
    AlignmentType alignmentType_
>
class GpuOneToOneAligner : public GpuOneToOneAlignerI

template<
    int cudaArch_,
    int alphabetSize_,
    class ScoreType_, 
    PenaltyType penaltyType_,
    AlignmentType alignmentType_
>
class GpuOneToOneStartEndPosAligner : public GpuOneToOneStartEndPosAlignerI{

template<
    int cudaArch_,
    int alphabetSize_,
    class ScoreType_, 
    PenaltyType penaltyType_,
    AlignmentType alignmentType_
>
class GpuOneToAllAligner : public GpuOneToAllAlignerI

These classes are wrappers around a pre-configured GpuMultiConfigAligner. Configuration depends on the specified template parameters. cudaArch describes the GPU SM architecture of the GPU which executes the alignments, and is computed as (smMajor * 100 + smMinor * 10). For example, an for an Nvidia L40S GPU with sm_89 architecture, cudaArch=890 needs to be specified. If for a set of template parameters a configuration can be found in include/accelign/tuningconfigs, it will be used. Otherwise, a fixed fallback configuration is used. In this case, a message is printed to stdout at runtime. For half2, no fallback configuration is provided and compilation will fail.

We provide tuningconfigs for alphabet sizes 5 and 25 for the following Nvidia GPU architectures:

  • sm_80 (Ampere, float/half2, generated on NVIDIA A100)
  • sm_86 (Ampere, float/half2, generated on NVIDIA RTX 3090)
  • sm_89 (Ada, float/half2, generated on NVIDIA RTX 4090)
  • sm_90 (Hopper, float/half2/int/short2, generated on NVIDIA Grace Hopper)
  • sm_120 (Blackwell, float/half2/int/short2, generated on NVIDIA RTX Pro 6000 Blackwell)

Note: For our provided tuning configs, we varied numItems in steps of 4. Feel free to take a look at our tuning scripts and generate new configs if you require a more dense sampling of tile sizes or if you want to generate configs for an GPU architecture for which no tuning configs are included.

Stream-safety

The interfaces provide functions to set a substitution matrix or PSSM, as well as functions to execute alignments. These functions take a CUDA stream, operate in stream-order, and may return before the associated GPU work is completed.

For example, GpuOneToAllAlignerI provides functions

    //Set PSSM for subsequent alignments. 
    //the substitution score between encoded subject letter s and query position p must be stored at h_pssm[s * queryLength + p]
    virtual void setQuery(const int* h_pssm, int queryLength, cudaStream_t stream) = 0;

    //Align sequences against the previously set PSSM. Uses the single-tile kernel without temporary storage
    virtual void executeAlignmentSingleTile(
        int* d_scoreOutput,
        const OneToAllPSSMInterfaceInputData& interfaceInputData,
        const InterfaceGapScores& interfaceGapScores,
        cudaStream_t stream
    ) = 0;

And GpuOneToOneAlignerI provides

    //Set substitution matrix for subsequent alignments. 
    //h_substitutionMatrix[s * alphabetSize + q] gives the substitution score
    //between a letter `s` belong to the subject and a letter `q` belonging to the query
    virtual void setSubstitutionMatrix(const int* h_substitutionMatrix, cudaStream_t stream) = 0;

    //Align sequences using the previously set substitution matrix. Uses the single-tile kernel without temporary storage
    virtual void executeAlignmentSingleTile(
        int* d_scoreOutput,
        const OneToOneSubstmatInterfaceInputData& interfaceInputData,
        const InterfaceGapScores& interfaceGapScores,
        cudaStream_t stream
    ) = 0;

All of our derived classes are statefull, and users must consider the following with respect to streams to ensure correctness:

  • Alignments should be executed in the same stream that was used to set the substitution matrix / PSSM. Alternatively, synchronization or events must be employed manually.
  • Execution of alignments and / or setting substitution matrix / PSSM must not be performed in multiple streams simultaneously to avoid overwriting internal state while in use in a different stream. It is best to use multiple class instances to submit work to different streams.

Aligner factory

The high-level accelign library provides pre-configured aligners. However, compile times can be long, and since all classes are template classes, aligners might be recompiled frequently during project development.

Directory factory_generator provides a generator script to create source files for a static library of pre-configured aligners. See its Readme for more information.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published