Edit on GitHub

pynw

Check Build PyPI License: MIT

Rust-accelerated Needleman-Wunsch global sequence alignment for precomputed similarity matrices. Python bindings are built with PyO3.

Align two ordered sequences, allowing for gaps (insertions/deletions), given any precomputed pairwise similarity matrix. Useful for strings, time series, token sequences, or any domain where element order matters.

Features

  • Fast: Alignment runs in $O(nm)$ time; a 1000×1000 matrix takes <10 ms on modern CPUs.
  • NumPy-first: Accepts NumPy arrays directly — no intermediate Python objects.
  • Domain-agnostic: Operates on a precomputed similarity matrix, so any scoring function (distance metrics, semantic similarity, etc.) works out of the box.
  • Asymmetric gaps: Optionally set separate insert and delete penalties.

Installation

Requires Python 3.10+ and NumPy 1.21+. Prebuilt wheels for Linux, macOS, and Windows are published on PyPI.

pip install pynw

On platforms without a prebuilt wheel, pip will build from the source distribution. This requires a Rust toolchain (1.85+).

Quick start

Align two DNA sequences using a simple match/mismatch scoring scheme:

import numpy as np
from pynw import needleman_wunsch, alignment_indices

seq1 = np.array(list("GATTACA"))
seq2 = np.array(list("GCATGCA"))

# Build an (n, m) similarity matrix: +1 for match, -1 for mismatch
similarity_matrix = np.where(seq1[:, None] == seq2[None, :], 1.0, -1.0)

score, ops = needleman_wunsch(similarity_matrix, gap_penalty=-1.0)
src_idx, tgt_idx = alignment_indices(ops)

# Reconstruct aligned sequences; masked positions are gaps
aligned1 = np.ma.array(seq1).take(src_idx).filled("-")
aligned2 = np.ma.array(seq2).take(tgt_idx).filled("-")

print(f"Score: {score}\n{''.join(aligned1)}\n{''.join(aligned2)}")
# Score: 2.0
# G-ATTACA
# GCA-TGCA

Details

Precomputed similarity matrix

pynw takes a precomputed (n, m) similarity matrix rather than a scoring function. This means the entire alignment runs in compiled Rust code with no Python callbacks, and you can build scores with vectorized NumPy operations rather than element-wise Python loops.

The trade-off is that you must allocate the full matrix up front, which uses $O(nm)$ memory even when the scoring rule could be expressed more compactly.

Scoring

The total alignment score is the sum of similarity-matrix entries for matched positions and gap penalties for insertions/deletions. Gap penalties are typically negative. By default a single gap_penalty applies to both directions; set insert_penalty and/or delete_penalty to penalise them independently.

When multiple alignments achieve the same optimal score, pynw breaks ties deterministically: Align > Delete > Insert.

Edit-distance parameterizations

Needleman-Wunsch can reproduce common metrics with the right similarity-matrix values and gap penalty:

Metric S[i,j] match S[i,j] mismatch gap_penalty NW score equals
Levenshtein distance 0 -1 -1 -distance
Indel distance 0 -2 -1 -distance
LCS length 1 0 0 lcs_length
Hamming distance 0 -1 -(n+1) -distance

For Hamming distance, strings must have equal length.

 1"""
 2.. include:: ../README.md
 3   :start-line: 1
 4   :end-before: ## API
 5"""
 6
 7from importlib.metadata import version
 8
 9from pynw._native import needleman_wunsch
10from pynw._ops import EditOp, alignment_indices
11
12__docformat__ = "numpy"
13__version__ = version("pynw")
14
15__all__ = [
16    "needleman_wunsch",
17    "EditOp",
18    "alignment_indices",
19]
def needleman_wunsch( similarity_matrix, *, gap_penalty=-1.0, insert_penalty=None, delete_penalty=None):

Align two ordered sequences given a precomputed similarity matrix.

The total alignment score is the sum of similarity-matrix entries for matched positions and gap penalties for insertions/deletions.

Parameters
  • similarity_matrix (array_like, shape (n, m)): similarity_matrix[i, j] is the similarity score for aligning element i of the source sequence with element j of the target sequence.
  • gap_penalty (float, default -1.0): Penalty applied when a gap is inserted in either sequence. Use insert_penalty or delete_penalty to set them independently.
  • insert_penalty (float, optional): Penalty for advancing the target sequence without consuming a source element (gap in source). Defaults to gap_penalty.
  • delete_penalty (float, optional): Penalty for advancing the source sequence without consuming a target element (gap in target). Defaults to gap_penalty.
Raises
  • ValueError: If similarity_matrix is not 2-dimensional, or if any value in similarity_matrix or the gap penalties is NaN or Inf.
Returns
  • score (float): The optimal alignment score.
  • ops (ndarray of uint8, shape (k,)): Sequence of edit operations describing the alignment. Each element is of type EditOp. Use alignment_indices to reconstruct source and target index arrays.
Examples

Align two DNA sequences using a simple match/mismatch scoring scheme:

>>> import numpy as np
>>> from pynw import alignment_indices
>>> seq1 = np.array(list("GATTACA"))
>>> seq2 = np.array(list("GCATGCA"))
>>> sm = np.where(seq1[:, None] == seq2[None, :], 1.0, -1.0)
>>> score, ops = needleman_wunsch(sm, gap_penalty=-1.0)
>>> score
2.0
>>> src_idx, tgt_idx = alignment_indices(ops)
>>> "".join(np.ma.array(seq1).take(src_idx).filled("-"))
'G-ATTACA'
>>> "".join(np.ma.array(seq2).take(tgt_idx).filled("-"))
'GCA-TGCA'
Notes

When multiple alignments achieve the same optimal score, ties are broken deterministically: Align > Delete > Insert. This prefers substitutions over gaps, producing compact alignments. Other tools may return different co-optimal alignments.

All values in similarity_matrix and the gap penalties must be finite.

class EditOp(enum.IntEnum):
18class EditOp(IntEnum):
19    """
20    Edit operation codes.
21
22    Not guaranteed to be compatible between pynw version.
23    """
24
25    Align = OP_ALIGN
26    Insert = OP_INSERT
27    Delete = OP_DELETE

Edit operation codes.

Not guaranteed to be compatible between pynw version.

Align = <EditOp.Align: 0>
Insert = <EditOp.Insert: 1>
Delete = <EditOp.Delete: 2>
def alignment_indices( ops: ArrayLike) -> tuple[numpy.ma.MaskedArray[tuple[int], numpy.dtype[numpy.int64]], numpy.ma.MaskedArray[tuple[int], numpy.dtype[numpy.int64]]]:
 56def alignment_indices(
 57    ops: npt.ArrayLike,
 58) -> tuple[MaskedIndexArray, MaskedIndexArray]:
 59    """Reconstruct source and target indices from an ops array.
 60
 61    Converts a sequence of edit operations into a pair of masked index arrays.
 62    Each array has one entry per alignment position.  Positions where the
 63    corresponding sequence has a gap are masked out.
 64
 65    Parameters
 66    ----------
 67    ops : array_like of uint8, shape (k,)
 68        Edit-operation sequence returned by ``needleman_wunsch``.
 69
 70    Returns
 71    -------
 72    source_idx : masked array of intp, shape (k,)
 73        Index into the source sequence at each alignment position.
 74        Masked (invalid) at insert positions (gap in source).
 75    target_idx : masked array of intp, shape (k,)
 76        Index into the target sequence at each alignment position.
 77        Masked (invalid) at delete positions (gap in target).
 78
 79    Raises
 80    ------
 81    ValueError
 82        If ``ops`` cannot be converted to a 1-D ``uint8`` array, if any
 83        element is out of the ``uint8`` range, or if any element is not a
 84        valid ``EditOp`` discriminant.
 85
 86    Examples
 87    --------
 88    >>> import numpy as np
 89    >>> from pynw import needleman_wunsch, alignment_indices
 90    >>> source_seq = list("GAT")
 91    >>> target_seq = list("GT")
 92    >>> sm = np.where(
 93    ...     np.array(source_seq)[:, None] == np.array(target_seq)[None, :],
 94    ...     1.0, -1.0,
 95    ... )
 96    >>> _, ops = needleman_wunsch(sm, gap_penalty=-1.0)
 97    >>> src, tgt = alignment_indices(ops)
 98    >>> src.tolist()
 99    [0, 1, 2]
100    >>> tgt.tolist()
101    [0, None, 1]
102    """
103    src_idx, src_mask, tgt_idx, tgt_mask = _alignment_indices(ops)
104    return np.ma.array(src_idx, mask=src_mask), np.ma.array(tgt_idx, mask=tgt_mask)

Reconstruct source and target indices from an ops array.

Converts a sequence of edit operations into a pair of masked index arrays. Each array has one entry per alignment position. Positions where the corresponding sequence has a gap are masked out.

Parameters
  • ops (array_like of uint8, shape (k,)): Edit-operation sequence returned by needleman_wunsch.
Returns
  • source_idx (masked array of intp, shape (k,)): Index into the source sequence at each alignment position. Masked (invalid) at insert positions (gap in source).
  • target_idx (masked array of intp, shape (k,)): Index into the target sequence at each alignment position. Masked (invalid) at delete positions (gap in target).
Raises
  • ValueError: If ops cannot be converted to a 1-D uint8 array, if any element is out of the uint8 range, or if any element is not a valid EditOp discriminant.
Examples
>>> import numpy as np
>>> from pynw import needleman_wunsch, alignment_indices
>>> source_seq = list("GAT")
>>> target_seq = list("GT")
>>> sm = np.where(
...     np.array(source_seq)[:, None] == np.array(target_seq)[None, :],
...     1.0, -1.0,
... )
>>> _, ops = needleman_wunsch(sm, gap_penalty=-1.0)
>>> src, tgt = alignment_indices(ops)
>>> src.tolist()
[0, 1, 2]
>>> tgt.tolist()
[0, None, 1]