Algorithm¶

This page describes how Amplirust performs in-silico PCR.

Overview¶

In-silico PCR simulates the polymerase chain reaction computationally. Given a pair of primers (forward and reverse) and a target sequence, the algorithm:

Searches for approximate matches of the forward primer in the sequence
For each forward match, searches for the reverse complement of the reverse primer downstream
Extracts the region between (and including) the primer binding sites as a PCR product
Filters products by length, identity, and N-base content

SIMD-Accelerated Approximate Matching¶

Amplirust uses the sassy library for approximate string matching. Sassy provides SIMD-accelerated (AVX2, SSE4, NEON) pattern matching with edit distance constraints, making it significantly faster than naive approaches.

The search finds all occurrences of a primer in the target sequence within a specified maximum edit distance (-k). Edit distance counts the minimum number of insertions, deletions, and substitutions needed to transform one string into another.

Each match returns:

Position -- start and end coordinates in the target sequence
Cost -- the edit distance (number of mismatches + indels)
CIGAR string -- alignment representation
Strand -- which strand the match was found on

IUPAC Ambiguity Handling¶

Sassy supports IUPAC ambiguity codes natively through its Iupac profile. Ambiguous bases in primer sequences are expanded during matching -- for example, an R in a primer matches both A and G in the target without counting as a mismatch.

This is handled at the SIMD level, so ambiguous primers do not incur a performance penalty compared to standard ACGT primers.

See Primer Formats for the full IUPAC code table.

Forward/Reverse Primer Pairing¶

The core PCR algorithm works as follows:

Search for forward primer -- find all approximate matches of the forward primer in the (possibly extended) sequence
Search for reverse primer downstream -- for each forward match, search for the reverse complement of the reverse primer in the region from the forward match end to forward_start + max_len
Validate product -- check that the product satisfies:
- Length between --min-len and --max-len
- N-base fraction below --max-n-fraction
- Identity above --min-identity for both primers
Extract sequence -- extract the product from the sequence, either including primers (default) or excluding them (--trim-primers)

Reverse primer orientation

The reverse primer is specified in 5'-to-3' direction (as you would order it). Amplirust internally converts it to its reverse complement to search for the binding site on the forward strand.

Pool Mode (All-vs-All)¶

When --pool is enabled, Amplirust searches a pool of individual primers rather than predefined forward/reverse pairs. The algorithm:

Search phase -- O(2N) searches for N primers:
- Search each primer in forward orientation against the sequence
- Search the reverse complement of each primer against the sequence
Pairing phase -- combinatorial matching on in-memory results:
- For each forward match of primer_i at position X and each RC match of primer_j at position Y (where Y > X):
  - Calculate product length and apply length/N-fraction filters
  - By default, i != j (different primers). Use --pool-self-match to allow i == j.
Output -- product names use primerA+primerB format (e.g., 27F+1492R)

Circular genome support (--circular) and reverse complement strand search (--search-rc) work the same as pair mode.

Circular Genome Handling¶

When --circular is enabled, Amplirust extends the sequence to detect products that wrap around the origin of replication.

The extension strategy appends the first min(max_product_len - 1, sequence_length) bases of the sequence to its end:

Original:  [====SEQUENCE====]
Extended:  [====SEQUENCE====][====]
                              ↑ first N bases repeated

This allows the algorithm to find forward primer matches near the end and reverse primer matches that fall in the repeated region. Products that span the junction are flagged with the _wrap suffix and is_circular_wrap = true in the TSV output.

Positions are mapped back to the original coordinate system using modulo arithmetic.

Reverse Complement Strand Search¶

When --search-rc is enabled, Amplirust searches both DNA strands:

Forward strand -- the sequence as provided
Reverse complement strand -- the full reverse complement of the sequence

The same primer pairing algorithm runs on both strands. Products found on the reverse complement strand are:

Output in their reverse complement orientation
Labeled with strand=- and the _rc suffix in the header
Position coordinates mapped back to the original forward strand

Edit Distance vs Identity Filtering¶

Amplirust provides two complementary filtering mechanisms:

Edit Distance (`-k, --max-errors`)¶

Controls the search itself. Only matches within this edit distance are returned by the SIMD search engine. This is the primary performance knob -- lower values search faster.

Default: 2

Identity Threshold (`--min-identity`)¶

A post-filter applied after matching. Identity is calculated as:

identity = (alignment_length - edit_distance) / alignment_length

For example, a 20 bp primer with 2 mismatches has identity = 18/20 = 0.90.

Default: 0.0 (disabled)

When to use each

Use -k to control computational cost (lower = faster). Use --min-identity for biological relevance when primer lengths vary -- a 2-mismatch hit in a 15 bp primer (87% identity) is less reliable than in a 25 bp primer (92% identity).