This function provides an interface to
combinatorial_demultiplex() and
filter_demultiplex_res() such that reads are streamed in chunks
instead having to load everything at once, hence reducing memory consumption.
It accepts two functions which are called once per chunk:
A data loader function for producing the sequences of the chunk and an
archiver writing the results to file.
Usage
streaming_demultiplex(
state_init,
loader,
archiver,
barcodes,
allowed_mismatches,
segments,
segment_lengths
)Arguments
- state_init
The initial state to pass into
loader.- loader
Function loading the reads. It has the signature
f(state), wherestateis a user-defined object which is initialized to bestate_initand for the subsequent iterations taken as thestatefield of the output ofarchiver. Its return value is a list with the following fields:state: The state to be passed intoarchiver.sequences: AXStringSetobject, the sequences to be demultiplexed in the current chunk.should_terminate: A scalar logical. IfTRUE, the demultiplexing process terminates and the final results are returned. Notice that this termination happens before the sequences of the final call toloaderare demultiplexed.
- archiver
Function taking care of archiving the demultiplexed results. Its arguments are:
state: The state of the process returned byloader.filtered_res: The output from runningcombinatorial_demultiplex()andfilter_demultiplex_res()on the data expect that the fieldsummary_resis missing.
Its output is a state object fed into the next call to
loader.- barcodes
A list of
XStringSetobjects in the same order they appear in thesequences, the barcodes to be used for demultiplexing. All of the barcodes in eachXStringSetmust have the same length as specified by thesegment_lengthsargument and be named. For computational reasons, the maximum possible length of an individual barcode is 127.- allowed_mismatches
Integer vector of length one or the same length as the number of barcode segments; the threshold Hamming distance. All reads having a number of mismatches above this number in any of the barcodes will be filtered away.
- segments
Character vector showing the segments of the sequences from 5' end to 3' end. The code applied is as follows:
'A': Adapter (often referred to as linker), is trimmed and ignored'B': Barcode, used for demultiplexing'P': Payload, sequence to be kept after trimming and demultiplexing (e.g. cDNA or UMI).
If this vector is named, this will determine the names of the payload sets. Names of the barcode sets will be determined by the names of the argument
barcodes(if any).- segment_lengths
Integer vector with the same length as
segments, lengths of the segments provided in the same order as insegments. Up to one of the non-barcode segments can have its length set toNAwhich means it is considered a variadic length segment.
Value
A list with three elements:
freq_table: The frequency table for all reads, akin to the output ofcreate_freq_table().summary_res: The summary result of match filtering of all reads percreate_summary_res().state_final: The final state object returned fromloader.
Details
The data loader decides the size of each chunk.
While this framework does not provide any restriction on the state
object, the loader and archiver must be written such that the state objects
they return are compatible.
Since the data loader alone decides when to terminate,
bad terminations crieria can cause a runaway loop.
Usually, it will be useful to have a progress tracker of how many reads
are demultiplexed. The framework itself does not implement this, so
it is typically implemented into the archiver or loader.
For technical reasons, it is not possible to do streaming when the number of possible barcode combinations exceeds \(2^{32}-1\approx 2.1\cdot 10^{9}\).
See also
filter_demultiplex_res(), combinatorial_demultiplex(),
create_freq_table(), and create_summary_res()
for the underlying processing.
Examples
library(purrr)
library(Biostrings)
input_fastq <- system.file(
"extdata", "PETRI-seq_forward_reads.fq.gz", package = "posDemux")
output_barcode_table <- tempfile(pattern = "barcode_table", fileext = ".txt")
callbacks <- streaming_callbacks(
input_file = input_fastq, output_table_file = output_barcode_table,
chunk_size = 10000, verbose = TRUE)
barcode_files <- system.file(
"extdata/PETRI-seq_barcodes",
c(bc1 = "bc1.fa", bc2 = "bc2.fa", bc3 = "bc3.fa"),
package = "posDemux"
)
names(barcode_files) <- paste0("bc", 1L:3L)
barcode_index <- map(barcode_files, readDNAStringSet)
barcodes <- barcode_index[c("bc3", "bc2", "bc1")]
sequence_annotation <- c(UMI = "P", "B", "A", "B", "A", "B", "A")
segment_lengths <- c(7L, 7L, 15L, 7L, 14L, 7L, NA_integer_)
streaming_summary_res <- streaming_demultiplex(
state_init = callbacks$state_init, loader = callbacks$loader,
archiver = callbacks$archiver, barcodes = barcodes, allowed_mismatches = 1L,
segments = sequence_annotation, segment_lengths = segment_lengths
)
#> Wed Jan 21 16:11:08 2026 => Initializing FASTQ stream and output table
#> Wed Jan 21 16:11:08 2026 => Streaming FASTQ input file /home/runner/work/_temp/Library/posDemux/extdata/PETRI-seq_forward_reads.fq.gz
#> Wed Jan 21 16:11:08 2026 => Processed 10000 reads, successfully demultiplexed 9117 reads so far...
#> Wed Jan 21 16:11:08 2026 => Processed 20000 reads, successfully demultiplexed 18259 reads so far...
#> Wed Jan 21 16:11:08 2026 => Processed 30000 reads, successfully demultiplexed 27362 reads so far...
#> Wed Jan 21 16:11:08 2026 => Processed 40000 reads, successfully demultiplexed 36487 reads so far...
#> Wed Jan 21 16:11:08 2026 => Processed 50000 reads, successfully demultiplexed 45629 reads so far...
#> Wed Jan 21 16:11:08 2026 => Processed 56895 reads, successfully demultiplexed 51906 reads so far...
#> Wed Jan 21 16:11:08 2026 => Done demultiplexing