create_summary_res() is a helper function in order
to create a summary of the demultiplexing
and following match filtering. It is not designed to be invoked directly, but
its results will be returned automatically from
filter_demultiplex_res(). This returned object has it own method
for printing the result in a user-friendly manner.
Usage
create_summary_res(
retained,
barcodes,
assigned_barcodes,
allowed_mismatches,
mismatches
)
# S3 method for class 'demultiplex_filter_summary'
print(x, ...)Arguments
- retained
Logical vector with the same length as the number of reads in the input to the demultiplexer.
TRUEif the corresponding read is retained. Corresponds to the fieldretainedof the output offilter_demultiplex_res().- barcodes
A list of
XStringSetobjects, the barcodes which were used for demultiplexing.- assigned_barcodes
Character matrix of the assigned barcodes only including the onces within the mismatch threshold. Corresponds to of the field
demultiplex_res$assigned_barcodesoffilter_demultiplex_res().- allowed_mismatches
Integer vector of length one or the same length as the number of barcode segments; the threshold Hamming distance. All reads having a number of mismatches above this number in any of the barcodes will be filtered away.
- mismatches
Integer matrix of the number of mismatches of each assigned barcode. Corresponds to the field
mismatchesofcombinatorial_demultiplex().- x
An object of class
demultiplex_filter_summaryfromcreate_summary_res().- ...
Ignored
Value
create_summary_res() returns a list of
S3 class demultiplex_filter_summary
providing diagnostics for the filtering process. It contains the
the following fields:
n_reads: The total number of reads in the dataset before filtering.n_removed: The number of reads removed because demultiplexing failed.n_barcode_sets: The number of barcode sets.n_barcode_combinations: The possible number of barcode combinations.n_unique_barcodes: The number of observed unique barcode combinations (i.e. features which may be cells) detected after filtering mismatches.n_estimated_features: The estimated number of features having a detected combination of barcodes. This number will always be greater or equal thann_unique_barcodesdue to barcode collisions.observed_collision_lambda: The ratio of observed barcode combinations divided by the total number of possible barcode combinations.corrected_collision_lambda: The ratio of estimated number of features to the total number of possible barcode combinations.expected_collisions: The statistically expected number of barcode collisions or more precicely the expected number of observed barcodes which correspond to two or more features.barcode_summary: A list containing a summary for each barcode set. Each element contains the following:width: The width (number of nucleotides) of the barcode set.n_barcodes: Number of query barcodes.n_allowed_mismatches: Number of allowed mismatches for the barcode set.n_removed: Number of reads having too many mismatches for this barcode set.mismatch_frame: Adata.framewith the two columns,n_mismatchesandfrequencyshowing the number of reads for each of the allowed number of mismatches for the given barcode set.
The print() method returns its output invisibly.
Details
Following a uniform distribution of barcodes, the expected number of barcode collisions (observed barcodes combinations being composed of two or more features) is given by $$N\left(1-e^{-\lambda}-\lambda e^{-\lambda}\right),$$ where \(N\) is the number of possible barcode combinations and \(\lambda\) is in this summary referred to as the collision lambda: $$\lambda=\frac{n}{N},$$ where \(n\) is the number of features. However, \(n\) is unknown as we cannot know how many features there were originally due to potential collisions. Utilizing the fact that the expected observed number of barcodes is given by $$N\left(1-e^{-\lambda}\right),$$ we can correct the estimate for \(\lambda\) from the known value of the observed barcode combinations, and thus estimate the number of features and barcode collisions.
While each unique feature can be conceptually thought of as single cell with its transcripts, realistic datasets have many features with relatively small numbers of reads which are artifacts and unlikely to correspond to true cells.
Examples
library(purrr)
library(Biostrings)
input_fastq <- system.file(
"extdata", "PETRI-seq_forward_reads.fq.gz", package = "posDemux")
reads <- readDNAStringSet(input_fastq, format = "fastq")
barcode_files <- system.file(
"extdata/PETRI-seq_barcodes",
c(bc1 = "bc1.fa", bc2 = "bc2.fa", bc3 = "bc3.fa"), package = "posDemux"
)
names(barcode_files) <- paste0("bc", 1L:3L)
barcode_index <- map(barcode_files, readDNAStringSet)
barcodes <- barcode_index[c("bc3", "bc2", "bc1")]
sequence_annotation <- c(UMI = "P", "B", "A", "B", "A", "B", "A")
segment_lengths <- c(7L, 7L, 15L, 7L, 14L, 7L, NA_integer_)
demultiplex_res <- posDemux::combinatorial_demultiplex(
reads, barcodes = barcodes, segments = sequence_annotation,
segment_lengths = segment_lengths
)
filtered_res <- filter_demultiplex_res(demultiplex_res, allowed_mismatches = 1L)
freq_table <- create_freq_table(filtered_res$demultiplex_res$assigned_barcodes)
print(filtered_res$summary_res)
#> Total number of reads: 56895
#> Number of reads failing to demultiplex: 4989 (8.77%)
#> Observed number of unique barcode combinations:978
#> Number of possible barcode combinations: 884736
#> Estimated number of features: 978.5
#> Observed feature to barcode ratio: 0.001105
#> Corrected feature to barcode ratio: 0.001106
#> Estimated number of observed barcode combinations
#> corresponding to more than one feature: 0.5 (0.06%)
#> Number of barcode sets: 3
#> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#> Barcode set: bc3
#> Barcode width: 7
#> Number of possible barcodes: 96
#> Number of allowed mismatches: 1
#> Number of reads with 0 mismatches: 51923 (91.26%)
#> Number of reads with 1 mismatches: 671 (1.18%)
#> Number of reads above mismatch threshold: 4301 (7.56%)
#> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#> Barcode set: bc2
#> Barcode width: 7
#> Number of possible barcodes: 96
#> Number of allowed mismatches: 1
#> Number of reads with 0 mismatches: 51918 (91.25%)
#> Number of reads with 1 mismatches: 567 (1%)
#> Number of reads above mismatch threshold: 4410 (7.75%)
#> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#> Barcode set: bc1
#> Barcode width: 7
#> Number of possible barcodes: 96
#> Number of allowed mismatches: 1
#> Number of reads with 0 mismatches: 51930 (91.27%)
#> Number of reads with 1 mismatches: 642 (1.13%)
#> Number of reads above mismatch threshold: 4323 (7.6%)
#> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# This also works, but is usually not necessary to call directly
alternative_summary_res <- create_summary_res(
retained = filtered_res$retained, barcodes = barcodes,
assigned_barcodes = filtered_res$demultiplex_res$assigned_barcodes,
allowed_mismatches = 1L, mismatches = demultiplex_res$mismatches
)