#RANDOM BARCODE SOFTWARE#
Our software circumvents the need for complete (and intractable) multiple sequence alignment, by making use of the idea of circularizing the sequences that are to be error corrected, and rather than pursuing a multiple sequence alignment approach, we instead borrow ideas from genome assembly. Here, we present a fast and error-robust k-mer based approach to detecting random barcodes from sequencing data.
On the other hand, the problem is constrained in that the sequences are short (barcodes are typically 10–16 bp long) and the length of each barcode is the same and known. However unlike standard biological sequence alignment settings, the single-cell barcode identification problem requires analysis of millions, if not billions of different sequences. The problem of identifying true barcodes from among many sequences corrupted by mismatch and deletion errors seemingly requires a multiple sequence alignment, from which errors can be detected and corrected. On the other hand, the barcodes generated though split-pool synthesis (including Drop-seq) are random and no prior information can be used for either error-correction or read assignment. In the used by 10× Genomics among others, barcodes drawn are from a known ‘whitelist’ of sequences, and as such this prior knowledge of a whitelist can be used to simplify error-correction and read assignment. In such experiments, there are two major approaches toward generating barcodes. Additionally, some current approaches require that the approximate number of cells in the experiment be known beforehand, and in some experimental contexts such information is not easily obtained. However the complex nature of errors, that unlike sequencing based error also include deletions, can lead to large number of discarded reads (reads that could not be assigned to a barcode). Ignoring such errors can therefore dramatically lower the number of usable reads in a dataset, while incorrectly grouping reads together can confound single cell analysis.Ĭurrent approach to “barcode calling”, the process of grouping reads together by barcode, use simple heuristics to first identify barcodes that are likely to be uncorrupted, and then “error correct” remaining barcodes to increase yields. One consequence of this synthetic technique is that deletion errors are extremely prevalent by some estimates 25% of all barcode sequences observed contain at least one deletion. Similar split-and-pool barcoding strategies are used in other single-cell sequencing assays such as Seq-Well and Split-seq. For example, in the Drop-Seq protocol, which is a popular microfluidic-based single-cell experimental platform, DNA barcodes are synthesized on a solid bead support, using split-and-pool DNA synthesis, and this approach has been applied to obtain single-cell transcriptome profiles from a number of model- and non-model organisms. This technique is in the cornerstone of many single-cell sequencing experiments, where reads originating from individual cells are tagged with cell-specific barcodes as such, the first step in any single-cell sequencing experiment involves separating reads by barcode to recover single-cell profiles (. Tagging of sequencing reads with short DNA barcodes is a common experimental practice that enables a pooled sequencing library to be separated into biologically meaningful partitions. Sircel, a software package that implements this approach is described and publically available. This approach is robust to the type of error (mismatch, insertion, deletion), as well as to the relative abundances of the cells. We show that for single-cell RNA-Seq circularization improves the recovery of accurate single-cell transcriptome estimates, especially when there are a high number of errors per read. This allows for assignment of reads to consensus fingerprints constructed from k-mers. Our approach is based on the observation that circularizing a barcode sequence can yield error-free k-mers even when the size of k is large relative to the length of the barcode sequence, a regime which is typical single-cell barcoding applications. Here we present an approach to identify and error-correct barcodes by traversing the de Bruijn graph of circularized barcode k-mers. However, this step can be difficult due to high rates of mismatch and deletion errors that can afflict barcodes. In order to recover single-cell information from such experiments, reads must be grouped based on their barcode tag, a crucial processing step that precedes other computations. Single-cell sequencing experiments use short DNA barcode ‘tags’ to identify reads that originate from the same cell.