Samtools get consensus sequences

10/5/2023

NanoPlot (version: 1.38.0) was used for quality control and QUAST (version: 5.0.2) was used for evaluation of the sequencing data and assemblies, with default parameters. Furtherly, if polishing with NGS data was required, default is Pilon (version: 1.24), was used to polish the consensus sequence with default parameters.īy default, BWA (version: 0.7.17-r1188) and Minimap2 (version: 2.21-r1071) were installed for alignment, while Sambamba (version: 0.8.0) and Samtools (version: 1.12) were installed for alignment processing. Finally, the sequencing data and consensus sequence 2 were used for three rounds of error correction to obtain the final consensus sequence. In the second round of error correction, the sequencing data were aligned against consensus sequence 1, and the assembly from FlyE was used as the target genome to generate consensus sequence 2. For example, if the number of scaffolds from Canu, Wtdbg2, and FlyE was 3, 2, and 1, respectively, the sequencing data were aligned against the assembly from Canu in the first round of correction, with the assembly from Wtdbg2 used as the target genome to generate consensus sequence 1. Then, error correction algorithm, default is Racon (version: v1.4.20), was used for “2+3” rounds of self-correction. Then, the assembly results from various software and/or algorithm were sorted according to the number of scaffolds in descending order. MAECI takes advantage of the fact that different assembly algorithms produce different assembly errors for the same data, and corrects them by methods of self-correction to produce a single consensus sequence with fewer assembly error and more accurate than any of the inputs.įig 1. It takes nanopore sequencing data as input, uses multiple assembly algorithms to generate a single consensus sequence, and then uses nanopore sequencing data to perform self-error correction. Therefore, we develop MAECI, a pipeline that enables the assembly for nanopore long-read sequencing data of bacterial genomes. Since genome assembly is often the beginning of bioinformatics analysis by de novo sequencing of bacterial genomes, assembly errors may have critical implications for downstream analysis. Therefore, the assembly, especially of bacterial genomes, is far from perfect, and there are many details to consider and substantial room for improvement.

Both approaches can mitigate some of these problems and improve the accuracy of the assemblies, but assembly errors cannot be completely avoided. Alternatively, the assemblies can be corrected using nanopore sequencing data and then polished with NGS data. Hybrid assembly, which uses both short and long reads from next- and third-generation sequencing platforms, is gaining popularity. Nanopore sequencing data are characterized by the presence of indels, non-random systematic errors and the occurrence of assembly errors spanning hundreds of bases, which may lead to inaccurate or incomplete assemblies. They have relative advantages and disadvantages as well as varying performance and assembly outcomes, but in terms of overall performance, FlyE and Raven stands out as the best bacterial genome assembler. Many software or algorithm have been developed for bacterial genome assembly, such as Canu, FlyE, and Wtdbg2.

Compared with short reads from next-generation sequencing (NGS), long reads can span larger genomic repeats and complex genomic structures, thus facilitating downstream genome assembly and analysis. If you prefer a FASTA format instead of FASTQ, you can use tools like seqtk or fastq_to_fasta to convert the FASTQ file to FASTA format if needed.Long reads from nanopore sequencing platforms such as Oxford Nanopore Technologies (ONT) are widely used in the study of bacterial genomes. Please make sure to replace reference.fasta with the filename of your reference genome and sorted_aligned_reads.bam with the appropriate name of your sorted and indexed BAM file.Īfter running this script, you should obtain the consensus sequence in the consensus.fastq file. vcf2fq: Converts the consensus genotype in VCF format to FASTQ format, representing the consensus sequence.Ĭonsensus.fastq: The output file containing the consensus sequence in FASTQ format. Sorted_aligned_reads.bam: The sorted and indexed BAM file.īcftools call: Calls the consensus genotype for each position based on the pileup. f reference.fasta: Specifies the reference genome in FASTA format. Samtools mpileup: Generates a pileup of aligned reads at each position in the reference genome. Samtools mpileup -uf reference.fasta sorted_aligned_reads.bam | bcftools call -c | vcf2fq > consensus.fastq

0 Comments

Samtools get consensus sequences

Leave a Reply.

Author

Archives

Categories