Category Archives: Machine Learning

SFASTA: Fast Index building

What is SFASTA? Genomic and bioinformatic-adjacent sequences (RNA, Protein, Peptides) are stored as FASTA files. Sequencing reads off a machine are stored as FASTQ files, adding a quality score associated with each nucleotide. Currently, these are non-human-readable plaintext files. As sequencing increases, we need to be able to process many more gigabytes and terabytes of files rapidly and… Read More »

Machine Learning for Variant calling with DeepVariant from Google Brain

Last December Google Brain released DeepVariant, a machine-learning based variant caller using convolutional neural networks. While PacBio and Nanopore (long-read) sequencing become more mainstream, there exist massive amounts of data from 2nd generation sequencing* for populations which still have lots of use. For the Medicago HapMap project, we have 262 accessions with various depth of 2nd generation sequencing.… Read More »

Machine Learning Crash Course from Google

Earlier this month Google made their internal Machine Learning Crash Course available. You can read more about it on their developer blog. I have a few machine learning projects going, mostly to learn but also to create an alignment-free sequence origin-identification tool. The (unorganized, incomplete) code is available at my GitHub repository. I’m curious about methods to improve genome… Read More »