SFASTA: Fast Index building

What is SFASTA? Genomic and bioinformatic-adjacent sequences (RNA, Protein, Peptides) are stored as FASTA files. Sequencing reads off a machine are stored as FASTQ files, adding a quality score associated with each nucleotide. Currently, these are non-human-readable plaintext files. As sequencing increases, we need to be able to process many more gigabytes and terabytes of… Read More »

Species-wide genomics of kākāpō provides tools to accelerate recovery

The kākāpō is a critically endangered, intensively managed, long-lived nocturnal parrot endemic to Aotearoa New Zealand. We generated and analysed whole-genome sequence data for nearly all individuals living in early 2018 (169 individuals) to generate a high-quality species-wide genetic variant callset. We leverage extensive long-term metadata to quantify genome-wide diversity of the species over time… Read More »

Dissertation Defense Announcement

In partial fulfillment of my doctoral degree, I will be presenting my work on “Genomic complexities in the legume-rhizobial symbiosis.” My work is primarily computational and the work is generalizable to many different systems. Open to the public. I will be presenting my work on Thursday, May 31 @ Noon @ the BioSciences building room 257… Read More »

Machine Learning for Variant calling with DeepVariant from Google Brain

Last December Google Brain released DeepVariant, a machine-learning based variant caller using convolutional neural networks. While PacBio and Nanopore (long-read) sequencing become more mainstream, there exist massive amounts of data from 2nd generation sequencing* for populations which still have lots of use. For the Medicago HapMap project, we have 262 accessions with various depth of… Read More »

Partition implemented in Python

Coming from a functional programming mindset, I needed a partition function in Python. I discovered this on the internet and wanted to share it here with anyone else looking for similar functions. You can see what partition typically does at ClojureDocs to get an idea if you are curious.

Importing GloVe Embeddings into Tensorflow

GloVe is a useful tool for rapidly generating word embeddings. I am using this with DNA sequences now to experiment with machine learning techniques in genomics. Loading these embeddings into TensorFlow is essential for my experiments. Here is how to do it in Python.

Select and resequence manuscript published

Select and resequence reveals relative fitness of bacteria in symbiotic and free-living environments. Abstract Assays to accurately estimate relative fitness of bacteria growing in multistrain communities can advance our understanding of how selection shapes diversity within a lineage. Here, we present a variant of the “evolve and resequence” approach both to estimate relative fitness and… Read More »

Machine Learning Crash Course from Google

Earlier this month Google made their internal Machine Learning Crash Course available. You can read more about it on their developer blog. I have a few machine learning projects going, mostly to learn but also to create an alignment-free sequence origin-identification tool. The (unorganized, incomplete) code is available at my GitHub repository. I’m curious about methods… Read More »

Using ODG from the Neo4j Web Console

The ODG query interface should suffice for many operations, and the command-line interface supports only certain analyses. If you have more advanced queries to run, you can interact with ODG’s generated database from nearly any programming language, using a library or package, via the REST API, or through Neo4j’s Web Console. This tutorial will cover… Read More »