Category Archives: Bioinformatics

SFASTA: Fast Index building

What is SFASTA? Genomic and bioinformatic-adjacent sequences (RNA, Protein, Peptides) are stored as FASTA files. Sequencing reads off a machine are stored as FASTQ files, adding a quality score associated with each nucleotide. Currently, these are non-human-readable plaintext files. As sequencing increases, we need to be able to process many more gigabytes and terabytes of files rapidly and… Read More »

Machine Learning for Variant calling with DeepVariant from Google Brain

Last December Google Brain released DeepVariant, a machine-learning based variant caller using convolutional neural networks. While PacBio and Nanopore (long-read) sequencing become more mainstream, there exist massive amounts of data from 2nd generation sequencing* for populations which still have lots of use. For the Medicago HapMap project, we have 262 accessions with various depth of 2nd generation sequencing.… Read More »

Machine Learning Crash Course from Google

Earlier this month Google made their internal Machine Learning Crash Course available. You can read more about it on their developer blog. I have a few machine learning projects going, mostly to learn but also to create an alignment-free sequence origin-identification tool. The (unorganized, incomplete) code is available at my GitHub repository. I’m curious about methods to improve genome… Read More »

Using ODG from the Neo4j Web Console

The ODG query interface should suffice for many operations, and the command-line interface supports only certain analyses. If you have more advanced queries to run, you can interact with ODG’s generated database from nearly any programming language, using a library or package, via the REST API, or through Neo4j’s Web Console. This tutorial will cover accessing it via… Read More »

ODG, the Omics Database Generator, has been published

ODG: Omics Database Generator has been published in BMC Bioinformatics and is available online now. ODG is a tool that allows users to supply -omics data and ODG will integrate the data into a coherent database and generate a web-based user-interface. Advanced users can query the database directly, through a programming language or by using the CYPHER query language. ODG uses Neo4j’s… Read More »

Bio* Library for Clojure.

Biotools is my basic bioinformatics file parsing library. You can find it at GitHub. It can parse BLAST+ Tab output (-outfmt “6 std qlen slen”), ExPASY ENZYME.dat, FASTA, GFF3, FPKM Tracking files from Cufflinks, Interproscan tab delimited output, Gene Ontology/Plant Ontology OBO format (any Ontology in OBO format), PMN Pathways format, and a PSI-MITAB 2.5 format. These are all… Read More »

Reading genes.fpkm_tracking into Clojure/Incanter

I needed to analyze a large batch of samples (~300) of genes.fpkm_tracking files in Clojure and Incanter. This guide will show you how I read the files in, only looked at the FPKMs, and converted it into a single dataset. You need a project.clj file somewhere with the dependencies below (incanter, me.raynes.fs).

Graph Database example using Gene Ontology – Part 1

The Gene Ontology project is a useful tool for anyone doing genomics. It’s a highly relational and controlled vocabulary, making it ideal for use in a graph database. In this example I will show you what a graph database is, and throughout this series we will create a graph database of GO terms, properly linked, inside the Neo4j… Read More »