Last December Google Brain released DeepVariant, a machine-learning based variant caller using convolutional neural networks. While PacBio and Nanopore (long-read) sequencing become more mainstream, there exist massive amounts of data from 2nd generation sequencing* for populations which still have lots of use. For the Medicago HapMap project, we have 262 accessions with various depth of 2nd generation sequencing. The SNP calls have helped many researchers perform association and population studies, as well as contributed to the knowledge of legume-rhizobial symbiosis in other ways.
SNP Calling with 2GS usually can be simplified to look like:
Generate Reference Genome --> Population Sequencing --> Map reads to Reference Genome --> Variant Calling
While more sequencing is often cost prohibitive, improving the reference genome by joining contigs and scaffolds together, and adding in non-core sequence are easy ways to improve SNP calls. The other method is to improve variant calling, which is what DeepVariant purports to do. If you are using non-Human organisms (like me) then you’ll have to generate a model yourself. They have a guide for generating models. The biggest problem is finding variants you are confident about. This could be done using GATK or Freebayes and then using only the highest confident variant calls as a truth set for DeepVariant, or even again for GATK or Freebayes as a second run. While this is a circular type of logic it works well enough.
(Thanks to Derek Nedveck for pointing this software out to me)
Let’s talk semantics for a second.
* Next-generation sequencing or NGS would now refer to fourth-generation sequencing (4GS, or NGS), because it is what is next.
It’s worth just trying DeepVariant out without training a new model. It should just work well for your use case, since Medicago is a diploid species as far as I can tell. Feel free to post any questions you have on our github page.
Hmm, thanks. I’ll give it a try. Medicago is mostly inbred so het’s are usually a sign of error, but it’d be worth seeing if existing models worked with it or altered the calls compared to the prebuilt ones.