Graph Database example using Gene Ontology – Part 1

By | April 11, 2013

The Gene Ontology project is a useful tool for anyone doing genomics. It’s a highly relational and controlled vocabulary, making it ideal for use in a graph database. In this example I will show you what a graph database is, and throughout this series we will create a graph database of GO terms, properly linked, inside the Neo4j database system. To make it more useful than examining GO terms for the sake of examining GO terms, we will connect it to another database as well.

The purpose of this is to give an example of what a graph database is, rather than explain it. I find that my learning style thrives from concrete, as opposed to abstract, examples. Wikipedia, as usual, can give you an introduction to Graph Databases. I’m also happy to answer any questions in the comments.

Let’s take a look at a GO entry:


[Term]
id: GO:0000234
name: phosphoethanolamine N-methyltransferase activity
namespace: molecular_function
def: "Catalysis of the reaction: S-adenosyl-L-methionine + ethanolamine phosphate = S-adenosyl-L-homocysteine + N-methylethanolamine phosphate." [EC:2.1.1.103]
synonym: "phosphoethanolamine methyltransferase activity" EXACT [EC:2.1.1.103]
synonym: "S-adenosyl-L-methionine:ethanolamine-phosphate N-methyltransferase activity" EXACT [EC:2.1.1.103]
xref: EC:2.1.1.103
xref: KEGG:R02037
xref: MetaCyc:2.1.1.103-RXN
xref: RHEA:20366
xref: RHEA:20368
is_a: GO:0008170 ! N-methyltransferase activity
is_a: GO:0008757 ! S-adenosylmethionine-dependent methyltransferase activity

This is directly from their flat file, whose format description and download is available from here.

Basically their format is one item per line, a stanza starts with [Term] and has zero or more of the tags below(each have their own requirements, you can have unlimited XREFs for example but no more than one def. This example application will ignore most of that, as long as our parser works we will not worry about the individual technicalities. In fact, our focus is on the database, not the technicalities.

Any protein annotated with this term, GO:0000234, is a “phosphoethanolamine N-methyltransferase activity” with links to the ENZYME, KEGG, MetaCyc, and RHEA databases. It’s also part of (IS_A) two other GO terms. This gives the Gene Ontology a hierarchy. A longer definition, and two synonyms are also present. Other GO terms could also link to this same EC or MetaCyc reaction, or to this GO term. Looking at just the IS_A relationships we have this data structure that will be present in our graph.

is_a

It’s very simple, but when expanded with the XREF relationships, it becomes a more complex web, as we will see later. For now, let’s add the XREFs

xrefs

The power here is that you can find all GO term’s related to any of the XREFs, even though the IS_A GO terms may not connect directly to those EC’s. In fact GO:0008170 connects to EC:2.1.1.-, which is a parent of EC:2.1.1.103. Adding just these you find:

xrefs-next-level

It is important to know that here I am selectively adding terms and relationships here, the full entry for GO:0008168 is:


[Term]
id: GO:0008168
name: methyltransferase activity
namespace: molecular_function
alt_id: GO:0004480
def: "Catalysis of the transfer of a methyl group to an acceptor molecule." [ISBN:0198506732]
subset: goslim_generic
subset: goslim_yeast
subset: gosubset_prok
synonym: "methylase" BROAD []
xref: EC:2.1.1
xref: Reactome:REACT_100745 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Bos taurus"
xref: Reactome:REACT_100863 "methylation of Dopamine to form 3-Methoxytyramine, Taeniopygia guttata"
xref: Reactome:REACT_100884 "methylation of Dopamine to form 3-Methoxytyramine, Xenopus tropicalis"
xref: Reactome:REACT_103958 "methylation of Dopamine to form 3-Methoxytyramine, Rattus norvegicus"
xref: Reactome:REACT_104757 "guanidinoacetate + S-adenosylmethionine => creatine + S-adenosylhomocysteine, Danio rerio"
xref: Reactome:REACT_105478 "guanidinoacetate + S-adenosylmethionine => creatine + S-adenosylhomocysteine, Bos taurus"
xref: Reactome:REACT_105909 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Xenopus tropicalis"
xref: Reactome:REACT_106465 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Sus scrofa"
xref: Reactome:REACT_107676 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Schizosaccharomyces pombe"
xref: Reactome:REACT_15531 "methylation of Dopamine to form 3-Methoxytyramine, Homo sapiens"
xref: Reactome:REACT_15553 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Homo sapiens"
xref: Reactome:REACT_2094 "guanidinoacetate + S-adenosylmethionine => creatine + S-adenosylhomocysteine, Homo sapiens"
xref: Reactome:REACT_29020 "methylation of Dopamine to form 3-Methoxytyramine, Bos taurus"
xref: Reactome:REACT_29098 "guanidinoacetate + S-adenosylmethionine => creatine + S-adenosylhomocysteine, Canis familiaris"
xref: Reactome:REACT_33644 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Mycobacterium tuberculosis"
xref: Reactome:REACT_33779 "methylation of Dopamine to form 3-Methoxytyramine, Mycobacterium tuberculosis"
xref: Reactome:REACT_73401 "guanidinoacetate + S-adenosylmethionine => creatine + S-adenosylhomocysteine, Rattus norvegicus"
xref: Reactome:REACT_77286 "methylation of Dopamine to form 3-Methoxytyramine, Canis familiaris"
xref: Reactome:REACT_78774 "guanidinoacetate + S-adenosylmethionine => creatine + S-adenosylhomocysteine, Mus musculus"
xref: Reactome:REACT_79799 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Mus musculus"
xref: Reactome:REACT_83214 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Gallus gallus"
xref: Reactome:REACT_84525 "guanidinoacetate + S-adenosylmethionine => creatine + S-adenosylhomocysteine, Taeniopygia guttata"
xref: Reactome:REACT_84974 "methylation of Dopamine to form 3-Methoxytyramine, Sus scrofa"
xref: Reactome:REACT_86817 "methylation of Dopamine to form 3-Methoxytyramine, Schizosaccharomyces pombe"
xref: Reactome:REACT_86905 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Rattus norvegicus"
xref: Reactome:REACT_87169 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Taeniopygia guttata"
xref: Reactome:REACT_93379 "methylation of Dopamine to form 3-Methoxytyramine, Gallus gallus"
xref: Reactome:REACT_93809 "guanidinoacetate + S-adenosylmethionine => creatine + S-adenosylhomocysteine, Xenopus tropicalis"
xref: Reactome:REACT_94274 "Methylation of 3,4-dihydroxypheylacetic acid to homovanillic acid, Canis familiaris"
xref: Reactome:REACT_96444 "methylation of Dopamine to form 3-Methoxytyramine, Mus musculus"
xref: Reactome:REACT_99316 "guanidinoacetate + S-adenosylmethionine => creatine + S-adenosylhomocysteine, Gallus gallus"
is_a: GO:0016741 ! transferase activity, transferring one-carbon groups
relationship: part_of GO:0032259 ! methylation

Not only do you recover the EC family 2.1.1.-, you also get entries to REACTOME that were not present before, and a new relationship to another GO term, part_of instead of IS_A. If we had all of the data in our graph, querying GO term 0000234 could reveal the REACTOME entries of 0008168. We would need the correct query, and graph databases must be traversed correctly. These are all topics that will be covered later.

This post exists to provide a basic introduction to the concepts of the graph database, and to briefly explore how GO Terms may be connected.

A note on style: In my databases I capitalize the relationships. In this blog I am attempting to capitalize and italicize them to make them stand out. I never pluralize them in the database.