Reading genes.fpkm_tracking into Clojure/Incanter

By | March 8, 2014

I needed to analyze a large batch of samples (~300) of genes.fpkm_tracking files in Clojure and Incanter. This guide will show you how I read the files in, only looked at the FPKMs, and converted it into a single dataset. You need a project.clj file somewhere with the dependencies below (incanter, me.raynes.fs).

(use 'incanter.core 'incanter.io 'incanter.charts 'incanter.stats 'incanter.svg) ; Load up the stats
(require ['me.raynes.fs :as 'fs]) ; Needed to query the filesystem and identify all of the files
(def files  ; Define a list of all of the files matching 
  (filter 
    #(re-find #"combined" %) ; I only want files where the directory has "combined" in it
    (map 
      #(.toString %) ; Convert from File type that Java returns to a string
      (fs/find-files 
        "~/workspace/cufflinks_output"  ; The root directory to search
        #"genes.fpkm_tracking")))) ; The file names I want to find
(defn get-accession [x] (second (re-find #"(XX\d\d\d)" x))) ; The directories have XX followed by 3 digits as an identifier, and I want to record this

(def accession-files (map (juxt get-accession identity) files)) ; Creates a list of vectors, the first element is the accession ID, the second is the filename and path

(println "Files identified:" (count accession-files)) ; So I know it is running

(defn get-FPKMs [data accession] ; Helper function
  (rename-cols {:FPKM (keyword accession)} (sel data :cols [:gene_id :FPKM])))

(def set1 [:XX001 :XX002 :XX003 :XX004 :XX005]) ; Separate set of samples, abbreviated here. Used late to do split-analysis

(println "Calculating FPKMs")

(def FPKMs 
  (reduce 
    (fn [x y] ($join [:gene_id :gene_id] y x))
    (pmap ; Pmap to speed up the reading of these files
      (fn 
        [[hm file]]
        (get-FPKMs (read-dataset file :delim \tab :header true) hm))
      accession-files)))

; This basically converts the dataset to a table with a column :gene_id and the rest are :XX001 :XX002 :XX... etc... for the FPKM of each accession

; Save the output, so we can work with it later without going through the above (time-consuming) steps
(save FPKMs "FPKMs.csv")