I needed to analyze a large batch of samples (~300) of genes.fpkm_tracking files in Clojure and Incanter. This guide will show you how I read the files in, only looked at the FPKMs, and converted it into a single dataset. You need a project.clj file somewhere with the dependencies below (incanter, me.raynes.fs).
(use 'incanter.core 'incanter.io 'incanter.charts 'incanter.stats 'incanter.svg) ; Load up the stats
(require ['me.raynes.fs :as 'fs]) ; Needed to query the filesystem and identify all of the files
(def files ; Define a list of all of the files matching
(filter
#(re-find #"combined" %) ; I only want files where the directory has "combined" in it
(map
#(.toString %) ; Convert from File type that Java returns to a string
(fs/find-files
"~/workspace/cufflinks_output" ; The root directory to search
#"genes.fpkm_tracking")))) ; The file names I want to find
(defn get-accession [x] (second (re-find #"(XX\d\d\d)" x))) ; The directories have XX followed by 3 digits as an identifier, and I want to record this
(def accession-files (map (juxt get-accession identity) files)) ; Creates a list of vectors, the first element is the accession ID, the second is the filename and path
(println "Files identified:" (count accession-files)) ; So I know it is running
(defn get-FPKMs [data accession] ; Helper function
(rename-cols {:FPKM (keyword accession)} (sel data :cols [:gene_id :FPKM])))
(def set1 [:XX001 :XX002 :XX003 :XX004 :XX005]) ; Separate set of samples, abbreviated here. Used late to do split-analysis
(println "Calculating FPKMs")
(def FPKMs
(reduce
(fn [x y] ($join [:gene_id :gene_id] y x))
(pmap ; Pmap to speed up the reading of these files
(fn
[[hm file]]
(get-FPKMs (read-dataset file :delim \tab :header true) hm))
accession-files)))
; This basically converts the dataset to a table with a column :gene_id and the rest are :XX001 :XX002 :XX... etc... for the FPKM of each accession
; Save the output, so we can work with it later without going through the above (time-consuming) steps
(save FPKMs "FPKMs.csv")