I needed to analyze a large batch of samples (~300) of genes.fpkm_tracking files in Clojure and Incanter. This guide will show you how I read the files in, only looked at the FPKMs, and converted it into a single dataset. You need a project.clj file somewhere with the dependencies below (incanter, me.raynes.fs).
(use 'incanter.core 'incanter.io 'incanter.charts 'incanter.stats 'incanter.svg) ; Load up the stats (require ['me.raynes.fs :as 'fs]) ; Needed to query the filesystem and identify all of the files (def files ; Define a list of all of the files matching (filter #(re-find #"combined" %) ; I only want files where the directory has "combined" in it (map #(.toString %) ; Convert from File type that Java returns to a string (fs/find-files "~/workspace/cufflinks_output" ; The root directory to search #"genes.fpkm_tracking")))) ; The file names I want to find (defn get-accession [x] (second (re-find #"(XX\d\d\d)" x))) ; The directories have XX followed by 3 digits as an identifier, and I want to record this (def accession-files (map (juxt get-accession identity) files)) ; Creates a list of vectors, the first element is the accession ID, the second is the filename and path (println "Files identified:" (count accession-files)) ; So I know it is running (defn get-FPKMs [data accession] ; Helper function (rename-cols {:FPKM (keyword accession)} (sel data :cols [:gene_id :FPKM]))) (def set1 [:XX001 :XX002 :XX003 :XX004 :XX005]) ; Separate set of samples, abbreviated here. Used late to do split-analysis (println "Calculating FPKMs") (def FPKMs (reduce (fn [x y] ($join [:gene_id :gene_id] y x)) (pmap ; Pmap to speed up the reading of these files (fn [[hm file]] (get-FPKMs (read-dataset file :delim \tab :header true) hm)) accession-files))) ; This basically converts the dataset to a table with a column :gene_id and the rest are :XX001 :XX002 :XX... etc... for the FPKM of each accession ; Save the output, so we can work with it later without going through the above (time-consuming) steps (save FPKMs "FPKMs.csv")