Archive

Posts Tagged ‘leiningen’

Remedial Clojure: Leiningen, Lazytest, and some code

October 17th, 2010 6 comments

Update/Note 10/27/2011: Some of the code in the this post makes use of monolithic clojure.contrib. This library is now deprecated so please don’t use it in new Clojure code. For more info see “Where Did Clojure.Contrib Go?


I’m starting to get friendly with Clojure. There’s no better way to learn a new language than to build something and write (that is, reflect) about it. Let’s try building a little program that covers some of the basics. I’m going to assume you have access to some Clojure reference material such as the Clojure wiki or “Programming Clojure”. Here we’ll be using Clojure, Leiningen for builds and Lazytest for continuous testing. This program, histowords, will read a text file, count word occurrences and produce a histogram of the results. Nothing fancy, but the journey of a thousand miles and all that…

The final source for this post is all on github.com.

Initial Project Setup

First, let’s create a new project with Leiningen. Leiningen is a Clojure dependency management and build tool similar to Maven, hopefully without the betrayal and heartache (it’s actually built on top of Maven). Download “lein” and put it on your path somewhere. Of course, you’ll need Java installed and on your path as well.

$ lein new histowords
$ cd histowords

(Note that the first invocation of lein might take awhile while it downloads the internet)

Now we’ve got a skeleton project that looks something like this:

histowords/
|-- README
|-- lib
|-- project.clj
|-- src
|   `-- histowords
|       `-- core.clj
|-- test
    `-- histowords
        `-- test
            `-- core.clj

project.clj is the Leiningen project file where we define our dependencies and everything. Let’s add our Clojure and Lazytest dependencies now:

(defproject histowords "1.0.0-SNAPSHOT"
  :description "FIXME: write"
  :dependencies [[org.clojure/clojure "1.2.0"]
                 [org.clojure/clojure-contrib "1.2.0"]]
  :dev-dependencies [[com.stuartsierra/lazytest "1.1.2"]]
  :repositories {"stuartsierra-releases" "http://stuartsierra.com/maven2"}
  :main histowords.core)

We can now ask Leiningen to download the dependent jars into the lib and lib/dev folders:

$ lein deps

Leiningen has some other useful targets as well:

  • clean – clean the project
  • jar – Build a jar
  • uberjar – Build a standalone jar with Clojure and everything bundled.
  • repl – Start a repl with the classpath all set up

Lazytest Setup

Lazytest is an RSpec-like BDD/testing framework created by Stuart Sierra. Before we write any code, let’s get Lazytest running in “watch” mode so every time we save a file, it’ll re-run our tests automatically. First, we’ll add some setup code to our test file, test/histowords/test/core.clj. Replace the default code generated by Leiningen with this:

(ns histowords.test.core
  (:use [lazytest.describe :only (describe it)])
  (:use histowords.core))

Now, in a new console, fire up Lazytest:

$ cd histowords
$ java -cp "src:test:lib/*:lib/dev/*" lazytest.watch src test

This tells Lazytest to watch for changes in the src/ and test/ directories. You’ll see output like this:

Namespaces (no cases run)

Ran 0 test cases.
0 failures.

Done.

Let’s Write Some Code

As mentioned above, we want to make a histogram by counting words in a file. So the input “Why Betty, why Betty, why?” will generate this:

why   ###
betty ##

I’m going to take a bottom up approach to this problem. First, let’s break up the input into words, discarding whitespace and punctuation, and converting to lowercase. Here’s the tests I came up with:

(describe gather-words
  (it "splits words on whitespace"
    (= ["mary" "had" "a" "little" "lamb"] (gather-words "   mary had a\tlittle\n   lamb    ")))
  (it "removes punctuation"
    (= ["mary" "had" "a" "little" "lamb"] (gather-words "., mary, had... a little; lamb!")))
  (it "converts words to lower case"
    (= ["mary" "had" "a" "little" "lamb"] (gather-words "., MaRy, hAd... A liTTle; lAmb!"))))

Add these to the test file and save it. Lazytest will immediately complain. Try implementing gather-words in src/historwords/core.clj. Here’s what I came up with:

(defn gather-words 
  "Given a string, return a list of lower-case words with whitespace and
   punctuation removed"
  [s]
  (map #(.toLowerCase %) (filter #(not (.isEmpty %)) (seq (.split #"[\s\W]+" s)))))

There are several things going on here. First we split on whitespace and non-word characters (yes, this isn’t perfect). We take that sequence and filter out any empty strings. Finally, we convert the results to lower case. I think the best way to go about this is to add each test one at a time and refine the implementation as you go. That worked well for me anyway.

Next, we want to count distinct words in the sequence returned by gather-words. The result will be a map from word to count. Here’s a test:

(describe count-words
  (it "counts words into a map"
    (= {"mary" 2 "why" 3 } (count-words ["why" "mary" "why" "mary" "why"]))))

Again, after you add the test, Lazytest will complain and you can start implementing the count-words function. In an imperative language like Java, you’d create an empty map and then iterate over the word list. If the word’s in the map, increment the count and put it back in the map, otherwise, add an entry to the map with an initial count of 1. Something like this:

public static Map<String, Integer> countWords(Collection<String> words) {
    final Map<String, Integer> result = new HashMap<String, Integer>();
    for(String word : words) {
        final Integer count = result.get(word);
        result.put(word, count != null ? count + 1 : 1));
    }
    return result;
}

In other words, a bureaucratic nightmare.

In Clojure, the idea’s the same, but we don’t need a for-loop. Instead, we can use the reduce function over the list of words. In other languages, this function is known as fold, foldl, inject, accumulate, and other names. Essentially, a callback function is called for each element in the sequence. It’s passed the element and the result of the previous call to the function. So we’re going to take in a word and a map, update the word count and return a new map. Here’s my implementation:

(defn count-words 
  "Take a seq of words and return a map from word to word count"
  [words]
  (reduce 
    (fn [m w] (assoc m w (+ 1 (m w 0)))) 
    {} 
    words))

Note the {} which is the initial value for our accumulator map. Our anonymous function looks up the current count for the word (default to 0 if missing) and increments by 1. Then we return a new map (the assoc function) with the updated count.

So now we have a map from words to counts. The next steps towards our histogram are to “flatten” the map into a list of word/count pairs and sort by count so our histogram looks nice. Conveniently, Clojure maps are already a “seq” of key/value pairs. That is, if we use a map in a context where a sequence is needed, the map will act like a sequence of pairs. This is easy to see in a repl:

$ lein repl
"REPL started; server listening on localhost:46775."
histowords.core=> (seq { "a" 1 "b" 2 "c" 3 })
(["a" 1] ["b" 2] ["c" 3])

So all we need to do is sort by the second field of each pair. Here’s a test:

(describe sort-counted-words
  (it "sorts and returns a list of word/count pairs"
    (= [["a" 1] ["b" 2] ["c" 3]] (sort-counted-words {"b" 2 "c" 3 "a" 1}))))

and here’s my implementation of sort-counted-words":

(defn sort-counted-words 
  "Given a sequence of word/count pairs, sort by count"
  [words]
  (sort-by #(% 1) words))

The sort-by function takes a comparator function and the sequence to sort. Here our comparator function just grabs the second field (index 1) of each pair. This function is so simple it's almost unnecessary, but it's nice when code is readable.

Now we have a list of word/count pairs, sorted by count. Now we just need to turn it into a histogram. We're going to need a function to generate the histogram bars. Here's the test:

(describe repeat-str
  (it "returns the empty string if count is zero"
    (= "" (repeat-str "*" 0)))
  (it "repeats the input string n times"
    (= "xxxxx" (repeat-str "x" 5))))

Can you implement it?

Next let's try generating a single entry in the histogram. histogram-entry will take a word/count pair and the width of the name column as parameters and return a string. Here's the test:

(describe histogram-entry
  (it "can generate a single histogram entry"
    (= "betty   ######" (histogram-entry ["betty" 6] 7))))

Here's my implementation:

(defn histogram-entry 
  "Make a histogram entry for a word/count pair and maximum word width"
  [[w n] width]
  (let [r (- width (.length w))]
    (str w (repeat-str " " r) " " (repeat-str "#" n))))

Note the use of destructuring ([w n]) in the parameter list to bind the word and count from the pair to variables rather than extracting with indices. Otherwise this is pretty straightforward. Calculate the required padding and concatenate some strings.

Finally, we're ready to pull it all together. The histogram function takes a sequence of word/count pairs and generates a full histogram for them with nice alignment and everything. Here's the test:

(describe histogram
  (it "can generate a histogram from word counts"
    (= "mary ##\nwhy  ###\n" (histogram [["mary" 2] ["why" 3]]))))

And here's the implementation:

(defn histogram 
  "Make a histogram for a seq of word/count pairs"
  [words]
  (let [width (apply max (map #(.length (%1 0)) words))]
    (reduce 
      (fn [acc pair] (str acc (histogram-entry pair width) "\n")) 
      "" 
      words)))

We use the max function to calculate the width of the widest word in the input sequence. The we use reduce again to generate the output string. Previously we were accumulating a map. This time we're accumulating a string, thus the initial value is the empty string.

Pulling It All Together

Now we've got a bunch of passing tests. How do we turn this into a program we can run? Leiningen to the rescue. We'll add a main function to src/histowords/core.clj. Let's string all our functions together there. We'll assume that a file name is given as a command-line parameter, and use slurp to read it into a string:

(defn -main [& args]
  (println 
    (histogram 
      (sort-counted-words 
        (count-words 
          (gather-words 
            (slurp (first args))))))))

Additionally, we need to tell the Clojure compiler to generate a class for this file so it can be used as a main entry point. Att the top of the file, add the :gen-class keyword:

(ns histowords.core
  (:use [clojure.contrib.str-utils :only (str-join)])
  (:gen-class))

With that in place, we can use Leiningen to generate a standalone uber-jar for us:

$ lein uberjar
$ java -jar histowords-1.0.0-SNAPSHOT-standalone.jar mary.txt
against    #
but        #
lingered   #
snow       #
near       #
go         #
still      #
fleece     #
does       ##
did        ##
a          ###
teacher    ###
was        ###
... snip ...
to         ####
turned     ####
day        ####
you        ####
school     #####
so         ######
it         #########
and        ##########
lamb       ############
mary       #############
the        ##############

Conclusions

I think this is pretty cool. I find the code nice and compact while remaining readable. Lazytest makes testing easy and Leiningen gives us the basic features of Maven without the XML nightmare. But, this is just a toy app, right? In real life, it'll get a lot messier. The thing I'm most excited about is that, since I started playing with Clojure last week, I've read a bunch of "realworld" Clojure code in github, and it's still just as compact and readable. I know I'm just scratching the surface too. Fun!

There are still some newb questions/issues I have though:

  • How do I get Leiningen to run my Lazytests as part of the build. A plugin maybe? What about Maven integration and JUnit-style reports so I can run all this stuff in Hudson?
  • Sometimes I have to stare at the Lazytest failure messages a bit to figure out what failed

Anyway, I'm just getting started, so if you see any glaring mistakes or ways to improve the code, I'd love to hear about it.

Categories: clojure Tags: , , ,