Archive

Posts Tagged ‘lazytest’

Remedial Clojure: What Rhymes with Clojure? Part 1

November 11th, 2010 No comments

Previously we put together some Clojure code to parse a pronouncing dictionary from a file, URL, or other resource. Recall that the resulting structure was a map from a word to a vector of phonemes and stress info. For example:

{ "CLOSURE" [["K" nil] ["L" nil] ["OW" :p] ["ZH" nil] ["ER" :n]] }

Now we’re going to put the dictionary to use. We’ll build a datastructure and function that can find all the rhymes for a word and then, in part 2, we’ll use Compojure to make simple web front-end for it. The final result will look like this:

The final source for this post is all on github.com

Soft Rhyme Classes

Determining whether two words rhyme basically consists of comparing the phonemes from the end of the words. For example, we can see that “BETTY” and “SPAGHETTI” rhyme because they share a common three-phoneme suffix:

  • BETTY [ B – EH – T – IY ]
  • SPAGHETTI [ S – P – AH – G – EH – T – IY ]

This is known as a “hard” rhyme because the phonemes match exactly. To keep things interesting, we’ll be looking for “soft” rhymes, that is phonemes that mostly rhyme, by grouping phonemes into equivalence classes. For example, “BETTY” and “READY” aren’t hard rhymes because the “T” and “D” phonemes are different. Put them in the same soft rhyme class however, and we have a soft rhyme for those two words.

So, to get things started, let’s set up our soft rhyme classes. First we’ll define the classes themselves as sets of phonemes and then build a map from each individual phoneme to its soft rhyme class. This gives us a chance to Clojure’s set literal as well as a for list comprehension.

(def soft-rhyme-classes-seed
  ^{ :doc "Seed for soft-rhyme-classes below. These classes are based on a very
          old dead page accessible only with the wayback machine:
          http://web.archive.org/web/20021018075650/http://web.utk.edu/~blyon/RhymerMiner/FranklinRhymerHelp.htm
          " }
  [#{"AA" "AO" "AW" "OW"}
   #{"AE" "EH"}
   ... more classes ...
   #{"AY"}
   #{"Y"}])

(def soft-rhyme-classes 
  ^{ :doc "A map from phoneme to soft rhyme class" }
  (apply hash-map (flatten (for [klass soft-rhyme-classes-seed, phone klass] [phone klass]))))

Note that I derived these classes many years ago from a webpage that only survives in the Wayback Machine. Thanks Wayback Machine!

for allows us to iterate over the equivalence classes and then their members and generate a key/value pair for each combination. This gives us a list of phoneme/class pairs (each is a vector) which we’d like to turn into a map. Now, the hash-map function creates a map from its arguments, so we need to turn this list of pairs first into a flat list and then use that list as the arguments to hash-map. Clojure has us covered on both counts. To flatten the list, we use, of all things, flatten. Go figure:

user=> (flatten [["KEY1" "VALUE1"] ["KEY2" "VALUE2"] ["KEY3" "VALUE3"]])
("KEY1" "VALUE1" "KEY2" "VALUE2" "KEY3" "VALUE3")

To use any sequence as the arguments of a function, use apply:

user=< (apply hash-map (flatten [["KEY1" "VALUE1"] ["KEY2" "VALUE2"] ["KEY3" "VALUE3"]]))
{"KEY1" "VALUE1", "KEY2" "VALUE2", "KEY3" "VALUE3"}

I’ve found that apply is easy to forget about, but often useful. So don’t forget about it!

The Rhymer

Design

Now that we have our rhyme classes out of the way we can work on calculating all the rhymes for a particular word. This would be pretty simple to do by comparing the tail of the word with the tails of the rest of the words in the dictionary… but at 160K words, that might be slow. Instead we can create a datastructure to speed the process. The basic idea is to create a trie, or prefix tree. The twist is the keys at each level will be soft rhyme classes and we’ll uses suffixes rather than prefixes. So we’ll get something like this:

{
    #{IY ...} {
        #{T D ...} {
            #{EH ...} {
                :words [EDDIE]
                #{B ...} {
                    :words [BETTY]
                }
                #{R ...} {
                    :words [READY]
                }
            }
        }
    }
}

Note that in Clojure any of its built-in datatypes can be used as keys in a map. hashCode() and equals() are taken care of and because they’re immutable we don’t have to worry about keys changing and corrupting the hash table.

Each level of the tree contains an optional list of words, keyed by :words. It’s a list so we can correctly handle homophones such as “BEAR” and “BARE”. Each level also has keys for any further phonemes moving up the suffix of the words. So to find the rhymes for a word, we follow its phonemes in reverse down the suffix tree to a desired depth and then collect all the words in that sub-tree.

Taking “BETTY” with the rhyme tree above as an example, we first reverse its phonemes, [IY T EH B] and then do lookups for the soft rhyme classes of IY, T, EH. This will give us a sub-tree containing “BETTY”, “READY”, and any other words that end with an “-ETTY” sound.

Building the Suffix Tree

Building the rhyme suffix tree is, I think, kind of tricky, so it’s even more important to have some tests. I’m continuing to use lazytest and here we’ll meet the given macro to set up a common environment for all our tests. given works basically like let, except that its body must contain test examples rather than arbitrary expressions. I use given to load a test dictionary from the classpath and predefine some words I’ll use throughout the tests:

(describe "with a test dictionary"
  (given [resource (.. (Thread/currentThread) 
                        getContextClassLoader 
                        (getResource "rhymetime/test/test-dict.txt"))
          dict (parse-dictionary resource)
          lisp (dict "LISP")
          asp  (dict "ASP")
          betty (dict "BETTY")
          bear (dict "BEAR")]

    (testing make-rhyme-tree ...

Note the use of getResource to read the test dictionary from the classpath. This is better than using a plain filesystem path because it makes your tests immune to changes in the working directory. Inevitably, a raw filesystem path will work great right up until you start running your tests in a continuous integration environment like Hudson where the working directory isn’t necessarily the root of your project.

Let’s define a function called make-rhyme-tree to build the tree. Here’s a test:

(testing make-rhyme-tree
  (it "builds a rhyme tree from a pronouncing dictionary where words are
      indexed by reversed soft rhyme classes. The words at a given node
      are stored under the :words key"
    (let [tree (make-rhyme-tree dict)]
      (and
        (= ["BETTY"]       (:words (get-in tree (rhyme-tree-path-for betty))))
        (= ["LISP"]        (:words (get-in tree (rhyme-tree-path-for lisp))))
        (= ["BEAR" "BARE"] (:words (get-in tree (rhyme-tree-path-for bear))))
        (= ["ASP"]         (:words (get-in tree (rhyme-tree-path-for asp))))))))

It takes a few helper functions to implement. First we need a way to build the path (list of soft rhyme classes) from the phonemes of a word:

(defn rhyme-tree-path-for
  "Given a word pronunciation, return its path in a rhyme tree"
  [phonemes]
  (->>
    phonemes
    (map first)
    (map soft-rhyme-classes)
    reverse))

Here the ->> (thread) macro comes in handy to show the series of transforms we’re doing on the phoneme list: grab the phoneme from the phoneme/stress pair, look up the soft rhyme class, and finally reverse the whole thing since this is a suffix tree as described above. Once we have a path, we can use the get-in and assoc-in functions to work with the tree. This is much more straightforward than using just get and assoc:

(defn- add-word-to-rhyme-tree
  "Given a rhyme tree and a parsed word, add it to the tree and return
   a new tree"
  [tree [word phonemes]]
  (let [path (rhyme-tree-path-for phonemes)
        sub-tree (get-in tree path)
        words (:words sub-tree)]
    (assoc-in tree path (assoc sub-tree :words (cons word words)))))

Here we look up the current list of words (homophones) at a particular path in the tree and then cons the word into the list. We’re taking advantage of a few things here. First, assoc-in will create the entire path necessary and any missing maps along the path. Second, assoc is happy to take nil as its first parameter. It creates a new map. Similarly, cons treats nil as an empty seq:

user=> (assoc-in nil [:a :long :non-existant :path] "value")
{:a {:long {:non-existant {:path "value"}}}}

user=> (assoc nil "key" "value")
{"key" "value"}

user=> (cons "value" nil)
("value")

Little touches like this help make Clojure code much more compact than equivalent Java code which would be riddled with checks for null. This nil-handling also plays an important part in laziness.

Now, given a dictionary, we can pull this together to build the full rhyme tree. It’s a little anti-climactic:

(defn make-rhyme-tree
  "Make a rhyme tree from a pronouncing dictionary. This constructs the
   raw tree structure. See make-rhymer for the actual rhyme calculation"
  [dict]
  (reduce add-word-to-rhyme-tree {} dict))

Just apply add-word-to-rhyme-tree to the key/values in the dictionary, accumulating into an initially empty map. For the reasons given above, the initial value could also just be nil, but I think {} makes it clearer that we’re iteratively filling in a map.

Implementing Lookup

Now that we’ve got our suffix tree, we just need a function for looking up rhymes. We’ll define a function that, given a dictionary, returns a function we can use to look up rhymes. I think this kind of “encapsulation” is a neat feature of functional programming. Rather than having to keep track of the context (the dictionary) and always pass it to some function, like the make-rhyme-tree test above, we capture the context in a closure and return a function (or functions) to operate on that context.

Continuing from our tests above, here’s the test for our “rhymer” function:

(testing make-rhymer
  (it "returns a function that calculates rhymes for a word"
    (let [rhymer (make-rhymer dict)]
      (= #{"BETTY" "READY" "SPAGHETTI" "MACARONI"} (set (rhymes "MACARONI" 1)))
      (= #{"BETTY" "READY" "SPAGHETTI" }           (set (rhymes "BETTY" 2)))))
  (it "returns a function that returns an empty seq for unknown words"
    (let [rhymes (make-rhymer dict)]
      (empty? (rhymes "MACARONIX" 1)))))

We turn results into sets so we don’t have to worry about the ordering of the results.

Now, to implement this we need to look up a sub-tree for a word and then collect all the words in that sub-tree. Preferably, the returned sequence of words should be lazy because it’s good to be lazy. It took me quite a while to come up with the right approach for this, and I’m definitely not convinced this is the best approach:

(defn- collect-words-in-sub-tree
  "Returns a lazy seq of all the words in the :words keys of the given sub-tree."
  [coll]
  (lazy-seq 
    (when-let [s (seq coll)]
      (let [[k v] (first s) 
            r (rest s)]
        (concat (if (= :words k) 
                  v 
                  (collect-words-in-sub-tree v))
                (collect-words-in-sub-tree r))))))

As described in Programming Clojure, this seems to be the canonical pattern for the usage of the lazy-seq macro. lazy-seq returns a seq-able object that doesn’t actually invoke any code until the first time it’s used. The pattern is basically, do whatever you’d normally do in a recursive way, and then wrap it in lazy-seq. It’s common to have a when-let test to end recursion and then a let which grabs the first item which is processed, and rest of the collection which is used for the recursion.

Here I test whether I’m looking at the :words key. If I am, I use it, otherwise it’s another node so we recurse. Also of interest is the use of the destructuring [k v] in the let. This is one of my favorite features of Clojure so far.

As I mentioned, it took me a while to get to this point. I’m sure it will get easier with experience, but any pointers or guidance would be appreciated.

Now let’s finally make our rhymer function:

(defn make-rhymer
  "Returns a function that can lookup up rhymes for a word in the given dictionary"
  [dict]
  (let [tree (make-rhyme-tree dict)]
    (fn [word depth] 
      (when-let [phonemes (dict word)]
        (let [path     (take depth (rhyme-tree-path-for phonemes))
              sub-tree (get-in tree path)]
          (collect-words-in-sub-tree sub-tree))))))

This function has two parts. The outer let is where we build the rhyme tree. Then we return an anonymous function which takes a word and a depth and returns a list of rhymes. The rhyme tree is captured by the anonymous function giving the encapsulation mentioned above. Otherwise, the implementation is pretty mundane. Look up the word in the dictionary, use its phonemes to do a lookup in the rhyme tree, and return the list of rhyming words. Yay, we’re done!

Here’s an example usage at the repl. Note that I had to bump the max heap size of the JVM to 512MB to load the full CMU dictionary (-Xmx512M):

user=> (use 'rhymetime.pronounce)
nil

user=> (use 'rhymetime.rhyme)
nil

user=> (def dict (parse-dictionary "https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.7a"))
#'user/dict

user=> (def rhymer (make-rhymer dict))
#'user/rhymer

; helper to set the depth to (# of phonemes - 1)
user=> (defn h [w] (rhymer w (dec (count (dict w)))))
#'user/h

user=> (h "CLOSURE")
("FORECLOSURE" "ENCLOSURE" "INCLOSURE" "ENCLOSURE(1)" 
 "DISCLOSURE" "ZLOGAR" "CLOSURE" "CLOTURE" "CLOSER(1)" 
 "LOESER" "LOSURE" "LOGAR" "LOKER" "LOESCHER" "PLOEGER")

user=> (h "MONAD")
("NOMAD" "MONAD" "NANDONET" "AUTOMAP" "GLYCOMED(1)" 
 "GONAD" "BIOMET" "BIOMED")

user=> (h "JAVA")
("MOFFA" "NAVA" "GUAVA" "HOFFA" "SAVA" "SCOZZAFAVA" "FAVA" "RYAVA" 
 "LACAVA" "PENKAVA" "RUBALCAVA" "CAVA" "JAVA" "ACTAVA" "VOTAVA" 
 "BOFFA" "BAVA" "GUSTAVA" "MUSTAFA" "SRIVASTAVA" "SRIVASTAVA(1)" 
 "STAFFA" "SLAVA" "BRATTASLAVA" "LAVA")

Conclusion

That wasn’t so hard. The more you work with Clojure, the more you get a feel for how the various seemingly simple pieces can be composed to build powerful software. In part 2, we’ll take what we built here and build a web-based UI for it using Compojure. Until then, keep Clojuring.

Categories: clojure Tags: , ,

Remedial Clojure: duck-streams, let style, and pronunciation

October 22nd, 2010 3 comments

This is the “If it weren’t for my family and my stupid conscience I’d drive to Clojure Conj” edition of Remedial Clojure


UPDATE, 10/16/2011: Ok, it’s been a year. Now clojure.contrib is really, truly deprecated. Don’t use duck-streams or anything else from monolithic contrib. See Where Did Clojure.Contrib Go?, maintained tirelessly by Sean Cornfield, for more info.


EDIT, 10/26/2010: It’s come to my attention that duck-streams is technically deprecated. It’s still in clojure.contrib and usable, but it’s been superseded by functions in clojure.java.io. Everything I say about the usefulness of duck-streams still apply though. Wherever you see (read-lines "foo"), replace it with something like this:

(with-open [rdr (clojure.java.io/reader "foo")] ...)

Ok, here’s another entry in my Clojure journey. Previously, I covered getting started with Leiningen for builds and Lazytest for TDD. I’ll be doing the same here. There are plenty of places where I’m unsure about the most “Clojure-y” way to do things, so feel free to poke holes in the code.

In the next couple posts, I’ll build a super simple rhyming dictionary using the CMU Pronouncing Dictionary to determine whether words rhyme. Here’s the overall structure of the program:

  • Read the lines of the dictionary using the clojure.contrib.duck-streams library.
  • Parse each line
  • Build a map from words to pronunciations
  • Create a function for determining whether two words rhyme (part 2)
  • Building a fast datastructure for finding all rhymes for a word (part 2)

The final source for this post is all on github.com.

Initial Project Setup

The setup for this project is pretty vanilla. I’m using Leiningen again, following the same procedure as my last post to create a stub project. This time I’m going to create a new namespace, rhymetime.pronounce where we’ll keep all the code for parsing the dictionary. We’ll end up with these new files:

  • src/rhymetime/pronounce.clj
  • test/rhymetime/test/pronounce.clj

Let’s Code

Ok, let’s get down to it. First we want to read the dictionary from file… but that’s a little boring. What if we want to read it from the web, or a string, or a test resource? This is where the clojure.contrib.duck-streams library comes in. Given an input (a file name, url, reader, input stream, etc), it magically determines how to open it. Well, maybe not magically, but it does its best to do what you’d expect it to do. Here are some examples:

(use 'clojure.contrib.duck-streams)

; Lazy seq of lines from file named README
(read-lines "README")
; ... or ...
(read-lines (java.io.File. "README"))

; Lazy seq of lines from a URL (2009 Tornado data)
(read-lines "http://www.spc.noaa.gov/wcm/data/2009_torn.csv")
; ... or ...
(read-lines (java.net.URL. "http://www.spc.noaa.gov/wcm/data/2009_torn.csv"))

In these examples, read-lines can be replaced with reader which returns a buffered Java Reader as well as slurp* which returns the contents of the file or resource as a single string.

Anyway, the nice thing about duck-streams is that our parser can be agnostic about where input comes from without adding a bunch of configuration hassle. It also helps with testing!

“… It’s not that hard: Na-ghee-na-na-jar. Nagheenanajar. “

The CMU Pronouncing Dictionary has been around for a long, long time. It’s a flat-file database of about 130,000 words. Each word includes a pronunciation as a list of Arpabet (which is really fun to say) phonemes, each with an optional stress. Each entry has the following form:

word phoneme0 phoneme1 ...

and each phoneme consists of alphabetic characters followed by, if it’s a vowel sound, an optional stress identifier, “0”, “1”, or “2”. Here are some examples:

BETTY  B EH1 T IY0
READY  R EH1 D IY0
SPAGHETTI  S P AH0 G EH1 T IY0
MACARONI  M AE2 K ER0 OW1 N IY0

Creating small…

Let’s write some tests first. In test/rhymetime/test/pronounce.clj, starting bottom-up, we’ll try to parse a single phoneme into a phoneme/stress pair:

(describe parse-phoneme
  (it "can parse a phoneme with no stress"
    (= ["K" nil] (parse-phoneme "K")))
  (it "can parse a phoneme with null stress"
    (= ["AH" :n] (parse-phoneme "AH0")))
  (it "can parse a phoneme with primary stress"
    (= ["AH" :p] (parse-phoneme "AH1")))
  (it "can parse a phoneme with null stress"
    (= ["AH" :s] (parse-phoneme "AH2"))))

Here I’ve decided to represent these entries as a vector pair rather than, say, a map. I honestly don’t know if this is good style or not. The map might be clearer, and memory usage isn’t much of a concern since the set of all phonemes is small and could be easily cached. Thoughts?

This seems like a good place for a regex. In Clojure, regular expression matches are sequences and can participate in destructuring so you can very easily extract values from a match without a lot of fuss. Here’s what I came up with:

(defn parse-phoneme
  "Parse a phoneme into a phone/accent pair"
  [s]
  (let [[_ phone stress] (re-matches #"([A-Z]+)([012])?" s)]
    [phone (case stress "0" :n "1" :p "2" :s nil)]))

Here, the result of the regex match is unpacked into [_ phone stress]. When you don’t care about a particular field when destructuring, I guess it’s Clojure convention to use an underscore. Done. I use a case to map from the dictionary’s stress code to a symbolic code, :n for none, :p for primary, and :s for secondary.

... testable ...

Parsing a full entry is also pretty easy. Here's the test:

(describe parse-entry
  (it "can parse an entry with a name and phoneme list"
    (= { :word "CLOJURE" 
         :phonemes [["K" nil] ["L" nil]["OW" :p] ["ZH" nil] ["ER" :n]] } 
       (parse-entry "CLOJURE K L OW1 ZH ER0"))))

and the code:

(defn parse-entry
  "Parse a dictionary entry"
  [s]
  (let [parts (re-split #"\s+" s)]
    { :word     (first parts) 
      :phonemes (map parse-phoneme (rest parts)) }))

Just split the line on whitespace. Use the first result as the word, and apply parse-phoneme to the rest. Is returning a map here necessary or good? I think it makes the code more readable, but if that's the case why didn't I do the same for parse-phoneme. Lau B. Jensen has some interesting thoughts on this topic in his "Taking Uncle Bob to School" article. Using a map gives us somewhere to stick additional info in the future. YAGNI? If I'm wrong, at least I'm in good company :)

... functions ...

Finally, we want to parse an entire dictionary. We'll complicate things a bit because the we want support for comments (lines starting with a ;) and empty lines. But first, we need to set up some tests:

(describe parse-dictionary
  (given [line1 "CLOSURE K L OW1 ZH ER0"
          line2 "MACARONI  M AE2 K ER0 OW1 N IY0"
          expected {"CLOSURE"  (:phonemes (parse-entry line1))
                    "MACARONI" (:phonemes (parse-entry line2))}
          input (fn [& s] (java.io.StringReader. (apply str s)))]

  (it "can parse multiple entries"
    (= expected (parse-dictionary (input line1 "\n" line2))))

  (it "can parse a dictionary with comments"
    (= expected (parse-dictionary (input "; comment\n" line1 "\n;;; another\n" line2))))

  (it "can parse a dictionary with empty lines"
    (= expected (parse-dictionary (input line1 "\n\n   \n" line2))))))

Here I've used Lazytest's given macro to define some constants used across all test examples. given acts just like a let except its body consists of examples and nested test groups. I pre-define the expected result and then test it against various input combinations. It at least saves some typing.

The other interesting bit is the input helper function which converts its arguments into a StringReader suitable for input to duck-streams. Remember what I said about testability above? I think this helper function should really go somewhere else, but I'll let future development guide that ...

... is the Clojure way!

And here's the resulting code for parse-dictionary:

(defn- is-dictionary-entry?
  "Returns true if the given line is a dictionary entry"
  [line]
  (and (not (empty? line))
       (not (.startsWith line ";"))))

(defn parse-dictionary
  [reader]
  (let [lines    (read-lines reader)
        trimmed  (map #(.trim %1) lines)
        filtered (filter is-dictionary-entry? trimmed)
        parsed   (map parse-entry filtered)]
    (reduce #(assoc %1 (:word %2) (:phonemes %2)) {}  parsed)))

I've read that it's idiomatic in Clojure to use the Java libraries when available, but I don't yet know the Clojure core API well enough to choose between them. For example, I originally had (.isEmpty line) in the is-dictionary-entry? helper because I know the Java String API really well off the top of my head. Later, I found the empty? function which seemed more appropriate and, I think, reads better. So, are there better ways to express my usage of .startsWith and .trim? Time will tell ... and now I look at the API doc for empty? and see that I should be using (seq line) instead!

Note that I've broken down the parsing into a number of transformation steps in a let form. For me, this made the data flow much clearer since I can name the results of each steps. Of course, what do I know? So I asked and got some really great feedback both validating my approach and offering equally readable alternatives. In particular, the ->> macro calls multiple forms, passing the return value from each form as the last argument of the next form.

Here's the code above translated to use ->>:

(defn parse-dictionary
  [reader]
  (->> (read-lines reader)
       (map #(.trim %1))
       (filter is-dictionary-entry?)
       (map parse-entry)
       (reduce #(assoc %1 (:word %2) (:phonemes %2)) {})))

It's arguably more readable. It's also less typing because you don't have to pass the results along manually. But having the named results is a nice form of documentation, and if you need to use a result in multiple places downstream, ->> doesn't help much. I guess it's a tossup and depends on the situation.

In the same thread, there was also some interesting discussion of performance characteristics of these approaches and a nice demonstration of using macros to create a test harness for analyzing them.

Conclusion

If you've read this far, I hope you've found this post informative. I know that I've learned a lot in the course of writing and refining just this little bit of code. As I mentioned in the introduction, I'll be adding to it in future posts. Bear with me; I think this is all going somewhere.

What have we learned?

  • We can use given in Lazytest to set up constants used across multiple test examples
  • clojure.contrib.duck-streams is super handy. It makes it easy to read data from multiple sources and, by decoupling our code from specific input sources, makes it easier to test.
  • It's ok to bind a stream of operations in a let, but there are alternatives like the ->> macro.
  • Building up code by creating small, testable functions is preferable to creating a giant, kitchen sink function.
  • The Clojure community is really helpful and friendly, going above and beyond to answer a noob question.

Until next time ...

Remedial Clojure: Leiningen, Lazytest, and some code

October 17th, 2010 6 comments

Update/Note 10/27/2011: Some of the code in the this post makes use of monolithic clojure.contrib. This library is now deprecated so please don’t use it in new Clojure code. For more info see “Where Did Clojure.Contrib Go?


I’m starting to get friendly with Clojure. There’s no better way to learn a new language than to build something and write (that is, reflect) about it. Let’s try building a little program that covers some of the basics. I’m going to assume you have access to some Clojure reference material such as the Clojure wiki or “Programming Clojure”. Here we’ll be using Clojure, Leiningen for builds and Lazytest for continuous testing. This program, histowords, will read a text file, count word occurrences and produce a histogram of the results. Nothing fancy, but the journey of a thousand miles and all that…

The final source for this post is all on github.com.

Initial Project Setup

First, let’s create a new project with Leiningen. Leiningen is a Clojure dependency management and build tool similar to Maven, hopefully without the betrayal and heartache (it’s actually built on top of Maven). Download “lein” and put it on your path somewhere. Of course, you’ll need Java installed and on your path as well.

$ lein new histowords
$ cd histowords

(Note that the first invocation of lein might take awhile while it downloads the internet)

Now we’ve got a skeleton project that looks something like this:

histowords/
|-- README
|-- lib
|-- project.clj
|-- src
|   `-- histowords
|       `-- core.clj
|-- test
    `-- histowords
        `-- test
            `-- core.clj

project.clj is the Leiningen project file where we define our dependencies and everything. Let’s add our Clojure and Lazytest dependencies now:

(defproject histowords "1.0.0-SNAPSHOT"
  :description "FIXME: write"
  :dependencies [[org.clojure/clojure "1.2.0"]
                 [org.clojure/clojure-contrib "1.2.0"]]
  :dev-dependencies [[com.stuartsierra/lazytest "1.1.2"]]
  :repositories {"stuartsierra-releases" "http://stuartsierra.com/maven2"}
  :main histowords.core)

We can now ask Leiningen to download the dependent jars into the lib and lib/dev folders:

$ lein deps

Leiningen has some other useful targets as well:

  • clean – clean the project
  • jar – Build a jar
  • uberjar – Build a standalone jar with Clojure and everything bundled.
  • repl – Start a repl with the classpath all set up

Lazytest Setup

Lazytest is an RSpec-like BDD/testing framework created by Stuart Sierra. Before we write any code, let’s get Lazytest running in “watch” mode so every time we save a file, it’ll re-run our tests automatically. First, we’ll add some setup code to our test file, test/histowords/test/core.clj. Replace the default code generated by Leiningen with this:

(ns histowords.test.core
  (:use [lazytest.describe :only (describe it)])
  (:use histowords.core))

Now, in a new console, fire up Lazytest:

$ cd histowords
$ java -cp "src:test:lib/*:lib/dev/*" lazytest.watch src test

This tells Lazytest to watch for changes in the src/ and test/ directories. You’ll see output like this:

Namespaces (no cases run)

Ran 0 test cases.
0 failures.

Done.

Let’s Write Some Code

As mentioned above, we want to make a histogram by counting words in a file. So the input “Why Betty, why Betty, why?” will generate this:

why   ###
betty ##

I’m going to take a bottom up approach to this problem. First, let’s break up the input into words, discarding whitespace and punctuation, and converting to lowercase. Here’s the tests I came up with:

(describe gather-words
  (it "splits words on whitespace"
    (= ["mary" "had" "a" "little" "lamb"] (gather-words "   mary had a\tlittle\n   lamb    ")))
  (it "removes punctuation"
    (= ["mary" "had" "a" "little" "lamb"] (gather-words "., mary, had... a little; lamb!")))
  (it "converts words to lower case"
    (= ["mary" "had" "a" "little" "lamb"] (gather-words "., MaRy, hAd... A liTTle; lAmb!"))))

Add these to the test file and save it. Lazytest will immediately complain. Try implementing gather-words in src/historwords/core.clj. Here’s what I came up with:

(defn gather-words 
  "Given a string, return a list of lower-case words with whitespace and
   punctuation removed"
  [s]
  (map #(.toLowerCase %) (filter #(not (.isEmpty %)) (seq (.split #"[\s\W]+" s)))))

There are several things going on here. First we split on whitespace and non-word characters (yes, this isn’t perfect). We take that sequence and filter out any empty strings. Finally, we convert the results to lower case. I think the best way to go about this is to add each test one at a time and refine the implementation as you go. That worked well for me anyway.

Next, we want to count distinct words in the sequence returned by gather-words. The result will be a map from word to count. Here’s a test:

(describe count-words
  (it "counts words into a map"
    (= {"mary" 2 "why" 3 } (count-words ["why" "mary" "why" "mary" "why"]))))

Again, after you add the test, Lazytest will complain and you can start implementing the count-words function. In an imperative language like Java, you’d create an empty map and then iterate over the word list. If the word’s in the map, increment the count and put it back in the map, otherwise, add an entry to the map with an initial count of 1. Something like this:

public static Map<String, Integer> countWords(Collection<String> words) {
    final Map<String, Integer> result = new HashMap<String, Integer>();
    for(String word : words) {
        final Integer count = result.get(word);
        result.put(word, count != null ? count + 1 : 1));
    }
    return result;
}

In other words, a bureaucratic nightmare.

In Clojure, the idea’s the same, but we don’t need a for-loop. Instead, we can use the reduce function over the list of words. In other languages, this function is known as fold, foldl, inject, accumulate, and other names. Essentially, a callback function is called for each element in the sequence. It’s passed the element and the result of the previous call to the function. So we’re going to take in a word and a map, update the word count and return a new map. Here’s my implementation:

(defn count-words 
  "Take a seq of words and return a map from word to word count"
  [words]
  (reduce 
    (fn [m w] (assoc m w (+ 1 (m w 0)))) 
    {} 
    words))

Note the {} which is the initial value for our accumulator map. Our anonymous function looks up the current count for the word (default to 0 if missing) and increments by 1. Then we return a new map (the assoc function) with the updated count.

So now we have a map from words to counts. The next steps towards our histogram are to “flatten” the map into a list of word/count pairs and sort by count so our histogram looks nice. Conveniently, Clojure maps are already a “seq” of key/value pairs. That is, if we use a map in a context where a sequence is needed, the map will act like a sequence of pairs. This is easy to see in a repl:

$ lein repl
"REPL started; server listening on localhost:46775."
histowords.core=> (seq { "a" 1 "b" 2 "c" 3 })
(["a" 1] ["b" 2] ["c" 3])

So all we need to do is sort by the second field of each pair. Here’s a test:

(describe sort-counted-words
  (it "sorts and returns a list of word/count pairs"
    (= [["a" 1] ["b" 2] ["c" 3]] (sort-counted-words {"b" 2 "c" 3 "a" 1}))))

and here’s my implementation of sort-counted-words":

(defn sort-counted-words 
  "Given a sequence of word/count pairs, sort by count"
  [words]
  (sort-by #(% 1) words))

The sort-by function takes a comparator function and the sequence to sort. Here our comparator function just grabs the second field (index 1) of each pair. This function is so simple it's almost unnecessary, but it's nice when code is readable.

Now we have a list of word/count pairs, sorted by count. Now we just need to turn it into a histogram. We're going to need a function to generate the histogram bars. Here's the test:

(describe repeat-str
  (it "returns the empty string if count is zero"
    (= "" (repeat-str "*" 0)))
  (it "repeats the input string n times"
    (= "xxxxx" (repeat-str "x" 5))))

Can you implement it?

Next let's try generating a single entry in the histogram. histogram-entry will take a word/count pair and the width of the name column as parameters and return a string. Here's the test:

(describe histogram-entry
  (it "can generate a single histogram entry"
    (= "betty   ######" (histogram-entry ["betty" 6] 7))))

Here's my implementation:

(defn histogram-entry 
  "Make a histogram entry for a word/count pair and maximum word width"
  [[w n] width]
  (let [r (- width (.length w))]
    (str w (repeat-str " " r) " " (repeat-str "#" n))))

Note the use of destructuring ([w n]) in the parameter list to bind the word and count from the pair to variables rather than extracting with indices. Otherwise this is pretty straightforward. Calculate the required padding and concatenate some strings.

Finally, we're ready to pull it all together. The histogram function takes a sequence of word/count pairs and generates a full histogram for them with nice alignment and everything. Here's the test:

(describe histogram
  (it "can generate a histogram from word counts"
    (= "mary ##\nwhy  ###\n" (histogram [["mary" 2] ["why" 3]]))))

And here's the implementation:

(defn histogram 
  "Make a histogram for a seq of word/count pairs"
  [words]
  (let [width (apply max (map #(.length (%1 0)) words))]
    (reduce 
      (fn [acc pair] (str acc (histogram-entry pair width) "\n")) 
      "" 
      words)))

We use the max function to calculate the width of the widest word in the input sequence. The we use reduce again to generate the output string. Previously we were accumulating a map. This time we're accumulating a string, thus the initial value is the empty string.

Pulling It All Together

Now we've got a bunch of passing tests. How do we turn this into a program we can run? Leiningen to the rescue. We'll add a main function to src/histowords/core.clj. Let's string all our functions together there. We'll assume that a file name is given as a command-line parameter, and use slurp to read it into a string:

(defn -main [& args]
  (println 
    (histogram 
      (sort-counted-words 
        (count-words 
          (gather-words 
            (slurp (first args))))))))

Additionally, we need to tell the Clojure compiler to generate a class for this file so it can be used as a main entry point. Att the top of the file, add the :gen-class keyword:

(ns histowords.core
  (:use [clojure.contrib.str-utils :only (str-join)])
  (:gen-class))

With that in place, we can use Leiningen to generate a standalone uber-jar for us:

$ lein uberjar
$ java -jar histowords-1.0.0-SNAPSHOT-standalone.jar mary.txt
against    #
but        #
lingered   #
snow       #
near       #
go         #
still      #
fleece     #
does       ##
did        ##
a          ###
teacher    ###
was        ###
... snip ...
to         ####
turned     ####
day        ####
you        ####
school     #####
so         ######
it         #########
and        ##########
lamb       ############
mary       #############
the        ##############

Conclusions

I think this is pretty cool. I find the code nice and compact while remaining readable. Lazytest makes testing easy and Leiningen gives us the basic features of Maven without the XML nightmare. But, this is just a toy app, right? In real life, it'll get a lot messier. The thing I'm most excited about is that, since I started playing with Clojure last week, I've read a bunch of "realworld" Clojure code in github, and it's still just as compact and readable. I know I'm just scratching the surface too. Fun!

There are still some newb questions/issues I have though:

  • How do I get Leiningen to run my Lazytests as part of the build. A plugin maybe? What about Maven integration and JUnit-style reports so I can run all this stuff in Hudson?
  • Sometimes I have to stare at the Lazytest failure messages a bit to figure out what failed

Anyway, I'm just getting started, so if you see any glaring mistakes or ways to improve the code, I'd love to hear about it.

Categories: clojure Tags: , , ,