Archive

Posts Tagged ‘testing’

Remedial Clojure: duck-streams, let style, and pronunciation

October 22nd, 2010 3 comments

This is the “If it weren’t for my family and my stupid conscience I’d drive to Clojure Conj” edition of Remedial Clojure


UPDATE, 10/16/2011: Ok, it’s been a year. Now clojure.contrib is really, truly deprecated. Don’t use duck-streams or anything else from monolithic contrib. See Where Did Clojure.Contrib Go?, maintained tirelessly by Sean Cornfield, for more info.


EDIT, 10/26/2010: It’s come to my attention that duck-streams is technically deprecated. It’s still in clojure.contrib and usable, but it’s been superseded by functions in clojure.java.io. Everything I say about the usefulness of duck-streams still apply though. Wherever you see (read-lines "foo"), replace it with something like this:

(with-open [rdr (clojure.java.io/reader "foo")] ...)

Ok, here’s another entry in my Clojure journey. Previously, I covered getting started with Leiningen for builds and Lazytest for TDD. I’ll be doing the same here. There are plenty of places where I’m unsure about the most “Clojure-y” way to do things, so feel free to poke holes in the code.

In the next couple posts, I’ll build a super simple rhyming dictionary using the CMU Pronouncing Dictionary to determine whether words rhyme. Here’s the overall structure of the program:

  • Read the lines of the dictionary using the clojure.contrib.duck-streams library.
  • Parse each line
  • Build a map from words to pronunciations
  • Create a function for determining whether two words rhyme (part 2)
  • Building a fast datastructure for finding all rhymes for a word (part 2)

The final source for this post is all on github.com.

Initial Project Setup

The setup for this project is pretty vanilla. I’m using Leiningen again, following the same procedure as my last post to create a stub project. This time I’m going to create a new namespace, rhymetime.pronounce where we’ll keep all the code for parsing the dictionary. We’ll end up with these new files:

  • src/rhymetime/pronounce.clj
  • test/rhymetime/test/pronounce.clj

Let’s Code

Ok, let’s get down to it. First we want to read the dictionary from file… but that’s a little boring. What if we want to read it from the web, or a string, or a test resource? This is where the clojure.contrib.duck-streams library comes in. Given an input (a file name, url, reader, input stream, etc), it magically determines how to open it. Well, maybe not magically, but it does its best to do what you’d expect it to do. Here are some examples:

(use 'clojure.contrib.duck-streams)

; Lazy seq of lines from file named README
(read-lines "README")
; ... or ...
(read-lines (java.io.File. "README"))

; Lazy seq of lines from a URL (2009 Tornado data)
(read-lines "http://www.spc.noaa.gov/wcm/data/2009_torn.csv")
; ... or ...
(read-lines (java.net.URL. "http://www.spc.noaa.gov/wcm/data/2009_torn.csv"))

In these examples, read-lines can be replaced with reader which returns a buffered Java Reader as well as slurp* which returns the contents of the file or resource as a single string.

Anyway, the nice thing about duck-streams is that our parser can be agnostic about where input comes from without adding a bunch of configuration hassle. It also helps with testing!

“… It’s not that hard: Na-ghee-na-na-jar. Nagheenanajar. “

The CMU Pronouncing Dictionary has been around for a long, long time. It’s a flat-file database of about 130,000 words. Each word includes a pronunciation as a list of Arpabet (which is really fun to say) phonemes, each with an optional stress. Each entry has the following form:

word phoneme0 phoneme1 ...

and each phoneme consists of alphabetic characters followed by, if it’s a vowel sound, an optional stress identifier, “0”, “1”, or “2”. Here are some examples:

BETTY  B EH1 T IY0
READY  R EH1 D IY0
SPAGHETTI  S P AH0 G EH1 T IY0
MACARONI  M AE2 K ER0 OW1 N IY0

Creating small…

Let’s write some tests first. In test/rhymetime/test/pronounce.clj, starting bottom-up, we’ll try to parse a single phoneme into a phoneme/stress pair:

(describe parse-phoneme
  (it "can parse a phoneme with no stress"
    (= ["K" nil] (parse-phoneme "K")))
  (it "can parse a phoneme with null stress"
    (= ["AH" :n] (parse-phoneme "AH0")))
  (it "can parse a phoneme with primary stress"
    (= ["AH" :p] (parse-phoneme "AH1")))
  (it "can parse a phoneme with null stress"
    (= ["AH" :s] (parse-phoneme "AH2"))))

Here I’ve decided to represent these entries as a vector pair rather than, say, a map. I honestly don’t know if this is good style or not. The map might be clearer, and memory usage isn’t much of a concern since the set of all phonemes is small and could be easily cached. Thoughts?

This seems like a good place for a regex. In Clojure, regular expression matches are sequences and can participate in destructuring so you can very easily extract values from a match without a lot of fuss. Here’s what I came up with:

(defn parse-phoneme
  "Parse a phoneme into a phone/accent pair"
  [s]
  (let [[_ phone stress] (re-matches #"([A-Z]+)([012])?" s)]
    [phone (case stress "0" :n "1" :p "2" :s nil)]))

Here, the result of the regex match is unpacked into [_ phone stress]. When you don’t care about a particular field when destructuring, I guess it’s Clojure convention to use an underscore. Done. I use a case to map from the dictionary’s stress code to a symbolic code, :n for none, :p for primary, and :s for secondary.

... testable ...

Parsing a full entry is also pretty easy. Here's the test:

(describe parse-entry
  (it "can parse an entry with a name and phoneme list"
    (= { :word "CLOJURE" 
         :phonemes [["K" nil] ["L" nil]["OW" :p] ["ZH" nil] ["ER" :n]] } 
       (parse-entry "CLOJURE K L OW1 ZH ER0"))))

and the code:

(defn parse-entry
  "Parse a dictionary entry"
  [s]
  (let [parts (re-split #"\s+" s)]
    { :word     (first parts) 
      :phonemes (map parse-phoneme (rest parts)) }))

Just split the line on whitespace. Use the first result as the word, and apply parse-phoneme to the rest. Is returning a map here necessary or good? I think it makes the code more readable, but if that's the case why didn't I do the same for parse-phoneme. Lau B. Jensen has some interesting thoughts on this topic in his "Taking Uncle Bob to School" article. Using a map gives us somewhere to stick additional info in the future. YAGNI? If I'm wrong, at least I'm in good company :)

... functions ...

Finally, we want to parse an entire dictionary. We'll complicate things a bit because the we want support for comments (lines starting with a ;) and empty lines. But first, we need to set up some tests:

(describe parse-dictionary
  (given [line1 "CLOSURE K L OW1 ZH ER0"
          line2 "MACARONI  M AE2 K ER0 OW1 N IY0"
          expected {"CLOSURE"  (:phonemes (parse-entry line1))
                    "MACARONI" (:phonemes (parse-entry line2))}
          input (fn [& s] (java.io.StringReader. (apply str s)))]

  (it "can parse multiple entries"
    (= expected (parse-dictionary (input line1 "\n" line2))))

  (it "can parse a dictionary with comments"
    (= expected (parse-dictionary (input "; comment\n" line1 "\n;;; another\n" line2))))

  (it "can parse a dictionary with empty lines"
    (= expected (parse-dictionary (input line1 "\n\n   \n" line2))))))

Here I've used Lazytest's given macro to define some constants used across all test examples. given acts just like a let except its body consists of examples and nested test groups. I pre-define the expected result and then test it against various input combinations. It at least saves some typing.

The other interesting bit is the input helper function which converts its arguments into a StringReader suitable for input to duck-streams. Remember what I said about testability above? I think this helper function should really go somewhere else, but I'll let future development guide that ...

... is the Clojure way!

And here's the resulting code for parse-dictionary:

(defn- is-dictionary-entry?
  "Returns true if the given line is a dictionary entry"
  [line]
  (and (not (empty? line))
       (not (.startsWith line ";"))))

(defn parse-dictionary
  [reader]
  (let [lines    (read-lines reader)
        trimmed  (map #(.trim %1) lines)
        filtered (filter is-dictionary-entry? trimmed)
        parsed   (map parse-entry filtered)]
    (reduce #(assoc %1 (:word %2) (:phonemes %2)) {}  parsed)))

I've read that it's idiomatic in Clojure to use the Java libraries when available, but I don't yet know the Clojure core API well enough to choose between them. For example, I originally had (.isEmpty line) in the is-dictionary-entry? helper because I know the Java String API really well off the top of my head. Later, I found the empty? function which seemed more appropriate and, I think, reads better. So, are there better ways to express my usage of .startsWith and .trim? Time will tell ... and now I look at the API doc for empty? and see that I should be using (seq line) instead!

Note that I've broken down the parsing into a number of transformation steps in a let form. For me, this made the data flow much clearer since I can name the results of each steps. Of course, what do I know? So I asked and got some really great feedback both validating my approach and offering equally readable alternatives. In particular, the ->> macro calls multiple forms, passing the return value from each form as the last argument of the next form.

Here's the code above translated to use ->>:

(defn parse-dictionary
  [reader]
  (->> (read-lines reader)
       (map #(.trim %1))
       (filter is-dictionary-entry?)
       (map parse-entry)
       (reduce #(assoc %1 (:word %2) (:phonemes %2)) {})))

It's arguably more readable. It's also less typing because you don't have to pass the results along manually. But having the named results is a nice form of documentation, and if you need to use a result in multiple places downstream, ->> doesn't help much. I guess it's a tossup and depends on the situation.

In the same thread, there was also some interesting discussion of performance characteristics of these approaches and a nice demonstration of using macros to create a test harness for analyzing them.

Conclusion

If you've read this far, I hope you've found this post informative. I know that I've learned a lot in the course of writing and refining just this little bit of code. As I mentioned in the introduction, I'll be adding to it in future posts. Bear with me; I think this is all going somewhere.

What have we learned?

  • We can use given in Lazytest to set up constants used across multiple test examples
  • clojure.contrib.duck-streams is super handy. It makes it easy to read data from multiple sources and, by decoupling our code from specific input sources, makes it easier to test.
  • It's ok to bind a stream of operations in a let, but there are alternatives like the ->> macro.
  • Building up code by creating small, testable functions is preferable to creating a giant, kitchen sink function.
  • The Clojure community is really helpful and friendly, going above and beyond to answer a noob question.

Until next time ...

Remedial Clojure: Leiningen, Lazytest, and some code

October 17th, 2010 6 comments

Update/Note 10/27/2011: Some of the code in the this post makes use of monolithic clojure.contrib. This library is now deprecated so please don’t use it in new Clojure code. For more info see “Where Did Clojure.Contrib Go?


I’m starting to get friendly with Clojure. There’s no better way to learn a new language than to build something and write (that is, reflect) about it. Let’s try building a little program that covers some of the basics. I’m going to assume you have access to some Clojure reference material such as the Clojure wiki or “Programming Clojure”. Here we’ll be using Clojure, Leiningen for builds and Lazytest for continuous testing. This program, histowords, will read a text file, count word occurrences and produce a histogram of the results. Nothing fancy, but the journey of a thousand miles and all that…

The final source for this post is all on github.com.

Initial Project Setup

First, let’s create a new project with Leiningen. Leiningen is a Clojure dependency management and build tool similar to Maven, hopefully without the betrayal and heartache (it’s actually built on top of Maven). Download “lein” and put it on your path somewhere. Of course, you’ll need Java installed and on your path as well.

$ lein new histowords
$ cd histowords

(Note that the first invocation of lein might take awhile while it downloads the internet)

Now we’ve got a skeleton project that looks something like this:

histowords/
|-- README
|-- lib
|-- project.clj
|-- src
|   `-- histowords
|       `-- core.clj
|-- test
    `-- histowords
        `-- test
            `-- core.clj

project.clj is the Leiningen project file where we define our dependencies and everything. Let’s add our Clojure and Lazytest dependencies now:

(defproject histowords "1.0.0-SNAPSHOT"
  :description "FIXME: write"
  :dependencies [[org.clojure/clojure "1.2.0"]
                 [org.clojure/clojure-contrib "1.2.0"]]
  :dev-dependencies [[com.stuartsierra/lazytest "1.1.2"]]
  :repositories {"stuartsierra-releases" "http://stuartsierra.com/maven2"}
  :main histowords.core)

We can now ask Leiningen to download the dependent jars into the lib and lib/dev folders:

$ lein deps

Leiningen has some other useful targets as well:

  • clean – clean the project
  • jar – Build a jar
  • uberjar – Build a standalone jar with Clojure and everything bundled.
  • repl – Start a repl with the classpath all set up

Lazytest Setup

Lazytest is an RSpec-like BDD/testing framework created by Stuart Sierra. Before we write any code, let’s get Lazytest running in “watch” mode so every time we save a file, it’ll re-run our tests automatically. First, we’ll add some setup code to our test file, test/histowords/test/core.clj. Replace the default code generated by Leiningen with this:

(ns histowords.test.core
  (:use [lazytest.describe :only (describe it)])
  (:use histowords.core))

Now, in a new console, fire up Lazytest:

$ cd histowords
$ java -cp "src:test:lib/*:lib/dev/*" lazytest.watch src test

This tells Lazytest to watch for changes in the src/ and test/ directories. You’ll see output like this:

Namespaces (no cases run)

Ran 0 test cases.
0 failures.

Done.

Let’s Write Some Code

As mentioned above, we want to make a histogram by counting words in a file. So the input “Why Betty, why Betty, why?” will generate this:

why   ###
betty ##

I’m going to take a bottom up approach to this problem. First, let’s break up the input into words, discarding whitespace and punctuation, and converting to lowercase. Here’s the tests I came up with:

(describe gather-words
  (it "splits words on whitespace"
    (= ["mary" "had" "a" "little" "lamb"] (gather-words "   mary had a\tlittle\n   lamb    ")))
  (it "removes punctuation"
    (= ["mary" "had" "a" "little" "lamb"] (gather-words "., mary, had... a little; lamb!")))
  (it "converts words to lower case"
    (= ["mary" "had" "a" "little" "lamb"] (gather-words "., MaRy, hAd... A liTTle; lAmb!"))))

Add these to the test file and save it. Lazytest will immediately complain. Try implementing gather-words in src/historwords/core.clj. Here’s what I came up with:

(defn gather-words 
  "Given a string, return a list of lower-case words with whitespace and
   punctuation removed"
  [s]
  (map #(.toLowerCase %) (filter #(not (.isEmpty %)) (seq (.split #"[\s\W]+" s)))))

There are several things going on here. First we split on whitespace and non-word characters (yes, this isn’t perfect). We take that sequence and filter out any empty strings. Finally, we convert the results to lower case. I think the best way to go about this is to add each test one at a time and refine the implementation as you go. That worked well for me anyway.

Next, we want to count distinct words in the sequence returned by gather-words. The result will be a map from word to count. Here’s a test:

(describe count-words
  (it "counts words into a map"
    (= {"mary" 2 "why" 3 } (count-words ["why" "mary" "why" "mary" "why"]))))

Again, after you add the test, Lazytest will complain and you can start implementing the count-words function. In an imperative language like Java, you’d create an empty map and then iterate over the word list. If the word’s in the map, increment the count and put it back in the map, otherwise, add an entry to the map with an initial count of 1. Something like this:

public static Map<String, Integer> countWords(Collection<String> words) {
    final Map<String, Integer> result = new HashMap<String, Integer>();
    for(String word : words) {
        final Integer count = result.get(word);
        result.put(word, count != null ? count + 1 : 1));
    }
    return result;
}

In other words, a bureaucratic nightmare.

In Clojure, the idea’s the same, but we don’t need a for-loop. Instead, we can use the reduce function over the list of words. In other languages, this function is known as fold, foldl, inject, accumulate, and other names. Essentially, a callback function is called for each element in the sequence. It’s passed the element and the result of the previous call to the function. So we’re going to take in a word and a map, update the word count and return a new map. Here’s my implementation:

(defn count-words 
  "Take a seq of words and return a map from word to word count"
  [words]
  (reduce 
    (fn [m w] (assoc m w (+ 1 (m w 0)))) 
    {} 
    words))

Note the {} which is the initial value for our accumulator map. Our anonymous function looks up the current count for the word (default to 0 if missing) and increments by 1. Then we return a new map (the assoc function) with the updated count.

So now we have a map from words to counts. The next steps towards our histogram are to “flatten” the map into a list of word/count pairs and sort by count so our histogram looks nice. Conveniently, Clojure maps are already a “seq” of key/value pairs. That is, if we use a map in a context where a sequence is needed, the map will act like a sequence of pairs. This is easy to see in a repl:

$ lein repl
"REPL started; server listening on localhost:46775."
histowords.core=> (seq { "a" 1 "b" 2 "c" 3 })
(["a" 1] ["b" 2] ["c" 3])

So all we need to do is sort by the second field of each pair. Here’s a test:

(describe sort-counted-words
  (it "sorts and returns a list of word/count pairs"
    (= [["a" 1] ["b" 2] ["c" 3]] (sort-counted-words {"b" 2 "c" 3 "a" 1}))))

and here’s my implementation of sort-counted-words":

(defn sort-counted-words 
  "Given a sequence of word/count pairs, sort by count"
  [words]
  (sort-by #(% 1) words))

The sort-by function takes a comparator function and the sequence to sort. Here our comparator function just grabs the second field (index 1) of each pair. This function is so simple it's almost unnecessary, but it's nice when code is readable.

Now we have a list of word/count pairs, sorted by count. Now we just need to turn it into a histogram. We're going to need a function to generate the histogram bars. Here's the test:

(describe repeat-str
  (it "returns the empty string if count is zero"
    (= "" (repeat-str "*" 0)))
  (it "repeats the input string n times"
    (= "xxxxx" (repeat-str "x" 5))))

Can you implement it?

Next let's try generating a single entry in the histogram. histogram-entry will take a word/count pair and the width of the name column as parameters and return a string. Here's the test:

(describe histogram-entry
  (it "can generate a single histogram entry"
    (= "betty   ######" (histogram-entry ["betty" 6] 7))))

Here's my implementation:

(defn histogram-entry 
  "Make a histogram entry for a word/count pair and maximum word width"
  [[w n] width]
  (let [r (- width (.length w))]
    (str w (repeat-str " " r) " " (repeat-str "#" n))))

Note the use of destructuring ([w n]) in the parameter list to bind the word and count from the pair to variables rather than extracting with indices. Otherwise this is pretty straightforward. Calculate the required padding and concatenate some strings.

Finally, we're ready to pull it all together. The histogram function takes a sequence of word/count pairs and generates a full histogram for them with nice alignment and everything. Here's the test:

(describe histogram
  (it "can generate a histogram from word counts"
    (= "mary ##\nwhy  ###\n" (histogram [["mary" 2] ["why" 3]]))))

And here's the implementation:

(defn histogram 
  "Make a histogram for a seq of word/count pairs"
  [words]
  (let [width (apply max (map #(.length (%1 0)) words))]
    (reduce 
      (fn [acc pair] (str acc (histogram-entry pair width) "\n")) 
      "" 
      words)))

We use the max function to calculate the width of the widest word in the input sequence. The we use reduce again to generate the output string. Previously we were accumulating a map. This time we're accumulating a string, thus the initial value is the empty string.

Pulling It All Together

Now we've got a bunch of passing tests. How do we turn this into a program we can run? Leiningen to the rescue. We'll add a main function to src/histowords/core.clj. Let's string all our functions together there. We'll assume that a file name is given as a command-line parameter, and use slurp to read it into a string:

(defn -main [& args]
  (println 
    (histogram 
      (sort-counted-words 
        (count-words 
          (gather-words 
            (slurp (first args))))))))

Additionally, we need to tell the Clojure compiler to generate a class for this file so it can be used as a main entry point. Att the top of the file, add the :gen-class keyword:

(ns histowords.core
  (:use [clojure.contrib.str-utils :only (str-join)])
  (:gen-class))

With that in place, we can use Leiningen to generate a standalone uber-jar for us:

$ lein uberjar
$ java -jar histowords-1.0.0-SNAPSHOT-standalone.jar mary.txt
against    #
but        #
lingered   #
snow       #
near       #
go         #
still      #
fleece     #
does       ##
did        ##
a          ###
teacher    ###
was        ###
... snip ...
to         ####
turned     ####
day        ####
you        ####
school     #####
so         ######
it         #########
and        ##########
lamb       ############
mary       #############
the        ##############

Conclusions

I think this is pretty cool. I find the code nice and compact while remaining readable. Lazytest makes testing easy and Leiningen gives us the basic features of Maven without the XML nightmare. But, this is just a toy app, right? In real life, it'll get a lot messier. The thing I'm most excited about is that, since I started playing with Clojure last week, I've read a bunch of "realworld" Clojure code in github, and it's still just as compact and readable. I know I'm just scratching the surface too. Fun!

There are still some newb questions/issues I have though:

  • How do I get Leiningen to run my Lazytests as part of the build. A plugin maybe? What about Maven integration and JUnit-style reports so I can run all this stuff in Hudson?
  • Sometimes I have to stare at the Lazytest failure messages a bit to figure out what failed

Anyway, I'm just getting started, so if you see any glaring mistakes or ways to improve the code, I'd love to hear about it.

Categories: clojure Tags: , , ,