Archive

Posts Tagged ‘duck-streams’

Remedial Clojure: duck-streams, let style, and pronunciation

October 22nd, 2010 3 comments

This is the “If it weren’t for my family and my stupid conscience I’d drive to Clojure Conj” edition of Remedial Clojure


UPDATE, 10/16/2011: Ok, it’s been a year. Now clojure.contrib is really, truly deprecated. Don’t use duck-streams or anything else from monolithic contrib. See Where Did Clojure.Contrib Go?, maintained tirelessly by Sean Cornfield, for more info.


EDIT, 10/26/2010: It’s come to my attention that duck-streams is technically deprecated. It’s still in clojure.contrib and usable, but it’s been superseded by functions in clojure.java.io. Everything I say about the usefulness of duck-streams still apply though. Wherever you see (read-lines "foo"), replace it with something like this:

(with-open [rdr (clojure.java.io/reader "foo")] ...)

Ok, here’s another entry in my Clojure journey. Previously, I covered getting started with Leiningen for builds and Lazytest for TDD. I’ll be doing the same here. There are plenty of places where I’m unsure about the most “Clojure-y” way to do things, so feel free to poke holes in the code.

In the next couple posts, I’ll build a super simple rhyming dictionary using the CMU Pronouncing Dictionary to determine whether words rhyme. Here’s the overall structure of the program:

  • Read the lines of the dictionary using the clojure.contrib.duck-streams library.
  • Parse each line
  • Build a map from words to pronunciations
  • Create a function for determining whether two words rhyme (part 2)
  • Building a fast datastructure for finding all rhymes for a word (part 2)

The final source for this post is all on github.com.

Initial Project Setup

The setup for this project is pretty vanilla. I’m using Leiningen again, following the same procedure as my last post to create a stub project. This time I’m going to create a new namespace, rhymetime.pronounce where we’ll keep all the code for parsing the dictionary. We’ll end up with these new files:

  • src/rhymetime/pronounce.clj
  • test/rhymetime/test/pronounce.clj

Let’s Code

Ok, let’s get down to it. First we want to read the dictionary from file… but that’s a little boring. What if we want to read it from the web, or a string, or a test resource? This is where the clojure.contrib.duck-streams library comes in. Given an input (a file name, url, reader, input stream, etc), it magically determines how to open it. Well, maybe not magically, but it does its best to do what you’d expect it to do. Here are some examples:

(use 'clojure.contrib.duck-streams)

; Lazy seq of lines from file named README
(read-lines "README")
; ... or ...
(read-lines (java.io.File. "README"))

; Lazy seq of lines from a URL (2009 Tornado data)
(read-lines "http://www.spc.noaa.gov/wcm/data/2009_torn.csv")
; ... or ...
(read-lines (java.net.URL. "http://www.spc.noaa.gov/wcm/data/2009_torn.csv"))

In these examples, read-lines can be replaced with reader which returns a buffered Java Reader as well as slurp* which returns the contents of the file or resource as a single string.

Anyway, the nice thing about duck-streams is that our parser can be agnostic about where input comes from without adding a bunch of configuration hassle. It also helps with testing!

“… It’s not that hard: Na-ghee-na-na-jar. Nagheenanajar. “

The CMU Pronouncing Dictionary has been around for a long, long time. It’s a flat-file database of about 130,000 words. Each word includes a pronunciation as a list of Arpabet (which is really fun to say) phonemes, each with an optional stress. Each entry has the following form:

word phoneme0 phoneme1 ...

and each phoneme consists of alphabetic characters followed by, if it’s a vowel sound, an optional stress identifier, “0”, “1”, or “2”. Here are some examples:

BETTY  B EH1 T IY0
READY  R EH1 D IY0
SPAGHETTI  S P AH0 G EH1 T IY0
MACARONI  M AE2 K ER0 OW1 N IY0

Creating small…

Let’s write some tests first. In test/rhymetime/test/pronounce.clj, starting bottom-up, we’ll try to parse a single phoneme into a phoneme/stress pair:

(describe parse-phoneme
  (it "can parse a phoneme with no stress"
    (= ["K" nil] (parse-phoneme "K")))
  (it "can parse a phoneme with null stress"
    (= ["AH" :n] (parse-phoneme "AH0")))
  (it "can parse a phoneme with primary stress"
    (= ["AH" :p] (parse-phoneme "AH1")))
  (it "can parse a phoneme with null stress"
    (= ["AH" :s] (parse-phoneme "AH2"))))

Here I’ve decided to represent these entries as a vector pair rather than, say, a map. I honestly don’t know if this is good style or not. The map might be clearer, and memory usage isn’t much of a concern since the set of all phonemes is small and could be easily cached. Thoughts?

This seems like a good place for a regex. In Clojure, regular expression matches are sequences and can participate in destructuring so you can very easily extract values from a match without a lot of fuss. Here’s what I came up with:

(defn parse-phoneme
  "Parse a phoneme into a phone/accent pair"
  [s]
  (let [[_ phone stress] (re-matches #"([A-Z]+)([012])?" s)]
    [phone (case stress "0" :n "1" :p "2" :s nil)]))

Here, the result of the regex match is unpacked into [_ phone stress]. When you don’t care about a particular field when destructuring, I guess it’s Clojure convention to use an underscore. Done. I use a case to map from the dictionary’s stress code to a symbolic code, :n for none, :p for primary, and :s for secondary.

... testable ...

Parsing a full entry is also pretty easy. Here's the test:

(describe parse-entry
  (it "can parse an entry with a name and phoneme list"
    (= { :word "CLOJURE" 
         :phonemes [["K" nil] ["L" nil]["OW" :p] ["ZH" nil] ["ER" :n]] } 
       (parse-entry "CLOJURE K L OW1 ZH ER0"))))

and the code:

(defn parse-entry
  "Parse a dictionary entry"
  [s]
  (let [parts (re-split #"\s+" s)]
    { :word     (first parts) 
      :phonemes (map parse-phoneme (rest parts)) }))

Just split the line on whitespace. Use the first result as the word, and apply parse-phoneme to the rest. Is returning a map here necessary or good? I think it makes the code more readable, but if that's the case why didn't I do the same for parse-phoneme. Lau B. Jensen has some interesting thoughts on this topic in his "Taking Uncle Bob to School" article. Using a map gives us somewhere to stick additional info in the future. YAGNI? If I'm wrong, at least I'm in good company :)

... functions ...

Finally, we want to parse an entire dictionary. We'll complicate things a bit because the we want support for comments (lines starting with a ;) and empty lines. But first, we need to set up some tests:

(describe parse-dictionary
  (given [line1 "CLOSURE K L OW1 ZH ER0"
          line2 "MACARONI  M AE2 K ER0 OW1 N IY0"
          expected {"CLOSURE"  (:phonemes (parse-entry line1))
                    "MACARONI" (:phonemes (parse-entry line2))}
          input (fn [& s] (java.io.StringReader. (apply str s)))]

  (it "can parse multiple entries"
    (= expected (parse-dictionary (input line1 "\n" line2))))

  (it "can parse a dictionary with comments"
    (= expected (parse-dictionary (input "; comment\n" line1 "\n;;; another\n" line2))))

  (it "can parse a dictionary with empty lines"
    (= expected (parse-dictionary (input line1 "\n\n   \n" line2))))))

Here I've used Lazytest's given macro to define some constants used across all test examples. given acts just like a let except its body consists of examples and nested test groups. I pre-define the expected result and then test it against various input combinations. It at least saves some typing.

The other interesting bit is the input helper function which converts its arguments into a StringReader suitable for input to duck-streams. Remember what I said about testability above? I think this helper function should really go somewhere else, but I'll let future development guide that ...

... is the Clojure way!

And here's the resulting code for parse-dictionary:

(defn- is-dictionary-entry?
  "Returns true if the given line is a dictionary entry"
  [line]
  (and (not (empty? line))
       (not (.startsWith line ";"))))

(defn parse-dictionary
  [reader]
  (let [lines    (read-lines reader)
        trimmed  (map #(.trim %1) lines)
        filtered (filter is-dictionary-entry? trimmed)
        parsed   (map parse-entry filtered)]
    (reduce #(assoc %1 (:word %2) (:phonemes %2)) {}  parsed)))

I've read that it's idiomatic in Clojure to use the Java libraries when available, but I don't yet know the Clojure core API well enough to choose between them. For example, I originally had (.isEmpty line) in the is-dictionary-entry? helper because I know the Java String API really well off the top of my head. Later, I found the empty? function which seemed more appropriate and, I think, reads better. So, are there better ways to express my usage of .startsWith and .trim? Time will tell ... and now I look at the API doc for empty? and see that I should be using (seq line) instead!

Note that I've broken down the parsing into a number of transformation steps in a let form. For me, this made the data flow much clearer since I can name the results of each steps. Of course, what do I know? So I asked and got some really great feedback both validating my approach and offering equally readable alternatives. In particular, the ->> macro calls multiple forms, passing the return value from each form as the last argument of the next form.

Here's the code above translated to use ->>:

(defn parse-dictionary
  [reader]
  (->> (read-lines reader)
       (map #(.trim %1))
       (filter is-dictionary-entry?)
       (map parse-entry)
       (reduce #(assoc %1 (:word %2) (:phonemes %2)) {})))

It's arguably more readable. It's also less typing because you don't have to pass the results along manually. But having the named results is a nice form of documentation, and if you need to use a result in multiple places downstream, ->> doesn't help much. I guess it's a tossup and depends on the situation.

In the same thread, there was also some interesting discussion of performance characteristics of these approaches and a nice demonstration of using macros to create a test harness for analyzing them.

Conclusion

If you've read this far, I hope you've found this post informative. I know that I've learned a lot in the course of writing and refining just this little bit of code. As I mentioned in the introduction, I'll be adding to it in future posts. Bear with me; I think this is all going somewhere.

What have we learned?

  • We can use given in Lazytest to set up constants used across multiple test examples
  • clojure.contrib.duck-streams is super handy. It makes it easy to read data from multiple sources and, by decoupling our code from specific input sources, makes it easier to test.
  • It's ok to bind a stream of operations in a let, but there are alternatives like the ->> macro.
  • Building up code by creating small, testable functions is preferable to creating a giant, kitchen sink function.
  • The Clojure community is really helpful and friendly, going above and beyond to answer a noob question.

Until next time ...