Archive

Archive for November, 2010

Remedial Clojure: What Rhymes with Clojure? Part 1

November 11th, 2010 No comments

Previously we put together some Clojure code to parse a pronouncing dictionary from a file, URL, or other resource. Recall that the resulting structure was a map from a word to a vector of phonemes and stress info. For example:

{ "CLOSURE" [["K" nil] ["L" nil] ["OW" :p] ["ZH" nil] ["ER" :n]] }

Now we’re going to put the dictionary to use. We’ll build a datastructure and function that can find all the rhymes for a word and then, in part 2, we’ll use Compojure to make simple web front-end for it. The final result will look like this:

The final source for this post is all on github.com

Soft Rhyme Classes

Determining whether two words rhyme basically consists of comparing the phonemes from the end of the words. For example, we can see that “BETTY” and “SPAGHETTI” rhyme because they share a common three-phoneme suffix:

  • BETTY [ B – EH – T – IY ]
  • SPAGHETTI [ S – P – AH – G – EH – T – IY ]

This is known as a “hard” rhyme because the phonemes match exactly. To keep things interesting, we’ll be looking for “soft” rhymes, that is phonemes that mostly rhyme, by grouping phonemes into equivalence classes. For example, “BETTY” and “READY” aren’t hard rhymes because the “T” and “D” phonemes are different. Put them in the same soft rhyme class however, and we have a soft rhyme for those two words.

So, to get things started, let’s set up our soft rhyme classes. First we’ll define the classes themselves as sets of phonemes and then build a map from each individual phoneme to its soft rhyme class. This gives us a chance to Clojure’s set literal as well as a for list comprehension.

(def soft-rhyme-classes-seed
  ^{ :doc "Seed for soft-rhyme-classes below. These classes are based on a very
          old dead page accessible only with the wayback machine:
          http://web.archive.org/web/20021018075650/http://web.utk.edu/~blyon/RhymerMiner/FranklinRhymerHelp.htm
          " }
  [#{"AA" "AO" "AW" "OW"}
   #{"AE" "EH"}
   ... more classes ...
   #{"AY"}
   #{"Y"}])

(def soft-rhyme-classes 
  ^{ :doc "A map from phoneme to soft rhyme class" }
  (apply hash-map (flatten (for [klass soft-rhyme-classes-seed, phone klass] [phone klass]))))

Note that I derived these classes many years ago from a webpage that only survives in the Wayback Machine. Thanks Wayback Machine!

for allows us to iterate over the equivalence classes and then their members and generate a key/value pair for each combination. This gives us a list of phoneme/class pairs (each is a vector) which we’d like to turn into a map. Now, the hash-map function creates a map from its arguments, so we need to turn this list of pairs first into a flat list and then use that list as the arguments to hash-map. Clojure has us covered on both counts. To flatten the list, we use, of all things, flatten. Go figure:

user=> (flatten [["KEY1" "VALUE1"] ["KEY2" "VALUE2"] ["KEY3" "VALUE3"]])
("KEY1" "VALUE1" "KEY2" "VALUE2" "KEY3" "VALUE3")

To use any sequence as the arguments of a function, use apply:

user=< (apply hash-map (flatten [["KEY1" "VALUE1"] ["KEY2" "VALUE2"] ["KEY3" "VALUE3"]]))
{"KEY1" "VALUE1", "KEY2" "VALUE2", "KEY3" "VALUE3"}

I’ve found that apply is easy to forget about, but often useful. So don’t forget about it!

The Rhymer

Design

Now that we have our rhyme classes out of the way we can work on calculating all the rhymes for a particular word. This would be pretty simple to do by comparing the tail of the word with the tails of the rest of the words in the dictionary… but at 160K words, that might be slow. Instead we can create a datastructure to speed the process. The basic idea is to create a trie, or prefix tree. The twist is the keys at each level will be soft rhyme classes and we’ll uses suffixes rather than prefixes. So we’ll get something like this:

{
    #{IY ...} {
        #{T D ...} {
            #{EH ...} {
                :words [EDDIE]
                #{B ...} {
                    :words [BETTY]
                }
                #{R ...} {
                    :words [READY]
                }
            }
        }
    }
}

Note that in Clojure any of its built-in datatypes can be used as keys in a map. hashCode() and equals() are taken care of and because they’re immutable we don’t have to worry about keys changing and corrupting the hash table.

Each level of the tree contains an optional list of words, keyed by :words. It’s a list so we can correctly handle homophones such as “BEAR” and “BARE”. Each level also has keys for any further phonemes moving up the suffix of the words. So to find the rhymes for a word, we follow its phonemes in reverse down the suffix tree to a desired depth and then collect all the words in that sub-tree.

Taking “BETTY” with the rhyme tree above as an example, we first reverse its phonemes, [IY T EH B] and then do lookups for the soft rhyme classes of IY, T, EH. This will give us a sub-tree containing “BETTY”, “READY”, and any other words that end with an “-ETTY” sound.

Building the Suffix Tree

Building the rhyme suffix tree is, I think, kind of tricky, so it’s even more important to have some tests. I’m continuing to use lazytest and here we’ll meet the given macro to set up a common environment for all our tests. given works basically like let, except that its body must contain test examples rather than arbitrary expressions. I use given to load a test dictionary from the classpath and predefine some words I’ll use throughout the tests:

(describe "with a test dictionary"
  (given [resource (.. (Thread/currentThread) 
                        getContextClassLoader 
                        (getResource "rhymetime/test/test-dict.txt"))
          dict (parse-dictionary resource)
          lisp (dict "LISP")
          asp  (dict "ASP")
          betty (dict "BETTY")
          bear (dict "BEAR")]

    (testing make-rhyme-tree ...

Note the use of getResource to read the test dictionary from the classpath. This is better than using a plain filesystem path because it makes your tests immune to changes in the working directory. Inevitably, a raw filesystem path will work great right up until you start running your tests in a continuous integration environment like Hudson where the working directory isn’t necessarily the root of your project.

Let’s define a function called make-rhyme-tree to build the tree. Here’s a test:

(testing make-rhyme-tree
  (it "builds a rhyme tree from a pronouncing dictionary where words are
      indexed by reversed soft rhyme classes. The words at a given node
      are stored under the :words key"
    (let [tree (make-rhyme-tree dict)]
      (and
        (= ["BETTY"]       (:words (get-in tree (rhyme-tree-path-for betty))))
        (= ["LISP"]        (:words (get-in tree (rhyme-tree-path-for lisp))))
        (= ["BEAR" "BARE"] (:words (get-in tree (rhyme-tree-path-for bear))))
        (= ["ASP"]         (:words (get-in tree (rhyme-tree-path-for asp))))))))

It takes a few helper functions to implement. First we need a way to build the path (list of soft rhyme classes) from the phonemes of a word:

(defn rhyme-tree-path-for
  "Given a word pronunciation, return its path in a rhyme tree"
  [phonemes]
  (->>
    phonemes
    (map first)
    (map soft-rhyme-classes)
    reverse))

Here the ->> (thread) macro comes in handy to show the series of transforms we’re doing on the phoneme list: grab the phoneme from the phoneme/stress pair, look up the soft rhyme class, and finally reverse the whole thing since this is a suffix tree as described above. Once we have a path, we can use the get-in and assoc-in functions to work with the tree. This is much more straightforward than using just get and assoc:

(defn- add-word-to-rhyme-tree
  "Given a rhyme tree and a parsed word, add it to the tree and return
   a new tree"
  [tree [word phonemes]]
  (let [path (rhyme-tree-path-for phonemes)
        sub-tree (get-in tree path)
        words (:words sub-tree)]
    (assoc-in tree path (assoc sub-tree :words (cons word words)))))

Here we look up the current list of words (homophones) at a particular path in the tree and then cons the word into the list. We’re taking advantage of a few things here. First, assoc-in will create the entire path necessary and any missing maps along the path. Second, assoc is happy to take nil as its first parameter. It creates a new map. Similarly, cons treats nil as an empty seq:

user=> (assoc-in nil [:a :long :non-existant :path] "value")
{:a {:long {:non-existant {:path "value"}}}}

user=> (assoc nil "key" "value")
{"key" "value"}

user=> (cons "value" nil)
("value")

Little touches like this help make Clojure code much more compact than equivalent Java code which would be riddled with checks for null. This nil-handling also plays an important part in laziness.

Now, given a dictionary, we can pull this together to build the full rhyme tree. It’s a little anti-climactic:

(defn make-rhyme-tree
  "Make a rhyme tree from a pronouncing dictionary. This constructs the
   raw tree structure. See make-rhymer for the actual rhyme calculation"
  [dict]
  (reduce add-word-to-rhyme-tree {} dict))

Just apply add-word-to-rhyme-tree to the key/values in the dictionary, accumulating into an initially empty map. For the reasons given above, the initial value could also just be nil, but I think {} makes it clearer that we’re iteratively filling in a map.

Implementing Lookup

Now that we’ve got our suffix tree, we just need a function for looking up rhymes. We’ll define a function that, given a dictionary, returns a function we can use to look up rhymes. I think this kind of “encapsulation” is a neat feature of functional programming. Rather than having to keep track of the context (the dictionary) and always pass it to some function, like the make-rhyme-tree test above, we capture the context in a closure and return a function (or functions) to operate on that context.

Continuing from our tests above, here’s the test for our “rhymer” function:

(testing make-rhymer
  (it "returns a function that calculates rhymes for a word"
    (let [rhymer (make-rhymer dict)]
      (= #{"BETTY" "READY" "SPAGHETTI" "MACARONI"} (set (rhymes "MACARONI" 1)))
      (= #{"BETTY" "READY" "SPAGHETTI" }           (set (rhymes "BETTY" 2)))))
  (it "returns a function that returns an empty seq for unknown words"
    (let [rhymes (make-rhymer dict)]
      (empty? (rhymes "MACARONIX" 1)))))

We turn results into sets so we don’t have to worry about the ordering of the results.

Now, to implement this we need to look up a sub-tree for a word and then collect all the words in that sub-tree. Preferably, the returned sequence of words should be lazy because it’s good to be lazy. It took me quite a while to come up with the right approach for this, and I’m definitely not convinced this is the best approach:

(defn- collect-words-in-sub-tree
  "Returns a lazy seq of all the words in the :words keys of the given sub-tree."
  [coll]
  (lazy-seq 
    (when-let [s (seq coll)]
      (let [[k v] (first s) 
            r (rest s)]
        (concat (if (= :words k) 
                  v 
                  (collect-words-in-sub-tree v))
                (collect-words-in-sub-tree r))))))

As described in Programming Clojure, this seems to be the canonical pattern for the usage of the lazy-seq macro. lazy-seq returns a seq-able object that doesn’t actually invoke any code until the first time it’s used. The pattern is basically, do whatever you’d normally do in a recursive way, and then wrap it in lazy-seq. It’s common to have a when-let test to end recursion and then a let which grabs the first item which is processed, and rest of the collection which is used for the recursion.

Here I test whether I’m looking at the :words key. If I am, I use it, otherwise it’s another node so we recurse. Also of interest is the use of the destructuring [k v] in the let. This is one of my favorite features of Clojure so far.

As I mentioned, it took me a while to get to this point. I’m sure it will get easier with experience, but any pointers or guidance would be appreciated.

Now let’s finally make our rhymer function:

(defn make-rhymer
  "Returns a function that can lookup up rhymes for a word in the given dictionary"
  [dict]
  (let [tree (make-rhyme-tree dict)]
    (fn [word depth] 
      (when-let [phonemes (dict word)]
        (let [path     (take depth (rhyme-tree-path-for phonemes))
              sub-tree (get-in tree path)]
          (collect-words-in-sub-tree sub-tree))))))

This function has two parts. The outer let is where we build the rhyme tree. Then we return an anonymous function which takes a word and a depth and returns a list of rhymes. The rhyme tree is captured by the anonymous function giving the encapsulation mentioned above. Otherwise, the implementation is pretty mundane. Look up the word in the dictionary, use its phonemes to do a lookup in the rhyme tree, and return the list of rhyming words. Yay, we’re done!

Here’s an example usage at the repl. Note that I had to bump the max heap size of the JVM to 512MB to load the full CMU dictionary (-Xmx512M):

user=> (use 'rhymetime.pronounce)
nil

user=> (use 'rhymetime.rhyme)
nil

user=> (def dict (parse-dictionary "https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.7a"))
#'user/dict

user=> (def rhymer (make-rhymer dict))
#'user/rhymer

; helper to set the depth to (# of phonemes - 1)
user=> (defn h [w] (rhymer w (dec (count (dict w)))))
#'user/h

user=> (h "CLOSURE")
("FORECLOSURE" "ENCLOSURE" "INCLOSURE" "ENCLOSURE(1)" 
 "DISCLOSURE" "ZLOGAR" "CLOSURE" "CLOTURE" "CLOSER(1)" 
 "LOESER" "LOSURE" "LOGAR" "LOKER" "LOESCHER" "PLOEGER")

user=> (h "MONAD")
("NOMAD" "MONAD" "NANDONET" "AUTOMAP" "GLYCOMED(1)" 
 "GONAD" "BIOMET" "BIOMED")

user=> (h "JAVA")
("MOFFA" "NAVA" "GUAVA" "HOFFA" "SAVA" "SCOZZAFAVA" "FAVA" "RYAVA" 
 "LACAVA" "PENKAVA" "RUBALCAVA" "CAVA" "JAVA" "ACTAVA" "VOTAVA" 
 "BOFFA" "BAVA" "GUSTAVA" "MUSTAFA" "SRIVASTAVA" "SRIVASTAVA(1)" 
 "STAFFA" "SLAVA" "BRATTASLAVA" "LAVA")

Conclusion

That wasn’t so hard. The more you work with Clojure, the more you get a feel for how the various seemingly simple pieces can be composed to build powerful software. In part 2, we’ll take what we built here and build a web-based UI for it using Compojure. Until then, keep Clojuring.

Categories: clojure Tags: , ,

Compojure, the Repl, and Vars

November 5th, 2010 1 comment

For a future post, I’ve been working on a simple webapp using Compojure, a simple web framework for Clojure. Working with Compojure interactively in the repl gets mentioned a lot, but never with very satisfying answers. So tonight I worked through it and believe I have a reasonable approach that covers everything I want, namely:

  • The ability to start the server in the background of the repl
  • The ability to stop the server in the repl
  • The ability to reload code, hit refresh on the browser, and see my changes

I’ve seen several posts that have a separate namespace (i.e., .clj file) for running the server. This didn’t seem necessary in my tests. Assmuing we’re in web-app.core (src/web_app/code.clj), define your routes and a run function like this:


(defroutes all-routes
  ... route definitions ...)

(defn run
  [options]
  (let [options (merge {:port 8080 :join? true } options)]
    (run-jetty (var all-routes) options)))


There are a couple semi-interesting things going on here. First, we have some default options for the server. When (run) is called, its default behavior is to run the server on port 8080 and block (because of the join? flag, which is important once we’re in the repl).

Secondly, rather than passing all-routes directly to run-jetty, we pass in (var all-routes) (or the #'all-routes reader macro). This way, when we reload our code, the server is using the Var rather than having its own copy. When the root binding changes, the server picks it up without having to restart (see below for a simple example).

Finally, we can use this in the repl:


  ;;; Load the web-app
  user=> (use 'web-app.core :reload)
  nil

  ;;; Run the server with join? false so it doesn't block the repl
  ;;; Also save the result so we can stop the server later
  user=> (def server (run {:join? false}))
  #'user/server
  2010-11-05 20:46:32.797::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
  2010-11-05 20:46:32.800::INFO:  jetty-6.1.14
  2010-11-05 20:46:32.903::INFO:  Started SocketConnector@0.0.0.0:8080

  ;;; Now test the webapp, make some changes to the code, and ...
  user=> (use 'web-app.core :reload)
  nil

  ;;; Now make sure the changes took effect. Yay!
  user=> (use 'web-app.core :reload)
  nil

  ;;; If we need to, we can stop the server and restart it too!
  user=> (.stop server)
  nil
  user=> (def server (run {:join? false}))
  #'user/server
  2010-11-05 20:48:02.022::INFO:  jetty-6.1.14
  2010-11-05 20:48:02.046::INFO:  Started SocketConnector@0.0.0.0:8080
  nil
  user=> 


That’s about it. Way better than restarting the repl on every code change. I bet there’s much slicker way to do this in Compojure, but I haven’t found it yet.

Var?

As mentioned above, passing the var of the request handler is key to having changes take effect when the code is reloaded. Here’s a very simple example of this behavior outside of the webapp context:


  user=> (defn greet [] (println "Hi!"))
  #'user/greet
  user=> (greet)
  Hi!
  nil
  user=> (defn call [f] (f))
  #'user/call
  user=> (call greet)
  Hi!
  nil
  user=> (def wrap (partial call greet))
  #'user/wrap
  user=> (wrap)
  Hi!
  nil
  user=> (defn greet [] (println "HEY!"))
  #'user/greet
  user=> (wrap)
  Hi!
  nil

  ;;; Note that it still prints "Hi!".
  ;;; Let's try that again, but use var instead

  user=> (def wrap (partial call (var greet)))
  #'user/wrap
  user=> (defn greet [] (println "Hi!"))
  #'user/greet
  user=> (wrap)
  Hi!
  nil
  user=> (defn greet [] (println "Hey!"))
  #'user/greet
  user=> (wrap)
  Hey!
  nil

  ;;; This time the change to greet took affect!


Categories: clojure Tags: , ,

A Brief Note on Compojure Routes and Requests

November 3rd, 2010 3 comments

This post is mostly for my own reference, but hopefully someone else will find it helpful. I’ve been playing with Compojure and, maybe because it’s a young framework, have found up-to-date documentation and examples hard to come by. I think this is exacerbated by the fact that Compojure was broken up (I think) into Ring, Clout, Hiccup etc.

Note that I’m using Compojure 0.5.2. Hopefully this info applies to future versions.

Anyway, the problem I had, which seemed really simple, was getting query parameters from the request in route rules. In a lot of ways, my reptillian, Java-poisoned brain was overthinking the issue. Once I stepped back and thought about it, here are the simple rules that became clear to me:

  • The second argument to the GET macro (or POST, etc) is a binding form. If you give it just a symbol, you’ll get the entire request map (see below)
  • The request binding form can use destructuring to pull out the parts of the request map you need.
  • If an entry in defroutes table returns nil for any reason (say you’re just trying to print a query parameter, but you have a typo), Compojure moves on and tries the next entry.

That’s it. I know, I know … dumb, but coming from languages with syntax and lots of “special forms”, it’s easy to let Clojure trick you into thinking there’s more to something than there really is.

Finally, and the real reason for this post, is “What does the request map actually look like, so I know what I can destructure from it?!?”. Here’s a dump of the request map I made once the lightbulb came on:


{
  :body               #<Input org.mortbay.jetty.HttpParser$Input@3e1d25>
  :character-encoding nil, 
  :content-length     nil, 
  :content-type       nil, 
  :cookies            {}, 
  :form-params        {},
  :params             {"word" "bar"}, 
  :query-params       {"word" "bar"},
  :query-string       "word=bar",
  :remote-addr        0:0:0:0:0:0:0:1,
  :request-method     :get,
  :route-params       {},
  :scheme             :http,
  :server-name        "localhost", 
  :uri                "/foo", 
  :server-port        8080, 
  :headers {
    accept          text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, 
    accept-charset  ISO-8859-1,utf-8;q=0.7,*;q=0.7, 
    accept-encoding gzip,deflate, 
    accept-language en-us,en;q=0.5, 
    connection      keep-alive, 
    host            localhost:8080
    keep-alive      300, 
    user-agent      Mozilla/5.0 ...,
  }, 
}

One thing to note is that the keys of :params and :query-params are Strings, not keywords!

Destructure away.

Categories: clojure Tags: ,