Mohegan SkunkWorks

Sat, 26 Jun 2010 20:00:53 EST

finding things in a mongo database

The generic method dbfind provides the basic query interface in cl-mongo. It's defined as:

(defgeneric db.find (collection  kv &key) )

db.find returns documents in the collection which satisfy the query pattern specified by the bson document kv. If you use the keyword :all db.find will return all documents. If you're looking for all documents with specific (key value) pair you'd specify that those as the query pattern :

 (db.find "foo" (kv (kv "k" 3) (kv "l" 4)))
Continue reading "finding things in a mongo database "

Wed, 23 Jun 2010 16:20:59 EST

key-value pairs in cl-mongo

In a key-value database like mongodb the fundamental data element is the key value pair. Key-value pairs in languages like javascript or python as have a seperate representation like :

 
{ key : value }

In lisp a natural enough equivalent is the association list or alternatively the dotted list. Since the javascript expression also creates a key-value object, I wanted to mimick the same semantics in lisp. In order to stay a close as possible to the java script syntax, and to also create a disticnt type I could switch on when dispatching generic function calls, I created the two types.

Continue reading "key-value pairs in cl-mongo"

Sun, 20 Jun 2010 10:35:02 EST

literate programming intro to cl-mongo

After the jump you'll find the source code of the cl-mongo demo I gave the mongonyc event. The code is in literate programming style and available as gist.

Continue reading "literate programming intro to cl-mongo"

Fri, 18 Jun 2010 15:31:29 EST

mongonyc presentation.

On may 21st this year the folks at 10gen hosted mongonyc. This conference the first mongodb conference here in New York. I give a brief lightning talk on cl-mongo

mongonyc presetber badge The contents of the presentation can be found here.

It was great to see folks from various start-ups here in the City turn up for this. I was especially interested in how mongodb was used.

Continue reading "mongonyc presentation."

Sat, 13 Mar 2010 17:39:02 EST

cl-mongo now on multiple lisps.

cl-mongo now runs in the following lisp images : sbcl, clisp, allegro common lisp and clozure common lisp. I was not able to get it working with armed bear common lisp.I originally used sbcl to develop cl-mongo, but it is my goal to be able to run it on most lisps.

To get cl-mongo to run with clisp I needed to make only one fairly minor change in the networking code.

When you send an insert to mongodb, it does not respond back. When you send a query request you obviously do get a response back. Since I'm using the same socket connection for writes and reads I need to have some mechanism in place to wait for a response, when one is expected. A blocking read on the socket is out of the question. That would destroy performance.

Continue reading "cl-mongo now on multiple lisps."

Fri, 12 Mar 2010 15:03:57 EST

Adding clojure to an existing slime setup in emacs

The current recommended setup of emacs and slime with clojure is to have elpa handle all the dependencies. As an alternative, you can start a swank server using either the swank plug-in for the leiningen build tool, or the swank plug-in for the maven build tool.

All of this advice is good, but I've been using slime with sbcl and emacs for years and I don't want to start from scratch just to add clojure. In addition, rather than hand things off to a tool like elpa, I'd like to install things myself, so I get to understand how the various pieces work together.

I'm going to show how you too can use use the current cvs head for slime, and the current git repos for clojure, clojure-contrib and swank clojure to run clojure with slime and lisp. I"ll provide a few helpful links to get more information on slime and swank. As it turns out, there 's currently a bit of incompatibility between clojure and the slime package, but it's minor and easy to work around.

Continue reading "Adding clojure to an existing slime setup in emacs"

Sun, 28 Feb 2010 19:18:52 EST

Connections in CL-MONGO

I revamped the way connections to a mongo database are handled in cl-mongo. In the new implementation each connection is referenced by a unique name. Each connection is stored in a connection registry. Database calls in cl-mongo default to using the :default connection. The connection parameters used by the default connection for host, port and db are accessible through the mongo-default-.. special variables.

defgeneric mongo (&key host port db name) gets the connection referred to by the :name keyword. The default for :name is :default. If no connection with that name exists, a new connection will be created.

defun mongo-show () will show all connections currently in the registry.

(defgeneric mongo-swap (left right) will swap two connections. This is useful if you want to use a different default connection, or want to change the parameters on an existing named connection.

defgeneric mongo-close ( name ) is used to close a connection. The special keyword :all can be used to close all connections in the registry.

Continue reading "Connections in CL-MONGO"

Wed, 17 Feb 2010 19:04:03 EST

Generating Documentation for cl-mongo

I want to be able to keep the documentation of cl-mongo current when I am adding new features to the package.

The original documentation was generated 'by hand'. I went through the code and made sure I documented the exported classes, methods and functions. This is not really satisfactory. Ideally you want the documentation and the code tightly integrated. Javadoc style documentation, which can be generated from source code annotations would be ideal.

There are various lisp packages available which will pull out the source code comments for the public interface defined in the packages file.
One such package is Edi Weitz's documentation template. This package generates an HTML file with the API description based on embedded comments, which is exactly what I was looking for.
The package does have its quirks/features. It doesn't seem to like embedded HTML or markdown formatting in the lisp code comments, so the API descriptions appear somewhat 'flat'.
In addition I can't define an order on the way the API components are presented and consequently things jump around a-bit.
It also hard-codes licensing information and URLs which are not appropriate for me. The way I dealt with this was to take the generated HTML file and to search and replace the licensing information and URLs.

The next challenge was to integrate the HTML file generated by documentation-template with the README.md. This I accomplished by stripping out the HTML header and appending the resulting HTML file to a markdown formatted file.

The result doesn't look bad and more importantly gives me a way to easily keep my documentation up-to-date.

Continue reading "Generating Documentation for cl-mongo "

Sat, 13 Feb 2010 10:29:17 EST

submitting my blog to technorati.com

The purpose of this post is to claim my blog on technorati. For that to work I need to include the claim code they provided to me in this little email This is an automatically-generated email.

Thank you for submitting your blog claim on Technorati. Technorati will need to verify that you  
are an author of the site http://www.mohegan-skunkworks.com/ by looking for a unique code.  
We have just  assigned the claim token 2624RGHKBMNQ to this claim.  
Please visit http://technorati.com/account/ for more details, including how to use the claim token.  
 
Thank you.  



Let's see if this works.

Continue reading "submitting my blog to technorati.com"

Tue, 02 Feb 2010 18:59:51 EST

cl-mongo

mongo is a scalable, high-performance, open source, schema-free, document-oriented database. I was introduced to mongo at the new-york mysql meetup. Two things made mongo look attractive: inter-operability and document centric storage.

I'm familiar with the elephant persistence framework in lisp. However elephant objects are not readable (as far as I know) in languages other than lisp.That makes inter-operating with other platforms difficult. A traditional rdms requires some sort of schema if you want to use it effectively. Mongo on the other hand is optimized for the the kind of free form document storage I'm looking for.

Mongo comes with set of drivers but was missing was a lisp driver. This looked like a good project to get better acquainted with lisp and mongo.

So I set out to try to write one and the result is cl-mongo. At this stage it's close to having the capabilities I'm looking for, but cl-mongo is obviously a work in progress.

Continue reading "cl-mongo"

Sat, 07 Nov 2009 09:49:51 EST

Fourth NYSA Machine Learning Seminar

Friday I attended the 4 th Machine Learning Symposium organized by the New York Academy of Sciences (NYSA).

The Symposium program consisted of four main talks given by local experts in the area of machine learning, interspersed with four graduate student talks, a poster session and a terrific lunch.

Since I'm not really hindered by any overwhelming expertise in this area I'll confine my self to a few breezy impressions of the main talks.

The first one was given by Bob Bell, from AT&T Bell Labs and a member of the team which won the Netflix prize.

What made the contest challenging was not only the huge size of the data set but also the fact that 99 % of the data was missing. In addition there were significant differences between training and test data. Regardless of whether a 10 % improvement of a movie rating system should be worthy of a million dollar prize, it provided a great way to test classifiers against real world data.

One thing that stood out for me was that a relative small amount of users was responsible for 'most' of the ratings. He mentioned that they identified one user responsible for 5400 ratings on one particular day. This could an data error on the Netflix side, where the time stamp was somehow misapplied. On the other hand it sounds like someone was trying to deliberately affect a large swath of ratings.

The final classifier incorporated breakthroughs made by different teams in the earlier stages of this multi-year competition.One such breakthrough was to consider the previous genres of the movies someone has rated to determine future recommendations. That must seem rather obvious in retrospect. The other was a clever way called Collaborative Filtering which takes into account the time-dependency of people's movie preferences.

An ensemble of previously validated classifiers was used to construct the final classifier and the calculation to get the final result submitted to Netflix took almost a month, primarily because a power failure forced a restart of the calculation engine. In fact the use of an ensemble of classifiers of mentioned as one of the main lessons learned from the contest. The other was the power matrix factorization (i.e. treating users and preferences as independent parameters and using matrix to link the two) as a computational tool.

Continue reading "Fourth NYSA Machine Learning Seminar"

Fri, 06 Nov 2009 06:32:19 EST

Embedding Equations in a Blog Post

I'm using a home-grown blogging engine which converts pages formatted in markdown to static html pages served from my github account. If I want to include mathematical equations in my blog post my options are to use inline html code or to use one of the online Latex equation editors.

My requirements are simple: I want to be able to use the usual cast of mathematical symbols inlined in my main text as well format large equation blocks. In this post I'll compare and contrast inline html with online editors provided by SITMO, CodeCogs and Textify.

To save you the trouble of having to wade through miles of text, I'll start of with my

Conclusions

Use HTML for inlining symbols and equations. Although those will never look as good in html as in Latex, the overall format of your text will suffer less.

For equation blocks you have the choice between two online editors : SITMO, CodeCogs. When it comes to embedding latex code into your html both editors are comparable. When you're looking for more variety with regard to fonts or markup languages CodeCogs is your only alternative.

If you want to use the url link to embed a large code block you probably would want to use a url shortner. Tiny url is a good option here.As an alternative both editors can generate a png image for you to embed.

Textify has a nice clean interface, but I can't embed the links it generates. My only alternative would appear to be to run my own instance of this service which is obviously not something I need or want to do. Furthermore, although it cleverly provides a shortened url, it uses bit.ly which unfortunately doesn't handle complex latex url's well..

Read on to find out how I reached these conclusions .

Continue reading "Embedding Equations in a Blog Post"

Sun, 01 Nov 2009 10:30:37 EST

Toy Problem: Simple One Dimensional Least Squares Learner.

In chapter two of Hastie, Tibshirani and Friedman 's 'The Elements of Statistical Learning' the authors discuss the use of least- squares regression to construct a data classifier for linearly separable data.

A set of training data together with the least-squares method is used to construct a hyper-plane in the data space. The classification of a data point depends on what side of the hyper-plane you end up on.

The example in Hastie uses two data classes in a two dimensional parameter space. I didn't grok the the example immediately, and I thought it would be helpful to try to construct my own much simpler example by staying in one dimension and using a simple normal distribution. The rest of this post describes the details.

Continue reading "Toy Problem: Simple One Dimensional Least Squares Learner."

Sun, 27 Sep 2009 14:52:50 EST

Processing the Sieve in Python

In a previous post I discussed four methods to multi-thread the Sieve of Erasthones in Python. I concluded that multi-threading didn't increase performance, and in fact could have a significant adverse effect. The global interpretor lock (GIL) prevents threads from running concurrently and thus limits the upside of threading. The use of locks or avoiding the use of shared data can than decrease performance quite a bit.

In this section I'll be using Python's multiprocessing module to 'multi-thread' the Sieve.

The multiprocessing module spawns a number of processes and distributes the calculation amongst them.There is no equivalent to the GIL so I should be able to see some gain in performance as the number of processes increases. On the other hand, spawning processes means that there is startup overhead which may offset any performance gain due to the distribution of its execution across multiple processes. However, I should still be able to investigate how performance scales with the number of processes, and whether the multiprocessing module is able to take advantage of multiple cores. In this post I'll discuss four approaches to distributing the Sieve algorithm, basically following the approaches I discussed earlier when using the multi-threading package.The various approaches differ in the way the load is balanced and whether the state of the sieve is shared.

The source for the code discussed here and in the previous post can be found in prime_share.py in the blog-code package on github.

Continue reading "Processing the Sieve in Python"

Sat, 12 Sep 2009 17:03:23 EST

Threading the Sieve in Python

This is the first of two posts on threading and multiprocessing in Python. In this post I'll explore the thread module and in the second post I'll look at Python's multiprocessing module. My starting point is the multi-threaded implementation of the Sieve of Erasthones found in this tutorial on multi-threading in Python (pdf).

Threading a compute-bound algorithm, like the Sieve consists of subdividing of the main task into autonomous sub-tasks which share as little state as possible. Having no shared state eliminates the overhead that inevitably comes with locking. It turns out that Python is not very good at multi-threading compute-bound processes. This is not a surprise. CPython has a global interpretor lock (GIL) which prevents threads from running concurrently.

Regardless, there are other lessons I learned when multi-threading the Sieve algorithm. One is that sharing state between threads may be unavoidable to achieve reasonable performance. In fact, if you don't share state, performance can become predictable worse as the number of threads of execution increases.

The other is that locking can have a surprising impact on performance. It's not just the cost of locking per se, but the effect locking has on the distribution of work between the various threads.

Continue reading "Threading the Sieve in Python"

Sat, 29 Aug 2009 19:15:49 EST

Simplified command line processing with dyn-options.py

Am I the only one in the world who feels that using python's getopt is a bit of a struggle ? It involves a lot of boiler plate. Tedious refactoring is required each time you add or change an option. This is not specific to Python, as most languages have a similar facility to parse the command line, which is similarly annoying.

I decided to create an easier way to process command line options, by transforming the command line into an immutable (read-only) object. The result is dyn_options.

dyn_options considers every string on the command line which starts with either - or -- (i.e. a single or double dash) an option flag. The value of the option flag is a concatenation of everything that follows it, until the next flag is encountered. A simple option flag is one without explicit values and is considered a boolean flag, set to True. dyn_options creates a read-only object, with attributes and values set to the command line option flags and values respectively.

So, '--opt4 hello world' will be converted to an option flag called opt4, with a value of hello world. This makes dealing with spaces on the command line a lot easier.

Continue reading "Simplified command line processing with dyn-options.py"

Sun, 09 Aug 2009 10:27:45 EST

Factorials, Tail Recursion and CPS ... in C

Recursive algorithms are elegant. However, if the recursion is not a tail call the growth of the stack leads to a stack overflow.

Tail call recursion is a technique whereby the last call in a recursive function does not depend on the variables pushed on the stack. In other words the function returns the value of its additional (recursive) call.

Functional languages like Haskell or Lisp are designed to support the use of tail recursive algorithms.The JVM -although now the target platform of a lisp like clojure or a hybrid functional language like scala - does not support tail recursion at all. In C/C++ the compiler can in fact replace tail recursive calls with a simple loop, thereby eliminating the allocation for additional stack frames all together. In this post I'll consider various implementations of the humble factorial to illustrate some of these things.

Continue reading "Factorials, Tail Recursion and CPS ... in C"

Sun, 19 Jul 2009 11:49:05 EST

CL-BLIKY : A simple lisp based blog engine

I 'm writing this using a self rolled blog engine called cl-bliky. I'm indebted to an excellent tutorial put together by Vetle Roeim. His goal was obviously to put together a compelling tutorial and he succeeded. My goal was to use lisp in a small programming project, and developing a simple and easily customizable blog engine seemed like a good start.

Continue reading "CL-BLIKY : A simple lisp based blog engine"