Implement an Interpreter using Clojure Instaparse

This is a small example on how to implement an interpreter using Clojure and the Instaparse library.

Dependencies

First we create a deps.edn file to get Rich Hickey’s Clojure, Mark Engelberg’s Instaparse, the Midje test suite by Brian Marick, and my modified version of Max Miorim’s midje-runner:

{:deps {org.clojure/clojure {:mvn/version "1.11.3"}
        instaparse/instaparse {:mvn/version "1.5.0"}}
 :paths ["src"]
 :aliases
 {:test
  {:extra-deps
   {midje/midje {:mvn/version "1.10.9"}
    midje-runner/midje-runner
    {:git/url "https://github.com/wedesoft/midje-runner.git"
     :git/sha "8ceb29d5781b9fc43ad4116fc645ade797342fad"}}
   :extra-paths ["test"]
   :main-opts ["-m" "midje-runner.runner"]}}}

Initial setup

Next we create a test suite with an initial test in test/clj_calculator/t_core.clj:

(ns clj-calculator.t-core
  (:require [midje.sweet :refer :all]
            [instaparse.core :as insta]
            [clj-calculator.core :refer :all]))

(facts "Test parser"
       (calc-parser "-42") => [:START [:NUMBER "-42"]])

You can run the test suite as follows which should give an error.

clj -M:test

Next we create a module with the parser in src/clj_calculator/core.clj:

(ns clj-calculator.core
    (:require [instaparse.core :as insta]))

(def calc-parser
  (insta/parser
    (slurp "resources/clj_calculator/calculator.bnf")))

We also need to create an initial grammar in resources/clj_calculator/calculator.bnf defining a minimal grammar and a regular expression for parsing an integer:

START  = NUMBER
NUMBER = #'[-+]?[0-9]+'

At this point the first test should pass.

Ignoring whitespace

Next we add a test to ignore whitespace.

(facts "Test parser"
       ; ...
       (calc-parser " -42 ") => [:START [:NUMBER "-42"]])

The test should fail with unexpected input. The grammar needs to be modified to pass this test:

START      = <WHITESPACE?> NUMBER <WHITESPACE?>
NUMBER     = #'[-+]?[0-9]+'
WHITESPACE = #'[,\ \t]+'

Note the use of ‘<’ and ‘>’ to omit the parsed whitespace from the parse tree.

Parsing expressions

Next we can add tests for sum, difference, or product of two numbers:

(facts "Test parser"
       ; ...
       (calc-parser "1 + 2")
       => [:START [:SUM [:NUMBER "1"] [:NUMBER "2"]]]
       (calc-parser "5 - 4")
       => [:START [:DIFF [:NUMBER "5"] [:NUMBER "4"]]]
       (calc-parser "2 * 3")
       => [:START [:PROD [:NUMBER "2"] [:NUMBER "3"]]])

The grammar now becomes:

START      = <WHITESPACE?> (NUMBER | SUM | DIFF | PROD) <WHITESPACE?>
SUM        = NUMBER <WHITESPACE?> <'+'> <WHITESPACE?> NUMBER
DIFF       = NUMBER <WHITESPACE?> <'-'> <WHITESPACE?> NUMBER
PROD       = NUMBER <WHITESPACE?> <'*'> <WHITESPACE?> NUMBER
NUMBER     = #'[-+]?[0-9]+'
WHITESPACE = #'[,\ \t]+'

Transforming syntax trees

Instaparse comes with a useful transformation function for recursively transforming the abstract syntax tree we obtained from parsing. First we write and run a failing test for transforming a string to an integer:

(facts "Test calculator"
       (calculate "-42") => -42)

To pass the test we implement a calculator function which transforms the syntax tree. Initially it only needs to deal with the nonterminal symbols START and NUMBER:

(defn calculate
  [string]
  (instaparse.transform/transform
    {:START  identity
     :NUMBER #(Integer/parseInt %)}
    (calc-parser string)))

Performing calculations

Obviously we can use the transformation function to also perform the calculations. Here are the tests for the three possible operations of the parse tree.

(facts "Test calculator"
       (calculate "-42") => -42
       (calculate "1 + 2") => 3
       (calculate "5 - 4") => 1
       (calculate "2 * 3") => 6)

The implementation using the Instaparse transformation function is quite elegant:

(defn calculate
  [string]
  (instaparse.transform/transform
    {:START  identity
     :NUMBER #(Integer/parseInt %)
     :SUM    +
     :DIFF   -
     :PROD   *}
    (calc-parser string)))

Recursive Grammar

The next test is about implementing an expression with two operations.

(facts "Recursive grammar"
       (calculate "2 - 1 + 3") => 4)

A naive implementation using a blind EXPR nonterminal symbol passes the test:

START      = <WHITESPACE?> (EXPR | SUM | DIFF | PROD) <WHITESPACE?>
<EXPR>     = SUM | DIFF | PROD | NUMBER
SUM        = EXPR <WHITESPACE?> <'+'> <WHITESPACE?> EXPR
DIFF       = EXPR <WHITESPACE?> <'-'> <WHITESPACE?> EXPR
PROD       = EXPR <WHITESPACE?> <'*'> <WHITESPACE?> EXPR
NUMBER     = #'[-+]?[0-9]+'
WHITESPACE = #'[,\ \t]+'

However there is a problem with this grammar: It is ambiguous. The following failing test shows that the parser could generate two different parse trees:

(facts "Recursive grammar"
       ; ...
       (count (insta/parses calc-parser "2 - 1 + 3")) => 1)

When parsing small strings, this might not be a problem. However if you use an ambiguous grammar to parse a large file with a syntax error near the end, the resulting combinatorial explosion leads to a long processing time before the parser can return the syntax error. The good thing is, that Instaparse uses the GLL parsing algorithm, i.e. it can handle a left-recursive grammar to resolve the ambiguity:

START      = <WHITESPACE?> (EXPR | SUM | DIFF | PROD) <WHITESPACE?>
<EXPR>     = SUM | DIFF | PROD | NUMBER
SUM        = EXPR <WHITESPACE?> <'+'> <WHITESPACE?> NUMBER
DIFF       = EXPR <WHITESPACE?> <'-'> <WHITESPACE?> NUMBER
PROD       = EXPR <WHITESPACE?> <'*'> <WHITESPACE?> NUMBER
NUMBER     = #'[-+]?[0-9]+'
WHITESPACE = #'[,\ \t]+'

This grammar is not ambiguous any more and will pass above test.

Grouping using brackets

We might want to use brackets to group expressions and influence the order expressions are applied:

(facts "Recursive grammar"
       ; ...
       (calculate "2 - ( 1 + 3 )") => -2
       (calculate "( 2 - 1 ) + 3") => 4)

The following grammar implements this:

START      = <WHITESPACE?> (EXPR | SUM | DIFF | PROD) <WHITESPACE?>
<EXPR>     = SUM | DIFF | PROD | GROUP
<GROUP>    = NUMBER | <'('> <WHITESPACE?> EXPR <WHITESPACE?>  <')'>
SUM        = EXPR <WHITESPACE?> <'+'> <WHITESPACE?> GROUP
DIFF       = EXPR <WHITESPACE?> <'-'> <WHITESPACE?> GROUP
PROD       = EXPR <WHITESPACE?> <'*'> <WHITESPACE?> GROUP
NUMBER     = #'[-+]?[0-9]+'
WHITESPACE = #'[,\ \t]+'

A final consideration is operator precedence of multiplication over addition and subtraction. I leave this as an exercise for the interested reader ;)

Main function

Now we only need a main function to be able to use the calculator program.

(ns clj-calculator.core
    (:require [instaparse.core :as insta])
    (:gen-class))

; ...

(defn -main
  [& _args]
  (loop [line (read-line)]
        (when line
          (println (calculate line))
          (recur (read-line))))
  (System/exit 0))

Now one can run the program as follows:

clj -M -m clj-calculator.core

To exit the calculator, simply press CTRL+D.

See github.com/wedesoft/clj-calculator for source code.

Enjoy!