Saturday, December 26, 2015

A whirlwind tour of Haskell, day 8

Day 8, functionally: pattern-matching

Today's puzzle involves parsing: we are given a string literal containing escape sequences, and we must interpret those escape sequences into the characters they represent. Or at least into the correct number of characters, whatever.

In my puzzle solutions so far, I have always pattern-matched at most one level deep, meaning that the pattern if any was always of the form (Constructor arg1 arg2 ...). Instead of binding those arguments to variables, I can also choose to nest more constructors. When pattern-matching on the list of characters of a string, nesting multiple patterns allows me to match on several characters at once, making it easy to recognize each of the three escape sequences specified in the puzzle using one pattern each.

import Data.Char

-- |
-- >>> print (parse "\"(\x20\\\"\x20)\"")
-- "( \" )"
parse :: String -> String
parse ('"':xs0) = go xs0
    go []                  = error "unterminated string literal"
    go ('"'           :[]) = []
    go ('\\':'\\'     :xs) = '\\'      : go xs
    go ('\\':'"'      :xs) = '"'       : go xs
    go ('\\':'x':x1:x2:xs) = hex x1 x2 : go xs
    go (x             :xs) = x         : go xs
parse _ = error "not a string literal"

hex :: Char -> Char -> Char
hex c1 c2 = chr (16 * digitToInt c1 + digitToInt c2)

If we were writing an imperative implementation, we would probably have some kind of current index into the string which we would increment by 1, 2, or 4 characters depending on the pattern. It would be easy to accidentally increment that index by the wrong number of characters. By contrast, notice how the pattern-matching approach is less error-prone, as it naturally encourages us to give a name to the part of the string which isn't covered by each pattern.

Day 8.5, combinatorially: invertible-syntax

Once the second part is revealed, we can see that the complete puzzle consists of two related tasks: parsing a string literal into a string value, and pretty-printing a string value into a string literal. Whether imperative or functional, functions convert inputs to outputs, and so it would be natural to implement both directions as two separate functions. But why restrict ourselves to functions? While Haskell is best known for being a purely functional programming language, "functional" programming does not imply that everything is a function. In fact, we regularly construct other abstractions, new kinds of building blocks which can be composed into larger ones: the "combinator libraries" I briefly mentioned on Day 1. So let's put on our combinatorial glasses, step away from functions for a bit, and ask ourselves: What kind of building blocks would be useful for this problem? How can those building blocks be composed into larger building blocks?

The invertible-syntax library proposes the following approach: each building block shall know how to parse and to pretty print a value of a particular type, for example a character, and those building blocks can be combined into larger building blocks that know how to parse and pretty-print more complicated values, for example a string.

As usual, I'll write doctest comments above each definition to demonstrate what each building block does. Since I almost always want to demonstrate that a string s parses into a value x and that this same value x pretty-prints into the same string s, I'll use the following helper function to test both directions at once.

{-# LANGUAGE RankNTypes #-}
import Prelude hiding (print, pure, (<$>), (<*), (<*>), (*>))
import Text.Syntax
import Text.Syntax.Parser.Naive
import Text.Syntax.Printer.Naive

-- |
-- >>> test (pure 42) "" 42
-- True
-- >>> test (text "foo") "foo" ()
-- True
test :: Eq a => (forall f. Syntax f => f a) -> String -> a -> Bool
test block s x = (parsedValue == x) && (printedString == Just s)
    parsedValue = head (parse block s)
    printedString = print block x

The two primitive building blocks I have demonstrated in the doctest comment are dual to each other: pure consumes no characters when parsing and always produces the same value, while text always consumes the same characters and doesn't produce anything useful as its value. They can be combined to describe "atoms", that is, a particular value which is represented using a particular string. Useful atoms for this puzzle include the backslash and double quote characters, which are represented using their respective escape sequences.

atom :: (Eq a, Syntax f) => String -> a -> f a
atom s x = pure x <* text s

-- |
-- >>> test backslash "\\\\" '\\'
-- True
backslash :: Syntax f => f Char
backslash = atom "\\\\" '\\'

-- |
-- >>> test doubleQuote "\\\"" '\"'
-- True
doubleQuote :: Syntax f => f Char
doubleQuote = atom "\\\"" '\"'

The hexadecimal escape sequence is a bit more complicated. I'll build it out of smaller atoms, one for each hexadecimal digit:

type Word4 = Int
type Word8 = Int

-- |
-- >>> test (hexDigits !! 12) "c" 12
-- True
hexDigits :: Syntax f => [f Word4]
hexDigits = [atom [intToDigit i] i | i <- [0..15]]

oneOf :: Syntax f => [f a] -> f a
oneOf = foldr (<|>) empty

-- |
-- >>> test hexDigit "a" 10
-- True
hexDigit :: Syntax f => f Word4
hexDigit = oneOf hexDigits

I've used the binary choice combinator (<|>) from invertible-syntax to construct an n-ary choice operator oneOf acting on lists, and I used it to combine my atoms into a more useful building block hexDigit which can parse and pretty-print any hex digit.

Next, I'd like to combine two of those into a full byte, but with only the invertible-syntax combinators, I can only combine them into a pair of nibbles (a number between 0 and 15):

-- |
-- >>> test nibbles "6f" (6, 15)
-- True
nibbles :: Syntax f => f (Word4, Word4)
nibbles = hexDigit <*> hexDigit

Given a pair of nibbles, I can easily compute the value of the corresponding byte. However, since our building blocks are going in both directions, we will also need a formula for splitting a byte into a pair of nibbles. This way, the parsing process will extract two nibbles and merge them into a byte, and the pretty-printing process will split the byte into two nibbles and will pretty-print each one.

import Control.Isomorphism.Partial.Prim
import Control.Isomorphism.Partial.Unsafe

iso :: (a -> b) -> (b -> a) -> Iso a b
iso f g = Iso (Just . f) (Just . g)

mergeNibbles :: Iso (Word4, Word4) Word8
mergeNibbles = iso (\(x,y) -> x * 16 + y) (`divMod` 16)

-- |
-- >>> test byte "6f" 111
-- True
byte :: Syntax f => f Word8
byte = mergeNibbles <$> nibbles

-- |
-- >>> test hexChar "\\x6f" 'o'
-- True
hexChar :: Syntax f => f Char
hexChar = iso chr ord <$> text "\\x" *> byte

invertible-syntax has a companion library called partial-isomorphisms for constructing such pairs of inverse functions. It is a combinator library in its own right, with many primitives and combinators ensuring that the resulting functions really are inverses of each other, but in this case I prefer to use the unsafe portion of the library, which allows me to specify an arbitrary pair of functions and to tell the library to trust me that those are inverses of each other.

Now that we have covered all the escape sequences, the last kind of character which could occur in a string is a non-escaped character.

-- |
-- >>> test otherChar "o" 'o'
-- True
otherChar :: Syntax f => f Char
otherChar = subset (`notElem` ['\\','\"']) <$> token

subset builds a partial isomorphism by taking the identity isomorphism which doesn't change the value in either direction, and restricting its domain to the values satisfying a predicate. In this case, we don't want backslash and double quote characters to be included in the domain of non-escaped characters, because those characters do need to be escaped.

So far I've been talking about functions going in two directions as if the two functions were true inverses of each other, but in fact as the name implies, partial isomorphisms are only partial inverses. The reason we don't need the functions to be exact inverses is that a given value may have several different, equally valid textual representations. For example, the character 'o' can be represented either as "o" or as "\x6f", and both encodings are valid. But when we convert those two representations to the single value 'o' and then back to a string, the function which converts 'o' into a string can only choose one answer! Suppose it chooses "o": the other value, "\x6f", demonstrates that the functions are not inverses, because applying one function to "\x6f" yields 'o' but applying the other function to 'o' does not yield back "\x6f".

Now that we have building blocks for every kind of character, we can combine them into one more complicated building block which knows how to parse and pretty-print any character.

-- |
-- >>> test char "o" 'o'
-- True
-- >>> head (parse char "\x6f")
-- 'o'
-- >>> print char 'o'
-- Just "o"
char :: Syntax f => f Char
char = oneOf [backslash, doubleQuote, otherChar, hexChar]

While parsing, escaped backslashes and double quotes will be parsed into backslashes and double quotes, non-escaped characters will be parsed into themselves, and escaped hexadecimal codes will be parsed into the corresponding ascii character. When pretty-printing, however, no characters will be printed in hexadecimal, because every character is either a backslash, a double quote, or in the domain of otherChar. So the pretty printer will never fall back to hexChar, and 'o' will be printed as "o", not "\x6f". We could of course have chosen otherwise by ordering the list of building blocks differently.

Now that we can handle characters, we're finally ready to describe how to handle an entire string.

-- |
-- >>> test string "\"( \\\" )\"" "( \" )"
-- True
-- >>> head (parse string "\"(\x20\\\"\x20)\"")
-- "( \" )"
-- >>> print string "( \" )"
-- Just "\"( \\\" )\""
string :: Syntax f => f String
string = between (text "\"") (text "\"") (many char)

This approach ended up requiring much more code than simply writing separate functions for parsing and pretty-printing strings. It is, however, much less error-prone: if we end up changing the way in which strings can be represented, for example by supporting more kinds of escape sequences, we will only have to modify the code once. In contrast, with two separate functions, it would be easy to forget to update the other function, or to accidentally modify the functions in a way which causes them to no longer be partial inverses.


1 comment:

TecCrowd said...

I understand the code you shared. Very clear writing. Thanks for this tutorial.