Saturday, May 02, 2015

Strongly-typed ghci commands

The ghci documentation explains how to implement conditional breakpoints by conditionally generating ghci commands depending on the values of variables which are in scope at the breakpoint. The approach works, but is hard to implement correctly because ghci's commands are stringly-typed. In this post, I will present a strongly-typed DSL for writing such commands in a type safe way, and I will use it to reimplement conditional breakpoints and to implement data breakpoints.

Data breakpoint demo

For future reference, here is how to define data breakpoints in ghci.

:def databreak \checkData -> return (":set stop :cmd (" ++ checkData ++ ") >>= \\hasChanged -> if hasChanged then return \":set stop :list\\n:back\" else return \":step\"")

That was unreadable, but that's kind of the point: multi-stage ghci commands are complex, which is why we need a DSL to make sense of them. For now, here's how you can use :databreak to find the step at which foo writes 2 to the IORef.

initIORef :: IO (IORef Int)
initIORef = newIORef 0

hasWrittenTwo :: IORef Int -> IO Bool
hasWrittenTwo ref = do
    value <- readIORef ref
    return (value == 2)

foo :: IORef Int -> [Int] -> IO ()
foo ref = mapM_ $ \x -> do
    print x
    writeIORef ref x

> ref <- initIORef
> :databreak (hasWrittenTwo ref)
> :step (foo ref [5,4,3,2,1])
[lots of forward stepping...]
foo ref = mapM_ $ \x -> do
    print x
    writeIORef ref x
> x

By the end of the post, we'll be able to reimplement :databreak in a less succinct, but much more understandable style.

The status quo: stringly-typed commands

If you search for "conditional breakpoints" in the ghci documentation, you will see the following magic incantation (which I have split into multiple lines for readability):

:def cond \expr -> return (":cmd if (" ++ expr ++ ") \
                                \then return \"\" \
                                \else return \":continue\"")
:set stop 0 :cond (x < 3)

The meaning of the above is a multi-stage computation:

  1. When breakpoint 0 is encountered, execute the command :cond (x < 3).
  2. (x < 3) looks like an expression, but is actually the string "(x < 3)", which is passed to the definition of cond above.
  3. The body of cond is an IO String computation. It returns a string which represents a ghci command.
  4. That ghci command is: :cmd if ((x < 3)) then return "" else return ":continue".
  5. The argument to :cmd is an expression of type IO String involving the variable x, which is assumed to be in scope in the code where breakpoint 0 is.
  6. The IO computation returns one of two strings depending on the runtime value of x.

    If x is less than three,

    1. The condition for our conditional breakpoint has been triggered, so we should stop. The empty string is returned.
    2. This empty string is interpreted as a ghci command which does nothing. The multi-stage computation stops, as desired.

    If x is greater than or equal to three,

    1. The condition for our conditional breakpoint has not yet been triggered, so we should continue. The string ":continue" is returned.
    2. This string is interpreted as the ghci command :continue, which continues the execution until the next breakpoint is encountered. Go to 1.

There is a good reason why ghci is using strings to represent commands: as some of those commands are executed, the program being debugged progresses, which causes variables like x to enter the scope and others to leave it. Thus, at the moment when a command is defined, we don't yet have enough information to verify whether its definition is well-scoped.

Creating these kinds of multi-stage computations is a bit of a pain, because everything is a string and as a Haskell programmer, I need more precise types than that in order to reason about my program. It would be too easy to forget a :cmd or a return, producing a computation whose first stage type-checks, but which fails at runtime during the later stages.

> :{
> :def cond \expr -> return $ ":cmd if (" ++ expr ++ ") \
>                                  \then \"\" \
>                                  \else \":continue\""
> :}
> :set stop 0 :cond (x < 3)

The two commands above are accepted by ghci, because the expression given to :def has the required type String -> IO String, but we get a type error the first time the breakpoint is hit, much later than we'd like.

Making strings type-safe using phantom types

I have encountered this situation before while working on Hawk, an awk-like tool for manipulating text from the command-line via Haskell expressions. In the --map mode, Hawk applies the user expression to each line of stdin. This is implemented by embedding the user expression expr in the larger expression interact (unlines . map expr . lines), except with more complications due to other interfering features. To avoid making mistakes in this more complicated code, we came up with the following trick: instead of representing an expression as a raw string, we represent an expression of type a as a value of type S a. This is still a string under the hood, with an extra phantom type a documenting the purpose of the string's expression and allowing the type-checker to make sure we are using the string in the way we intended to.

data S a = S { runS :: String } deriving Show

To make use of S, I first need to define a few primitives corresponding to the many functions I might want to use inside my custom ghci commands. I need to be careful to specify the correct types here, as the type checker is not able to verify that the string I give corresponds to a function of the correct type. If I manage not to make any mistakes here, this will allow the type checker to prevent me from making mistakes later on, when combining those primitives into larger expressions.

sFmap :: Functor f => S ((a -> b) -> f a -> f b)
sFmap = S "fmap"

sReturn :: Monad m => S (a -> m a)
sReturn = S "return"

sIf :: S (a -> a -> Bool -> a)
sIf = S "\\t f b -> if b then t else f"

To combine those primitives, I can apply a string representing a function to a string representing an argument. The semantics for this operation is exactly the same as (<*>), but I cannot give an Applicative instance because my version of pure isn't polymorphic enough.

pureS :: Show a => a -> S a
pureS = S . show

(<.>) :: S (a -> b) -> S a -> S b
S f <.> S x = S ("(" ++ f ++ ") (" ++ x ++ ")")

Thanks to the type indices, Haskell can verify that I am constructing type-safe expressions, even though in the end I am still just concatenating strings.

-- Couldn't match expected type 'IO String' with actual type 'String'
sIncorrectExample :: S (Bool -> IO String)
sIncorrectExample = sIf <.> pureS ""
                        <.> pureS ":continue"

sCorrectedExample :: S (Bool -> IO String)
sCorrectedExample = sIf <.> (sReturn <.> pureS "")
                        <.> (sReturn <.> pureS ":continue")

> putStrLn $ runS sCorrectedExample
(\t f b -> if b then t else f) (return "") (return ":continue")

As the if..then..else example above demonstrates, computations expressed in terms of S primitives and (<.>) don't have access to the syntactic sugar we're accustomed to use in ordinary Haskell expressions. For this reason, it's often much more readable to define a new named function and to wrap it as a new primitive.

 namedExample :: Bool -> IO String
 namedExample True  = return ""
 namedExample False = return ":continue"
 sNamedExample :: S (Bool -> IO String)
 sNamedExample = S "namedExample"
Making ghci commands type-safe

There is another layer of stringly-typed imprecision in the above code: ":continue" is a plain old string, but not every string describes a valid ghci command. Let's create a newtype for those strings which do.

newtype C = C { runC :: String } deriving Show

No phantom type this time, since I'm not aware of any situation in which some commands would be allowed but not others.

continue :: C
continue = C ":continue"

back :: C
back = C ":back"

forward :: C
forward = C ":forward"

list :: C
list = C ":list"

step :: C
step = C ":step"

steplocal :: C
steplocal = C ":steplocal"

The :cmd and :def commands both expect a string argument, but since we know that this string is supposed to represent an IO computation, we should use S to specify that.

rawCmd :: S (IO String) -> C
rawCmd body = C $ ":cmd " ++ runS body

type Name = String
rawDef :: Name -> S (String -> IO String) -> C
rawDef name body = C $ ":def " ++ name ++ " " ++ runS body
A precisely-typed version of :cmd

We can do better than this. The IO computation given to :cmd doesn't return any string, it returns a string which represents a newline-separated sequence of commands.

import Data.List

-- compile from [C], the high-level representation of a list
-- of commands, to String, the low-level representation.
compileCmds :: [C] -> String
compileCmds = intercalate "\n" . fmap runC

sCompileCmds :: S ([C] -> String)
sCompileCmds = S "compileCmds"

We can thus make a safer version of rawCmd which asks for a computation returning a list of commands instead of a computation returning a string. We'll still be compiling to a string under the hood, but our users will have a much clearer picture of what they're doing.

cmd :: S (IO [C]) -> C
cmd body = rawCmd (sFmap <.> sCompileCmds <.> body)
A precisely-typed version of :def

For :def, it's a bit more difficult, because we would also like to give a precise type to the input string. It's a string given by the user, like the "(x < 3)" in our running example, which represents some expression. In the case of :cond, this is a boolean expression, and in order to keep things concrete, for now let's assume that this is always the case. Since the argument is a string representing a boolean expression, we can make its type more precise by using S Bool instead of String. So instead of S (String -> ...), we should use S (S Bool -> ...).

Yes, that's two nested S constructors. No, I'm not intentionally making this more complicated than it needs to be, I'm merely using the type system to make sense of the complexity which is already there. It means that :def's argument is a string representing a function which takes in a string which represents a boolean.

It's much easier to reason about this doubly-nested S if we first implement a function which compiles the precise representation S Bool -> IO [C] down to the imprecise representation String -> IO String, and then wrap that named function in an S constructor; the same trick we have used to improve readability.

-- using 'S a' instead of 'S Bool', to support any type of argument
compileDef :: (S a -> IO [C]) -> (String -> IO String)
compileDef body s = do
    cmds <- body (S s)
    return (compileCmds cmds)

sCompileDef :: S ((S a -> IO [C]) -> (String -> IO String))
sCompileDef = S "compileDef"

def :: String -> S (S a -> IO [C]) -> C
def name body = rawDef name (sCompileDef <.> body)
A strongly-typed version of :cond
> :{
> :def cond \expr -> return ":cmd if (" ++ expr ++ ")"
>     "then return \" \" "
>     "else return \":continue \""
> :}

Now that we have a type-safe version of :def, we can use it to rewrite the above definition of conditional breakpoints in a type-safe way.

defineCond :: C
defineCond = def "cond" sCond

cond :: S Bool -> IO [C]
cond expr = return [cmd (sCondHelper <.> expr)]

condHelper :: Bool -> IO [C]
condHelper True  = return []
condHelper False = return [continue]

sCond :: S (S Bool -> IO [C])
sCond = S "cond"

sCondHelper :: S (Bool -> IO [C])
sCondHelper = S "condHelper"

My version is more verbose because it is divided into several helper functions, which makes sense for a multi-stage computation.

  1. There is the stage in which :def is called, with a string representing the second stage as an argument.
  2. Then there is the cond stage, in which a custom command is constructed by concatenating two strings: one representing a computation expecting a boolean and one representing a boolean.
  3. That computation is the next stage, condHelper, which uses the boolean to determine which commands to run next.
  4. Those commands are the final stage.

Note that it would not have been correct to define those helpers in a where clause, because then the function names would no longer be in scope when ghci needs to interpret the code inside the strings. The trick we are using for readability is polluting the global namespace, which I think is a good tradeoff in the context of a blog post, but I wouldn't do that in a library.

Why not a monad?

To the eyes of a Haskeller, there is one more obvious API improvement which can be made: abstracting over the mechanism which separates the different stages and wrapping it up under a monadic interface, thereby allowing a multi-staged computation to be expressed as a single unified monadic computation. With such an API, instead of translating the definition of :cond to a bunch of helper functions like we did above, we would be able to express it via something like this:

defCond :: DSL ()
defCond = def "cond" $ \sBool -> do
    bool <- interpret sBool
    if bool
    then return ()
    else executeC continue

Unfortunately, it's not possible to implement such a monadic interface. Consider the type signatures for interpret and bind:

interpret :: S a -> DSL a
(>>=) :: DSL a -> (a -> DSL b) -> DSL b

Since S a is represented as a string, there is no way to obtain the underlying value of type a in order to pass it to bind's second argument. Instead, we would need to embed that argument, the monadic continuation, inside the string which we will eventually pass on to ghci. This leads to the following variant of bind:

bindDSL :: DSL a -> S (a -> DSL b) -> DSL b

With an appropriate representation for DSL, it is certainly possible to implement bindDSL. But then for readability purposes, we'd still be separating the implementation of the second argument into a separate function, so the end result would still be separated into approximately one function per stage. For this reason, I'll skip the details of this pseudo-monadic implementation.

Implementing data breakpoints.

A data breakpoint is a breakpoint which triggers not when the program reaches a particular line, but when a particular reference is assigned a new value. In a language in which each variable points to its own memory address, this can be implemented efficiently using a hardware breakpoint. Since GHC's garbage collector moves data around, changing their memory addresses, we'll have to use a much slower approach.

I plan to repeatedly use the :step command to advance the computation a tiny bit, checking between every step whether the reference I am interested is now pointing to a different value.

A strongly-typed version of :set stop

If we use :step when there is no computation is being debugged, there is nothing to step through, and ghci prints an error message. Unfortunately this message is visible to the user but not to our command, so we cannot use this error to determine when to stop stepping. So instead, I will use :set stop to specify that my command should be executed each time the execution is stopped, whether after stepping or after encountering a breakpoint.

setStop :: C -> C
setStop c = C (":set stop " ++ runC c)

Since there is only one stop hook, attaching my command will replace any other command which might have already been attached to it. I would like to restore that command once I'm done, but since there is no way to ask ghci which command is currently attached, I'll have no choice but to reset it to a convenient default instead. Most people use :set stop :list, which displays the source which is about to be executed.

resetStop :: C
resetStop = setStop list
A strongly-typed version of :databreak

Now that we have built all of the scaffolding, the definition of :databreak is straightforward. We hook up a stop command so that each time we stop, we check whether the data has changed. If it has, we detach the hook. Otherwise, we step forward and if the computation hasn't finished yet, our hook will trigger again. As a result, we keep stepping forward until either the data changes or the computation terminates.

defineDatabreak :: C
defineDatabreak = def "databreak" sSetDataBreakpoint

setDataBreakpoint :: S (IO Bool) -> IO [C]
setDataBreakpoint sCheckData = return [ setStop
                                      $ cmd
                                      $ sCheckAgain <.> sCheckData

checkAgain :: IO Bool -> IO [C]
checkAgain checkData = do
    hasChanged <- checkData
    if hasChanged
    then onDataChanged
    else return [step]

-- go back one step, since it's the previous step
-- which caused the data to change
onDataChanged :: IO [C]
onDataChanged = return [resetStop, back]

sSetDataBreakpoint :: S (S (IO Bool) -> IO [C])
sSetDataBreakpoint = S "setDataBreakpoint"

sCheckAgain :: S (IO Bool -> IO [C])
sCheckAgain = S "checkAgain"

sOnDataChanged :: S (IO [C])
sOnDataChanged = S "onDataChanged"

Let's try it out:

> putStrLn $ runC $ defineDatabreak
:def databreak (compileDef setDataBreakpoint)
> :def databreak (compileDef setDataBreakpoint)

> ref <- initIORef
> :databreak (hasWrittenTwo ref)
> :step (foo ref [5,4,3,2,1])
[lots of forward stepping...]
foo ref = mapM_ $ \x -> do
    print x
    writeIORef ref x
> x

For convenience, I have expanded out and simplified the definition of defineDatabreak, thereby obtaining the easy-to-paste version from the top of the post.