It's time for another anti-tutorial! Whereas a tutorial is an advanced user giving step-by-step instructions to help newbies, an anti-tutorial is a new user describing their path to enlightenment. My approach is usually to follow the types, so my anti-tutorials are also examples of how to do that.
Previously in the series:
Today, inspired by a question from Syncopat3d, I'll try to learn how to use Simon Marlow's Haxl library. I think Haxl is supposed to improve the performance of complicated queries which use multiple data sources, such as databases and web services, by somehow figuring out which parts of the query should be executed in parallel and which ones should be batched together in one request. Since Syncopat3d is looking for a way to schedule the execution of a large computation which involves running several external processes in parallel, caching the results which are used more than once, and batching together the processes which use the same input, Haxl seemed like a good fit!
To understand the basics of the library, I'd like to create a black triangle, that is, a trivial program which nevertheless goes through the whole pipeline. So as a first step, I need to figure out what the stages of Haxl's pipeline are.
Since I'm using a type-directed approach, I need some type signature from which to begin my exploration. Hunting around Haxl's hackage page for something important-looking, I find
GenHaxl, "the Haxl monad". Despite the recent complaints about the phrase "the <something> monad", finding that phrase here is quite reassuring, as it gives me a good idea of what to expect in this package: a bunch of commands which I can string together into a computation, and some function to run that computation.
Thus, to a first approximation, the Haxl pipeline has two stages: constructing a computation, and then running it.
A trivial computation
GenHaxl is a monad, I already know that
return 42 is a suitably trivial and valid computation, so all I need now is a function to run a
That function is typically right after the definition of the datatype, and indeed, that's where I find
runHaxl. I see that in addition to my trivial
GenHaxl computation, I'll need a value of type
Env u. How do I make one?
Clicking through to the definition of
Env, I see that
emptyEnv can make an
Env u out of a
u. Since there are no constraints on
u so far, I'll simply use
(). I fully expect to revisit that decision once I figure out what the type
u represents in the type
GenHaxl u a.
>>> myEnv <- emptyEnv () >>> runHaxl myEnv (return 42) 42
Good, we now have a base on which to build! Let's now make our computation slightly less trivial.
What's a data source?
There are a bunch of
GenHaxl commands listed after
runHaxl, but most of them seem to be concerned with auxiliary matters such as exceptions and caching. Except for one:
dataFetch :: (DataSource u r, Request r a) => r a -> GenHaxl u a
That seems to be our link to another stage of Haxl's pipeline: data sources. So the first stage is a data source, then we describe a computation which fetches from the data source, then finally, we run the computation.
So, I want an
r a satisfying
DataSource u r. Is there something simple I could use for
r? The documentation for
DataSource doesn't list any instances, so I guess I'll have to define one myself. Let's see, there is only one method to implement,
fetch, and it uses both
r. The way in which they're used should give me a hint as to what those type variables represent.
fetch :: State r -> Flags -> u -> [BlockedFetch r] -> PerformFetch
I find it surprising that neither
r seem to constrain the output type. In particular,
u is again completely unconstrained, so I'll keep using
(). The description of the
u parameter, "User environment", makes me think that indeed, I can probably get away with any concrete type of my choosing. As for
r, which seems to be the interesting part here, we'll have to look at the definitions for
BlockedFetch to figure out what it's about.
class Typeable1 r => StateKey r Source data State r data BlockedFetch r = forall a . BlockedFetch (r a) (ResultVar a)
State r is an associated type in an otherwise-empty typeclass, so I can again pick whatever I want.
BlockedFetch r is much more interesting: it has an existential type
a, which ties the
r a to its
ResultVar a. The documentation for
BlockedFetch explains this link very clearly:
r a is a request with result
a, whose result must be placed inside the
ResultVar a. This explains why
r wasn't constraining
fetch's output type: this
ResultVar is the Haskell equivalent of an output parameter. So instead of being a pure function returning something related to
fetch method must be an imperative computation which fills in its output parameters before returning to the caller. The type of
fetch's return type,
PerformFetch, is probably some monad which has commands for filling in
data PerformFetch = SyncFetch (IO ()) | ...
At least in the simple case,
PerformFetch is a simple wrapper around
IO (), so I guess
ResultVar must be a simple wrapper around
A trivial data source
Anyway, we now have a clear idea of what
r a is: a request whose result has type
a. Let's create a simple data source, Deep Thought, which only knows how to answer a single request.
data DeepThought a where AnswerToLifeTheUniverseAndEverything :: DeepThought Int
I'm using a GADT so that each request can specify the type of its answer. For example, I could easily add a request whose answer is a string instead of a number:
data DeepThought a where AnswerToLifeTheUniverseAndEverything :: DeepThought Int QuestionOfLifeTheUniverseAndEverything :: DeepThought String
But of course, Deep Thought isn't powerful enough to answer that request.
We also know that fullfilling a request isn't done by returning an answer, but by assigning the answer to a
runDeepThought :: DeepThought a -> ResultVar a -> IO () runDeepThought AnswerToLifeTheUniverseAndEverything var = putSuccess var 42
Alright, let's try to make
DeepThought an official data source by implementing the
instance DataSource () DeepThought where fetch _ _ _ reqs = SyncFetch $ forM_ reqs $ \(BlockedFetch req var) -> runDeepThought req var
There's also a bunch of other easy typeclasses to implement, see the next source link for details.
A trivial state
I now have everything I need for my
dataFetch to compile...
>>> runHaxl myEnv (dataFetch AnswerToLifeTheUniverseAndEverything) *** DataSourceError "data source not initialized: DeepThought"
...but the execution fails at runtime. Now that I think about it, it makes a lot of sense: even though I don't use it,
fetch receives a value of type
State DeepThought, but since this is a custom type and I haven't given any of its inhabitants to anything, there is no way for Haxl to conjure one up from thin air. There must be a way to initialize the state somehow.
I must say that I'm a bit disappointed by how imperative Haxl's API has been so far. Whether we're assigning values to result variables or initializing a state, correctness requires us to perform actions which aren't required by the types and thus can't be caught until runtime. This is unusual for a Haskell library, and if the rest of the API is like this, I'm afraid following the types won't be a very useful exploration technique.
Anyway, I couldn't find any function with "init" in the name, but by looking for occurences of
State in the types, I figured out how to perform the initialization: via the environment
u which I had left empty until now.
instance StateKey DeepThought where data State DeepThought = NoState initialState :: StateStore initialState = stateSet NoState stateEmpty >>> myEnv <- initEnv initialState () >>> runHaxl myEnv (dataFetch AnswerToLifeTheUniverseAndEverything) 42
It worked! We have a trivial data source, we have a trivial expression which queries it, we can run our expression, and we obtain the right answer. That's our black triangle!
Multiple data sources, multiple requests
Next, I'd like to try a slightly more complicated computation. Syncopat3d gives the following example:
F_0(x, y, z) = E(F_1(x, y), F_2(y, z))
Here we clearly have two different data sources,
F. Syncopat3d insists that
E is computed by an external program, which is certainly possible since our data sources can run any IO code, but I don't think this implementation detail is particularly relevant to our exploration of Haxl, so I'll create two more trivial data sources.
data E a where E :: String -> String -> E String deriving Typeable data F a where F_1 :: String -> String -> F String F_2 :: String -> String -> F String deriving Typeable runE :: E a -> ResultVar a -> IO () runE (E x y) var = putSuccess var (printf "E(%s,%s)" x y) runF :: F a -> ResultVar a -> IO () runF (F_1 x y) var = putSuccess var (printf "F_1(%s,%s)" x y) runF (F_2 x y) var = putSuccess var (printf "F_2(%s,%s)" x y)
GenHaxl is a monad, assembling those three requests should be quite straightforward...
>>> runHaxl myEnv $ do ... f1 <- dataFetch (F_1 "x" "y") ... f2 <- dataFetch (F_2 "y" "z") ... dataFetch (E f1 f2) "E(F_1(x,y),F_2(y,z))"
...but if I add a bit of tracing to my
DataSource instances, I see that this computation is performed in three phases:
>>> runHaxl myEnv ... Computing ["F_1(x,y)"] Computing ["F_2(y,z)"] Computing ["E(F_1(x,y),F_2(y,z))"] "E(F_1(x,y),F_2(y,z))"
This is not the trace I was hoping to see. Since
fetch is receiving a list of request/var pairs, I expected Haxl to send me multiple requests at once, in case my data source knows how to exploit commonalities in the requests. But it doesn't look like Haxl figured out that the
F_2 requests could be performed at the same time.
It turns out that this is a well-known problem with Haxl's monadic interface. I remember about it now, it was described in a presentation about Haxl (slide 45) when it came out. The solution is to use the
Applicative syntax to group the parts which are independent of each other:
>>> runHaxl myEnv $ do ... (f1,f2) <- liftA2 (,) (dataFetch (F_1 "x" "y")) ... (dataFetch (F_2 "y" "z")) ... dataFetch (E f1 f2) Computing ["F_2(y,z)","F_1(x,y)"] Computing ["E(F_1(x,y),F_2(y,z))"] "E(F_1(x,y),F_2(y,z))"
F_2 requests are now being performed together.
I don't like the way in which we have to write our computations. Consider a slightly more complicated example:
E( E(F_1(x,y), F_2(y,z)), E(F_1(x',y'), F_2(y',z')) )
Since the four
F_2 requests at the leaves are all independent, it would make sense for Haxl to batch them all together. But in order to obtain this behaviour, I have to list their four subcomputations together.
>>> runHaxl myEnv $ do ... (f1,f2,f1',f2') <- (,,,) <$> (dataFetch (F_1 "x" "y")) ... <*> (dataFetch (F_2 "y" "z")) ... <*> (dataFetch (F_1 "x'" "y'")) ... <*> (dataFetch (F_2 "y'" "z'")) ... (e1,e2) <- (,) <$> (dataFetch (E f1 f2)) ... <*> (dataFetch (E f1' f2')) ... dataFetch (E e1 e2) Computing ["F_2(y',z')","F_1(x',y')","F_2(y,z)","F_1(x,y)"] Computing ["E(F_1(x',y'),F_2(y',z'))","E(F_1(x,y),F_2(y,z))"] Computing ["E(E(F_1(x,y),F_2(y,z)),E(F_1(x',y'),F_2(y',z')))"] "E(E(F_1(x,y),F_2(y,z)),E(F_1(x',y'),F_2(y',z')))"
I feel like I'm doing the compiler's job, manually converting from the nested calls I want to write to the leaves-to-root, layered style I have to write if I want batching to work.
So I stopped working on my anti-tutorial and wrote a toy library which converts from one style to the other :)
...and when I came back here to show it off, I discovered that
GenHaxl already behaved exactly like my library did! You just have to know how to define your intermediate functions:
f_1 :: GenHaxl () String -> GenHaxl () String -> GenHaxl () String f_1 x y = join (dataFetch <$> (F_1 <$> x <*> y)) f_2 :: GenHaxl () String -> GenHaxl () String -> GenHaxl () String f_2 x y = join (dataFetch <$> (F_2 <$> x <*> y)) e :: GenHaxl () String -> GenHaxl () String -> GenHaxl () String e x y = join (dataFetch <$> (E <$> x <*> y))
And with those, we can now describe the computation as nested function calls, as desired.
>>> x = pure "x" >>> y = pure "y" >>> z = pure "z" >>> x' = pure "x'" >>> y' = pure "y'" >>> z' = pure "z'" >>> runHaxl myEnv $ e (e (f_1 x y) (f_2 y z)) ... (e (f_1 x' y') (f_2 y' z')) Computing ["F_2(y',z')","F_1(x',y')","F_2(y,z)","F_1(x,y)"] Computing ["E(F_1(x',y'),F_2(y',z'))","E(F_1(x,y),F_2(y,z))"] Computing ["E(E(F_1(x,y),F_2(y,z)),E(F_1(x',y'),F_2(y',z')))"] "E(E(F_1(x,y),F_2(y,z)),E(F_1(x',y'),F_2(y',z')))"
I now understand Haxl's purpose much better. With the appropriate intermediate functions, Haxl allows us to describe a computation very concisely, as nested function calls. Haxl executes this computation one layer at a time: all of the leaves, then all the requests which only depend on the leaves, and so on. Within a single layer, the requests are subdivided again, this time according to their respective data sources. Finally, for a given data source, it is
fetch's responsibility to find and exploit opportunities for reusing work across the different requests belonging to the same batch. There are also some features related to caching and parallelism which I didn't explore.
I also understand Haxl's implementation much better, having reimplemented part of it myself. In fact, I'd be interested in writing a follow-up post named "Homemade Haxl", in the same vein as my "Homemade FRP" series. What do you think? Are you more interested in watching me learn some new libraries, watching me reimplement libraries, or watching me implement new stuff? I'll be doing all three anyway, I just want to know which of those activities I should blog about :)
Really, your feedback would be greatly appreciated, as the only reason I started this anti-tutorial series in the first place is that my first write-up on understanding Pipes was so surprisingly popular. I've streamlined the format a lot since that first post, and I want to make sure I haven't lost any of the magic in the process!