It's time for another anti-tutorial! Whereas a tutorial is an advanced user giving step-by-step instructions to help newbies, an anti-tutorial is a new user describing their path to enlightenment. My approach is usually to follow the types, so my anti-tutorials are also examples of how to do that.
Previously in the series:
Today, inspired by a question from Syncopat3d, I'll try to learn how to use Simon Marlow's Haxl library. I think Haxl is supposed to improve the performance of complicated queries which use multiple data sources, such as databases and web services, by somehow figuring out which parts of the query should be executed in parallel and which ones should be batched together in one request. Since Syncopat3d is looking for a way to schedule the execution of a large computation which involves running several external processes in parallel, caching the results which are used more than once, and batching together the processes which use the same input, Haxl seemed like a good fit!
Black triangle
To understand the basics of the library, I'd like to create a black triangle, that is, a trivial program which nevertheless goes through the whole pipeline. So as a first step, I need to figure out what the stages of Haxl's pipeline are.
Since I'm using a type-directed approach, I need some type signature from which to begin my exploration. Hunting around Haxl's hackage page for something important-looking, I find GenHaxl
, "the Haxl monad". Despite the recent complaints about the phrase "the <something> monad", finding that phrase here is quite reassuring, as it gives me a good idea of what to expect in this package: a bunch of commands which I can string together into a computation, and some function to run that computation.
Thus, to a first approximation, the Haxl pipeline has two stages: constructing a computation, and then running it.
A trivial computation
Since GenHaxl
is a monad, I already know that return 42
is a suitably trivial and valid computation, so all I need now is a function to run a GenHaxl
computation.
That function is typically right after the definition of the datatype, and indeed, that's where I find runHaxl
. I see that in addition to my trivial GenHaxl
computation, I'll need a value of type Env u
. How do I make one?
Clicking through to the definition of Env
, I see that emptyEnv
can make an Env u
out of a u
. Since there are no constraints on u
so far, I'll simply use ()
. I fully expect to revisit that decision once I figure out what the type u
represents in the type GenHaxl u a
.
>>> myEnv <- emptyEnv ()
>>> runHaxl myEnv (return 42)
42
(source)
Good, we now have a base on which to build! Let's now make our computation slightly less trivial.
What's a data source?
There are a bunch of GenHaxl
commands listed after runHaxl
, but most of them seem to be concerned with auxiliary matters such as exceptions and caching. Except for one:
dataFetch :: (DataSource u r, Request r a) => r a -> GenHaxl u a
That seems to be our link to another stage of Haxl's pipeline: data sources. So the first stage is a data source, then we describe a computation which fetches from the data source, then finally, we run the computation.
So, I want an r a
satisfying DataSource u r
. Is there something simple I could use for r
? The documentation for DataSource
doesn't list any instances, so I guess I'll have to define one myself. Let's see, there is only one method to implement, fetch
, and it uses both u
and r
. The way in which they're used should give me a hint as to what those type variables represent.
fetch :: State r
-> Flags
-> u
-> [BlockedFetch r]
-> PerformFetch
I find it surprising that neither u
nor r
seem to constrain the output type. In particular, u
is again completely unconstrained, so I'll keep using ()
. The description of the u
parameter, "User environment", makes me think that indeed, I can probably get away with any concrete type of my choosing. As for r
, which seems to be the interesting part here, we'll have to look at the definitions for State
and BlockedFetch
to figure out what it's about.
class Typeable1 r => StateKey r Source
data State r
data BlockedFetch r
= forall a . BlockedFetch (r a) (ResultVar a)
Okay, so State r
is an associated type in an otherwise-empty typeclass, so I can again pick whatever I want. BlockedFetch r
is much more interesting: it has an existential type a
, which ties the r a
to its ResultVar a
. The documentation for BlockedFetch
explains this link very clearly: r a
is a request with result a
, whose result must be placed inside the ResultVar a
. This explains why r
wasn't constraining fetch
's output type: this ResultVar
is the Haskell equivalent of an output parameter. So instead of being a pure function returning something related to r
, this fetch
method must be an imperative computation which fills in its output parameters before returning to the caller. The type of fetch
's return type, PerformFetch
, is probably some monad which has commands for filling in ResultVar
s.
data PerformFetch = SyncFetch (IO ()) | ...
At least in the simple case, PerformFetch
is a simple wrapper around IO ()
, so I guess ResultVar
must be a simple wrapper around MVar
or IORef
.
A trivial data source
Anyway, we now have a clear idea of what r a
is: a request whose result has type a
. Let's create a simple data source, Deep Thought, which only knows how to answer a single request.
data DeepThought a where
AnswerToLifeTheUniverseAndEverything :: DeepThought Int
I'm using a GADT so that each request can specify the type of its answer. For example, I could easily add a request whose answer is a string instead of a number:
data DeepThought a where
AnswerToLifeTheUniverseAndEverything :: DeepThought Int
QuestionOfLifeTheUniverseAndEverything :: DeepThought String
But of course, Deep Thought isn't powerful enough to answer that request.
We also know that fullfilling a request isn't done by returning an answer, but by assigning the answer to a ResultVar
.
runDeepThought :: DeepThought a -> ResultVar a -> IO ()
runDeepThought AnswerToLifeTheUniverseAndEverything var
= putSuccess var 42
Alright, let's try to make DeepThought
an official data source by implementing the DataSource
typeclass:
instance DataSource () DeepThought where
fetch _ _ _ reqs = SyncFetch $
forM_ reqs $ \(BlockedFetch req var) ->
runDeepThought req var
There's also a bunch of other easy typeclasses to implement, see the next source link for details.
A trivial state
I now have everything I need for my dataFetch
to compile...
>>> runHaxl myEnv (dataFetch AnswerToLifeTheUniverseAndEverything)
*** DataSourceError "data source not initialized: DeepThought"
...but the execution fails at runtime. Now that I think about it, it makes a lot of sense: even though I don't use it, fetch
receives a value of type State DeepThought
, but since this is a custom type and I haven't given any of its inhabitants to anything, there is no way for Haxl to conjure one up from thin air. There must be a way to initialize the state somehow.
I must say that I'm a bit disappointed by how imperative Haxl's API has been so far. Whether we're assigning values to result variables or initializing a state, correctness requires us to perform actions which aren't required by the types and thus can't be caught until runtime. This is unusual for a Haskell library, and if the rest of the API is like this, I'm afraid following the types won't be a very useful exploration technique.
Anyway, I couldn't find any function with "init" in the name, but by looking for occurences of State
in the types, I figured out how to perform the initialization: via the environment u
which I had left empty until now.
instance StateKey DeepThought where
data State DeepThought = NoState
initialState :: StateStore
initialState = stateSet NoState stateEmpty
>>> myEnv <- initEnv initialState ()
>>> runHaxl myEnv (dataFetch AnswerToLifeTheUniverseAndEverything)
42
(source)
It worked! We have a trivial data source, we have a trivial expression which queries it, we can run our expression, and we obtain the right answer. That's our black triangle!
Multiple data sources, multiple requests
Next, I'd like to try a slightly more complicated computation. Syncopat3d gives the following example:
F_0(x, y, z) = E(F_1(x, y), F_2(y, z))
Here we clearly have two different data sources, E
and F
. Syncopat3d insists that E
is computed by an external program, which is certainly possible since our data sources can run any IO code, but I don't think this implementation detail is particularly relevant to our exploration of Haxl, so I'll create two more trivial data sources.
data E a where
E :: String -> String -> E String
deriving Typeable
data F a where
F_1 :: String -> String -> F String
F_2 :: String -> String -> F String
deriving Typeable
runE :: E a -> ResultVar a -> IO ()
runE (E x y) var = putSuccess var (printf "E(%s,%s)" x y)
runF :: F a -> ResultVar a -> IO ()
runF (F_1 x y) var = putSuccess var (printf "F_1(%s,%s)" x y)
runF (F_2 x y) var = putSuccess var (printf "F_2(%s,%s)" x y)
Since GenHaxl
is a monad, assembling those three requests should be quite straightforward...
>>> runHaxl myEnv $ do
... f1 <- dataFetch (F_1 "x" "y")
... f2 <- dataFetch (F_2 "y" "z")
... dataFetch (E f1 f2)
"E(F_1(x,y),F_2(y,z))"
(source)
Batching
...but if I add a bit of tracing to my DataSource
instances, I see that this computation is performed in three phases: F_1
, F_2
, then E
.
>>> runHaxl myEnv ...
Computing ["F_1(x,y)"]
Computing ["F_2(y,z)"]
Computing ["E(F_1(x,y),F_2(y,z))"]
"E(F_1(x,y),F_2(y,z))"
(source)
This is not the trace I was hoping to see. Since fetch
is receiving a list of request/var pairs, I expected Haxl to send me multiple requests at once, in case my data source knows how to exploit commonalities in the requests. But it doesn't look like Haxl figured out that the F_1
and F_2
requests could be performed at the same time.
It turns out that this is a well-known problem with Haxl's monadic interface. I remember about it now, it was described in a presentation about Haxl (slide 45) when it came out. The solution is to use the Applicative
syntax to group the parts which are independent of each other:
>>> runHaxl myEnv $ do
... (f1,f2) <- liftA2 (,) (dataFetch (F_1 "x" "y"))
... (dataFetch (F_2 "y" "z"))
... dataFetch (E f1 f2)
Computing ["F_2(y,z)","F_1(x,y)"]
Computing ["E(F_1(x,y),F_2(y,z))"]
"E(F_1(x,y),F_2(y,z))"
(source)
Good, the F_1
and F_2
requests are now being performed together.
Style
I don't like the way in which we have to write our computations. Consider a slightly more complicated example:
E(
E(F_1(x,y), F_2(y,z)),
E(F_1(x',y'), F_2(y',z'))
)
Since the four F_1
and F_2
requests at the leaves are all independent, it would make sense for Haxl to batch them all together. But in order to obtain this behaviour, I have to list their four subcomputations together.
>>> runHaxl myEnv $ do
... (f1,f2,f1',f2') <- (,,,) <$> (dataFetch (F_1 "x" "y"))
... <*> (dataFetch (F_2 "y" "z"))
... <*> (dataFetch (F_1 "x'" "y'"))
... <*> (dataFetch (F_2 "y'" "z'"))
... (e1,e2) <- (,) <$> (dataFetch (E f1 f2))
... <*> (dataFetch (E f1' f2'))
... dataFetch (E e1 e2)
Computing ["F_2(y',z')","F_1(x',y')","F_2(y,z)","F_1(x,y)"]
Computing ["E(F_1(x',y'),F_2(y',z'))","E(F_1(x,y),F_2(y,z))"]
Computing ["E(E(F_1(x,y),F_2(y,z)),E(F_1(x',y'),F_2(y',z')))"]
"E(E(F_1(x,y),F_2(y,z)),E(F_1(x',y'),F_2(y',z')))"
(source)
I feel like I'm doing the compiler's job, manually converting from the nested calls I want to write to the leaves-to-root, layered style I have to write if I want batching to work.
So I stopped working on my anti-tutorial and wrote a toy library which converts from one style to the other :)
...and when I came back here to show it off, I discovered that GenHaxl
already behaved exactly like my library did! You just have to know how to define your intermediate functions:
f_1 :: GenHaxl () String -> GenHaxl () String -> GenHaxl () String
f_1 x y = join (dataFetch <$> (F_1 <$> x <*> y))
f_2 :: GenHaxl () String -> GenHaxl () String -> GenHaxl () String
f_2 x y = join (dataFetch <$> (F_2 <$> x <*> y))
e :: GenHaxl () String -> GenHaxl () String -> GenHaxl () String
e x y = join (dataFetch <$> (E <$> x <*> y))
And with those, we can now describe the computation as nested function calls, as desired.
>>> x = pure "x"
>>> y = pure "y"
>>> z = pure "z"
>>> x' = pure "x'"
>>> y' = pure "y'"
>>> z' = pure "z'"
>>> runHaxl myEnv $ e (e (f_1 x y) (f_2 y z))
... (e (f_1 x' y') (f_2 y' z'))
Computing ["F_2(y',z')","F_1(x',y')","F_2(y,z)","F_1(x,y)"]
Computing ["E(F_1(x',y'),F_2(y',z'))","E(F_1(x,y),F_2(y,z))"]
Computing ["E(E(F_1(x,y),F_2(y,z)),E(F_1(x',y'),F_2(y',z')))"]
"E(E(F_1(x,y),F_2(y,z)),E(F_1(x',y'),F_2(y',z')))"
(source)
Conclusion
I now understand Haxl's purpose much better. With the appropriate intermediate functions, Haxl allows us to describe a computation very concisely, as nested function calls. Haxl executes this computation one layer at a time: all of the leaves, then all the requests which only depend on the leaves, and so on. Within a single layer, the requests are subdivided again, this time according to their respective data sources. Finally, for a given data source, it is fetch
's responsibility to find and exploit opportunities for reusing work across the different requests belonging to the same batch. There are also some features related to caching and parallelism which I didn't explore.
I also understand Haxl's implementation much better, having reimplemented part of it myself. In fact, I'd be interested in writing a follow-up post named "Homemade Haxl", in the same vein as my "Homemade FRP" series. What do you think? Are you more interested in watching me learn some new libraries, watching me reimplement libraries, or watching me implement new stuff? I'll be doing all three anyway, I just want to know which of those activities I should blog about :)
Really, your feedback would be greatly appreciated, as the only reason I started this anti-tutorial series in the first place is that my first write-up on understanding Pipes was so surprisingly popular. I've streamlined the format a lot since that first post, and I want to make sure I haven't lost any of the magic in the process!
3 comments:
Very informative tutorial on Haxl. I wish you continue your series of anti tutorials.
I also wanted to know a solution using Haxl to a problem I currently face.
We have a several independent API calls to data sources which then mutate a datastructure and give a final value.
How do I bring a mutable datastructure into picture with Haxl?
Thanks for the feedback!
I'm not sure I understand what you're trying to do; is it the data sources which mutate a data structure, or is it the code which fetches data from those data sources? Would it work for you to perform the pure part of the computation inside GenHaxl, and then to perform the mutations inside IO, once you have received the results from runHaxl?
Why not just adding {-# LANGUAGE ApplicativeDo #-}
This way you let the compiler solve the data dependencies instead of using Applicative when there are no data dependencies ?
Post a Comment