Introduction
I have read about Generalized Algebraic Data Types (GADTs) before, at least as implemented in GHC. The standard type-safe expression evaluator was interesting, but it never left much of an impression on me. Last week, I ran into them in real code for the first time while I was playing with hoopl, a library for representing control-flow graphs and performing dataflow analysis and graph rewriting. The use of GADTs in the hoopl code was enlightening and now I think I have a reasonable feel for the power of GADTs; hoopl uses them in concert with type families in ways I am still digesting.
Hoopl uses GADTs to indicate the shape of objects (graphs, blocks, and
instructions) in a CFG: objects are either "open" (O) or "closed" (C)
on entry and exit. Control flow can "fall into" nodes that are open
on entry and "fall through" nodes that are open on exit. For example,
a typical addition instruction in a compiler IR is open on both entry
and exit, and so would have a type like Insn O O
. On the other
hand, block termination instructions, which cause control flow to jump
to another block, are closed on exit because control flow does not
just linearly fall through them, so a branch instruction would have
type Insn O C
. Hoopl treats instructions that are jumped to as
being of type Insn C O
; it seems conventional for (possibly virtual)
label instructions to serve this role. Hoopl does not allow
instructions to have type Insn C C
. Note that basic blocks are also
either open or closed on entry and exit.
A New CFG in Hoopl
I was building a control-flow graph for the LLVM IR, so I mirrored it
as closely as I could. I mapped each BasicBlock name to a Label (type
Insn C O
) and treated that as the first instruction in each block.
Every LLVM block ends in a terminator instruction (type Insn O C
).
All other instructions then had type Insn O O
. Since the label is
closed on entry and the terminator is closed on exit, the basic block
that they form is of type Block C C
: closed on both entry and exit.
Combining all of the basic blocks in a function then produces a Graph C C
. In an alternative design, the first block of a control-flow
graph could be of type Block O C
, making the graph Graph O C
,
meaning it has a unique entry point. More on that later.
Now having fancy types on your control-flow graph is interesting, but is it useful? When treating the control-flow graph as an opaque type, I do not know yet. The answer might even be no. On the other hand, the first thing I did with this control-flow graph was to write my own dataflow analysis framework (admittedly based heavily on the one provided by hoopl). While hoopl provides its own dataflow analysis framework, I had a few changes I wanted to make:
-
I wanted monadic transfer functions. While general side-effects in a transfer function would be awkward and possibly dangerous, I like to provide Reader environments to some analyses to thread immutable data through transfer functions. I also like to provide a writer context to log diagnostics purely, instead of just dumping them to stderr. With careful management, a cache in a State monad can sometime be very useful, too (perhaps avoiding repeated calls to a theorem prover).
-
I removed the graph rewriting. Even though half of the point of hoopl is integrated on-the-fly graph rewriting, I do not need that functionality (and the rest of my IR cannot yet support it). Removing it keeps things simple.
-
I wanted a simpler API. Despite the elegance of GADTs, I did not really want to have to deal with them in the public dataflow analysis API. Nearly any use of a GADT requires client code to enable the extension, and doing so with catch-all pattern matches seems prone to triggering odd warnings. Further, the additional safety in the public API would have been minimal for the LLVM IR.
Suddenly GADTs
Even though one of my motivations was to hide GADTs from the user,
they proved very convenient in writing the dataflow analysis
framework. The first step in using hoopl is to make your instruction
type an instance of the NonLocal
class, which describes the
connectivity between blocks.
class NonLocal thing where
entryLabel :: thing C x -> Label
successors :: thing e C -> [Label]
I made a GADT wrapper around my LLVM instructions, Insn, which had the appropriate entry and exit type tags:
data Insn e x where
Lbl :: BasicBlock -> Label -> Insn C O
Terminator :: Instruction -> [Label] -> Insn O C
Normal :: Instruction -> Insn O O
UniqueExitLabel :: Label -> Insn C O
UniqueExit :: Insn O C
While most instructions have normal fallthrough semantics (tagged O O
),
the virtual labels and terminator instructions have more
interesting types. I use the last two constructors to provide a
unified exit node for each function, since LLVM does not have a unique
exit node. The power of GADTs started to show while making Insn
an
instance of NonLocal
:
instance NonLocal Insn where
entryLabel (Lbl _ lbl) = lbl
entryLabel (UniqueExitLabel lbl) = lbl
successors (Terminator _ lbls) = lbls
successors UniqueExit = []
Note that entryLabel
only needs to handle the block entry
instructions (those with entry type C
), and successors
only needs
to handle the terminator instructions (those with exit type C
).
Even compiling with -Wall
, the compiler issues no warnings for this
code because it can prove that the pattern matching is exhaustive.
The type signatures for those two class methods restrict them to
accepting only values that are closed on entry and exit, respectively,
so only values with those type tags need to be handled. The compiler
is able to infer this because the types of the Insn
GADT
constructors introduce constraints that mean the only closed-on-entry
instructions are Lbl
and UniqueExitLabel
. Likewise for the
terminator instructions. Without GADTs, I would have had to either
implement dummy cases for the other constructors for each method, or
throw in a catchall case that called error or something similar.
While implementing the actual dataflow analysis code, GADTs made a few more interesting appearances. First, since my control-flow graphs are all closed on entry and on exit, I did not have to handle cases that would have been possible had I opted to allow control-flow graphs to be open at the beginning. I also did not have to write any cases handling blocks that are open on entry. A consequence of this decision, though, is that my dataflow analysis must specify the entry node into the control-flow graph; since no blocks are open on entry, one cannot automatically be found. On the other hand hoopl handles graphs of all shapes and the documentation for the forward dataflow analysis function has a note: "if the graph being analyzed is open at the entry, there must be no other entry point, or all goes horribly wrong...". Perhaps being explicit here is better overall. The note indicates to me that the type system cannot guarantee (as things stand) that there is only one open-on-entry block in the graph. Although my library code has to be explicit in providing an entry point, users of my library do not have to deal with this - each control-flow graph tracks its own entry point and automatically provides it to the dataflow framework.
The dataflow analysis implementation in hoopl uses GADTs everywhere,
but one other style of use struck me as particularly elegant, so I
think that I will mention it here. It also took me quite a while to
understand. I think it is described in the hoopl paper, but the
detail was lost on me until I was implementing it myself. In hoopl,
transfer functions in the dataflow analysis return type Fact x f
.
Here, type x
represents the shape of the instruction on exit and f
is the type of the dataflow fact. The definition of Fact
is:
type family Fact x f :: *
type instance Fact O f = f
type instance Fact C f = FactBase f
and the type of the transfer function is:
transfer :: forall e x . n e x -> f -> Fact x f
This means that instructions that are open on exit return a single
dataflow fact, while instructions that are closed on exit (i.e.,
terminator instructions) return a FactBase
. A FactBase
maps
labels to dataflow facts, so the transfer function applied to a
terminator instruction can propagate different information along each
outgoing edge. I thought this was extremely elegant. Unfortunately,
exposing the GADT API to dataflow clients would have been inconvenient
for me because my instructions do not have natural shape tags, so I
would have had to expose my GADT wrappers. I decided that keeping the
implementation hidden was more important than mirroring this API
exactly, so I simply have a separate edge transfer function that
clients can optionally provide. The real surprise to me, though, was
the interplay between type families and GADTs: transfer functions can
return different types depending on their inputs because the shape
tag selects the type family instance being used. I think this might
only work as well as it does because the entire dataflow analysis
framework in hoopl is designed as a series of composed fact
transformers; otherwise, the heterogeneous return types probably
would not be very useful. I suspect this is a useful pattern to
keep in mind.
Impressions
Overall, I am very happy with my new control-flow graph and hoopl. I also have a new-found respect for GADTs, now that I know what they are and have some idea of how to use them. The code for hoopl is like a giant type-based tapestry and the 16 page Haskell '10 paper barely does it justice.
Aside
As an aside, I had read a paper a while ago about an Applicative control-flow graph that used zippers I always thought it was a very cool idea, but I never implemented it myself because I have not needed that level of sophistication yet (in particular, I do not really need to be able to rewrite control-flow graphs). When I started reading the paper on hoopl, I thought the idea sounded familiar; I didn't realize that the first two authors were the same on both papers until much later.