Relative to many other languages, the humble array is underrepresented
in Haskell. That may follow from the difficulty of fitting arrays
into pure code; an entire array to update a single element rarely
makes sense. There are some immutable data structures that
look like arrays,
but are not actually. Mutable arrays (in ST
or IO
) make more
sense, so I want to write up a few notes about them.
Haskell has two general purpose array libraries: array and vector. There are also some special-purpose libraries like repa, which enables efficient (and implicitly parallel) numeric computations. The vector package provides a flexible and efficient array abstraction, with support for fusing many operations. In contrast, the array package has a reputation as archaic and slightly odd; certainly, the API specified in the Haskell report is strange when compared with the API provided by vector, which more closely mirrors the APIs provided by containers and the other data structure libraries in the Haskell ecosystem. That is a fair criticism. However, if you do not need the features in vector, array is still worth a look.
Why Array
If you just need contiguous (and possibly unboxed) storage of values
with Int
(or Int
-like) keys with support for O(1) reads and
writes, array is still a very good choice. For this use case, array
even has a significant advantage over vector: the index into the array
can be any type implementing the Ix
typeclass (for “indexable”).
This means that you can define extra types (perhaps as a newtype
around Int
) as array indexes. I have found this to be immensely
useful, as it prevents me from accidentally indexing an array with the
wrong Int
that I had lying around.
Unboxed data
Safe indexing and O(1) updates are undeniably useful. However, one of the real strengths of arrays (and vectors) is that they support unboxed data. In a boxed array, elements are stored via pointer. In an unboxed array, the elements are stored directly in the allocated array. This has a few nice implications:
- it removes a layer of pointer indirection,
- the elements are all stored contiguously, and
- the garbage collector does not need to scan the elements.
In short, if you can store your data in an unboxed format, it is usually worth it. There is one restriction: you cannot store unevaluated thunks in an unboxed container. Since that extra layer of indirection has been removed, all of the elements are always fully evaluated. In many cases, this is actually also a good thing.
Both array and vector have a default set of instances allowing them to store most of the unboxable types in base. But what if we want to store our own types in unboxed arrays? The answer turns out to be a little complicated.
Unboxed mutable vectors
The vector package has a few classes that need to
be implemented for your type T
(in the case of mutable vectors):
Data.Vector.Generic.Mutable.MVector Data.Vector.Unboxed.Mutable.MVector T
Unbox T
It really is not obvious how to implement them by hand. You need to
root around in the implementation of vector to figure it out.
Luckily, these instances can be derived with
GeneralizedNewtypeDeriving
. Or, they could, until ghc 7.8 made
GeneralizedNewtypeDeriving
safe. If you are
willing to use Template Haskell, the
vector-th-unbox
package can generate the necessary instances.
Unboxed mutable arrays
The story for array is not much simpler. I could not find much
documentation on this issue, so hopefully this post will shed some
light on these darker corners of the library ecosystem. In order to
store a type T
in an unboxed mutable array, you only need one
instance (at least for the IO
variant):
MArray IOUArray T IO
This is a bit problematic. First, it does not really fit the form we
would expect for a (GeneralizedNewtypeDeriving
) deriving clause.
Normally, we would expect the T
to come last, but MArray
is just
not defined that way. That is okay – we can work around that with
StandaloneDeriving
:
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE StandaloneDeriving #-}
import GHC.IO ( IO(..) )
import qualified Data.Array.IO as IOA
newtype T = T Int
deriving instance IOA.MArray IOA.IOUArray T IO
We have to import the constructor for IO
these days, or GHC
complains. It turns out that GHC still complains:
Can't make a derived instance of ‘IOA.MArray IOA.IOUArray T IO’
(even with cunning newtype deriving):
cannot eta-reduce the representation type enough
In the stand-alone deriving instance for
‘IOA.MArray IOA.IOUArray T IO’
I confess that I do not really know what that means, aside from it
just not working. Perhaps GHC does not know what parameter it is
supposed to be deriving for. In any case, that leaves us with one
option: we get to write the instance ourselves. Doing so is
unpleasant, but at least it is fairly obvious from the existing
instances defined in array. For our type T
, which is a newtype
around Int
, the instance looks like the following:
import qualified GHC.Base as B
import GHC.ST ( ST(..) )
import Control.Monad.ST ( stToIO )
import qualified Data.Array.IO as IOA
import qualified Data.Array.IO.Internals as IOA
import qualified Data.Array.MArray as MA
import qualified Data.Array.Base as BA
instance MA.MArray (BA.STUArray s) T (ST s) where
{-# INLINE getBounds #-}
getBounds (BA.STUArray l u _ _) = return (l, u)
{-# INLINE getNumElements #-}
getNumElements (BA.STUArray _ _ n _) = return n
{-# INLINE unsafeRead #-}
unsafeRead (BA.STUArray _ _ _ marr#) (B.I# i#) = ST $ \s1# ->
case B.readIntArray# marr# i# s1# of
(# s2#, e# #) -> (# s2#, T (B.I# e#) #)
{-# INLINE unsafeWrite #-}
unsafeWrite (BA.STUArray _ _ _ marr#) (B.I# i#) (T (B.I# e#)) = ST $ \s1# ->
case B.writeIntArray# marr# i# e# s1# of
s2# -> (# s2#, () #)
unsafeNewArray_ (l, u) = BA.unsafeNewArraySTUArray_ (l, u) BA.wORD_SCALE
instance MA.MArray IOA.IOUArray T IO where
{-# INLINE getBounds #-}
getBounds (IOA.IOUArray arr) = stToIO $ BA.getBounds arr
{-# INLINE getNumElements #-}
getNumElements (IOA.IOUArray arr) = stToIO $ BA.getNumElements arr
{-# INLINE unsafeRead #-}
unsafeRead (IOA.IOUArray arr) i = stToIO $ BA.unsafeRead arr i
{-# INLINE unsafeWrite #-}
unsafeWrite (IOA.IOUArray arr) i e = stToIO $ BA.unsafeWrite arr i e
unsafeNewArray_ bounds = stToIO (IOA.IOUArray <$> BA.unsafeNewArray_ bounds)
Entirely unpleasant, but worth the effort. There are a few important things to note about these instances:
- You need to replace calls to
readIntArray
/writeIntArray
and theI#
constructors with variants appropriate for your underlying type. If you have a type that is not anewtype
around some existing type, you will need to write your own variants. - The
wORD_SCALE
function is defined in Data.Array.Base, along with a few others. You also need to choose the appropriate scale (which is basically the stride in the array) for your type. Again, you may need to define your own for custom types. - While it may look like a function you would never call, you still
need to implement
unsafeNewArray_
. The default implementation ofnewArray
, which you probably do not want to implement, calls it. The default implementation ofunsafeNewArray_
fills the array elements with a call to error. This is fine for boxed arrays, but immediately bottoms out when it is evaluated to be put into an unboxed array. The call tounsafeNewArraySTUArray_
just allocates some memory and leaves its contents undefined in the traditional C sense.
The substantive definition is for ST
, with a trivial wrapper lifting
it into IO
. You probably want to keep all of the INLINE
pragmas,
as most of this code will get compiled away nicely.
Conclusion
Unboxed data can be a bit of a pain to work with in Haskell
(especially with the recent changes to GeneralizedNewtypeDeriving
),
but it is often worth it when performance is important. I have tried
to document the steps required to unbox custom types with the array
library. I also argue that array is still a useful library, despite
its quirky API.