Suppose we're trying to render truetype fonts to bitmaps without
calling malloc().

makeCharacterBitmap() performs rasterization.

makeCharacterBitmapMemoryNeeded() computes how much temp memory to pass in.

How do we implement the latter function? Well, we have many needs for memory,
but the very last one we'll need is for the edge list during rasterization.
This depends on the number of simultaneous edges on a single scanline. This
depends on the character. (99% of users will never see more than 100, or maybe
even 10 or 20, but a font COULD have anything in it.) Specifically, it depends
on the edge list, which means we'll need enough temp memory to build the edge
list already.

We won't have memory to store the active edge list, so computing how big the
active edge list *would be* probably requires heroic programming, or maybe
unavoidably takes a significant performance hit as we have to rescan the edge
list. (Possibly even with heroic programming it's impossible to avoid O(N^2)
performance if you can't have another data structure.)

So the client will have to pass in to makeCharacterBitmapMemoryNeeded() enough
memory to have computed the edge list. So how much memory is that?

makeCharacterBitmapMemoryNeededMemoryNeeded() will compute the memory needed
for the above function.

How do we implement this? Well, we need to get the tesselated edge list. How
big is the tesselated edge list? How do we build the tesselated edge list?

The way this works in stb_truetype is first we build a list of all the curves
in the shape, and then we tesselate it. If we stick with that approach, then
this function still needs to build that list of curves. So we need memory to
store those curves, so we need

makeCharacterBitmapMemoryNeededMemoryNeededMemoryNeeded()

Of course you could directly compute the count of tesselated edges from each
curves without building up the full lists of curves explicitly. This does not
require heroic programming, but it does cost you some performance. That's
because truetype doesn't store coordinates as (x,y) pairs. Instead, for a given
shape, it stores all of the x coordinates, and then all of the y coordinates,
each varying length. So if you want to visit all the curves without storing
them, you have to fully parse all the x coordinates to find the start of the
y coordinates, and then start over and simultaneously parse out the x coordinates
and the y coordinates. Except it's more complicated than that; first there's an
array of N flags, then N x coords, then N y coords, where the flags control how
you decode the x & y coords.

This isn't a *huge* performance suck, especially since each of the other functions
above are also going to redo this decoding too, but it's still means writing the
code in a much uglier way for doing this pass.

So, for the naivest approach (without changing anything in stb_truetype), we'd
require:

makeCharacterBitmapMemoryNeededMemoryNeededMemoryNeeded()
makeCharacterBitmapMemoryNeededMemoryNeeded()
makeCharacterBitmapMemoryNeeded()
makeCharacterBitmap()

With varying amounts of rewriting of the functions, we could reduce the number
of these that are needed. We can even avoid the active edge list mess by simply
requiring the client to pass in the max # of edges on a single scanline to
makeCharacterBitmapMemoryNeeded() and use that to size the memory (and the burden
is on the client to set that correctly).

But what is all that programming in service of?

You, the client, are either going to pass in some pre-allocated
memory buffer of fixed size (or, conceptually, a correctly-sized
portion of that), or you're going to call malloc and return that
to us.

And then internally, our library is going to take the temporary
memory you pass in and make an arena and suballocate from it.
Except wait, our library doesn't actually need all that memory
at the same time. We'll have freed up the curve list by the time
we have the active edge list, so those can come from the same
memory. So, if we want to *minimize* memory usage, we actually
need to use a dynamic allocator internally. So, whether you call
malloc or use a fixed-memory block, we're going to internally
do something equivalent to malloc.

(Actually, in this specific case, you might be able to just allocate
from the beginning and end of the block, growing towards the middle.)

So, in stb_truetype, rather than have to make N passes over things
to figure out those sizes in advance -- when you're either going
to pass in an *independently-sized fixed-size reserved block*, or
just going to call malloc, we just say "hey, you can either let
us call malloc, or you can make your own little system to 'malloc'
out of your temporary block and pass that to us". That keeps our
performance *higher*, and *induces exactly the same amount of
fragmentation it would have* (i.e. none, because it's fragmenting
this temp memory that we don't care about). It just pushes the
complexity onto you.

[[
This is clearer with other types of libraries, which have to do
significant work to determine things. For example, physics systems
which need to keep list of contacts between objects touching each
other, those have a variable number of such contacts, and determining
how many are needed requires *running the simulation to that point*.
A function to determine how many are needed would itself have to
run the simulation to the end, and that function needs enough memory
to run the simulation to just before the end. I.e. you'd end up with
a sort of "iterative deepening" physics that you would call once
per memory allocation. In fact, at this point (and you can see this
in the stb_truetype pattern as well), it would make more sense from
a performance standpoint to, instead of running N functions that
repeat all the same work and get a little further in determining
how much memory is needed, have each of the N functions *reuse*
the work from the N-1th function, i.e. each function is really
continuing where it left off. But what that means is that your
client code boils down to:

    void *newmem = NULL;
    size_t newsize = 0;
    for(;;) {
       int code = apiFunctionPartial(..., newmem, &newsize);
       if (code != NEEDS_MORE_MEMORY)
          break;
       newmem = malloc(newsize);
    }

And at this point, you could get the same effect with a lot less
library complexity by simply passing in an allocator function.
This is what stb_truetype does, except the allocator function isn't
passed-in, it's a #define.
]]

Back to pushing the complexity onto the client, stb_truetype
could instead provide this private-allocator-from-temp-mem
itself so you don't have to. And so could every other library you
call that has similar behavior. But since you're the person
scared of dynamic allocations, I'm perfectly ok with pushing
that complexity onto you, rather than requiring every library
you might want to use to each independently handle that complexity.

And, to be honest, stb_truetype *does* already take a performance
hit in the name of simplifying memory management -- to avoid you
having to define a *realloc* (alt: to avoid potentially needing
2.9X the memory due to reallocing), stb_truetype actually tesselates
the curves twice -- the first time so it can find out how big
the final array is, and the second time to fill it out.