~ ~~~~~~~~~~~~~~~~~~~~~
~ ~~ Execution model ~~
~ ~~~~~~~~~~~~~~~~~~~~~
~
~    We use Forth-style dual stacks, one for values and one for control. We
~  use rsp for values, just like C does. We use rbp for the control stack,
~  which is a special Forth-y stack: These are pointers into the bodies of
~  Forth words, not return addresses.
~
~    The choice of rsp and rbp for the stack pointers imitates Jonesforth;
~  I'm hopeful that it gives us convenient addressing modes, and will report
~  back about that when I feel that I understand the implications.
~
~    In Forth, everything is a "word", including mutable variables.
~  Conceptually, a word is a unit of execution, which may be implemented
~  either in machine code or as an array of pointer to other words.
~
~    This polymorphism is implemented by having each word's contents begin
~  with a "codeword", which is a pointer to machine code that "interprets"
~  the rest of the contents. In the case of words implemented in machine
~  code, the codeword points directly to that code, which is normally right
~  next to it.
~
~    Variables, to Forth, are simply one more thing that can be executed; the
~  effect of executing a variable is to push its address onto the value
~  stack.
~
~    We adopt this model of words, codewords, and variables-as-words. It's
~  really nice how it doesn't force anything else on us, not even a heap,
~  though we do end up using a heap.
~
~    We specifically implement a version of calling and returning that Forth
~  calls indirect threaded code: The control stack is a stack of pointers
~  into the middle of interpreted words. The interpreter snippet, called
~  docol, implements calling.  Each word is responsible for making sure
~  returning works properly. Interpreted words accomplish this by ending with
~  the word "exit", while machine-code words accomplish it by ending with a
~  verbatim snippet called "next".
~
~    Conceptually, "next" returns, but more specifically it accomplishes this
~  by doing the caller's next dispatch for it; thus control never actually
~  goes back to the caller's interpreter after initial setup. For performance
~  reasons, "next" is always inlined, so we define it as a macro.
~
~    The docol routine is just ordinary code, not a macro. It's defined later
~  in this file, as a label.
~
~    Notionally, we could consider not having a dictionary, and not giving
~  our words names. However, it feels silly to stop when we're so close to
~  being a full Forth, and using names for things solves a bootstrapping
~  problem related to heap management - see the write-up of _start about how
~  the heap is created, below. So, we add an additional header before the
~  codeword for this purpose.
~
~    The Forth dictionary is usually a linked list of every word that has
~  ever been defined, with the newest at the head; the names of words are
~  stored in string fields, often right next to the link pointer. We adopt
~  this model, with the field sizes and order shown in the quick reference
~  below. We break with Forth tradition in one way: Rather than having a
~  length field, we use a null-terminated string. Thus, there's no length
~  limit on names. This necessitates breaking out the flags (to be explained
~  later) into their own byte, rather than taking bits from the length field
~  for them.
~
~    There's an important performance consideration: Executable words
~  reference each other by pointers to their respective codewords. However,
~  dictionary entries reference each other by pointers to their respective
~  link fields. Traversing from the link field to the codeword is easy,
~  though it's a non-constant-time operation: Just walk the string. In order
~  to make Forth words easy to "decompile", it would be nice to also have a
~  way to traverse backwards. We solve this by making the name field be
~  null-terminated at both ends. Fun, yeah?
~
~
~
~
~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~   Quick Reference
~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~
~  The layout of an interpreted word:
~
~  (overall start)
~             0x00 - 0x07                     Link (to next-oldest word)
~             0x08 - 0x08  HM00000I           Flags
~                                                 H - hidden
~                                                 M - metadata
~                                                 I - immediate
~                                                 all other bits reserved
~  (name start)
~             0x09 - 0x09                     Null byte (terminates name)
~             0x0a - name-end - 1             Name, as UTF-8
~         name-end - name-end                 Null byte (terminates name)
~  (padding start)
~     name-end + 1 - codeword-start - 1       Zero-pad to 8-byte boundary
~     (it's possible this will be zero bytes long)
~  (codeword start)
~       ... + 0x00 - ... + 0x08               Codeword (ie. address of docol)
~           (8-byte chunks)                   Addresses of other words
~                  - ... (end)                Address of "exit" word
~
~  The layout of a machine-code word is different only from the codeword on:
~
~       ... + 0x00 - ... + 0x08               Addresss of next byte
~       ... + 0x08 - ????                     Arbitrary machine code
~                  - ... (end)                Inlined implementation of next
~
~  Also, words always start at 8-byte boundaries.
~
~
~  REGISTER usage conventions:
~
~  * rsi is the "instruction pointer" for the "interpreter".
~      That is, it points to some word-pointer inside an array of
~    word-pointers inside the content of the word they're part of. It always
~    points to the next word that should be executed, whose execution hasn't
~    begun yet.
~
~  * rbp points to the top of the control stack
~      These are former values of rsi, to eventually be returned to, from
~    successively older callers as you look further up the stack. The stack
~    grows downwards in memory. Since values are kept separately, the only
~    thing on the control stack is return addresses, one per layer of call.
~
~  * rsp points to the top of the value stack
~      The value stack has no specific format, but it grows downwards in
~    memory. In particular there's no concept of stack frames, because items
~    on the stack don't belong to any particular word; the value stack in
~    Forth is in part a mechanism for passing values between words.
~
~  Additionally, immediately after beginning execution of a word:
~
~  * rax points to the address of the codeword being executed
~      The value of rax is purely for the callee's benefit, and does not need
~    to be preserved.
~
~    Other registers are purely discretionary, and are not preserved across
~  calls.
~
~
~  FLAG usage:
~
~  * DF should be 0
~    We use lodsq extensively and that makes it increment rsi after using it.
~
~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~ Macro next
~ ~~~~~~~~~~
~
~   Include this inline at the end of a word implemented in machine-code.
~ Conceptually, it returns. What it actually does is do the next thing the
~ caller would do, which is call the next word from the caller's array of
~ word pointers.
~
~   This is a widespread technique in Forth implementation, referred to as
~ indirect threaded code. It's "threaded" in the sense that each word takes
~ responsibility for finishing up by following the notional thread through the
~ metaphorical labyrinth to figure out the next word that its caller wants to
~ run after it. In other words, control never directly returns to the parent,
~ it proceeds directly to the sibling.
~
~ Registers in:
~
~ * rsi points to the address of the word to execute
~
~ Registers out:
~
~ * rax points to the codeword in the contents of the word that was executed
~ * rsi points to the next word-address after this one
~
~ Flags
~ * DF = 0 is required
~
~ (base address -- new base address)
: pack-next
  ~ Copy the next word's address from *rsi into rax. Increment rsi (as per the
  ~ DF flag).
  lods64

  ~ Load the codeword from the word's contents, and jump to the interpreter it
  ~ points to.
  :rax jmp-abs-indirect-reg64 ;


~ Macro beforenext
~ ~~~~~~~~~~~~~~~~
~
~   Sometimes we want to transfer control from a word implemented in
~ machine-code to another word, without coming back after, as if we were
~ simply jumping to it. This is an innovation of ours; Jonesforth doesn't do
~ it. It is similar to the tail-call optimization that many Lisp dialects
~ have.
~
~   This implementation will work regardless of how the receiving word is
~ implemented. It impersonates the "next" snippet, setting up rax to point
~ to the codeword then jumping to the interpreter. Since it doesn't change
~ the control stack or rsi, when the receiving word eventually invokes
~ "next"; it will pick up in the same place as if this sending word had done
~ it.
~
~   Thus, notionally we are doing just this one transfer of control before
~ eventually getting around to inlining "next". Hence the name.
~
~ (target address, base address -- new base address)
: pack-beforenext
  ~ Do a permanent transfer of control by setting rax and invoking the
  ~ codeword. Of course, we could jump to docol ourselves but this will work
  ~ regardless of what the receiving codeword is.
  :rax mov-reg64-imm64
  :rax jmp-abs-indirect-reg64 ;


~ Macros pushcontrol
~        popcontrol
~ ~~~~~~~~~~~~~~~~~~
~
~   Include these inline to push an address onto the control stack, or pop
~ one off of it. You will recall the control stack is kept in rbp. The
~ parameter is given in a user-specified register.
~
~   Jonesforth's analogous macros are called PUSHRSP and POPRSP but I think
~ that's super confusing, since rsp is also the name of a register, but a
~ different one. I guess it was less confusing in 32-bit, since esp doesn't
~ start with an "r". Anyway, this has to be named something that
~ distinguishes it from Intel's PUSH and POP opcodes, so...
~
~   "Load effective address" is just a cute way to do arithmetic on a
~ register, here. To push or pop we decrement or increment rbp by 8. To
~ actually interact with the space in the stack, we indirect through rbp.
~
~ Registers in and out:
~
~ * rbp points to the top of the control stack.
~
~ (source register, base address -- new base address)
: pushcontrol
  swap :rbp -8 :rbp lea-reg64-disp8-reg64
  swap :rbp 0 mov-disp8-reg64-reg64 ;

~ (target register, base address -- new base address)
: popcontrol
  :rbp 0 3roll mov-reg64-disp8-reg64
  :rbp 8 :rbp lea-reg64-disp8-reg64 ;

~ Constants
~ ~~~~~~~~~
~
~ These are used in cold-start, just below.

: heap-requested-address 0x0000001000000000 ; ~ (very arbitrary)
: heap-size              0x0000000001000000 ; ~ 16 MiB
: control-stack-size                0x10000 ; ~ 64 KiB


~ Routine cold-start
~ ~~~~~~~~~~~~~~~~~~
~
~   This is the entry point of the whole program, the very first code we
~ actually execute. Linkers traditionally call this _start, but let's live a
~ little. Anyway, the ELF header points to it and exec() jumps to it; from the
~ kernel's perspective, there's no such thing as names. Also, though this code
~ could be anywhere in the code part of the output, in order to make the
~ hexdump pretty we put it at the start.
~
~   The kernel gives us most registers zeroed, and rsp pointing to the
~ command-line stuff (argc, argv, envp), which is at an ASLR'd address with
~ some stack space allocated for us, despite the fact we didn't request any.
~ It also gives us all the flags clear except IF, but we don't rely on that.
~ Lastly, of course, it loads our code segment and sets the instruction
~ pointer where we asked; we don't need to check what those addresses are,
~ because they're not randomized.
~
~   This routine is really only responsible for one-time initialization.
~
~ Registers in:
~
~ * rsp points to the logical top  of the value stack
~     The kernel sets this up for us, and we need to save it somewhere so
~   Forth can use it.
~
~ Registers out:
~
~ * rsi points within "quit"
~     Quit is the word that's Forth's closest equivalent to main().
~ * rsp points to the top of the value stack
~
~   Notably, rbp is still uninitialialized after _start.
~
~ Stack in:
~
~ * argc, argv, envp in the usual Unix way
~     We ignore them, though.
~
~ Stack out:
~
~ * The value of "heap", as a pointer
~     The meaning of this will be explained below.
~
~ Registers within:
~
~ * rdi points to the base the heap was allocated at, once it exists
~     This is the same value that "heap" will hold, once we reach a point
~   where we have variables. Of course, variables are stored on the heap,
~   hence this temporary measure.
~
~   We also take this opportunity to define soeme memory layout parameters
~ that this routine will be responsible for doing something with:
~
~ (output memory start, current output point
~  -- output memory start, current output point)
: cold-start
  current-offset L' cold-start set-label
  cld                                      ~ clear the DF flag

  ~   Prepare the heap.
  ~
  ~   We could ask for a data segment in the program header, but where's the
  ~ fun in that? Instead, we call mmap().
  ~
  ~   If we wanted the kernel to do ASLR for us, passing address zero would
  ~ cause it to pick somewhere at random, but instead we choose our own
  ~ location. It's still not guaranteed to be where we ask for, so we still
  ~ do the work to record where it wound up. We could pass the "fixed" flag
  ~ and the kernel would trust us, but this gives us more options for
  ~ interoperating with other runtimes.
  ~
  9 :rax mov-reg64-imm64                      ~ mmap()
  heap-requested-address :rdi mov-reg64-imm64 ~ address (very arbitrary)
  heap-size :rsi mov-reg64-imm64              ~ size (one meg)
  0x07 :rdx mov-reg64-imm64                   ~ protection (read+write+exec)
  0x22 :r10 mov-extrareg64-imm64              ~ flags (private+anonymous)
  0 :r8 mov-extrareg64-imm64                  ~ file descriptor (ignored)
  0 :r9 mov-extrareg64-imm64                  ~ offset (ignored)
  syscall

  ~   The return value of the system call is in rax, we'll use it in a sec.
  ~ We need to save this somewhere in case we ever want to munmap() it;
  ~ there's no widely-used name for it so we have to make one up. S0 and r0
  ~ are widely-used names for the physical tops (logical bottoms) of the
  ~ value and control stacks, respectively, and we will eventually set those
  ~ up as well, so we should keep those names in mind. The control stack
  ~ lives within the heap, while the value stack is its own segment. This
  ~ value, though, is the physical bottom of the segment, meaning that it
  ~ stays the same even as we allocate and deallocate things within it. This
  ~ is unlike the two stack pointers, so we give it a name that doesn't
  ~ suggest similarity: "heap".
  ~
  ~   Once Forth is fully set up, its internal variables will be accessed
  ~ through variable-words like any other Forth data, including "heap". To
  ~ get to that point, though, we need to be able to hold onto variable data
  ~ between now and then. In fact, if we don't have at least one of "heap"
  ~ and "here" (its counterpart which points to the logical top end), all
  ~ our efforts to hold onto anything seem a bit doomed.
  ~
  ~   So, we temporarily dedicate rdi to "heap" - only within this routine -
  ~ and store everything else in ways that let us find things by reference
  ~ to it. We choose rdi because it works with the indexing modes we care
  ~ about, and its name suggests its function.
  ~
  ~   The strategy Jonesforth uses is not applicable to us; Jonesforth
  ~ takes advantage of the linker to let its code segment refer to specific,
  ~ pre-allocated objects in the data segment. We are our own linker, and we
  ~ don't care to have a data segment. Hence, this approach.
  ~
  ~   Keying things off "heap" is the fundamental decision, but to make sure
  ~ our variables are accessible both during early bootstrapping, and later,
  ~ we also have to be thoughtful about data structures. More on that in a
  ~ moment.
  ~
  :rax :rdi mov-reg64-reg64

  ~   We also initialize rbp. We could hold off and let "quit" do this, but
  ~ doing it now is the easiest way to initialize the r0 variable, since
  ~ there's no instruction that moves a 64-bit immediate to memory.
  ~
  ~   This is the moment at which we decide where the control stack starts!
  ~ Fun, right? "Allocation" is just a fancy word for picking where we want
  ~ something, then being consistent about it - like placing furniture in
  ~ your home. See below for a little more thought about why here in
  ~ particular.
  ~
  :rdi control-stack-size :rbp lea-reg64-disp32-reg64

  ~   Now we save some stuff onto the heap. These are the locations that
  ~ will eventually be the backing stores of the Forth variables, but we
  ~ don't create the word headers yet, since there's no requirement that
  ~ they be next to the backing stores. We'll do that later, once we have
  ~ word-writing infrastructure in place. For now, we just use their offsets
  ~ relative to the physical bottom of the heap, which are fixed.
  ~
  ~   These will be the permanent homes of these values, though we have
  ~ copies of them elsewhere while we're still in this routine.
  ~
  :rdi control-stack-size 0x00 + :rdi mov-reg64-disp32-reg64  ~ heap
  :rsp control-stack-size 0x08 + :rdi mov-reg64-disp32-reg64  ~ s0
  :rbp control-stack-size 0x10 + :rdi mov-reg64-disp32-reg64  ~ r0
  L@' final-word-name :rax mov-reg64-imm64
  :rax control-stack-size 0x18 + :rdi mov-reg64-disp32-reg64  ~ latest
  :rdi control-stack-size 0x28 + :rax lea-reg64-disp32-reg64
  :rax control-stack-size 0x20 + :rdi mov-reg64-disp32-reg64  ~ here
  ~
  ~ * "heap" is the physical bottom of the heap
  ~     The heap grows upwards in memory, so this is also the logical
  ~   bottom. This comes from the address mmap() just returned to us.
  ~ * "s0" is the logical bottom of the value stack
  ~     The value stack grows downwards in memory, so this is the physical
  ~   top of it. This comes from the stack pointer the kernel initialized us
  ~   with.
  ~ * "r0" is the logical bottom of the control stack
  ~     The control stack also grows downwards, so this is its pysical top
  ~   as well. We allocate this dedicated space within the heap right here,
  ~   in this routine, through our choice of where to put things.
  ~ * "here" is the physical start of the unallocated space in the heap
  ~     We allocate heap space from bottom to top, by incrementing this
  ~   value. So, it would also be accurate to say that it points immediately
  ~   after the physical top of the allocated space. At any rate, the
  ~   address it points to is the first one that hasn't been used yet.
  ~ * "latest" is the address of the most-recently-defined word's header
  ~     Defining new words changes this value.
  ~
  ~   s0 and r0 are mostly used when we want to initialize or reinitialize
  ~ their respective stacks - that is, discard all their contents at once.
  ~
  ~   The value of r0 is the same address these variables start at, so
  ~ you'll want to do a close read of the implementation of pushcontrol
  ~ and convince yourself that it only ever writes things just below the rbp
  ~ address it receives, never right on top of it.
  ~
  ~   As an aside, by the way, please notice that strictly speaking, r0
  ~ could be a constant... but it isn't known until runtime, so we might as
  ~ well make it a variable. That will play nicely with any astonishing
  ~ memory shenanigans someone might wish to do in the future.
  ~
  ~   Notice also that "here" points immediately after itself. This is just
  ~ a convenience, making it the last one like that so that the concern is
  ~ dealt with in a single place and is easy to keep up-to-date with code
  ~ changes.
  ~
  ~   A little more detail about why we offset everything by
  ~ control_stack_size: We're carving out some space at the bottom of the
  ~ heap - which grows low-to-high - to be the control stack - which grows
  ~ high-to-low. So the control stack is allocated out of the heap as a
  ~ fixed-size, one-time thing, and then the variables come immediately
  ~ after that. We do need to use 32-bit displacement indexing to access
  ~ them this way, but that's no big deal.
  ~
  ~   This is perhaps questionable, they should maybe be separate segments
  ~ created with separate calls to mmap(), but for now we're not worried
  ~ about overflow so we use the same allocation for both.
  ~
  ~   We'll come back to these variables a bit later and generate the word
  ~ headers that point at them, but now we're almost ready to switch to
  ~ proper threaded-execution, so we finish that setup first...

  ~   Push the value of "heap" onto the value stack so that it can be the
  ~ breadcrumb the threaded code needs to find... the backing store of
  ~ "heap". Yes, self-reference can be weird like that sometimes. There's
  ~ nothing stopping "quit" from reading rdi, it just violates the
  ~ abstraction...
  :rdi push-reg64

  ~   We are about to set up rsi, we did rbp already, and rsp came to us
  ~ already set up. That's all that "next" needs, so take it away!
  L@' warm-start origin + :rsi mov-reg64-imm64
  pack-next ;

~ Routine warm-start
~ ~~~~~~~~~~~~~~~~~~
~
~   This routine runs as Forth code. That is, indirect threaded execution has
~ been initialized, and the structure of this routine is an array of codeword
~ pointers. It was directly jumped to from cold-start. There's nowhere to
~ return to, so it needs to never return.
: warm-start
  ~   While it's not actually a requirement that codeword pointers be
  ~ word-aligned, it's highly likely that it helps performance. (Whether it
  ~ does is up to Intel's microcode.)
  8 packalign
  current-offset L!' warm-start

  ~   Before handing off to us, cold-start pushed a single value onto the
  ~ stack, a pointer to the beginning of the heap. Now, we load our entire
  ~ Forth implementation onto that heap, beginning with the minimal set of
  ~ words needed to define more words. We do this because we need variables as
  ~ infrastructure so we can eventually have dynamic definitions.
  ~
  ~   There's something non-obvious here: words implemented statically as
  ~ part of the executable image can't contain things that vary at runtime.
  ~ That means that even if these words tried to implement some sort of
  ~ dynamic lookup, they would have no way to find the root of whatever
  ~ dynamic data structure they use. Dynamism needs to be bootstrapped.
  ~
  ~   In a more traditional C-style program, static code could look up
  ~ variables based on fixed addresses that are the same on every run.
  ~ Failing that, we could dedicate a register to it, though that's a
  ~ considerable expense. We chose not to do either of those things, because
  ~ we want the versatility that comes with not being picky about our
  ~ address space: It allows us to contemplate future improvements such as
  ~ ASLR, or embedding into other processes that impose their own addressing
  ~ constraints, or even coexisting with multiple versions of ourselves.
  ~ That choice does mean we have the hard version of this bootstrapping
  ~ problem, and copying ourselves to the heap is how we solve it.
  ~
  ~   We do have the heap address right now, though that won't last. In case
  ~ it's unclear why not: keeping it on the stack would require all future
  ~ references to walk the stack, and somehow know when they've reached the
  ~ bottom. The stack is a good place to keep things with clearly delimited
  ~ lifetimes and visibility, but when we want something to live for our
  ~ entire program and be easy to find from any code within it, we need to
  ~ do something else. Anyway, since we have the address, we can use it for
  ~ the next little bit of setup.
  ~
  ~   The first few words we define are our variables, which hardcode the
  ~ addresses they will return - but since we're doing this at runtime,
  ~ "hardcoding" can reflect where our heap is. This is the fundamental
  ~ trick that makes the heap usable.
  ~
  ~   One more thing to notice: We already allocated the backing stores of
  ~ these variables, and populated their initial values, in _start. The
  ~ words we're defining return those same addresses for the same backing
  ~ stores. So, we have continuity: Stuff defined in terms of the
  ~ variable-words we're defining now will interoperate with the stuff that
  ~ we define in the "early" way, which includes those very words. Both the
  ~ early code and the later code are dealing with the same data structures,
  ~ they're just using a different technique to find them.
  ~
  ~   This is the only hardcoding we need to do; by building on top of it,
  ~ we will soon reach a point where the rest of the system can be defined
  ~ within itself.
  ~ TODO These need to, like, exist first. Also they need to be referenced
  ~ as labels.
  ~ dq early_heap, litstring, "heap", early_variable
  ~ dq early_s0, litstring, "s0", early_variable
  ~ dq early_r0, litstring, "r0", early_variable
  ~ dq early_latest, litstring, "latest", early_variable
  ~ dq early_here, litstring, "here", early_variable
  ;


~ (previous entry address, output point, name string pointer
~  -- new entry address, output point)
: output-create
  3roll dup 4 roll swap pack64
  ~ (string pointer, new entry address, output point)
  0 pack8
  0 pack8
  roll3 packstring
  ~ (new entry address, output point)
  8 packalign
  ;


~ Routine docol
~ ~~~~~~~~~~~~~
~
~   Reference this via its label as the codeword of a word to make it an
~ "interpreted" word. Concretely, it saves rsi (the "instruction pointer")
~ to the control stack, takes the address of the codeword from rax and
~ increments it in-place to form the new instruction pointer, and copies
~ that to rsi.
~
~   Having then done this, we're now in the state that normal execution
~ expects, so docol ends by it using "next" to begin the callee's execution,
~ kicking off a nested call.
~
~   The name is said to be short for "do colon", because Forth high-level
~ code begins word definitions with a colon.
~
~ Registers in:
~
~ * rsi is the caller's instruction pointer
~ * rbp is the control stack pointer
~ * rax is the address of the callee's codeword
~
~ Registers out:
~
~ * rsi is the callee's instruction pointer
~ * rbp is the control stack pointer
~
~ (previous entry address, output point)
: output-docol
  s" docol" output-create

  ~ Evaluated as a word, docol is a constant which returns a pointer.
  L@' docol :rax mov-reg64-imm64
  :rax push-reg64
  pack-next
  8 packalign

  ~ Since docol is not a normal word, the label points to the value we care
  ~ about from the assembly side of things, which is the address we use as the
  ~ codeword.
  current-offset L!' docol
  :rsi pack-pushcontrol
  8 :rax add-reg64-imm8
  :rax :rsi mov-reg64-reg64
  pack-next
  8 packalign
  ;

~   This is the mechanism to "return" from a word interpreted by docol.
~ We pop the control stack, and then, since this is threaded execution, we
~ do the next thing the caller wants to do, by inlining "next".
~
~ (previous entry address, output point
~  -- new entry address, output point)
: output-exit
  s" exit" output-create
  L!' exit
  :rsi pack-popcontrol
  pack-next
  ;