~ ~~~~~~~~~~~~~~~~~~~~~~~~~
~ ~~ Core Forth features ~~
~ ~~~~~~~~~~~~~~~~~~~~~~~~~
~
~   This file provides extremely fundamental functionality which is a
~ necessary component of any Forth dialect, including Evocation. It is
~ included statically as part of any generated executable, and a second copy
~ of it is later copied into the log when that executable runs. Therefore, it
~ is written to obey the constraints of both the label transform, and the
~ log-load transform; see transform.e for more details on that.


~ Execution support
~ ~~~~~~~~~~~~~~~~~
~
~   There's two words, docol and exit, which are essential parts of the
~ indirect-threaded execution model. It would be tempting to put them in
~ execution.e, so they'd be closer to the explanation of what they do, but
~ we need two copies of them (just like we do of every other word in this
~ file), one statically compiled and one in the log. So, they're here, because
~ that's significantly simpler, even though it creates a little bit of extra
~ work for the label transform.


~   Docol is the "interpreter" that is responsible for the semantics of words
~ written as Forth high-level code. The name is said to be short for
~ "do colon", because word definitions begin with a colon.
~
~   Concretely, when interpreting, it saves rsi (the "instruction pointer") to
~ the control stack, takes the address of the codeword from rax and increments
~ it in-place to form the new instruction pointer, and copies that to rsi.
~
~   Having then done this, we're now in the state that normal execution
~ expects, so docol ends by it using "next" to begin the callee's execution,
~ kicking off a nested call.
~
~   At runtime, invoke docol directly and it will return the value that should
~ be used as a codeword. When compiling to a binary executable, the
~ transformation facility needs to reference it directly, and for that
~ purpose, the label "docol-codeword-value" points to the correct place.
~
~ Registers in:
~
~ * rsi is the caller's instruction pointer
~ * rbp is the control stack pointer
~ * rax is the address of the callee's codeword
~
~ Registers out:
~
~ * rsi is the callee's instruction pointer
~ * rbp is the control stack pointer

~   We need this one snippet of assembly code that's just on the log raw,
~ without a word header, because of its role in making words work. We define
~ a label pointing to it.
~
~   This is the only use of labels in core.e. There's no easy way to
~ remove it. We can't use immediate computation based on "here" because
~ the label transform uses the host address space, not the target address
~ space. The transformation facility's support for labels in transformed
~ code was added just for this.
~
~   It may seem as if we could have used some extra space in the middle of
~ the proper docol word, which is defined just below. That would work fine
~ with the label transform, which can do forward references, but the log-load
~ transform's label support is special-cased to ONLY do this, and it will only
~ work with a backward reference.
here @
dup L!' docol-codeword-value
:rsi pack-pushcontrol
8 :rax add-reg64-imm8
:rax :rsi mov-reg64-reg64
pack-next
8 packalign
here !

: docol
  [ here @
    ~ Evaluated as a word, docol is a constant which returns a pointer.
    L@' docol-codeword-value :rax mov-reg64-imm64
    :rax push-reg64
    here ! ] ;asm

~   This is the mechanism to "return" from a word interpreted by docol.
~ We pop the control stack, and then, since this is threaded execution, we
~ do the next thing the caller wants to do, by inlining "next".
: exit
  [ here @
    :rsi pack-popcontrol
    here ! ] ;asm


~ Stack manipulation routines
~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
~
~   We start with the three traditional stack operations, swap drop and
~ roll. Sorry to fans of the name "ROT"; we were an HP calculator kid. It'll
~ always be roll to us. Anyway, we do a couple other operations too. Since
~ our goal right now is just to bootstrap the heap, we keep this short and
~ sweet.
~
~   This code has been written to satisfy the assumptions needed by the label
~ transform, described in-depth in transform.e. This means that it is, in
~ effect, written in a restricted dialect of Evocation rather than in the full
~ language.
~
~   There is definitely plenty of optimization that could be done.

: swap
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rax push-reg64
    :rbx push-reg64
    here ! ] ;asm

: drop
  [ here @
    :rax pop-reg64
    here ! ] ;asm

: 2drop
  [ here @
    :rax pop-reg64
    :rax pop-reg64
    here ! ] ;asm

~   Rotates "up" (pops its parameter, n; nth item then becomes current item).
~
~   We implement this the high-performance way, with rep movsq, aka the
~ instruction that exists to optimize C's memcpy(). The details of setting
~ that up are complex; see below.
: roll
  [ here @

    ~   Pop our parameter. The rep instruction takes rcx as its count, so we
    ~ reduce copying by using it to hold our count, as well.
    :rcx pop-reg64

    ~   We have n - 1 items to slide, so decrement rcx. For the purpose of
    ~ counting how many repetitions will happen, it's one-based. This is because
    ~ the rep instruction performs a single movsq, then decrements rcx, then
    ~ stops if rcx is zero.
    :rcx dec-reg64

    ~   Retrieve the nth item, for later. For this purpose we're thinking in
    ~  zero-based terms, so we do this after already having decremented rcx.
    :rsp :rcx 8 :rbx mov-reg64-indexed-reg64

    ~   The source address for movsq is rsi and the destination is rdi; we can
    ~ use rdi as we wish, but rsi is our Forth "instruction pointer", so we must
    ~ save and restore it. Doing so alters rsp, so we have to adjust the address
    ~ calculations by eight bytes as compared to the expressions above, but
    ~ happily we can use the disp8 field to do that. We'd be using disp8 anyway
    ~ because it's helpful.
    :rsi push-reg64

    ~   Now we set up parameters for the memory-sliding operation. We have
    ~ n - 1 items to copy, moving the range rsp through rsp + (n-2)*8 onto the
    ~ range rsp + 8 through rsp + (n-1)*8. That's with the value of rsp as it
    ~ exists at this moment (it's going to change soon).
    ~
    ~   We're sliding them upwards in memory, so we start at the high end so
    ~ that we're always moving into a location that doesn't have anything
    ~ precious. We use lea as a convenient way to do the stack math.
    ~
    ~   When rcx is 1, we want rsp + 8.
    :rsp :rcx 8 :rsi lea-reg64-indexed-reg64
    ~   When rcx is 1, we want rsp + 16.
    :rsp :rcx 8 8 :rdi lea-reg64-disp8-indexed-reg64
    ~
    ~   Using rcx = 1 is the most convenient example to use for figuring out the
    ~ arithmetic. It's a linear relationship, so as long as we get the 8-byte
    ~ stride correct, we just need to pick a single point and verify that our
    ~ math is right for that point, and it'll be right for any value of rcx.

    ~   Another of our Forth conventions is that the DF flag should be kept at
    ~ zero, which directs string instruction to increment rsi. Here, however,
    ~ because our source and destination ranges overlap, we need to start at the
    ~ high end, which means we need it to decrement. So we set DF to one, and
    ~ we'll clear it after.
    std
    rep-movs64

    ~ Set everything back.
    cld
    :rsi pop-reg64

    ~ There is now an extra item at the low end of the stack (the top) that
    ~ needs to go away, and coincidentally we have a value in rbx that needs to
    ~ be in that spot. Rather than doing a drop and push, we overwrite it, to
    ~ save a little work.
    :rbx :rsp mov-indirect-reg64-reg64

    ~ All done, wow! What a mouthful.
    here ! ] ;asm

~   Rotates "down" (pops its parameter, n; current item then becomes nth
~ item).
~
~   Jonesforth calls this "-roll" and we could do that, but honestly the name
~ unroll sounds nicer and it's only a single character longer. You might say
~ it rolls off the tongue better.
~
~   We implement this the high-performance way, with rep movsq, aka the
~ instruction that exists to optimize C's memcpy(). The details of setting
~ that up are complex; see below.
: unroll
  [ here @

    ~   Pop our parameter. The rep instruction takes rcx as its count, so we
    ~ reduce copying by using it to hold our count, as well.
    :rcx pop-reg64

    ~ We have n - 1 items to slide, so decrement rcx. Also, save a copy of it in
    ~ rdx after doing that, for later.
    :rcx dec-reg64
    :rcx :rdx mov-reg64-reg64

    ~   Retrieve the 0th item, for later.
    :rsp :rbx mov-reg64-indirect-reg64

    ~   Now we set up parameters for the memory-sliding operation. We have
    ~ n - 1 items to copy, moving the range rsp + 8 through rsp + (n-1)*8 onto
    ~ the range rsp through rsp + (n-2)*8. That's with the value of rsp as it
    ~ exists at this moment (it's going to change soon).
    ~
    ~   We're sliding them downwards in memory, so we start at the low end so
    ~ that we're always moving into a location that doesn't have anything
    ~ precious. We use lea as a convenient way to do the stack math.
    ~
    ~   As with roll, we need to save rsi and adjust those rsp calculations
    ~ accordingly.
    :rsi push-reg64

    ~   Regardless of rcx, we want rsp + 16.
    :rsp 16 :rsi lea-reg64-disp8-reg64
    ~   Regardless of rcx, we want rsp + 8.
    :rsp 8 :rdi lea-reg64-disp8-reg64

    ~   With roll, we were starting at the high end. Here, we start at the low
    ~ end, which means we need rsi to increment after each repetition. That's
    ~ what it does when the DF flag is clear, and another of our Forth
    ~ conventions is to keep it clear normally. So, we don't have to touch DF!
    ~ Yay!
    rep-movs64

    ~ Restore our original rsi.
    :rsi pop-reg64

    ~ There is now an extra item in the middle of the stack, at the high end of
    ~ the sliding we did, that needs to be overwritten with our value in rbx.
    ~ Since we destructively updated our count in rcx, we saved a copy of the
    ~ count in rdx, and we use that to find the right address.
    ~
    ~ When the original count was n, we want rsp + (n-1)*8, so we saved rdx
    ~ after decrementing rcx, above.
    :rbx :rsp :rdx 8 mov-indexed-reg64-reg64

    ~ All done, wow! What a mouthful.
    here ! ] ;asm

~ Rotates "up" (third item becomes current item)
: 3roll
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rcx pop-reg64
    :rbx push-reg64
    :rax push-reg64
    :rcx push-reg64
    here ! ] ;asm

~ Rotates "down" (current item becomes third item)
: 3unroll
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rcx pop-reg64
    :rax push-reg64
    :rcx push-reg64
    :rbx push-reg64
    here ! ] ;asm

: dup
  [ here @
    :rax pop-reg64
    :rax push-reg64
    :rax push-reg64
    here ! ] ;asm

: 2dup
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx push-reg64
    :rax push-reg64
    :rbx push-reg64
    :rax push-reg64
    here ! ] ;asm


~ Arithmetic routines
~ ~~~~~~~~~~~~~~~~~~~
~
~   No surprises here. Again, since our goal is to bootstrap the heap, we
~ keep it short. Also again, this is nowhere near optimal.

: +
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :rbx :rax add-reg64-reg64
    :rax push-reg64
    here ! ] ;asm

: -
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :rbx :rax sub-reg64-reg64
    :rax push-reg64
    here ! ] ;asm

: *
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx mul-reg64
    :rax push-reg64
    here ! ] ;asm

~   Jonesforth calls this "/mod" but % is widely recognized and has no special
~ Forth significance.
: /%
  [ here @
    :rdx :rdx xor-reg64-reg64    ~ rdx is the high bits of the input; zero it
    :rbx pop-reg64
    :rax pop-reg64
    :rbx divmod-reg64
    :rdx push-reg64              ~ remainder
    :rax push-reg64              ~ quotient
    here ! ] ;asm


~ Comparison routines
~ ~~~~~~~~~~~~~~~~~~~
~
~   So. This is subtle. These comparison routines all have the same structure.
~ Notice that = and != are commutative, whereas the others are not, so
~ consider > as the archetypical one when reasoning about correctness. To
~ test if A > B, we do cmp A, B, which sets the flags the same way as
~ subtracting B from A does. The mnemonic names of the condition codes are
~ based on the assumption you do it in this order.
~
~   We want to treat the top of the stack as the first operand, so we
~ carefully pop in the appropriate order.
~
~   There are both signed and unsigned variants of the condition codes. We
~ provide both; the unmarked comparisons such as > and <= are signed, while
~ the marked ones such as >unsigned and <=unsigned are of course unsigned.

~   Jonesforth calls this "="; most languages would call it "==".
: =
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-equal :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

~   Jonesforth calls this "<>", but even most modern SQL dialects recognize
~ C's legacy and allow "!=" these days. As someone who learned C in childhood,
~ this is not actually a hard call for us.
: !=
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-not-equal :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

: >
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-greater :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

~ Is the top of the stack less than the second item in the stack?
: <
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-less :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

: >=
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-greater-equal :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

: <=
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-less-equal :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

: >unsigned
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-above :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

: <unsigned
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-below :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

: >=unsigned
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-above-equal :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

: <=unsigned
  [ here @
    :rax pop-reg64
    :rbx pop-reg64
    :rbx :rax cmp-reg64-reg64
    :cc-below-equal :al set-reg8-cc
    0x01 :rax and-reg64-imm8
    :rax push-reg64
    here ! ] ;asm

~ Bitwise routines
~ ~~~~~~~~~~~~~~~~
~

: &
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :rbx :rax and-reg64-reg64
    :rax push-reg64
    here ! ] ;asm

: |
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :rbx :rax or-reg64-reg64
    :rax push-reg64
    here ! ] ;asm

: xor
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :rbx :rax xor-reg64-reg64
    :rax push-reg64
    here ! ] ;asm

~   The HP overloads the name "not", so we follow the Forth convention.
: invert
  [ here @
    :rax pop-reg64
    :rax not-reg64
    :rax push-reg64
    here ! ] ;asm


~ Literal routines
~ ~~~~~~~~~~~~~~~~
~
~   These words are rarely used directly from user code, but are emitted by
~ the interpreter, in compile mode. They allow literal values of various kinds
~ to exist inline as part of compiled code, which would otherwise be only an
~ array of codeword pointers. They do this, in their various ways, by pushing
~ any appropriate value on the stack and making sure execution skips over the
~ bytes used to represent it.

: lit
  [ here @
    lods64
    :rax push-reg64
    here ! ] ;asm

: litstring
  [ here @
    ~   The string immediately follows the codeword in memory, so rsi is
    ~ already pointing to it. That address will be our returned result, so we
    ~ push it to the stack.
    :rsi push-reg64

    ~   Now we need to skip over the string, so that rsi will be valid for the
    ~ next Forth word. To do that, we're going to do a string operation with
    ~ scas8, which takes rdi as the address to look at. This means scas8 is
    ~ treating its operand as analogous to the destination operand of movs*.
    :rsi :rdi mov-reg64-reg64
    ~   We want to compare for equality with zero; scas8 looks in al for the
    ~ other half of the comparison, so we clear rax.
    :rax :rax xor-reg64-reg64
    ~   Counterintuitively, we do need to pass a count. We pass -1, which will
    ~ always work.
    :rcx :rcx xor-reg64-reg64
    :rcx not-reg64
    ~   The DF flag is zero per our Forth execution-model convention, which
    ~ means scas8 increments rdi after each iteration. This is what we want,
    ~ so there's no need to mess with it.
    ~
    ~   We also need to worry about ZF, but fortunately that "not" will make
    ~ sure it's clear.
    ~
    ~   So, we're ready; do it!
    repnz-scas8
    ~   The scas8 completes, incrementing rdi, before the repnz checks its
    ~ condition. So, rdi is now pointing immediately after the terminating
    ~ null byte. Of course, we want this in rsi for Forth's purposes.
    :rdi :rsi mov-reg64-reg64

    ~   Finally, we need to align rsi to the next word boundary.
    7 :rsi add-reg64-imm8
    7 invert :rsi and-reg64-imm8

    here ! ] ;asm


~ Memory access routines
~ ~~~~~~~~~~~~~~~~~~~~~~
~
~   We go with Forth names for this stuff. The HP's names for memory and
~ storage operations heavily leverage the fact they have an object system
~ with type tags and so on; we want to stay close to the bytes.

~ Address on the top of the stack, value in the second position
: !
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :rax :rbx mov-indirect-reg64-reg64
    here ! ] ;asm

: @
  [ here @
    :rax pop-reg64
    :rax :rax mov-reg64-indirect-reg64
    :rax push-reg64
    here ! ] ;asm

~   I might have put the parameters the other way round, but this is what
~ Jonesforth does and it seems reasonable enough.
~
~ (value, address --)
: +!
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :rax :rbx add-indirect-reg64-reg64
    here ! ] ;asm

~ (value, address --)
: -!
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :rax :rbx sub-indirect-reg64-reg64
    here ! ] ;asm

~   Jonesforth calls this "c!" but we categorically reject the use of letters
~ that are meant to indicate byte sizes. It's unfriendly to newcomers. There's
~ some risk that this name will be confused for meaning "store a value of 8",
~ but that would not be a useful task, so hopefully it'll be okay.
: 8!
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :al :rbx mov-indirect-reg64-reg8
    here ! ] ;asm

: 8@
  [ here @
    :rbx pop-reg64
    :rax :rax xor-reg64-reg64
    :rbx :al mov-reg8-indirect-reg64
    :rax push-reg64
    here ! ] ;asm

: 16!
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :ax :rbx mov-indirect-reg64-reg16
    here ! ] ;asm

: 16@
  [ here @
    :rbx pop-reg64
    :rax :rax xor-reg64-reg64
    :rbx :ax mov-reg16-indirect-reg64
    :rax push-reg64
    here ! ] ;asm

: 32!
  [ here @
    :rbx pop-reg64
    :rax pop-reg64
    :eax :rbx mov-indirect-reg64-reg32
    here ! ] ;asm

: 32@
  [ here @
    :rbx pop-reg64
    :rax :rax xor-reg64-reg64
    :rbx :eax mov-reg32-indirect-reg64
    :rax push-reg64
    here ! ] ;asm

~   Before we get too deep into it, we also define a few reflection routines
~ that retrieve, or set, the address of either of the two stacks.
~
~   The result, or the new value, is on the top of the data stack. That's the
~ same as how it works for any other word, but given the metacircular nature
~ of these ones it's easy to get confused about that...
~
~   Jonesforth calls this RSP!, which looks as if it's meant to be an Intel
~ register name but is actually short for return stack pointer. There is no
~ register by that name, it's a Forth-provided abstraction. That's super
~ confusing, plus as discussed above we call it the control stack not the
~ return stack, so we call the words...
: control!
  [ here @
    :rbp pop-reg64
    here ! ] ;asm
: control@
  [ here @
    :rbp push-reg64
    here ! ] ;asm

~   Jonesforth calls this DSP!, for data stack pointer. Again, there's no
~ Intel register by that name, and we call it the value stack, so...
: value!
  [ here @
    ~   Per Intel's description of POP this reads from the old location, and
    ~ there is no increment applied to the resulting value. See, the
    ~ description says it increments the register then overwrites it, in that
    ~ order.
    :rsp pop-reg64
    here ! ] ;asm
: value@
  [ here @
    ~ Per Intel's description of PUSH this pushes the old value.
    :rsp push-reg64
    here ! ] ;asm

~   Copy one non-overlapping block of memory over another. For the overlapping
~ logic, see roll and unroll. This always copies at byte granularity, for ease
~ of implementation.
~
~   Jonesforth calls this CMOVE and has a CCOPY which is a single byte. POSIX,
~ however, has memcpy() for non-overlapping blocks and memmove() for
~ overlapping blocks. We follow the latter convention, because it feels like
~ the more important distinction.
~
~   Also, the "c" was meant to indicate that it works at one-byte granularity,
~ but that isn't, uh... actually an important property here, and as a blanket
~ call we're not using letters to denote data sizes. So we call it "memcopy".
~ Apologies to the C programming tradition but vowels are good, actually.
~
~   Jonesforth also offers C@C! as another name for its CCOPY, but neither
~ "@!" nor "mem@mem!" seems particulaly nice.
~
~ (destination, source, length --)
: memcopy
  [ here @
    ~   We need to save and restore rsi; the other registers we can trample.
    :rsi :rdx mov-reg64-reg64
    :rcx pop-reg64
    :rsi pop-reg64
    :rdi pop-reg64
    ~   We start from the low end, since that's easier arithmetic. So, we get
    ~ to leave the DF flag alone.
    rep-movs8
    :rdx :rsi mov-reg64-reg64
    here ! ] ;asm

~   This does the same thing as memcopy, but correctly handles situations
~ where the destination overlaps with the source. It achieves this by being
~ careful about which end the transfer starts from. This "move" vs. "copy"
~ distinction mirrors C terminology.
~
~ (destination, source, length --)
: memmove
  [ here @
    ~ We need to save and restore rsi; the other registers we can trample.
    :rsi :rdx mov-reg64-reg64
    :rcx pop-reg64
    :rsi pop-reg64
    :rdi pop-reg64

    ~ We need to check source < destination to decide which end to start from.
    :rsi :rax mov-reg64-reg64
    :rdi :rax cmp-reg64-reg64
    ~ Relative offsets are from the start of the instruction after the jmp.
    4 :cc-below jmp-cc-rel-imm8

    ~ If source is greater, we are sliding downwards so we start from the low
    ~ end. So, we get to leave the DF flag alone.
    rep-movs8                         ~ 2 bytes
    16 jmp-rel-imm8                   ~ 2 bytes

    ~ If destination is greater, we are sliding upwards so we start from the
    ~ high end. So, we have to save and restore DF. Also, we have to adjust the
    ~ pointers.
    :rcx :rsi add-reg64-reg64         ~ 3 bytes
    :rsi dec-reg64                    ~ 3 bytes
    :rcx :rdi add-reg64-reg64         ~ 3 bytes
    :rdi dec-reg64                    ~ 3 bytes
    std                               ~ 1 byte
    rep-movs8                         ~ 2 bytes
    cld                               ~ 1 byte

    :rdx :rsi mov-reg64-reg64
    here ! ] ;asm


~ (string pointer -- string length not including null byte)
: stringlen
  [ here @
    :rdi pop-reg64
    :rdi :rbx mov-reg64-reg64
    :rax :rax xor-reg64-reg64
    :rcx :rcx xor-reg64-reg64
    :rcx not-reg64
    repnz-scas8
    :rbx :rdi sub-reg64-reg64
    1 :rdi sub-reg64-imm8
    :rdi push-reg64
    here ! ] ;asm

~   This word is for working with strings that are null-terminated at both
~ ends. It counts backwards from the end to find the string length. It accepts
~ an input pointer to the final non-null byte at the end of the string, and
~ returns the length not including either null byte.
~
~ (string end pointer -- string length not including null)
: reverse-stringlen
  [ here @
    :rdi pop-reg64

    :rdi :rbx mov-reg64-reg64
    :rax :rax xor-reg64-reg64
    :rcx :rcx xor-reg64-reg64
    :rcx not-reg64

    std
    repnz-scas8
    cld

    :rdi :rbx sub-reg64-reg64
    1 :rbx sub-reg64-imm8

    :rbx push-reg64
    here ! ] ;asm


~   If you have a variable-length string followed by alignment padding, and
~ you want to traverse it in reverse, you also need to skip the alignment
~ padding...
~
~   The alignment padding can have a total length of no more than 8 bytes. The
~ returned value counts from the end, backwards, to the nearest non-null byte,
~ or at most a distance of eight.
~
~ (address of final null byte in padding -- count of null bytes in padding)
: reverse-padding-len
  [ here @
    :rdi pop-reg64

    :rdi :rbx mov-reg64-reg64
    :rax :rax xor-reg64-reg64
    9 :rcx mov-reg64-imm32

    std
    repz-scas8
    cld

    :rdi :rbx sub-reg64-reg64
    1 :rbx sub-reg64-imm8

    :rbx push-reg64
    here ! ] ;asm


~   We make this work using exactly two jump instructions, which is likely the
~ minimum possible. To avoid relying on labels, we hand-compute the byte
~ offsets, so every instruction within their ranges is annotated with its
~ length in bytes. Good fun.
~
~   The returned comparsion value is 0 for equal, 1 to indicate the left
~ string (lower on the stack) is greater, and -1 to indicate the right string
~ (top of the stack) is greater.
~
~ (left string pointer, right string pointer -- comparison value)
: stringcmp
  [ here @
    ~   Save the rsi register.
    ~
    ~   Happily, we don't need a lot of registers for this code, so we can
    ~ dedicate rdx to this and not have to deal with stack juggling..
    :rsi :rdx mov-reg64-reg64

    ~   Get the parameters off the stack.
    ~
    ~   For reasons that will be explained below, the left and right strings
    ~ from our caller's perspective are swapped from the perspective of the
    ~ comparisons we'll be doing. The rdi register points to the right-hand
    ~ operand of the "cmps" instruction, while rsi points to the left-hand
    ~ operand; [Intel] volume 2A, chapter 3, section 4-3.3, "CMPS"; note that
    ~ the PDF mis-numbers the sections in the second half of chapter 3.
    :rsi pop-reg64
    :rdi pop-reg64

    ~   For the comparison-unequal loop-exit code, we'll need rbx set to zero.
    ~ The easy way to do that is by xor'ing it with itself, but doing it at
    ~ the end of the loop would overwrite the comparison flags, which we need.
    ~ So we do it in advance, and carefully don't touch it anywhere else.
    ~
    ~   See the list of which instructions affect the flags in [Intel]
    ~ volume 1, appendix A, section A-1, table A-2.
    :rbx :rbx xor-reg64-reg64

    ~   We also clear rax; we'll be using al later, and it will be convenient
    ~ to know the upper bits are always zero.
    :rax :rax xor-reg64-reg64

    ~ Now we've initialized everything we need; everything after this point is
    ~ part of the loop.

    ~   At the start of each iteration, save a copy of the byte we're at now.
    ~ We need to do this before cmps because the pointers will increment;
    ~ we'll eventually use it to test whether we've reached the end delimiter.
    ~
    ~   We do a 64-bit load into rcx, which is otherwise unused, then copy the
    ~ low byte from cl to al. This avoids having to think about addressing
    ~ modes that combine 64-bit addresses with smaller data sizes; those are
    ~ very subtle.
    :rsi :rcx mov-reg64-indirect-reg64  ~ 3 bytes
    :cl :al mov-reg8-reg8               ~ 2 bytes

    ~   Now do the cmps, which is the heart of the loop.
    ~
    ~   Note that, in addition to relying on this comparison to test content
    ~ bytes against each other, in the event that one string is a prefix of
    ~ the other, this will also test content bytes against null delimiters.
    ~ The longer string will compare as "greater", because zero is less than
    ~ all other possible byte values. This is what we want.
    ~
    ~   Since strings of different lengths are necessarily unequal, letting
    ~ this test do the work of detecting that also means we don't have to deal
    ~ with scenarios where we're past the end of one string but not the other.
    cmps8                               ~ 1 byte

    ~   The flags are now set based on the simulated subtraction of the next
    ~ bytes from the two strings. If they were unequal, the loop will end. If
    ~ they were equal, we have another test to do before we're finished with
    ~ this iteration. So we put the loop-end code next, and conditionally jump
    ~ forward past it.
    ~
    ~   Recall that relative offsets are from the start of the instruction after
    ~ the jmp.
    15 :cc-equal jmp-cc-rel-imm8        ~ 2 bytes

    ~   If we got here, the strings are unequal, so we need to turn the flags
    ~ into a comparison value. We cleared rbx earlier; now we set its low bit
    ~ to the "above" flag, then use sbb to subtract the "carry" flag (which it
    ~ thinks of as the "borrow" flag).
    ~
    ~   Recall that "carry" is based on whether the comparison would cause a
    ~ change to the next bit "outside" the bits being compared; [Intel]
    ~ volume 1, chapter 3, section 3-4.3.1. Since the comparison is a
    ~ subtraction, "carry" is true when the right-hand byte is strictly
    ~ greater than the left-hand byte.
    ~
    ~   Also recall that "zero", also called "equal", is based on whether the
    ~ comparison's result produces output that's all zero bits. Since the
    ~ comparison is a subtraction, "zero" is true when the two bytes are
    ~ identical.
    ~
    ~   The "above" condition is not a flag, but a composite of flags. It's
    ~ true when both "carry" and "zero" are false; [Intel] volume 1, chapter
    ~ 7, section 7-3.1.1, table 7-2. For us, this is equivalent to saying the
    ~ left-hand byte is strictly greater than the right-hand.
    ~
    ~   So, after the "set" instruction, rbx is 1 if left > right, and 0 if
    ~ left = right or left < right. The "sbb" instruction with an immediate
    ~ value of zero subtracts the value of the carry flag from rbx; the flag
    ~ is true in the case where left < right, and false otherwise. So, after
    ~ the "sbb", rbx is 1 if left > right, 0 if left = right, and -1 if
    ~ left < right.
    ~
    ~   This sounds like the opposite of what we want, but recall that we
    ~ exchanged the operands, so these are the values our caller is expecting.
    ~ While it might be tempting to look for a way to not need that
    ~ transposition, and indeed there exists an approach using "not above",
    ~ it's fiddly in a way that's even harder to explain.
    ~
    ~   Note that it's important that the "jmp" and "set" instructions don't
    ~ change the flags (see table A-2 again), so the "cmps" is the most recent
    ~ thing that did. The "sbb" instruction does change them, but we don't
    ~ need them again after that point so it doesn't matter.
    ~
    ~   Finally, we need to restore rsi before we return to Forth.
    ~
    ~   The two-instruction set-sbb sequence is one of those classic assembly
    ~ programming tricks that often goes unexplained, because most of its
    ~ appeal is its brevity and a proper explanation is quite lengthy. Please
    ~ enjoy this gift of knowledge, and thanks to our friends who showed it to
    ~ us.
    :cc-above :bl set-reg8-cc           ~ 3 bytes
    0 :rbx sbb-reg64-imm8               ~ 4 bytes
    :rbx push-reg64                     ~ 1 byte
    :rdx :rsi mov-reg64-reg64           ~ 3 bytes
    pack-next                           ~ 4 bytes

    ~   If we got here, the current bytes are equal to each other. We still
    ~ need to test if they're null terminators; if so, we exit, and if not, we
    ~ loop.
    ~
    ~   At the beginning of this iteration, we saved a copy of the byte being
    ~ inspected in rax. The "test" instruction simulates a xor and sets the
    ~ flags accordingly. We check the "not equal" condition, also known as
    ~ "not zero", and jump backwards to the start of the loop when that's the
    ~ case.
    ~
    ~   Recall that relative offsets are from the start of the instruction
    ~ after the jmp.
    :rax :rax test-reg64-reg64          ~ 3 bytes
    -28 :cc-not-equal jmp-cc-rel-imm8   ~ 2 bytes

    ~   If we got here, we got all the way to the end of both strings and
    ~ found a null byte, so we return 0 to indicate they're equal.
    ~
    ~   We can use push-imm32-extended64 even though our stack holds 64-bit
    ~ values, because it gets sign-extended.
    ~
    ~   We need to restore rsi before we return to Forth.
    0 push-imm32-extended64
    :rdx :rsi mov-reg64-reg64
    here ! ] ;asm


~ Branching
~ ~~~~~~~~~

~   Okay, so there's a weird thing about how branch and 0branch are defined.
~ 0branch jumps into branch, so they need to be next to each other in the log.
~ To compute the offset to jump by, we pass an address on the stack. Notice
~ that assembly words are always defined in immediate mode; it's just they
~ usually only do trivial logic.

~   This takes a number of bytes, not machine words. That allows it to be used
~ for putting weird things embedded in the code.
~
~   The offset is relative to the start of the word the number of bytes is in,
~ so, make sure to have it skip itself.
: branch
  [ here @
    dup swap ~ Save a copy of this address for later.
    :rsi :rsi add-reg64-indirect-reg64
    here ! ] ;asm
~ (jump destination)

: 0branch
  [ here @
    ~ (jump destination, output point)

    :rax pop-reg64
    :rax :rax test-reg64-reg64

    ~   Please notice the 8-bit branch to the nearby word. This is the offset
    ~ from the end of the jmp instruction, to the start of the target
    ~ instruction. To avoid needing the label system for this, we compute it
    ~ based on the address before the jmp; the jmp itself is two bytes.
    ~
    ~   Remember, with jump arithmetic, subtract the start from the
    ~ destination.
    ~
    ~   What we're testing with :cc-equal is that the input value is zero.
    ~ The name is slightly counterintuitive; it's a result of the condition
    ~ names favoring cmp over test. While cmp simulates subtraction, test
    ~ simulates bitwise AND.
    dup 2 + 3roll swap - :cc-equal jmp-cc-rel-imm8
    ~ (output point)

    ~   In the event we didn't jump, we still need to skip over the literal
    ~ value.  Using lods here is just a convenient way to skip rsi forward.
    lods64

    here ! ] ;asm
~ the stack is empty once more


~   This is like next, but instead of using rsi as the "instruction pointer",
~ it takes a codeword address from the value stack.
~
~   In the event that the codeword is docol, docol will handle any
~ manipulation of the control stack that needs to happen. Yes, it really is
~ that simple.
~
~ (execution token --)
: execute
  [ here @
    :rax pop-reg64
    :rax jmp-abs-indirect-reg64
    here ! ] ;asm


~ Dictionary entries
~ ~~~~~~~~~~~~~~~~~~
~
~   Now, we have a bunch of words that are used for traversing the Forth
~ core data structures that describe words. First, we have a couple that
~ relate to individual words and their pieces...
~
~   The log-load transform produces code that requires
~ entry-to-execution-token, which means it's needed statically. So this stuff
~ to deal with word entry headers might as well go in core, since it has no
~ dependencies to speak of.
~
~   These are the first words in core that are implemented in Forth rather
~ than assembler. That's not as big a deal as it may seem; the Forth execution
~ model has been ready-to-go ever since we implemented docol and exit, and
~ at this point we have enough basics to do useful things with it.

~   Jonesforth calls this "TFCA" and ">CFA"; its author speculates that the
~ original meaning is "code field address".
~
~ (entry pointer -- execution token)
: entry-to-execution-token
  ~ Skip next-entry pointer, flag byte, and start terminator.
  10 +
  ~ Skip string contents.
  dup stringlen +
  ~ Skip one for the null terminator, seven more for alignment.
  8 +
  ~ Zero the low bits and now it's aligned.
  7 invert & ;

~   Jonesforth calls this "CFA>". Jonesforth's implementation searches the
~ entire dictionary, since its word header format isn't designed to be
~ traversed in reverse, but ours is, so it should be fast.
~
~ (execution token -- entry pointer)
: execution-token-to-entry
  1 -
  dup reverse-padding-len -
  dup reverse-stringlen -
  9 - ;

~ (entry pointer -- flags byte)
: entry-flags@
  8 + @ 0xFF & ;

~ TODO these parameters are in a counterintuitive order, swap them
~ (entry pointer, new flags byte --)
: entry-flags!
  swap
  8 +
  dup @ 3roll
  0xFF &
  swap 0xFFFFFFFFFFFFFF00 & |
  swap !
  ;

~ (entry pointer -- name string pointer)
: entry-to-name 10 + ;


~ Binary packing
~ ~~~~~~~~~~~~~~
~
~   These routines are for building up data structures in-memory. Sometimes
~ they're used for structures that are meant to stay in memory; other times
~ it's a buffer that will become output.
~
~   The general pattern is that each routine takes an output address and
~ some specific datum, and returns the output address adjusted to point
~ after the new datum. That makes them easy to chain together. We call this
~ address the "output point", to capture the idea that it's a running total
~ which gets updated by each new datum as it's packed.

~ (output point, value -- output point)
: pack64 swap dup 3unroll ! 8 + ;
: pack32 swap dup 3unroll 32! 4 + ;
: pack16 swap dup 3unroll 16! 2 + ;
: pack8 swap dup 3unroll 8! 1 + ;

~   This works on C-style strings, which are characters followed by a null
~ terminator. The packed data includes the null terminator.
~
~ (output point, string pointer -- output point)
: packstring
  dup stringlen 1 + dup
  ~ (output point, source, length, length)
  4 roll dup 5 unroll
  ~ (destination, source, length, length, output point)
  + 4 unroll
  ~ (output point, destination, source, length)
  memcopy ;

~ (output point, alignment byte count -- output point)
: packalign
  2dup /% drop 0branch [ 8 8 * , ]
  swap 0 pack8 swap branch [ -11 8 * , ]
  drop ;
  ~ TODO this is the implementation we could use if we had more flow-control
  ~ { 2dup /% drop { drop exit } unless
  ~  swap 0 pack8 swap } forever ;


~ Binary unpacking
~ ~~~~~~~~~~~~~~~~
~
~   These routines are for examining data structures in-memory.
~
~   Similarly to the output routines, each routine takes an input address,
~ which it updates to point after the data item being read. We call this the
~ "input point". Since this is input, the routines return data items rather
~ than accepting them.

~ (input point -- input point, value)
: unpack64 dup @ swap 8 + swap ;
: unpack32 dup 32@ swap 4 + swap ;
: unpack16 dup 16@ swap 2 + swap ;
: unpack8 dup 8@ swap 1 + swap ;

~ TODO does this need to have a separate name?
~ (proposed size, alignment byte count -- adjusted size)
: align-size
  dup 3unroll dup 3unroll
  ~ (alignment, alignment, proposed size, alignment)
  1 - + swap /% swap drop * ;

~   You might think this would be identical to packalign, but packalign has
~ side effects.
~
~ (input point, alignment byte count -- input point)
: unpackalign align-size ;


~ Development utilities
~ ~~~~~~~~~~~~~~~~~~~~~

~   This peforms the "hlt" instruction (Intel's mnemomic, short for "halt"),
~ which will cause the program to exit with a segmentation fault. If you're
~ running under a debugger, this is a convenient way to get execution to stop
~ at a certain point.
~
~   It's called "crash" rather than "hlt" to distinguish it from the word
~ which outputs the instruction as machine code.

: crash
  [ here @
    hlt
    here ! ] ;asm