~ ~~~~~~~~~~~~~~~~~
~ ~~ Interpreter ~~
~ ~~~~~~~~~~~~~~~~~
~
~   The code in this file defines the basic syntax and semantics of Forth as
~ a text-based language. It's written in terms of the underlying executor,
~ which is implemented and explained in execution.e. The execution model gives
~ us the concept of "words"; the control and value stacks; and the ability to
~ call things. It has nothing to say about text, only about the binary form of
~ the language.
~
~   It's traditional in Forth to refer to an act of "compiling" code, which
~ in this context means turning it from text into its binary representation.
~ That binary representation most commonly takes the form of a word entry
~ header followed by an array of codeword pointers.
~
~   It would be legitimate to critique the terminology by saying that codeword
~ pointers are still, in some sense, interpreted: They are not machine code to
~ be directly executed by the CPU; they rely on "docol" and "next" at runtime.
~ However, in language design circles, the term "compilation" takes on a
~ broader meaning, referring to any process which requires some or all of the
~ types of infrastructure we regard as being compiler internals: A successive
~ translation of code from one form into another, discarding some types of
~ information while computing others, in a careful order that results in
~ logically consistent output which in some sense has the same meaning as the
~ input. Sometimes this output may be machine code, but often it is another
~ language meant for human consumption, or an intermediate layer meant to be
~ fed into another process.
~
~   Forth compilation is compilation in this sense, so there is no conflict
~ and we run with the established terminology. In addition, it must be noted
~ that Evocation, like many Forths, makes extensive use of words which are
~ implemented directly in machine code; the Forth execution model allows these
~ words to co-exist with words that are interpreted by "docol".
~
~   At any rate, the code in this file is responsible for that compilation.
~
~   It is primarily concerned with managing the contents of an area of memory
~ we call the "log"; see log-load.e for more detail on terminology.

~   As a convenience for "word", we have some facilities for working with
~ stack-allocated strings. Yeah, trippy concept. Also, it would be a
~ buffer overrun hazard if we were worried about that, which is why this
~ is no longer common practice in C.
~
~   The most important of these is accumulate-string, but we need some
~ smaller pieces first.

~ (stack string..., items to be left alone..., item to be unrolled,
~  number of items above string that participate in the unroll
~  -- item that was unrolled, stack string..., items left alone)
: unroll-past-string
  8 *
  ~ (string, other items, top item, byte offset to string start)
  dup value@ +
  ~ We have two copies of the offset present, in addition to the stuff we want
  ~ to rotate. So, the actual string starts two words on... We could have
  ~ adjusted the offset instead, but we'll want the unmodified offset again
  ~ later.
  16 +
  ~ (string, other items, top item, offset to start, string pointer)
  stringlen
  ~ Same reasoning as in accumulate-string (see below).
  1 + 8 align-size
  ~ (string, other items, top item, offset to start, string length w/ padding)
  + 8 /% swap drop
  ~ (string, other items, top item, number of words to unroll)
  unroll ;


~ (stack string..., item to be swapped
~  -- item that was swapped, stack string...)
: swap-past-string 1 unroll-past-string ;


~ (stack string... --)
: dropstring
  value@ stringlen
  ~ Same reasoning as in accumulate-string (see below).
  1 + 8 align-size value@ +
  ~   At the time we fetched the stack pointer, there was an extra value atop
  ~ it, so we have to add one more word.
  8 + value! ;


~ (stack string..., item to be kept
~  -- item that was kept)
: dropstring-with-result swap-past-string dropstring ;


~ (stack string-so-far..., new character byte
~  -- updated string-so-far)
: accumulate-string
  ~ Compute the address of the final word of the string.
  ~
  ~   It's a little bit difficult to get the start pointer right, since all
  ~ our intermediate products affect what we get from value@, so we compute
  ~ that just once, here at the beginning.
  value@ 8 +
  ~ (string so far, new character byte, pointer to start of string)
  dup stringlen
  ~ There are two concerns here that overlap: First, we always want at least
  ~ one word. Recall that a length of zero bytes won't receive any alignment
  ~ padding because it's already divisible by 8. Second, the result of
  ~ stringlen doesn't include the null byte, which might be in a word by
  ~ itself that needs to be counted. We can address both of them by
  ~ unconditionally adding 1 to the length before applying alignment.
  1 +
  ~ Pad the length for alignment.
  8 align-size
  ~ We want an offset from the first word of the string to the last word of
  ~ the string, so we subtract one word from the length.
  8 - +
  ~ (string so far, new character byte, address of final word)

  ~ Examine the final word of the string, leaving other stuff undisturbed.
  ~ Work low-to-high to figure out where to store the new byte, taking the
  ~ first one that's available.
  ~ (string so far, new character byte, address of final word)
  dup @ dup 0x00000000000000FF & { 3roll | swap ! exit } unless
  ~ (string so far, new character byte, address of final word, old value)

  ~ This next part is repeated several times, changing only the offsets, for
  ~ bytes 1 through 6; bytes 0 (above) and 7 (way below) are different.
  ~ (string so far, new character byte, address of final word, old value)
  dup 0x000000000000FF00 & { 3roll 0x0000000000000100 * | swap ! exit } unless
  dup 0x0000000000FF0000 & { 3roll 0x0000000000010000 * | swap ! exit } unless
  dup 0x00000000FF000000 & { 3roll 0x0000000001000000 * | swap ! exit } unless
  dup 0x000000FF00000000 & { 3roll 0x0000000100000000 * | swap ! exit } unless
  dup 0x0000FF0000000000 & { 3roll 0x0000010000000000 * | swap ! exit } unless
  dup 0x00FF000000000000 & { 3roll 0x0001000000000000 * | swap ! exit } unless

  ~   The top byte of the final word is always zero (or else stringlen
  ~ wouldn't have called it the final word), so we don't need to check it, we
  ~ can just use it.
  ~
  ~   We need to put the new value in the top byte, which will mean we have no
  ~ null terminator, so we also need to start a new word.
  ~
  ~   There is a fiddly order-dependency here: unroll-past-string relies on
  ~ being able to find the null terminator, which won't work if we've gotten
  ~ rid of it. Also, calling it will move all the earlier words, including
  ~ the one we intend to write to, which will invalidate any pointer we're
  ~ keeping at that point. There's a few ways to resolve this; what we do is
  ~ put the new terminator in place first, manually nudge the pointer, and
  ~ then write the new value.
  ~ (string so far, new character byte, address of final word, old value)
  3roll 0x0100000000000000 * | swap
  ~ (string so far, new value, address of final word)
  0 3 unroll-past-string
  ~ (new null terminator, string so far, new value, invalid address)
  8 -
  ~ (new null terminator, string so far, new value, updated address)
  ! ;


~ (byte -- boolean)
: is-space
  dup 0x20 = { drop 1 exit } if
  dup 0x09 = { drop 1 exit } if
  dup 0x0a = { drop 1 exit } if
  dup 0x0b = { drop 1 exit } if
  dup 0x0c = { drop 1 exit } if
  dup 0x0d = { drop 1 exit } if
  drop 0 ;


~ (character -- 1 for true or 0 for false)
: is-alphanumeric
  ~ We don't have a character-literal syntax; the hex constants here are
  ~ ASCII codes.
  dup 0x30 > { drop 0 exit } if ~ Less than "0".
  dup 0x39 >= { drop 1 exit } if ~ Less than or equal to "9".
  dup 0x41 > { drop 0 exit } if ~ Less than "A".
  dup 0x5a >= { drop 1 exit } if ~ Less than or equal to "Z".
  dup 0x61 > { drop 0 exit } if ~ Less than "a".
  dup 0x7a >= { drop 1 exit } if ~ Less than or equal to "z".
  drop 0 ; ~ Greater than "z".


~ Okay, this is the big one, the lexer! Wow.
~
~ (-- stack string occupying multiple words)
: word
  ~ We allocate an empty string first, so that the result of "key" will
  ~ conveniently be on the easy-to-find end of it.
  0

  ~ Skip whitespace.
  { key dup is-space } { drop } while

  ~ Early exit if it's a zero byte.
  dup 0 = { drop dropstring 0 exit } if

  accumulate-string

  {
    peek dup is-space { drop exit } if
    dup { drop exit } unless
    consume accumulate-string
  } forever ;


~ (character -- value)
: generalized-digit-value
  ~ We don't have a character-literal syntax; the hex constants here are
  ~ ASCII codes.
  dup 0x61 <= { 0x61 - 10 + exit } if ~ lowercase "a"
  dup 0x41 <= { 0x41 - 10 + exit } if ~ uppercase "a"
  0x30 - ; ~ digit "0"


~ (character, base
~  -- value (if successful),
~     error indicator (zero equals success))
: decode-generalized-digit
  swap dup is-alphanumeric {
    ~ It's alphanumeric.
    ~ (base, character)
    generalized-digit-value
    ~ (base, value)
    dup 3roll
    ~ (value, value, base)
    > {
      ~ It's in range.
      ~ (value)
      0 exit } if
    ~ It's out of range.
    ~ (value)
    drop 1 exit } if
  ~ It's not alphanumeric.
  drop drop 1 ;


~ (string pointer, base
~  -- result (if successful),
~     error indicator (zero equals success))
: read-base-unsigned
  swap

  ~ If the first byte is null, this is an error
  unpack8
  ~ (numeric base, current point in string, character)
  dup 0 = { drop drop drop 1 exit } if

  ~ Decode the first byte as a generalized digit in the base.
  ~ (numeric base, current point in string, character)
  ~ If the first byte is less than "0", this is an error.
  3roll dup 4 unroll
  ~ (numeric base, current point in string, character, numeric base)
  decode-generalized-digit {
    ~ (numeric base, current point in string)
    drop drop 1 exit } if

  ~ The first byte is a valid generalized digit in the appropriate base, so
  ~ let's get started.
  ~ (numeric base, current point in string, initial value)
  swap

  {
    ~ (numeric base, result so far, current point in string)
    unpack8 dup 0 = {
      ~ A null after the first character is valid, and indicates we're done.
      drop drop swap drop 0 exit } if

    ~ Decode the latest byte as a generalized digit in the base.
    ~ (numeric base, result so far, current point in string, latest byte)
    4 roll dup 5 unroll
    ~ (numeric base, result so far, current point in string, character
    ~  numeric base)
    decode-generalized-digit {
      ~ If the latest character is not a valid digit, that's an error.
      ~ (numeric base, result so far, current point in string)
      drop drop drop 1 exit } if

    ~ The latest character is valid, so incorporate it and loop.
    ~ (numeric base, result so far, current point in string, latest value)
    3roll 4 roll dup 5 unroll * + swap
  } forever ;


~ (string pointer
~  -- result (if successful),
~     error indicator (zero equals success))
: read-integer-unsigned
  ~ We don't have a character-literal syntax; the hex constants here are
  ~ ASCII codes.
  dup unpack8 0x30 != { ~ digit "0"
    ~ This is the case where the leading digit is not a zero.
    ~ (original string pointer, advanced string pointer)
    drop 10 read-base-unsigned exit } if

  ~ This is the case where the leading digit is a zero.
  ~ (original string pointer, advanced string pointer)
  unpack8 dup 0x78 = { ~ lowercase "x"
    ~ (original string pointer, doubly advanced string pointer, character)
    drop swap drop 16 read-base-unsigned exit } if

  dup 0x6f = { ~ lowercase "o"
    ~ (original string pointer, doubly advanced string pointer, character)
    drop swap drop 8 read-base-unsigned exit } if

  dup 0x62 = { ~ lowercase "b"
    ~ (original string pointer, doubly advanced string pointer, character)
    swap drop swap 2 read-base-unsigned exit } if

  ~ This is the case where the second character is something else.
  ~ (original string pointer, doubly advanced string pointer, character)
  drop drop 10 read-base-unsigned ;


~ (string pointer
~  -- result (if successful),
~     error indicator (zero equals success))
: read-integer
  ~ We don't have a character-literal syntax; this is ASCII for a hyphen.
  dup unpack8 0x2d != {
    ~ This is the case where it's non-negative.
    ~ (original string pointer, advanced string pointer)
    drop read-integer-unsigned exit
  } if

  ~ This is the case where it's negative.
  ~ (original string pointer, advanced string pointer)
  swap drop read-integer-unsigned
  ~ (result maybe, exit code)
  dup {
    ~ Failure
    ~ (non-zero exit code)
    exit
  } if

  ~ Success
  ~ (result, zero exit code)
  swap -1 * swap ;


~ (string pointer
~  -- result (if successful),
~     error indicator (zero equals success))
: read-decimal
  ~ We don't have a character-literal syntax; this is ASCII for a hyphen.
  dup unpack8 0x2d != {
    ~ This is the case where it's non-negative.
    ~ (original string pointer, advanced string pointer)
    drop 10 read-base-unsigned exit
  } if

  ~ This is the case where it's negative.
  ~ (original string pointer, advanced string pointer)
  swap drop 10 read-base-unsigned
  ~ (result maybe, exit code)
  dup {
    ~ Failure
    ~ (non-zero exit code)
    exit
  } if

  ~ Success
  ~ (result, zero exit code)
  swap -1 * swap ;

~   Here, we allocate a single machine word's worth of space to use as the
~ backing store of a mutable variable, initialized to zero. Then we define the
~ variable which points to that address.
~
~   We don't actually need a word header for interpreter-flags-storage, we
~ could just append a zero and point to it directly, but that would make life
~ harder for words that attempt to work with the contents of other words. So
~ we give it a name.

s" interpreter-flags-storage" create make-hidden
here @ 0 , s" interpreter-flags" variable


~   There's an important bootstrapping concern: If you're loading this
~ interpreter into a running Evocation, it's important to not use the wrong
~ interpreter state value. The way we actually load this is under the log-load
~ transform, which will be running its alternate versions of these words and
~ therefore won't have trouble, but for the sake of making it easy to
~ experiment with modified versions of this file, we do the whole hide-show
~ dance.

: [ interpreter-flags @ 0x01 invert & interpreter-flags !
  ; make-immediate make-hidden
latest @
~ (pointer to [ entry)

: ] interpreter-flags @ 0x01 | interpreter-flags !
  ; make-hidden
latest @
~ (pointer to [ entry, pointer to ] entry)

~   It may seem nonsensical to use : to define :, but the log-load transform
~ provides an alternate for it, so it works. The same, of course, goes for all
~ these other word-defining words. The syntax-centric ones such as : are here
~ in interpret.e because they need to interact with interpreter-flags, which
~ isn't defined before this point.
~
~   If the ] at the end feels backwards, imagine to yourself that everything
~ that ISN'T defining a word body is part of an implicit [ ... ] sequence.
~ Doing so doesn't really change anything, but may make you happier.
: : word value@ create dropstring docol , make-hidden
  [ dup entry-to-execution-token , ] ;

~   The counterpart of : is ;.
: ;
  ~ See commentary on "literal", in dynamic.e, regarding "lit exit".
  lit exit ,
  make-visible
  ~   See above regarding [. Since it's an immediate word, we have to go to
  ~ extra trouble to compile it as part of ;. Since it's also hidden, we have
  ~ to go behind the interpreter's back to even get its entry pointer.
  ~
  ~   Adding insult to injury, the log transform breaks , for unclear reasons,
  ~ so we do that by hand as well.
  [ over entry-to-execution-token , ]
  ; make-immediate make-hidden
latest @
~ (pointer to [ entry, pointer to ] entry, pointer to ; entry)


: ;asm
  here @ pack-next 8 packalign here !
  make-visible
  latest @ entry-to-execution-token dup 8 + swap !
  ~ See above.
  [ 2 pick entry-to-execution-token , ]
  ; make-immediate make-hidden
latest @
~ (pointer to [ entry, pointer to ], pointer to ;, pointer to ;asm)


~   The word "'", often pronounced "tick", quotes the following word, looking
~ it up and treating it as a constant. In immediate mode, the constant winds
~ up on the stack; in compile mode it gets compiled.
~
~   There are a few possible implementation strategies here. Running as an
~ immediate word means there's a clear and unambiguous concept of "the
~ following word", so that's what we do; otherwise we'd have to get clever
~ about somehow finding out where we were called from. That means we take on
~ what would otherwise be the interpreter's responsibility, of checking what
~ mode we're in. Happily, that's easy to do.
~
~   There's a cyclic dependency where "if" relies on "'", and "'" relies on
~ "if". Fortunately flow-control is done with alternates by the log-load
~ transform, so we don't have to worry about it.
: ' word value@ find dropstring-with-result
  interpreter-flags @ 0x01 & { literal } if
  ; make-immediate


~ Let's have strings now.
: s"
  consume
  interpreter-flags @ 0x01 & {
    [ s" litstring" find literal ]
    entry-to-execution-token ,
  } if
  here @ key { dup dup 0x22 != && } { pack8 key } while
  drop 0 pack8
  interpreter-flags @ 0x01 &
    { 8 packalign here ! }
    { drop here @ } if-else
  ; make-immediate


~ Finicky semantics, but also important.
: ."
  ' s" entry-to-execution-token execute
  interpreter-flags @ 1 &
    { [ s" emitstring" find literal ]
      entry-to-execution-token , }
    { emitstring } if-else
  ; make-immediate


~ While we're thinking about input, let's also have comments.
: ~
  key { dup dup 10 != && } { drop key } while drop
  ; make-immediate


~ Now the single most important word...
: interpret
  word

  ~ If no word was returned, exit.
  dup 0 = { drop exit } if

  ~ The string is on the top of the stack, so to get a pointer to it we get
  ~ the stack address.
  ~ (string)
  value@ find

  ~ Check whether the word was found in the dictionary.
  dup {
    ~ If the word is in the dictionary, check what mode we're in, then...
    dropstring-with-result
    ~ (entry pointer)
    interpreter-flags @ 0x01 & {
      ~ ... if we're in compile mode, there's still a chance it's an immediate
      ~ word, in which case we fall through to interpret mode...
      dup entry-flags@ 1 & 0 =

      ~ ... but it's a regular word, so append it to the heap.
      { entry-to-execution-token , exit } if
    } if

    ~ ... if we're in interpret mode, or the word is immediate, run it.
    entry-to-execution-token execute exit
  } if

  ~ If it's not in the dictionary, check whether it's an integer literal.
  drop
  ~ As before, we get the stack address and use it as a string pointer.
  ~ (string)
  value@ read-integer 0 = {
    ~ It's a number.
    interpreter-flags @ 0x01 & {
      ~ We're in compile mode; append first "lit", then the number, to the
      ~ heap. The version of "lit" we use is the one that's current when we
      ~ ourselves are compiled, hardcoded; doing a dynamic lookup would
      ~ require dealing with what happens if it's not found.
      dropstring-with-result
      ' lit entry-to-execution-token
      , ,
      exit
    } if

    ~ We're in interpret mode; push the number to the stack. Or at least, that's
    ~ what the code we're interpreting will see. Really it's already on the
    ~ stack, just clean everything else up and leave it there.
    dropstring-with-result exit
  } if

  ~ If it's neither in the dictionary nor a number, just print an error.
  ." No such word: " value@ emitstring newline dropstring ;

: quit r0 @ control! { interpret } forever ;

~   Now we enable the three words we'd been keeping hidden. It would be a
~ good idea to call "quit" as soon as possible after this, but we leave it to
~ the user.
dup entry-flags@ 0x80 invert & entry-flags!
dup entry-flags@ 0x80 invert & entry-flags!
dup entry-flags@ 0x80 invert & entry-flags!
dup entry-flags@ 0x80 invert & entry-flags!

~   Though we only do it once, this is a bit involved so we provide it as a
~ word. Notionally there are situations in which it could come up again, so
~ it seems worth having around.
~
~   It actually needs to run as a compiled word no matter what; if it were
~ run in interpret mode it would cut itself off from the rest of itself.
~ However, if we didn't want to keep it around we could have it forget
~ itself...
~
~   It depends on ', so it's here in interpret.e.
: relink-main-input-buffer-to-stdin
  1024 allocate dup main-input-buffer buffer-physical-start !
  main-input-buffer buffer-logical-start !
  1024 main-input-buffer buffer-physical-length !
  0 main-input-buffer buffer-logical-length !
  [ s" refill-input-buffer-from-stdin"
    find entry-to-execution-token literal ]
  main-input-buffer input-buffer-refill !
  ;

~   Of course it is perfectly reasonable to change this, but for now it's
~ hardcoded.
relink-main-input-buffer-to-stdin