~ ~~~~~~~~~~~~~~~~~ ~ ~~ Interpreter ~~ ~ ~~~~~~~~~~~~~~~~~ ~ ~ The code in this file defines the basic syntax and semantics of Forth as ~ a text-based language. It's written in terms of the underlying executor, ~ which is implemented and explained in execution.e. The execution model gives ~ us the concept of "words"; the control and value stacks; and the ability to ~ call things. It has nothing to say about text, only about the binary form of ~ the language. ~ ~ It's traditional in Forth to refer to an act of "compiling" code, which ~ in this context means turning it from text into its binary representation. ~ That binary representation most commonly takes the form of a word entry ~ header followed by an array of codeword pointers. ~ ~ It would be legitimate to critique the terminology by saying that codeword ~ pointers are still, in some sense, interpreted: They are not machine code to ~ be directly executed by the CPU; they rely on "docol" and "next" at runtime. ~ However, in language design circles, the term "compilation" takes on a ~ broader meaning, referring to any process which requires some or all of the ~ types of infrastructure we regard as being compiler internals: A successive ~ translation of code from one form into another, discarding some types of ~ information while computing others, in a careful order that results in ~ logically consistent output which in some sense has the same meaning as the ~ input. Sometimes this output may be machine code, but often it is another ~ language meant for human consumption, or an intermediate layer meant to be ~ fed into another process. ~ ~ Forth compilation is compilation in this sense, so there is no conflict ~ and we run with the established terminology. In addition, it must be noted ~ that Evocation, like many Forths, makes extensive use of words which are ~ implemented directly in machine code; the Forth execution model allows these ~ words to co-exist with words that are interpreted by "docol". ~ ~ At any rate, the code in this file is responsible for that compilation. ~ ~ It is primarily concerned with managing the contents of an area of memory ~ we call the "log"; see log-load.e for more detail on terminology. ~ As a convenience for "word", we have some facilities for working with ~ stack-allocated strings. Yeah, trippy concept. Also, it would be a ~ buffer overrun hazard if we were worried about that, which is why this ~ is no longer common practice in C. ~ ~ The most important of these is accumulate-string, but we need some ~ smaller pieces first. ~ (stack string..., items to be left alone..., item to be unrolled, ~ number of items above string that participate in the unroll ~ -- item that was unrolled, stack string..., items left alone) : unroll-past-string 8 * ~ (string, other items, top item, byte offset to string start) dup value@ + ~ We have two copies of the offset present, in addition to the stuff we want ~ to rotate. So, the actual string starts two words on... We could have ~ adjusted the offset instead, but we'll want the unmodified offset again ~ later. 16 + ~ (string, other items, top item, offset to start, string pointer) stringlen ~ Same reasoning as in accumulate-string (see below). 1 + 8 align-size ~ (string, other items, top item, offset to start, string length w/ padding) + 8 /% swap drop ~ (string, other items, top item, number of words to unroll) unroll ; ~ (stack string..., item to be swapped ~ -- item that was swapped, stack string...) : swap-past-string 1 unroll-past-string ; ~ (stack string... --) : dropstring value@ stringlen ~ Same reasoning as in accumulate-string (see below). 1 + 8 align-size value@ + ~ At the time we fetched the stack pointer, there was an extra value atop ~ it, so we have to add one more word. 8 + value! ; ~ (stack string..., item to be kept ~ -- item that was kept) : dropstring-with-result swap-past-string dropstring ; ~ (stack string-so-far..., new character byte ~ -- updated string-so-far) : accumulate-string ~ Compute the address of the final word of the string. ~ ~ It's a little bit difficult to get the start pointer right, since all ~ our intermediate products affect what we get from value@, so we compute ~ that just once, here at the beginning. value@ 8 + ~ (string so far, new character byte, pointer to start of string) dup stringlen ~ There are two concerns here that overlap: First, we always want at least ~ one word. Recall that a length of zero bytes won't receive any alignment ~ padding because it's already divisible by 8. Second, the result of ~ stringlen doesn't include the null byte, which might be in a word by ~ itself that needs to be counted. We can address both of them by ~ unconditionally adding 1 to the length before applying alignment. 1 + ~ Pad the length for alignment. 8 align-size ~ We want an offset from the first word of the string to the last word of ~ the string, so we subtract one word from the length. 8 - + ~ (string so far, new character byte, address of final word) ~ Examine the final word of the string, leaving other stuff undisturbed. ~ Work low-to-high to figure out where to store the new byte, taking the ~ first one that's available. ~ (string so far, new character byte, address of final word) dup @ dup 0x00000000000000FF & { 3roll | swap ! exit } unless ~ (string so far, new character byte, address of final word, old value) ~ This next part is repeated several times, changing only the offsets, for ~ bytes 1 through 6; bytes 0 (above) and 7 (way below) are different. ~ (string so far, new character byte, address of final word, old value) dup 0x000000000000FF00 & { 3roll 0x0000000000000100 * | swap ! exit } unless dup 0x0000000000FF0000 & { 3roll 0x0000000000010000 * | swap ! exit } unless dup 0x00000000FF000000 & { 3roll 0x0000000001000000 * | swap ! exit } unless dup 0x000000FF00000000 & { 3roll 0x0000000100000000 * | swap ! exit } unless dup 0x0000FF0000000000 & { 3roll 0x0000010000000000 * | swap ! exit } unless dup 0x00FF000000000000 & { 3roll 0x0001000000000000 * | swap ! exit } unless ~ The top byte of the final word is always zero (or else stringlen ~ wouldn't have called it the final word), so we don't need to check it, we ~ can just use it. ~ ~ We need to put the new value in the top byte, which will mean we have no ~ null terminator, so we also need to start a new word. ~ ~ There is a fiddly order-dependency here: unroll-past-string relies on ~ being able to find the null terminator, which won't work if we've gotten ~ rid of it. Also, calling it will move all the earlier words, including ~ the one we intend to write to, which will invalidate any pointer we're ~ keeping at that point. There's a few ways to resolve this; what we do is ~ put the new terminator in place first, manually nudge the pointer, and ~ then write the new value. ~ (string so far, new character byte, address of final word, old value) 3roll 0x0100000000000000 * | swap ~ (string so far, new value, address of final word) 0 3 unroll-past-string ~ (new null terminator, string so far, new value, invalid address) 8 - ~ (new null terminator, string so far, new value, updated address) ! ; ~ (byte -- boolean) : is-space dup 0x20 = { drop 1 exit } if dup 0x09 = { drop 1 exit } if dup 0x0a = { drop 1 exit } if dup 0x0b = { drop 1 exit } if dup 0x0c = { drop 1 exit } if dup 0x0d = { drop 1 exit } if drop 0 ; ~ (character -- 1 for true or 0 for false) : is-alphanumeric ~ We don't have a character-literal syntax; the hex constants here are ~ ASCII codes. dup 0x30 > { drop 0 exit } if ~ Less than "0". dup 0x39 >= { drop 1 exit } if ~ Less than or equal to "9". dup 0x41 > { drop 0 exit } if ~ Less than "A". dup 0x5a >= { drop 1 exit } if ~ Less than or equal to "Z". dup 0x61 > { drop 0 exit } if ~ Less than "a". dup 0x7a >= { drop 1 exit } if ~ Less than or equal to "z". drop 0 ; ~ Greater than "z". ~ Okay, this is the big one, the lexer! Wow. ~ ~ (-- stack string occupying multiple words) : word ~ We allocate an empty string first, so that the result of "key" will ~ conveniently be on the easy-to-find end of it. 0 ~ Skip whitespace. { key dup is-space } { drop } while ~ Early exit if it's a zero byte. dup 0 = { drop dropstring 0 exit } if accumulate-string { peek dup is-space { drop exit } if dup { drop exit } unless consume accumulate-string } forever ; ~ (character -- value) : generalized-digit-value ~ We don't have a character-literal syntax; the hex constants here are ~ ASCII codes. dup 0x61 <= { 0x61 - 10 + exit } if ~ lowercase "a" dup 0x41 <= { 0x41 - 10 + exit } if ~ uppercase "a" 0x30 - ; ~ digit "0" ~ (character, base ~ -- value (if successful), ~ error indicator (zero equals success)) : decode-generalized-digit swap dup is-alphanumeric { ~ It's alphanumeric. ~ (base, character) generalized-digit-value ~ (base, value) dup 3roll ~ (value, value, base) > { ~ It's in range. ~ (value) 0 exit } if ~ It's out of range. ~ (value) drop 1 exit } if ~ It's not alphanumeric. drop drop 1 ; ~ (string pointer, base ~ -- result (if successful), ~ error indicator (zero equals success)) : read-base-unsigned swap ~ If the first byte is null, this is an error unpack8 ~ (numeric base, current point in string, character) dup 0 = { drop drop drop 1 exit } if ~ Decode the first byte as a generalized digit in the base. ~ (numeric base, current point in string, character) ~ If the first byte is less than "0", this is an error. 3roll dup 4 unroll ~ (numeric base, current point in string, character, numeric base) decode-generalized-digit { ~ (numeric base, current point in string) drop drop 1 exit } if ~ The first byte is a valid generalized digit in the appropriate base, so ~ let's get started. ~ (numeric base, current point in string, initial value) swap { ~ (numeric base, result so far, current point in string) unpack8 dup 0 = { ~ A null after the first character is valid, and indicates we're done. drop drop swap drop 0 exit } if ~ Decode the latest byte as a generalized digit in the base. ~ (numeric base, result so far, current point in string, latest byte) 4 roll dup 5 unroll ~ (numeric base, result so far, current point in string, character ~ numeric base) decode-generalized-digit { ~ If the latest character is not a valid digit, that's an error. ~ (numeric base, result so far, current point in string) drop drop drop 1 exit } if ~ The latest character is valid, so incorporate it and loop. ~ (numeric base, result so far, current point in string, latest value) 3roll 4 roll dup 5 unroll * + swap } forever ; ~ (string pointer ~ -- result (if successful), ~ error indicator (zero equals success)) : read-integer-unsigned ~ We don't have a character-literal syntax; the hex constants here are ~ ASCII codes. dup unpack8 0x30 != { ~ digit "0" ~ This is the case where the leading digit is not a zero. ~ (original string pointer, advanced string pointer) drop 10 read-base-unsigned exit } if ~ This is the case where the leading digit is a zero. ~ (original string pointer, advanced string pointer) unpack8 dup 0x78 = { ~ lowercase "x" ~ (original string pointer, doubly advanced string pointer, character) drop swap drop 16 read-base-unsigned exit } if dup 0x6f = { ~ lowercase "o" ~ (original string pointer, doubly advanced string pointer, character) drop swap drop 8 read-base-unsigned exit } if dup 0x62 = { ~ lowercase "b" ~ (original string pointer, doubly advanced string pointer, character) swap drop swap 2 read-base-unsigned exit } if ~ This is the case where the second character is something else. ~ (original string pointer, doubly advanced string pointer, character) drop drop 10 read-base-unsigned ; ~ (string pointer ~ -- result (if successful), ~ error indicator (zero equals success)) : read-integer ~ We don't have a character-literal syntax; this is ASCII for a hyphen. dup unpack8 0x2d != { ~ This is the case where it's non-negative. ~ (original string pointer, advanced string pointer) drop read-integer-unsigned exit } if ~ This is the case where it's negative. ~ (original string pointer, advanced string pointer) swap drop read-integer-unsigned ~ (result maybe, exit code) dup { ~ Failure ~ (non-zero exit code) exit } if ~ Success ~ (result, zero exit code) swap -1 * swap ; ~ (string pointer ~ -- result (if successful), ~ error indicator (zero equals success)) : read-decimal ~ We don't have a character-literal syntax; this is ASCII for a hyphen. dup unpack8 0x2d != { ~ This is the case where it's non-negative. ~ (original string pointer, advanced string pointer) drop 10 read-base-unsigned exit } if ~ This is the case where it's negative. ~ (original string pointer, advanced string pointer) swap drop 10 read-base-unsigned ~ (result maybe, exit code) dup { ~ Failure ~ (non-zero exit code) exit } if ~ Success ~ (result, zero exit code) swap -1 * swap ; ~ Here, we allocate a single machine word's worth of space to use as the ~ backing store of a mutable variable, initialized to zero. Then we define the ~ variable which points to that address. ~ ~ We don't actually need a word header for interpreter-flags-storage, we ~ could just append a zero and point to it directly, but that would make life ~ harder for words that attempt to work with the contents of other words. So ~ we give it a name. s" interpreter-flags-storage" create make-hidden here @ 0 , s" interpreter-flags" variable ~ There's an important bootstrapping concern: If you're loading this ~ interpreter into a running Evocation, it's important to not use the wrong ~ interpreter state value. The way we actually load this is under the log-load ~ transform, which will be running its alternate versions of these words and ~ therefore won't have trouble, but for the sake of making it easy to ~ experiment with modified versions of this file, we do the whole hide-show ~ dance. : [ interpreter-flags @ 0x01 invert & interpreter-flags ! ; make-immediate make-hidden latest @ ~ (pointer to [ entry) : ] interpreter-flags @ 0x01 | interpreter-flags ! ; make-hidden latest @ ~ (pointer to [ entry, pointer to ] entry) ~ It may seem nonsensical to use : to define :, but the log-load transform ~ provides an alternate for it, so it works. The same, of course, goes for all ~ these other word-defining words. The syntax-centric ones such as : are here ~ in interpret.e because they need to interact with interpreter-flags, which ~ isn't defined before this point. ~ ~ If the ] at the end feels backwards, imagine to yourself that everything ~ that ISN'T defining a word body is part of an implicit [ ... ] sequence. ~ Doing so doesn't really change anything, but may make you happier. : : word value@ create dropstring docol , make-hidden [ dup entry-to-execution-token , ] ; ~ The counterpart of : is ;. : ; ~ See commentary on "literal", in dynamic.e, regarding "lit exit". lit exit , make-visible ~ See above regarding [. Since it's an immediate word, we have to go to ~ extra trouble to compile it as part of ;. Since it's also hidden, we have ~ to go behind the interpreter's back to even get its entry pointer. ~ ~ Adding insult to injury, the log transform breaks , for unclear reasons, ~ so we do that by hand as well. [ over entry-to-execution-token , ] ; make-immediate make-hidden latest @ ~ (pointer to [ entry, pointer to ] entry, pointer to ; entry) : ;asm here @ pack-next 8 packalign here ! make-visible latest @ entry-to-execution-token dup 8 + swap ! ~ See above. [ 2 pick entry-to-execution-token , ] ; make-immediate make-hidden latest @ ~ (pointer to [ entry, pointer to ], pointer to ;, pointer to ;asm) ~ The word "'", often pronounced "tick", quotes the following word, looking ~ it up and treating it as a constant. In immediate mode, the constant winds ~ up on the stack; in compile mode it gets compiled. ~ ~ There are a few possible implementation strategies here. Running as an ~ immediate word means there's a clear and unambiguous concept of "the ~ following word", so that's what we do; otherwise we'd have to get clever ~ about somehow finding out where we were called from. That means we take on ~ what would otherwise be the interpreter's responsibility, of checking what ~ mode we're in. Happily, that's easy to do. ~ ~ There's a cyclic dependency where "if" relies on "'", and "'" relies on ~ "if". Fortunately flow-control is done with alternates by the log-load ~ transform, so we don't have to worry about it. : ' word value@ find dropstring-with-result interpreter-flags @ 0x01 & { literal } if ; make-immediate ~ Let's have strings now. : s" consume interpreter-flags @ 0x01 & { [ s" litstring" find literal ] entry-to-execution-token , } if here @ key { dup dup 0x22 != && } { pack8 key } while drop 0 pack8 interpreter-flags @ 0x01 & { 8 packalign here ! } { drop here @ } if-else ; make-immediate ~ Finicky semantics, but also important. : ." ' s" entry-to-execution-token execute interpreter-flags @ 1 & { [ s" emitstring" find literal ] entry-to-execution-token , } { emitstring } if-else ; make-immediate ~ While we're thinking about input, let's also have comments. : ~ key { dup dup 10 != && } { drop key } while drop ; make-immediate ~ Now the single most important word... : interpret word ~ If no word was returned, exit. dup 0 = { drop exit } if ~ The string is on the top of the stack, so to get a pointer to it we get ~ the stack address. ~ (string) value@ find ~ Check whether the word was found in the dictionary. dup { ~ If the word is in the dictionary, check what mode we're in, then... dropstring-with-result ~ (entry pointer) interpreter-flags @ 0x01 & { ~ ... if we're in compile mode, there's still a chance it's an immediate ~ word, in which case we fall through to interpret mode... dup entry-flags@ 1 & 0 = ~ ... but it's a regular word, so append it to the heap. { entry-to-execution-token , exit } if } if ~ ... if we're in interpret mode, or the word is immediate, run it. entry-to-execution-token execute exit } if ~ If it's not in the dictionary, check whether it's an integer literal. drop ~ As before, we get the stack address and use it as a string pointer. ~ (string) value@ read-integer 0 = { ~ It's a number. interpreter-flags @ 0x01 & { ~ We're in compile mode; append first "lit", then the number, to the ~ heap. The version of "lit" we use is the one that's current when we ~ ourselves are compiled, hardcoded; doing a dynamic lookup would ~ require dealing with what happens if it's not found. dropstring-with-result ' lit entry-to-execution-token , , exit } if ~ We're in interpret mode; push the number to the stack. Or at least, that's ~ what the code we're interpreting will see. Really it's already on the ~ stack, just clean everything else up and leave it there. dropstring-with-result exit } if ~ If it's neither in the dictionary nor a number, just print an error. ." No such word: " value@ emitstring newline dropstring ; : quit r0 @ control! { interpret } forever ; ~ Now we enable the three words we'd been keeping hidden. It would be a ~ good idea to call "quit" as soon as possible after this, but we leave it to ~ the user. dup entry-flags@ 0x80 invert & entry-flags! dup entry-flags@ 0x80 invert & entry-flags! dup entry-flags@ 0x80 invert & entry-flags! dup entry-flags@ 0x80 invert & entry-flags! ~ Though we only do it once, this is a bit involved so we provide it as a ~ word. Notionally there are situations in which it could come up again, so ~ it seems worth having around. ~ ~ It actually needs to run as a compiled word no matter what; if it were ~ run in interpret mode it would cut itself off from the rest of itself. ~ However, if we didn't want to keep it around we could have it forget ~ itself... ~ ~ It depends on ', so it's here in interpret.e. : relink-main-input-buffer-to-stdin 1024 allocate dup main-input-buffer buffer-physical-start ! main-input-buffer buffer-logical-start ! 1024 main-input-buffer buffer-physical-length ! 0 main-input-buffer buffer-logical-length ! [ s" refill-input-buffer-from-stdin" find entry-to-execution-token literal ] main-input-buffer input-buffer-refill ! ; ~ Of course it is perfectly reasonable to change this, but for now it's ~ hardcoded. relink-main-input-buffer-to-stdin