~ ~~~~~~~~~~~~~~~~~~~~~ ~ ~~ Execution model ~~ ~ ~~~~~~~~~~~~~~~~~~~~~ ~ ~ We use Forth-style dual stacks, one for values and one for control. We ~ use rsp for values, just like C does. We use rbp for the control stack, ~ which is a special Forth-y stack: These are pointers into the bodies of ~ Forth words, not return addresses. ~ ~ The choice of rsp and rbp for the stack pointers imitates Jonesforth; ~ I'm hopeful that it gives us convenient addressing modes, and will report ~ back about that when I feel that I understand the implications. ~ ~ In Forth, everything is a "word", including mutable variables. ~ Conceptually, a word is a unit of execution, which may be implemented ~ either in machine code or as an array of pointer to other words. ~ ~ This polymorphism is implemented by having each word's contents begin ~ with a "codeword", which is a pointer to machine code that "interprets" ~ the rest of the contents. In the case of words implemented in machine ~ code, the codeword points directly to that code, which is normally right ~ next to it. ~ ~ Variables, to Forth, are simply one more thing that can be executed; the ~ effect of executing a variable is to push its address onto the value ~ stack. ~ ~ We adopt this model of words, codewords, and variables-as-words. It's ~ really nice how it doesn't force anything else on us, not even a heap, ~ though we do end up using a heap. ~ ~ We specifically implement a version of calling and returning that Forth ~ calls indirect threaded code: The control stack is a stack of pointers ~ into the middle of interpreted words. The interpreter snippet, called ~ docol, implements calling. Each word is responsible for making sure ~ returning works properly. Interpreted words accomplish this by ending with ~ the word "exit", while machine-code words accomplish it by ending with a ~ verbatim snippet called "next". ~ ~ Conceptually, "next" returns, but more specifically it accomplishes this ~ by doing the caller's next dispatch for it; thus control never actually ~ goes back to the caller's interpreter after initial setup. For performance ~ reasons, "next" is always inlined, so we define it as a macro. ~ ~ The docol routine is just ordinary code, not a macro. It's defined later ~ in this file, as a label. ~ ~ Notionally, we could consider not having a dictionary, and not giving ~ our words names. However, it feels silly to stop when we're so close to ~ being a full Forth, and using names for things solves a bootstrapping ~ problem related to heap management - see the write-up of _start about how ~ the heap is created, below. So, we add an additional header before the ~ codeword for this purpose. ~ ~ The Forth dictionary is usually a linked list of every word that has ~ ever been defined, with the newest at the head; the names of words are ~ stored in string fields, often right next to the link pointer. We adopt ~ this model, with the field sizes and order shown in the quick reference ~ below. We break with Forth tradition in one way: Rather than having a ~ length field, we use a null-terminated string. Thus, there's no length ~ limit on names. This necessitates breaking out the flags (to be explained ~ later) into their own byte, rather than taking bits from the length field ~ for them. ~ ~ There's an important performance consideration: Executable words ~ reference each other by pointers to their respective codewords. However, ~ dictionary entries reference each other by pointers to their respective ~ link fields. Traversing from the link field to the codeword is easy, ~ though it's a non-constant-time operation: Just walk the string. In order ~ to make Forth words easy to "decompile", it would be nice to also have a ~ way to traverse backwards. We solve this by making the name field be ~ null-terminated at both ends. Fun, yeah? ~ ~ ~ ~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Quick Reference ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~ The layout of an interpreted word: ~ ~ (overall start) ~ 0x00 - 0x07 Link (to next-oldest word) ~ 0x08 - 0x08 HM00000I Flags ~ H - hidden ~ M - metadata ~ I - immediate ~ all other bits reserved ~ (name start) ~ 0x09 - 0x09 Null byte (terminates name) ~ 0x0a - name-end - 1 Name, as UTF-8 ~ name-end - name-end Null byte (terminates name) ~ (padding start) ~ name-end + 1 - codeword-start - 1 Zero-pad to 8-byte boundary ~ (it's possible this will be zero bytes long) ~ (codeword start) ~ ... + 0x00 - ... + 0x08 Codeword (ie. address of docol) ~ (8-byte chunks) Addresses of other words ~ - ... (end) Address of "exit" word ~ ~ The layout of a machine-code word is different only from the codeword on: ~ ~ ... + 0x00 - ... + 0x08 Addresss of next byte ~ ... + 0x08 - ???? Arbitrary machine code ~ - ... (end) Inlined implementation of next ~ ~ Also, words always start at 8-byte boundaries. ~ ~ ~ REGISTER usage conventions: ~ ~ * rsi is the "instruction pointer" for the "interpreter". ~ That is, it points to some word-pointer inside an array of ~ word-pointers inside the content of the word they're part of. It always ~ points to the next word that should be executed, whose execution hasn't ~ begun yet. ~ ~ * rbp points to the top of the control stack ~ These are former values of rsi, to eventually be returned to, from ~ successively older callers as you look further up the stack. The stack ~ grows downwards in memory. Since values are kept separately, the only ~ thing on the control stack is return addresses, one per layer of call. ~ ~ * rsp points to the top of the value stack ~ The value stack has no specific format, but it grows downwards in ~ memory. In particular there's no concept of stack frames, because items ~ on the stack don't belong to any particular word; the value stack in ~ Forth is in part a mechanism for passing values between words. ~ ~ Additionally, immediately after beginning execution of a word: ~ ~ * rax points to the address of the codeword being executed ~ The value of rax is purely for the callee's benefit, and does not need ~ to be preserved. ~ ~ Other registers are purely discretionary, and are not preserved across ~ calls. ~ ~ ~ FLAG usage: ~ ~ * DF should be 0 ~ We use lodsq extensively and that makes it increment rsi after using it. ~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Macro next ~ ~~~~~~~~~~ ~ ~ Include this inline at the end of a word implemented in machine-code. ~ Conceptually, it returns. What it actually does is do the next thing the ~ caller would do, which is call the next word from the caller's array of ~ word pointers. ~ ~ This is a widespread technique in Forth implementation, referred to as ~ indirect threaded code. It's "threaded" in the sense that each word takes ~ responsibility for finishing up by following the notional thread through the ~ metaphorical labyrinth to figure out the next word that its caller wants to ~ run after it. In other words, control never directly returns to the parent, ~ it proceeds directly to the sibling. ~ ~ Registers in: ~ ~ * rsi points to the address of the word to execute ~ ~ Registers out: ~ ~ * rax points to the codeword in the contents of the word that was executed ~ * rsi points to the next word-address after this one ~ ~ Flags ~ * DF = 0 is required ~ ~ (base address -- new base address) : pack-next ~ Copy the next word's address from *rsi into rax. Increment rsi (as per the ~ DF flag). lods64 ~ Load the codeword from the word's contents, and jump to the interpreter it ~ points to. :rax jmp-abs-indirect-reg64 ; ~ Macro beforenext ~ ~~~~~~~~~~~~~~~~ ~ ~ Sometimes we want to transfer control from a word implemented in ~ machine-code to another word, without coming back after, as if we were ~ simply jumping to it. This is an innovation of ours; Jonesforth doesn't do ~ it. It is similar to the tail-call optimization that many Lisp dialects ~ have. ~ ~ This implementation will work regardless of how the receiving word is ~ implemented. It impersonates the "next" snippet, setting up rax to point ~ to the codeword then jumping to the interpreter. Since it doesn't change ~ the control stack or rsi, when the receiving word eventually invokes ~ "next"; it will pick up in the same place as if this sending word had done ~ it. ~ ~ Thus, notionally we are doing just this one transfer of control before ~ eventually getting around to inlining "next". Hence the name. ~ ~ (target address, base address -- new base address) : pack-beforenext ~ Do a permanent transfer of control by setting rax and invoking the ~ codeword. Of course, we could jump to docol ourselves but this will work ~ regardless of what the receiving codeword is. :rax mov-reg64-imm64 :rax jmp-abs-indirect-reg64 ; ~ Macros pushcontrol ~ popcontrol ~ ~~~~~~~~~~~~~~~~~~ ~ ~ Include these inline to push an address onto the control stack, or pop ~ one off of it. You will recall the control stack is kept in rbp. The ~ parameter is given in a user-specified register. ~ ~ Jonesforth's analogous macros are called PUSHRSP and POPRSP but I think ~ that's super confusing, since rsp is also the name of a register, but a ~ different one. I guess it was less confusing in 32-bit, since esp doesn't ~ start with an "r". Anyway, this has to be named something that ~ distinguishes it from Intel's PUSH and POP opcodes, so... ~ ~ "Load effective address" is just a cute way to do arithmetic on a ~ register, here. To push or pop we decrement or increment rbp by 8. To ~ actually interact with the space in the stack, we indirect through rbp. ~ ~ Registers in and out: ~ ~ * rbp points to the top of the control stack. ~ ~ (source register, base address -- new base address) : pushcontrol swap :rbp -8 :rbp lea-reg64-disp8-reg64 swap :rbp 0 mov-disp8-reg64-reg64 ; ~ (target register, base address -- new base address) : popcontrol :rbp 0 3roll mov-reg64-disp8-reg64 :rbp 8 :rbp lea-reg64-disp8-reg64 ; ~ Constants ~ ~~~~~~~~~ ~ ~ These are used in cold-start, just below. : heap-requested-address 0x0000001000000000 ; ~ (very arbitrary) : heap-size 0x0000000001000000 ; ~ 16 MiB : control-stack-size 0x10000 ; ~ 64 KiB ~ Routine cold-start ~ ~~~~~~~~~~~~~~~~~~ ~ ~ This is the entry point of the whole program, the very first code we ~ actually execute. Linkers traditionally call this _start, but let's live a ~ little. Anyway, the ELF header points to it and exec() jumps to it; from the ~ kernel's perspective, there's no such thing as names. Also, though this code ~ could be anywhere in the code part of the output, in order to make the ~ hexdump pretty we put it at the start. ~ ~ The kernel gives us most registers zeroed, and rsp pointing to the ~ command-line stuff (argc, argv, envp), which is at an ASLR'd address with ~ some stack space allocated for us, despite the fact we didn't request any. ~ It also gives us all the flags clear except IF, but we don't rely on that. ~ Lastly, of course, it loads our code segment and sets the instruction ~ pointer where we asked; we don't need to check what those addresses are, ~ because they're not randomized. ~ ~ This routine is really only responsible for one-time initialization. ~ ~ Registers in: ~ ~ * rsp points to the logical top of the value stack ~ The kernel sets this up for us, and we need to save it somewhere so ~ Forth can use it. ~ ~ Registers out: ~ ~ * rsi points within "quit" ~ Quit is the word that's Forth's closest equivalent to main(). ~ * rsp points to the top of the value stack ~ ~ Notably, rbp is still uninitialialized after _start. ~ ~ Stack in: ~ ~ * argc, argv, envp in the usual Unix way ~ We ignore them, though. ~ ~ Stack out: ~ ~ * The value of "heap", as a pointer ~ The meaning of this will be explained below. ~ ~ Registers within: ~ ~ * rdi points to the base the heap was allocated at, once it exists ~ This is the same value that "heap" will hold, once we reach a point ~ where we have variables. Of course, variables are stored on the heap, ~ hence this temporary measure. ~ ~ We also take this opportunity to define soeme memory layout parameters ~ that this routine will be responsible for doing something with: ~ ~ (output memory start, current output point ~ -- output memory start, current output point) : cold-start current-offset L' cold-start set-label cld ~ clear the DF flag ~ Prepare the heap. ~ ~ We could ask for a data segment in the program header, but where's the ~ fun in that? Instead, we call mmap(). ~ ~ If we wanted the kernel to do ASLR for us, passing address zero would ~ cause it to pick somewhere at random, but instead we choose our own ~ location. It's still not guaranteed to be where we ask for, so we still ~ do the work to record where it wound up. We could pass the "fixed" flag ~ and the kernel would trust us, but this gives us more options for ~ interoperating with other runtimes. ~ 9 :rax mov-reg64-imm64 ~ mmap() heap-requested-address :rdi mov-reg64-imm64 ~ address (very arbitrary) heap-size :rsi mov-reg64-imm64 ~ size (one meg) 0x07 :rdx mov-reg64-imm64 ~ protection (read+write+exec) 0x22 :r10 mov-extrareg64-imm64 ~ flags (private+anonymous) 0 :r8 mov-extrareg64-imm64 ~ file descriptor (ignored) 0 :r9 mov-extrareg64-imm64 ~ offset (ignored) syscall ~ The return value of the system call is in rax, we'll use it in a sec. ~ We need to save this somewhere in case we ever want to munmap() it; ~ there's no widely-used name for it so we have to make one up. S0 and r0 ~ are widely-used names for the physical tops (logical bottoms) of the ~ value and control stacks, respectively, and we will eventually set those ~ up as well, so we should keep those names in mind. The control stack ~ lives within the heap, while the value stack is its own segment. This ~ value, though, is the physical bottom of the segment, meaning that it ~ stays the same even as we allocate and deallocate things within it. This ~ is unlike the two stack pointers, so we give it a name that doesn't ~ suggest similarity: "heap". ~ ~ Once Forth is fully set up, its internal variables will be accessed ~ through variable-words like any other Forth data, including "heap". To ~ get to that point, though, we need to be able to hold onto variable data ~ between now and then. In fact, if we don't have at least one of "heap" ~ and "here" (its counterpart which points to the logical top end), all ~ our efforts to hold onto anything seem a bit doomed. ~ ~ So, we temporarily dedicate rdi to "heap" - only within this routine - ~ and store everything else in ways that let us find things by reference ~ to it. We choose rdi because it works with the indexing modes we care ~ about, and its name suggests its function. ~ ~ The strategy Jonesforth uses is not applicable to us; Jonesforth ~ takes advantage of the linker to let its code segment refer to specific, ~ pre-allocated objects in the data segment. We are our own linker, and we ~ don't care to have a data segment. Hence, this approach. ~ ~ Keying things off "heap" is the fundamental decision, but to make sure ~ our variables are accessible both during early bootstrapping, and later, ~ we also have to be thoughtful about data structures. More on that in a ~ moment. ~ :rax :rdi mov-reg64-reg64 ~ We also initialize rbp. We could hold off and let "quit" do this, but ~ doing it now is the easiest way to initialize the r0 variable, since ~ there's no instruction that moves a 64-bit immediate to memory. ~ ~ This is the moment at which we decide where the control stack starts! ~ Fun, right? "Allocation" is just a fancy word for picking where we want ~ something, then being consistent about it - like placing furniture in ~ your home. See below for a little more thought about why here in ~ particular. ~ :rdi control-stack-size :rbp lea-reg64-disp32-reg64 ~ Now we save some stuff onto the heap. These are the locations that ~ will eventually be the backing stores of the Forth variables, but we ~ don't create the word headers yet, since there's no requirement that ~ they be next to the backing stores. We'll do that later, once we have ~ word-writing infrastructure in place. For now, we just use their offsets ~ relative to the physical bottom of the heap, which are fixed. ~ ~ These will be the permanent homes of these values, though we have ~ copies of them elsewhere while we're still in this routine. ~ :rdi control-stack-size 0x00 + :rdi mov-reg64-disp32-reg64 ~ heap :rsp control-stack-size 0x08 + :rdi mov-reg64-disp32-reg64 ~ s0 :rbp control-stack-size 0x10 + :rdi mov-reg64-disp32-reg64 ~ r0 L@' final-word-name :rax mov-reg64-imm64 :rax control-stack-size 0x18 + :rdi mov-reg64-disp32-reg64 ~ latest :rdi control-stack-size 0x28 + :rax lea-reg64-disp32-reg64 :rax control-stack-size 0x20 + :rdi mov-reg64-disp32-reg64 ~ here ~ ~ * "heap" is the physical bottom of the heap ~ The heap grows upwards in memory, so this is also the logical ~ bottom. This comes from the address mmap() just returned to us. ~ * "s0" is the logical bottom of the value stack ~ The value stack grows downwards in memory, so this is the physical ~ top of it. This comes from the stack pointer the kernel initialized us ~ with. ~ * "r0" is the logical bottom of the control stack ~ The control stack also grows downwards, so this is its pysical top ~ as well. We allocate this dedicated space within the heap right here, ~ in this routine, through our choice of where to put things. ~ * "here" is the physical start of the unallocated space in the heap ~ We allocate heap space from bottom to top, by incrementing this ~ value. So, it would also be accurate to say that it points immediately ~ after the physical top of the allocated space. At any rate, the ~ address it points to is the first one that hasn't been used yet. ~ * "latest" is the address of the most-recently-defined word's header ~ Defining new words changes this value. ~ ~ s0 and r0 are mostly used when we want to initialize or reinitialize ~ their respective stacks - that is, discard all their contents at once. ~ ~ The value of r0 is the same address these variables start at, so ~ you'll want to do a close read of the implementation of pushcontrol ~ and convince yourself that it only ever writes things just below the rbp ~ address it receives, never right on top of it. ~ ~ As an aside, by the way, please notice that strictly speaking, r0 ~ could be a constant... but it isn't known until runtime, so we might as ~ well make it a variable. That will play nicely with any astonishing ~ memory shenanigans someone might wish to do in the future. ~ ~ Notice also that "here" points immediately after itself. This is just ~ a convenience, making it the last one like that so that the concern is ~ dealt with in a single place and is easy to keep up-to-date with code ~ changes. ~ ~ A little more detail about why we offset everything by ~ control_stack_size: We're carving out some space at the bottom of the ~ heap - which grows low-to-high - to be the control stack - which grows ~ high-to-low. So the control stack is allocated out of the heap as a ~ fixed-size, one-time thing, and then the variables come immediately ~ after that. We do need to use 32-bit displacement indexing to access ~ them this way, but that's no big deal. ~ ~ This is perhaps questionable, they should maybe be separate segments ~ created with separate calls to mmap(), but for now we're not worried ~ about overflow so we use the same allocation for both. ~ ~ We'll come back to these variables a bit later and generate the word ~ headers that point at them, but now we're almost ready to switch to ~ proper threaded-execution, so we finish that setup first... ~ Push the value of "heap" onto the value stack so that it can be the ~ breadcrumb the threaded code needs to find... the backing store of ~ "heap". Yes, self-reference can be weird like that sometimes. There's ~ nothing stopping "quit" from reading rdi, it just violates the ~ abstraction... :rdi push-reg64 ~ We are about to set up rsi, we did rbp already, and rsp came to us ~ already set up. That's all that "next" needs, so take it away! L@' warm-start origin + :rsi mov-reg64-imm64 pack-next ; ~ Routine warm-start ~ ~~~~~~~~~~~~~~~~~~ ~ ~ This routine runs as Forth code. That is, indirect threaded execution has ~ been initialized, and the structure of this routine is an array of codeword ~ pointers. It was directly jumped to from cold-start. There's nowhere to ~ return to, so it needs to never return. : warm-start ~ While it's not actually a requirement that codeword pointers be ~ word-aligned, it's highly likely that it helps performance. (Whether it ~ does is up to Intel's microcode.) 8 packalign current-offset L!' warm-start ~ Before handing off to us, cold-start pushed a single value onto the ~ stack, a pointer to the beginning of the heap. Now, we load our entire ~ Forth implementation onto that heap, beginning with the minimal set of ~ words needed to define more words. We do this because we need variables as ~ infrastructure so we can eventually have dynamic definitions. ~ ~ There's something non-obvious here: words implemented statically as ~ part of the executable image can't contain things that vary at runtime. ~ That means that even if these words tried to implement some sort of ~ dynamic lookup, they would have no way to find the root of whatever ~ dynamic data structure they use. Dynamism needs to be bootstrapped. ~ ~ In a more traditional C-style program, static code could look up ~ variables based on fixed addresses that are the same on every run. ~ Failing that, we could dedicate a register to it, though that's a ~ considerable expense. We chose not to do either of those things, because ~ we want the versatility that comes with not being picky about our ~ address space: It allows us to contemplate future improvements such as ~ ASLR, or embedding into other processes that impose their own addressing ~ constraints, or even coexisting with multiple versions of ourselves. ~ That choice does mean we have the hard version of this bootstrapping ~ problem, and copying ourselves to the heap is how we solve it. ~ ~ We do have the heap address right now, though that won't last. In case ~ it's unclear why not: keeping it on the stack would require all future ~ references to walk the stack, and somehow know when they've reached the ~ bottom. The stack is a good place to keep things with clearly delimited ~ lifetimes and visibility, but when we want something to live for our ~ entire program and be easy to find from any code within it, we need to ~ do something else. Anyway, since we have the address, we can use it for ~ the next little bit of setup. ~ ~ The first few words we define are our variables, which hardcode the ~ addresses they will return - but since we're doing this at runtime, ~ "hardcoding" can reflect where our heap is. This is the fundamental ~ trick that makes the heap usable. ~ ~ One more thing to notice: We already allocated the backing stores of ~ these variables, and populated their initial values, in _start. The ~ words we're defining return those same addresses for the same backing ~ stores. So, we have continuity: Stuff defined in terms of the ~ variable-words we're defining now will interoperate with the stuff that ~ we define in the "early" way, which includes those very words. Both the ~ early code and the later code are dealing with the same data structures, ~ they're just using a different technique to find them. ~ ~ This is the only hardcoding we need to do; by building on top of it, ~ we will soon reach a point where the rest of the system can be defined ~ within itself. ~ TODO These need to, like, exist first. Also they need to be referenced ~ as labels. ~ dq early_heap, litstring, "heap", early_variable ~ dq early_s0, litstring, "s0", early_variable ~ dq early_r0, litstring, "r0", early_variable ~ dq early_latest, litstring, "latest", early_variable ~ dq early_here, litstring, "here", early_variable ; ~ (previous entry address, output point, name string pointer ~ -- new entry address, output point) : output-create 3roll dup 4 roll swap pack64 ~ (string pointer, new entry address, output point) 0 pack8 0 pack8 roll3 packstring ~ (new entry address, output point) 8 packalign ; ~ Routine docol ~ ~~~~~~~~~~~~~ ~ ~ Reference this via its label as the codeword of a word to make it an ~ "interpreted" word. Concretely, it saves rsi (the "instruction pointer") ~ to the control stack, takes the address of the codeword from rax and ~ increments it in-place to form the new instruction pointer, and copies ~ that to rsi. ~ ~ Having then done this, we're now in the state that normal execution ~ expects, so docol ends by it using "next" to begin the callee's execution, ~ kicking off a nested call. ~ ~ The name is said to be short for "do colon", because Forth high-level ~ code begins word definitions with a colon. ~ ~ Registers in: ~ ~ * rsi is the caller's instruction pointer ~ * rbp is the control stack pointer ~ * rax is the address of the callee's codeword ~ ~ Registers out: ~ ~ * rsi is the callee's instruction pointer ~ * rbp is the control stack pointer ~ ~ (previous entry address, output point) : output-docol s" docol" output-create ~ Evaluated as a word, docol is a constant which returns a pointer. L@' docol :rax mov-reg64-imm64 :rax push-reg64 pack-next 8 packalign ~ Since docol is not a normal word, the label points to the value we care ~ about from the assembly side of things, which is the address we use as the ~ codeword. current-offset L!' docol :rsi pack-pushcontrol 8 :rax add-reg64-imm8 :rax :rsi mov-reg64-reg64 pack-next 8 packalign ; ~ This is the mechanism to "return" from a word interpreted by docol. ~ We pop the control stack, and then, since this is threaded execution, we ~ do the next thing the caller wants to do, by inlining "next". ~ ~ (previous entry address, output point ~ -- new entry address, output point) : output-exit s" exit" output-create L!' exit :rsi pack-popcontrol pack-next ;