From 1fdeeb54b127fcd600d16c48c3b1b90e91f2ca28 Mon Sep 17 00:00:00 2001 From: Irene Knapp Date: Thu, 7 May 2026 18:56:23 -0700 Subject: document labels.e; also clean up elf.e the documentation in labels.e is entirely new, synthesized from informal private discussions. this is also intended as a final pass to make sure all the comments and nuances in the ELF code from quine.asm are incorporated in elf.e. also this uses the new `L@'` and `L!'` facilities for terseness Force-Push: yes Change-Id: Ieabb2bb26f4b83260f0072dcdcd0950f9aa9fab2 --- elf.e | 159 +++++++++++++++++++++++++++++++++++++++++++++------------------ hello.e | 12 ++--- labels.e | 128 +++++++++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 242 insertions(+), 57 deletions(-) diff --git a/elf.e b/elf.e index 4224b50..c38f740 100644 --- a/elf.e +++ b/elf.e @@ -1,57 +1,128 @@ -~ ~~ -~ ~~ ELF header -~ ~~ -~ ~~ This is the top-level ELF header, for the entire file. An ELF always -~ ~~ has exactly one of this header, which is always at the start of the file. -~ ~~ +~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~ ~~ Executable file format ~~ +~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~ +~ Before we do anything specific to the actual program we're building, we +~ do a lot of ELF-specific stuff to ensure that our output is in a format +~ Linux knows how to run. +~ +~ This relies on the label facility defined in labels.e. Make sure to load +~ that first. + +~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~ ~~ Runtime memory origin ~~ +~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~ +~ First, we pick an origin to load at. This is arbitrary, but it can't be +~ zero. We define a constant word for it so the body of the program can use +~ it in label calculations in whatever ways it needs to. + +: origin 0x08000000 ; + + +~ ~~~~~~~~~~~~~~~~~~~~~ +~ ~~ ELF file header ~~ +~ ~~~~~~~~~~~~~~~~~~~~~ +~ +~ Second, we output ELF's top-level file header. This header describes the +~ entire file. An ELF always has exactly one of this header, which is always +~ at the start of the file. +~ +~ The program we're building should call this word as the first output it +~ generates. +~ +~ The only interesting thing here is the entry pointer. + : elf-file-header - 0x7f pack8 s" ELF" pack-raw-string ~ magic number - 2 pack8 ~ 64-bit - 1 pack8 ~ little-endian - 1 pack8 ~ ELF header format v1 - 0 pack8 ~ System-V ABI - 0 pack64 ~ (padding) - - 2 pack16 ~ executable - 0x3e pack16 ~ Intel x86-64 - 1 pack32 ~ ELF format version - - L' start use-label origin + pack64 ~ entry point + ~ * denotes mandatory fields according to breadbox + current-offset 3unroll + + 0x7f pack8 s" ELF" pack-raw-string ~ *magic number + 2 pack8 ~ 64-bit + 1 pack8 ~ little-endian + 1 pack8 ~ ELF header format v1 + 0 pack8 ~ System-V ABI + 0 pack64 ~ (padding) + + 2 pack16 ~ *executable + 0x3e pack16 ~ *Intel x86-64 + 1 pack32 ~ ELF format version + + L@' cold-start origin + pack64 ~ *entry point ~ This includes the origin, intentionally. - L' program-header use-label pack64 ~ program header offset + L@' elf-program-header pack64 ~ *program header offset ~ We place the program header immediately after the ELF header. This ~ offset is from the start of the file. - 0 pack64 ~ section header offset - 0 pack32 ~ processor flags - 64 pack16 ~ ELF header size - 56 pack16 ~ program header entry size - 1 pack16 ~ number of program header entries - 0 pack16 ~ section header entry size - 0 pack16 ~ number of section header entries - 0 pack16 ~ section name string table index - ; + 0 pack64 ~ section header offset + 0 pack32 ~ processor flags + + L@' elf-header-size pack16 ~ ELF header size + L@' elf-program-header-size pack16 ~ *program header entry size + 1 pack16 ~ *number of program header entries + 0 pack16 ~ section header entry size + 0 pack16 ~ number of section header entries + 0 pack16 ~ section name string table index + + ~ Though hardcoding the size of this header would work fine, it's easier + ~ to use the label system to keep track of its size. The only place this is + ~ actually referenced is right here in the header. + current-offset 4 roll - L!' elf-header-size ; + + +~ ~~~~~~~~~~~~~~~~~~~~~~~~ +~ ~~ ELF program header ~~ +~ ~~~~~~~~~~~~~~~~~~~~~~~~ +~ +~ Third, we output ELF's program header, which lists the memory regions +~ ("segments") we want to have and where we want them to come from. There may +~ be any number of these entries, one per segment, , and they may be anywhere +~ in the file as long as they're consecutive. +~ +~ We list just a single region, which is the entire contents of the ELF file +~ from disk, and we put the program header immediately after the file header. +~ The program we're building should call this word as the second output it +~ generates. +~ +~ It would be more typical to use this header to ask the loader to give us +~ separate code and data segments, and perhaps a stack or heap, but this keeps +~ things simple, and we can create those things for ourselves later. +~ +~ We do have a little stack space available, though we don't explicitly +~ request any; the kernel allocates it for us as part of exec() so that it can +~ pass us argc and argv (which we ignore). That stack space will be at a +~ random address, different every time, because of ASLR; that's a neat +~ security feature, so we leave it as-is. Note that ASLR doesn't happen when +~ you run under gdb, so if you aren't seeing it, that's probably why. -~ ~~ -~ ~~ Program header -~ ~~ -~ ~~ An ELF program header consists of any number of these entries; they are -~ ~~ always consecutive, but may be anywhere in the file. We always have -~ ~~ exactly one, and it's always right after the ELF file header. ~ ~~ : elf-program-header - current-offset L' program-header set-label - 1 pack32 ~ "loadable" segment type - 0x05 pack32 ~ read+execute permission - 0 pack64 ~ offset in file - origin pack64 ~ virtual address + ~ * denotes mandatory fields according to breadbox + current-offset L!' elf-program-header + current-offset 3unroll + + 1 pack32 ~ *"loadable" segment type + 0x05 pack32 ~ *read+execute permission + 0 pack64 ~ *offset in file + origin pack64 ~ *virtual address ~ required, but can be anything, subject to alignment - 0 pack64 ~ physical address (ignored) + 0 pack64 ~ physical address (ignored) - L' total-size use-label pack64 ~ size in file - L' total-size use-label pack64 ~ size in memory + L@' total-size pack64 ~ *size in file + L@' total-size pack64 ~ *size in memory - 0 pack64 ~ segment alignment + 0 pack64 ~ segment alignment ~ for relocation, but this doesn't apply to us - ; + + ~ As with the file header, we use the label system to keep track of the + ~ program header's size. + current-offset 4 roll - L!' elf-program-header-size ; + +~ ~~~~~~~~~~~~~~~~ +~ ~~ That's it! ~~ +~ ~~~~~~~~~~~~~~~~ +~ +~ ELF is a simple format, really. Now you can output your own machine code +~ that you generate however you want; make sure to define the label +~ cold-start, which will be the first thing that runs. diff --git a/hello.e b/hello.e index 63f19c4..fcafa25 100644 --- a/hello.e +++ b/hello.e @@ -1,11 +1,11 @@ ~ cat labels.e elf.e hello.e | ./quine > hello && chmod 755 hello && ./hello : output-start-routine - current-offset L' start set-label + current-offset L!' cold-start 1 :rax mov-reg64-imm32 1 :rdi mov-reg64-imm64 - origin L' greeting use-label + :rsi mov-reg64-imm64 - L' greeting-size use-label :rdx mov-reg64-imm64 + origin L@' greeting + :rsi mov-reg64-imm64 + L@' greeting-size :rdx mov-reg64-imm64 syscall 60 :rax mov-reg64-imm32 0 :rdi mov-reg64-imm32 @@ -13,9 +13,9 @@ ; : output-greeting - current-offset dup L' greeting set-label 3unroll + current-offset dup L!' greeting 3unroll s" Hello, Irenes!" packstring - current-offset 4 roll - L' greeting-size set-label ; + current-offset 4 roll - L!' greeting-size ; ~ (output memory start, current output point ~ -- output memory start, current output point) @@ -27,7 +27,7 @@ elf-program-header output-start-routine output-greeting - current-offset L' total-size set-label + current-offset L!' total-size ; ' all-contents entry-to-execution-token label-loop diff --git a/labels.e b/labels.e index 6ece87e..be601a3 100644 --- a/labels.e +++ b/labels.e @@ -1,15 +1,63 @@ -~ current output point, string pointer +~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~ ~~ Machine label facility ~~ +~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +~ Compilers and assemblers always have a need to resolve symbolic names for +~ parts of their code to numeric values. Usually, these symbolic names are +~ called labels. +~ +~ It's a surprisingly deep problem. The output itself can end up depending +~ on the actual values, for a variety of reasons. For example, the size of +~ a relative-jump instruction may vary depending on how far away its target +~ is, which changes the position of everything after. Or, there may be a size +~ limit on some particular segment and code may have to be entirely +~ reorganized to make things fit, splitting it into pieces or adding +~ trampolines. +~ +~ Often, this winds up being a hidden layer with immense complexity, +~ implemented as part of a linker, that even compiler maintainers find obscure +~ and confusing. +~ +~ It's Irenes' position that we should engage with complexity rather than +~ encysting it. So, in Evocation we take a different approach, heavily +~ inspired by the semantics of a tool called flatassembler. We provide +~ downstream code with the basic operations set-label and use-label which it +~ can combine in whatever order it wants. We also provide a harness, +~ label-loop, which takes an execution token for the downstream code +~ generator and runs it over and over, computing the mathematical fixed point +~ of the label assignments. +~ +~ On the first pass, label-loop guesses a value of zero for any label that's +~ used before it's set. On subsequent passes, each label starts with the value +~ it had on the previous pass. When we have a pass where labels end with the +~ same value they started with, we say they've converged, and we announce +~ success. If a hundred passes go by without convergence, we fail instead. +~ +~ Most of the time, the shorthand words L@' and L!' will be all you need to +~ use from your own code. + +~ TODO this should go somewhere else :) +~ (current output point, string pointer --) : pack-raw-string { unpack8 dup } { 3roll swap pack8 swap } while drop drop ; -: origin 0x08000000 ; - -~ Status is a bit field: +~ The labels data structure is a linked list of dictionary entries, with +~ the same header format as the main Evocation dictionary, but instead of +~ being executable, each holds two words of data: status bits, and a value, +~ in that order. There's no "docol" pointer or anything of the sort. +~ +~ Status is a bit field: ~ bit zero is whether it's been used ~ bit one is whether it's been defined ~ bit two is whether it was used before being defined ~ bit three is whether the guessed value wound up equaling the actual value +~ +~ Just as the Evocation dictionary uses the global variable "latest" as a +~ handle (a pointer to a pointer) beginning a linked list of entries, so the +~ label dictionary also uses a handle variable, named "labels". +~ TODO we should just do this in immediate mode, but right now the word +~ "variable" steps on the same scratch space that s" uses, so we can't. : init-labels 8 allocate s" labels" variable 0 s" labels" find entry-to-execution-token execute ! @@ -17,6 +65,7 @@ ~ This needs to happen now because otherwise the word "labels" won't exist. init-labels +~ This is analogous to word-heading, but prints label information. ~ (entry pointer --) : label-heading dup entry-to-name dup emitstring space @@ -24,6 +73,7 @@ init-labels entry-to-execution-token dup 8 + @ .hex64 space @ .hex8 newline ; +~ TODO this should go elsewhere ~ (dictionary handle) : oldest-entry-in dup @@ -31,6 +81,7 @@ init-labels dup 3roll = { drop 0 } if ; +~ TODO this should go elsewhere ~ (entry pointer, dictionary handle) : next-newer-entry-in @ @@ -38,21 +89,42 @@ init-labels { dup { 2dup @ != } if } { @ } while swap drop ; +~ This prints the headings for all the labels that have been created. Note +~ that labels that have been created once stay in the dictionary forever, even +~ if subsequent passes neither use nor define them. That's because the control +~ loop has no way to know if they're still important; it's up to downstream +~ code to decide that. Everything has been carefully set up so that disused +~ labels don't hurt anything. : list-labels labels oldest-entry-in { dup } { dup label-heading labels next-newer-entry-in } while drop ; +~ This creates a new label given a name for it, initializing its value and +~ status to zero and adding it to the dictionary. This is responsible for the +~ initial guess of zero on the first pass. +~ ~ (name string pointer -- ) : new-label labels create-in 0 , 0 , ; +~ These helpers take a label entry pointer and return pointers to the status +~ and value fields. : label-status entry-to-execution-token ; : label-value entry-to-execution-token 8 + ; +~ This looks up a label by name if it exists, or creates it if it doesn't. +~ Either way, it returns an entry pointer. It's named after the function +~ "intern" that many Lisp dialects have. +~ +~ (name string pointer -- entry pointer) : intern-label dup labels swap find-in dup { swap drop } { drop dup new-label labels swap find-in } if-else ; +~ This returns the value of a label, also doing all necessary status checks +~ and updates to keep track of the circumstances under which it was used. The +~ label loop relies on all read accesses going through this word. +~ ~ (label entry pointer -- label value) : use-label ~ If it hasn't been defined yet, mark it used-before-set. @@ -65,6 +137,10 @@ init-labels label-value @ ; ; +~ This overwrites the value of a label, also doing all necessary status +~ checks and jupdates to keep track of the cirumstances under which it was +~ set. The label loop relies on all write accesses going through this word. +~ ~ (new label value, label entry pointer --) : set-label ~ We always set the defined bit to true. We leave the other status bits @@ -84,10 +160,15 @@ init-labels label-value ! label-status ! ; +~ This is a convenience helper which downstream code can use to check how +~ many bytes it has output thus-far. ~ (output memory start, current output point ~ -- output memory start, current output point, offset) : current-offset 2dup swap - ; +~ This is a concise syntax for referencing the entry pointer of a label, +~ when you know its name statically. It reads a word of text input and calls +~ intern-label on it. : L' word value@ interpreter-flags @ 0x01 & @@ -98,6 +179,8 @@ init-labels { intern-label dropstring-with-result } if-else ; make-immediate +~ This is a concise syntax for calling use-label with a label whose name +~ you know statically. It performs L', then calls use-label. : L@' ' L' entry-to-execution-token execute interpreter-flags @ 1 & @@ -105,6 +188,8 @@ init-labels { use-label } if-else ; make-immediate +~ This is a concise syntax for calling set-label with a label whose name +~ you know statically. It performs L', then calls set-label. : L!' ' L' entry-to-execution-token execute interpreter-flags @ 1 & @@ -112,15 +197,20 @@ init-labels { set-label } if-else ; make-immediate - -~ For a label to have "converged", at least one of the following must be true: +~ This is an internal helper that label-loop uses to check if a specific +~ label entry has "converged", based on its status bits. Downstream code won't +~ need to call this directly. +~ +~ For a label to have "converged", at least one of the following must be +~ true: ~ ~ 1. The label must never have been used (bit zero clear); ~ 2. The label was both used and defined, but not used before it was defined ~ (bits zero and one set; bit two clear); ~ 3. The label was both used and defined, and the guessed value equalled the ~ actual value (bits zero, one, and three set). -~ (label entry pointer) +~ +~ (label entry pointer -- boolean) : check-label-converged label-status @ dup 0x01 & not swap @@ -128,17 +218,41 @@ init-labels 0x0b & 0x0b = || || ; +~ This is an internal helper that label-loop uses to check if the overall +~ label assignments have converged, based on their status bits. Downstream +~ code won't need to call this directly. +~ +~ This returns true if and only if check-label-converged returns true for +~ all the labels that have been created. +~ +~ (-- boolean) : check-labels-converged 1 labels @ { dup } { dup check-label-converged 3roll && swap @ } while drop ; +~ This is an internal helper that label-loop calls between passes, to update +~ the status bits and get ready for the next one. Downstream code won't need +~ to call this directly. +~ : reset-labels labels @ { dup } { dup label-status 0 swap ! @ } while drop ; +~ This is the top-level word that invokes the entire label system. All the +~ code generation happens inside it. +~ +~ The execution token it's passed should have the interface: +~ +~ (output start, current output pointer +~ -- output start, current output pointer) +~ +~ In general, this is the same interface that code generation words should +~ use to communicate with each other. For example, all the words in elf.e +~ use this interface. +~ ~ (execution token -- output start, output length) : label-loop 0 swap -- cgit 1.4.1