summary refs log tree commit diff
diff options
context:
space:
mode:
-rw-r--r--elf.e159
-rw-r--r--hello.e12
-rw-r--r--labels.e128
3 files changed, 242 insertions, 57 deletions
diff --git a/elf.e b/elf.e
index 4224b50..c38f740 100644
--- a/elf.e
+++ b/elf.e
@@ -1,57 +1,128 @@
-~ ~~
-~ ~~ ELF header
-~ ~~
-~ ~~   This is the top-level ELF header, for the entire file. An ELF always
-~ ~~ has exactly one of this header, which is always at the start of the file.
-~ ~~
+~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~ ~~ Executable file format ~~
+~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~
+~   Before we do anything specific to the actual program we're building, we
+~ do a lot of ELF-specific stuff to ensure that our output is in a format
+~ Linux knows how to run.
+~
+~   This relies on the label facility defined in labels.e. Make sure to load
+~ that first.
+
+~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~ ~~ Runtime memory origin ~~
+~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~
+~   First, we pick an origin to load at. This is arbitrary, but it can't be
+~ zero. We define a constant word for it so the body of the program can use
+~ it in label calculations in whatever ways it needs to.
+
+: origin 0x08000000 ;
+
+
+~ ~~~~~~~~~~~~~~~~~~~~~
+~ ~~ ELF file header ~~
+~ ~~~~~~~~~~~~~~~~~~~~~
+~
+~   Second, we output ELF's top-level file header. This header describes the
+~ entire file. An ELF always has exactly one of this header, which is always
+~ at the start of the file.
+~
+~   The program we're building should call this word as the first output it
+~ generates.
+~
+~   The only interesting thing here is the entry pointer.
+
 : elf-file-header
-  0x7f pack8 s" ELF" pack-raw-string        ~ magic number
-  2 pack8                                   ~ 64-bit
-  1 pack8                                   ~ little-endian
-  1 pack8                                   ~ ELF header format v1
-  0 pack8                                   ~ System-V ABI
-  0 pack64                                  ~ (padding)
-
-  2 pack16                                  ~ executable
-  0x3e pack16                               ~ Intel x86-64
-  1 pack32                                  ~ ELF format version
-
-  L' start use-label origin + pack64        ~ entry point
+  ~ * denotes mandatory fields according to breadbox
+  current-offset 3unroll
+
+  0x7f pack8 s" ELF" pack-raw-string    ~ *magic number
+  2 pack8                               ~ 64-bit
+  1 pack8                               ~ little-endian
+  1 pack8                               ~ ELF header format v1
+  0 pack8                               ~ System-V ABI
+  0 pack64                              ~ (padding)
+
+  2 pack16                              ~ *executable
+  0x3e pack16                           ~ *Intel x86-64
+  1 pack32                              ~ ELF format version
+
+  L@' cold-start origin + pack64        ~ *entry point
     ~ This includes the origin, intentionally.
 
-  L' program-header use-label pack64        ~ program header offset
+  L@' elf-program-header pack64         ~ *program header offset
     ~ We place the program header immediately after the ELF header. This
     ~ offset is from the start of the file.
-  0 pack64                                  ~ section header offset
-  0 pack32                                  ~ processor flags
-  64 pack16                                 ~ ELF header size
-  56 pack16                                 ~ program header entry size
-  1 pack16                                  ~ number of program header entries
-  0 pack16                                  ~ section header entry size
-  0 pack16                                  ~ number of section header entries
-  0 pack16                                  ~ section name string table index
-  ;
+  0 pack64                              ~ section header offset
+  0 pack32                              ~ processor flags
+
+  L@' elf-header-size pack16            ~ ELF header size
+  L@' elf-program-header-size pack16     ~ *program header entry size
+  1 pack16                              ~ *number of program header entries
+  0 pack16                              ~ section header entry size
+  0 pack16                              ~ number of section header entries
+  0 pack16                              ~ section name string table index
+
+  ~   Though hardcoding the size of this header would work fine, it's easier
+  ~ to use the label system to keep track of its size. The only place this is
+  ~ actually referenced is right here in the header.
+  current-offset 4 roll - L!' elf-header-size ;
+
+
+~ ~~~~~~~~~~~~~~~~~~~~~~~~
+~ ~~ ELF program header ~~
+~ ~~~~~~~~~~~~~~~~~~~~~~~~
+~
+~   Third, we output ELF's program header, which lists the memory regions
+~ ("segments") we want to have and where we want them to come from. There may
+~ be any number of these entries, one per segment, , and they may be anywhere
+~ in the file as long as they're consecutive.
+~
+~   We list just a single region, which is the entire contents of the ELF file
+~ from disk, and we put the program header immediately after the file header.
+~ The program we're building should call this word as the second output it
+~ generates.
+~
+~   It would be more typical to use this header to ask the loader to give us
+~ separate code and data segments, and perhaps a stack or heap, but this keeps
+~ things simple, and we can create those things for ourselves later.
+~
+~   We do have a little stack space available, though we don't explicitly
+~ request any; the kernel allocates it for us as part of exec() so that it can
+~ pass us argc and argv (which we ignore). That stack space will be at a
+~ random address, different every time, because of ASLR; that's a neat
+~ security feature, so we leave it as-is. Note that ASLR doesn't happen when
+~ you run under gdb, so if you aren't seeing it, that's probably why.
 
-~ ~~
-~ ~~ Program header
-~ ~~
-~ ~~   An ELF program header consists of any number of these entries; they are
-~ ~~ always consecutive, but may be anywhere in the file. We always have
-~ ~~ exactly one, and it's always right after the ELF file header.
 ~ ~~
 : elf-program-header
-  current-offset L' program-header set-label
-  1 pack32                                  ~ "loadable" segment type
-  0x05 pack32                               ~ read+execute permission
-  0 pack64                                  ~ offset in file
-  origin pack64                             ~ virtual address
+  ~ * denotes mandatory fields according to breadbox
+  current-offset L!' elf-program-header
+  current-offset 3unroll
+
+  1 pack32                              ~ *"loadable" segment type
+  0x05 pack32                           ~ *read+execute permission
+  0 pack64                              ~ *offset in file
+  origin pack64                         ~ *virtual address
     ~ required, but can be anything, subject to alignment
-  0 pack64                                  ~ physical address (ignored)
+  0 pack64                              ~ physical address (ignored)
 
-  L' total-size use-label pack64            ~ size in file
-  L' total-size use-label pack64            ~ size in memory
+  L@' total-size pack64                 ~ *size in file
+  L@' total-size pack64                 ~ *size in memory
 
-  0 pack64                                  ~ segment alignment
+  0 pack64                              ~ segment alignment
     ~ for relocation, but this doesn't apply to us
-  ;
+
+  ~   As with the file header, we use the label system to keep track of the
+  ~ program header's size.
+  current-offset 4 roll - L!' elf-program-header-size ;
+
+~ ~~~~~~~~~~~~~~~~
+~ ~~ That's it! ~~
+~ ~~~~~~~~~~~~~~~~
+~
+~   ELF is a simple format, really.  Now you can output your own machine code
+~ that you generate however you want; make sure to define the label
+~ cold-start, which will be the first thing that runs.
 
diff --git a/hello.e b/hello.e
index 63f19c4..fcafa25 100644
--- a/hello.e
+++ b/hello.e
@@ -1,11 +1,11 @@
 ~ cat labels.e elf.e hello.e | ./quine > hello && chmod 755 hello && ./hello
 
 : output-start-routine
-  current-offset L' start set-label
+  current-offset L!' cold-start
   1 :rax mov-reg64-imm32
   1 :rdi mov-reg64-imm64
-  origin L' greeting use-label + :rsi mov-reg64-imm64
-  L' greeting-size use-label :rdx mov-reg64-imm64
+  origin L@' greeting + :rsi mov-reg64-imm64
+  L@' greeting-size :rdx mov-reg64-imm64
   syscall
   60 :rax mov-reg64-imm32
   0 :rdi mov-reg64-imm32
@@ -13,9 +13,9 @@
   ;
 
 : output-greeting
-  current-offset dup L' greeting set-label 3unroll
+  current-offset dup L!' greeting 3unroll
   s" Hello, Irenes!" packstring
-  current-offset 4 roll - L' greeting-size set-label ;
+  current-offset 4 roll - L!' greeting-size ;
 
 ~ (output memory start, current output point
 ~  -- output memory start, current output point)
@@ -27,7 +27,7 @@
   elf-program-header
   output-start-routine
   output-greeting
-  current-offset L' total-size set-label
+  current-offset L!' total-size
   ;
 
 ' all-contents entry-to-execution-token label-loop
diff --git a/labels.e b/labels.e
index 6ece87e..be601a3 100644
--- a/labels.e
+++ b/labels.e
@@ -1,15 +1,63 @@
-~ current output point, string pointer
+~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~ ~~ Machine label facility ~~
+~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+~   Compilers and assemblers always have a need to resolve symbolic names for
+~ parts of their code to numeric values. Usually, these symbolic names are
+~ called labels.
+~
+~   It's a surprisingly deep problem. The output itself can end up depending
+~ on the actual values, for a variety of reasons. For example, the size of
+~ a relative-jump instruction may vary depending on how far away its target
+~ is, which changes the position of everything after. Or, there may be a size
+~ limit on some particular segment and code may have to be entirely
+~ reorganized to make things fit, splitting it into pieces or adding
+~ trampolines.
+~
+~   Often, this winds up being a hidden layer with immense complexity,
+~ implemented as part of a linker, that even compiler maintainers find obscure
+~ and confusing.
+~
+~   It's Irenes' position that we should engage with complexity rather than
+~ encysting it. So, in Evocation we take a different approach, heavily
+~ inspired by the semantics of a tool called flatassembler. We provide
+~ downstream code with the basic operations set-label and use-label which it
+~ can combine in whatever order it wants. We also provide a harness,
+~ label-loop, which takes an execution token for the downstream code
+~ generator and runs it over and over, computing the mathematical fixed point
+~ of the label assignments.
+~
+~   On the first pass, label-loop guesses a value of zero for any label that's
+~ used before it's set. On subsequent passes, each label starts with the value
+~ it had on the previous pass. When we have a pass where labels end with the
+~ same value they started with, we say they've converged, and we announce
+~ success. If a hundred passes go by without convergence, we fail instead.
+~
+~   Most of the time, the shorthand words L@' and L!' will be all you need to
+~ use from your own code.
+
+~ TODO this should go somewhere else :)
+~ (current output point, string pointer --)
 : pack-raw-string
   { unpack8 dup } { 3roll swap pack8 swap } while drop drop ;
 
-: origin 0x08000000 ;
-
-~ Status is a bit field:
+~   The labels data structure is a linked list of dictionary entries, with
+~ the same header format as the main Evocation dictionary, but instead of
+~ being executable, each holds two words of data: status bits, and a value,
+~ in that order. There's no "docol" pointer or anything of the sort.
+~
+~   Status is a bit field:
 ~    bit zero is whether it's been used
 ~    bit one is whether it's been defined
 ~    bit two is whether it was used before being defined
 ~    bit three is whether the guessed value wound up equaling the actual value
+~
+~   Just as the Evocation dictionary uses the global variable "latest" as a
+~ handle (a pointer to a pointer) beginning a linked list of entries, so the
+~ label dictionary also uses a handle variable, named "labels".
 
+~ TODO we should just do this in immediate mode, but right now the word
+~ "variable" steps on the same scratch space that s" uses, so we can't.
 : init-labels
   8 allocate s" labels" variable
   0 s" labels" find entry-to-execution-token execute !
@@ -17,6 +65,7 @@
 ~ This needs to happen now because otherwise the word "labels" won't exist.
 init-labels
 
+~   This is analogous to word-heading, but prints label information.
 ~ (entry pointer --)
 : label-heading
   dup entry-to-name dup emitstring space
@@ -24,6 +73,7 @@ init-labels
   entry-to-execution-token dup 8 + @ .hex64 space @ .hex8
   newline ;
 
+~ TODO this should go elsewhere
 ~ (dictionary handle)
 : oldest-entry-in
   dup
@@ -31,6 +81,7 @@ init-labels
   dup 3roll = { drop 0 } if
   ;
 
+~ TODO this should go elsewhere
 ~ (entry pointer, dictionary handle)
 : next-newer-entry-in
   @
@@ -38,21 +89,42 @@ init-labels
   { dup { 2dup @ != } if }
   { @ } while swap drop ;
 
+~   This prints the headings for all the labels that have been created. Note
+~ that labels that have been created once stay in the dictionary forever, even
+~ if subsequent passes neither use nor define them. That's because the control
+~ loop has no way to know if they're still important; it's up to downstream
+~ code to decide that. Everything has been carefully set up so that disused
+~ labels don't hurt anything.
 : list-labels
   labels oldest-entry-in { dup }
   { dup label-heading labels next-newer-entry-in } while drop ;
 
+~   This creates a new label given a name for it, initializing its value and
+~ status to zero and adding it to the dictionary. This is responsible for the
+~ initial guess of zero on the first pass.
+~
 ~ (name string pointer -- )
 : new-label labels create-in 0 , 0 , ;
 
+~   These helpers take a label entry pointer and return pointers to the status
+~ and value fields.
 : label-status entry-to-execution-token ;
 : label-value entry-to-execution-token 8 + ;
 
+~   This looks up a label by name if it exists, or creates it if it doesn't.
+~ Either way, it returns an entry pointer. It's named after the function
+~ "intern" that many Lisp dialects have.
+~
+~ (name string pointer -- entry pointer)
 : intern-label
   dup labels swap find-in
   dup { swap drop }
       { drop dup new-label labels swap find-in } if-else ;
 
+~   This returns the value of a label, also doing all necessary status checks
+~ and updates to keep track of the circumstances under which it was used. The
+~ label loop relies on all read accesses going through this word.
+~
 ~ (label entry pointer -- label value)
 : use-label
   ~ If it hasn't been defined yet, mark it used-before-set.
@@ -65,6 +137,10 @@ init-labels
   label-value @ ;
   ;
 
+~   This overwrites the value of a label, also doing all necessary status
+~ checks and jupdates to keep track of the cirumstances under which it was
+~ set. The label loop relies on all write accesses going through this word.
+~
 ~ (new label value, label entry pointer --)
 : set-label
   ~ We always set the defined bit to true. We leave the other status bits
@@ -84,10 +160,15 @@ init-labels
   label-value ! label-status !
   ;
 
+~   This is a convenience helper which downstream code can use to check how
+~ many bytes it has output thus-far.
 ~ (output memory start, current output point
 ~  -- output memory start, current output point, offset)
 : current-offset 2dup swap - ;
 
+~   This is a concise syntax for referencing the entry pointer of a label,
+~ when you know its name statically. It reads a word of text input and calls
+~ intern-label on it.
 : L'
   word value@
   interpreter-flags @ 0x01 &
@@ -98,6 +179,8 @@ init-labels
   { intern-label dropstring-with-result } if-else
   ; make-immediate
 
+~   This is a concise syntax for calling use-label with a label whose name
+~ you know statically. It performs L', then calls use-label.
 : L@'
   ' L' entry-to-execution-token execute
   interpreter-flags @ 1 &
@@ -105,6 +188,8 @@ init-labels
     { use-label } if-else
   ; make-immediate
 
+~   This is a concise syntax for calling set-label with a label whose name
+~ you know statically. It performs L', then calls set-label.
 : L!'
   ' L' entry-to-execution-token execute
   interpreter-flags @ 1 &
@@ -112,15 +197,20 @@ init-labels
     { set-label } if-else
   ; make-immediate
 
-
-~ For a label to have "converged", at least one of the following must be true:
+~   This is an internal helper that label-loop uses to check if a specific
+~ label entry has "converged", based on its status bits. Downstream code won't
+~ need to call this directly.
+~
+~   For a label to have "converged", at least one of the following must be
+~ true:
 ~
 ~ 1. The label must never have been used (bit zero clear);
 ~ 2. The label was both used and defined, but not used before it was defined
 ~    (bits zero and one set; bit two clear);
 ~ 3. The label was both used and defined, and the guessed value equalled the
 ~    actual value (bits zero, one, and three set).
-~ (label entry pointer)
+~
+~ (label entry pointer -- boolean)
 : check-label-converged
   label-status @
   dup 0x01 & not swap
@@ -128,17 +218,41 @@ init-labels
   0x0b & 0x0b =
   || || ;
 
+~   This is an internal helper that label-loop uses to check if the overall
+~ label assignments have converged, based on their status bits. Downstream
+~ code won't need to call this directly.
+~
+~   This returns true if and only if check-label-converged returns true for
+~ all the labels that have been created.
+~
+~ (-- boolean)
 : check-labels-converged
   1
   labels @ { dup }
   { dup check-label-converged 3roll && swap
     @ } while drop ;
 
+~   This is an internal helper that label-loop calls between passes, to update
+~ the status bits and get ready for the next one. Downstream code won't need
+~ to call this directly.
+~
 : reset-labels
   labels @ { dup }
   { dup label-status 0 swap !
     @ } while drop ;
 
+~   This is the top-level word that invokes the entire label system. All the
+~ code generation happens inside it.
+~
+~   The execution token it's passed should have the interface:
+~
+~     (output start, current output pointer
+~      -- output start, current output pointer)
+~
+~   In general, this is the same interface that code generation words should
+~ use to communicate with each other. For example, all the words in elf.e
+~ use this interface.
+~
 ~ (execution token -- output start, output length)
 : label-loop
   0 swap