~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ ~~ Assembly language for the AMD64 architecture ~~
~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~
~   This is also often called the x86-64 architecture, but Intel didn't
~ invent it (they had their chance) and there's no reason to name it after
~ their product line. We have a bunch of assembler words that, taken as a
~ whole, form a sort of assembly language inside of the Forth-style language.
~
~   It's all backwards and stuff.
~
~   Okay, but seriously, the convention is: target on the top of the stack,
~ source behind it. This is similar to how the Forth "!" and "@" words work.
~
~   These routines use the binary packing routines such as pack64, defined in
~ core.e. They're called in the same way: an output address which we call the
~ "output point", followed by data items specific to what's being output. They
~ also chain together in the same way, returning the updated output point.
~
~   TODO cite the Intel reference manual here and explain the notation used
~ for the section citations below
~
~   TODO define instructions, assembly code, machine code, opcodes. if we ever
~ also want to recommend a childrens' introduction to binary, this might be
~ the place to do it.


~ Keywords
~ ~~~~~~~~
~
~   We define a bunch of keywords, which evaluate to their own codeword
~ addresses. We use these to refer to registers and condition codes by name.
~
~
~ On registers
~ ~~~~~~~~~~~~
~
~   The x86 architecture has been around a while, it has been through
~ several transitions from smaller word sizes to larger ones. Therefore it
~ has different names for the "same" registers, depending on how much of
~ them you're using.
~
~ TODO there's more to write here

~   The names of the 64-bit registers. The second half of these are considered
~ "extended" registers because they don't correspond to 32-bit registers in
~ the way the first eight do.
s" :rax" keyword
s" :rcx" keyword
s" :rdx" keyword
s" :rbx" keyword
s" :rsp" keyword
s" :rbp" keyword
s" :rsi" keyword
s" :rdi" keyword
s" :r8" keyword
s" :r9" keyword
s" :r10" keyword
s" :r11" keyword
s" :r12" keyword
s" :r13" keyword
s" :r14" keyword
s" :r15" keyword

~   The names of the 32-bit registers. The processor treats these as being
~ alternate names for the low halves of the 64-bit registers. There is a
~ very finicky distinction about what that means in different settings: Some
~ instructions operate on a 32-bit source or target, while others merely
~ accept a 32-bit value that gets sign-extended to 64 bits. We've taken pains
~ to clarify these cases in the instruction-specific notes, as they come up.
s" :eax" keyword
s" :ecx" keyword
s" :edx" keyword
s" :ebx" keyword
s" :esp" keyword
s" :ebp" keyword
s" :esi" keyword
s" :edi" keyword

~   The names of the 16-bit registers. Similarly, the processor treats these
~ as being alternate names for the low halves of the 32-bit registers.
s" :ax" keyword
s" :cx" keyword
s" :dx" keyword
s" :bx" keyword
s" :sp" keyword
s" :bp" keyword
s" :si" keyword
s" :di" keyword

~   The names of the 8-bit registers. The pattern here is a little bit
~ different; these come in "low" and "high" pairs, where for example :al is
~ the low half of :ax and :ah is the high half. Yes, this architecture grows
~ like a tree, with all the old things being still present, surrounded in the
~ new ones.
s" :al" keyword
s" :cl" keyword
s" :dl" keyword
s" :bl" keyword
s" :ah" keyword
s" :ch" keyword
s" :dh" keyword
s" :bh" keyword

~   The condition codes. Yes, there sure is a lot of duplication in these
~ names. The names are based on Intel's documented mnemonics...
~
~   "Above" and "below" are for unsigned comparisons. "Greater" and "less"  are
~ for signed comparisons.
~
~   This is documented on the individual opcode pages, and also in B.1.4.7.
s" :cc-overflow" keyword
s" :cc-no-overflow" keyword
s" :cc-below" keyword
s" :cc-above-equal" keyword
s" :cc-equal" keyword
s" :cc-not-equal" keyword
s" :cc-below-equal" keyword
s" :cc-above" keyword
s" :cc-sign" keyword
s" :cc-not-sign" keyword
s" :cc-even" keyword
s" :cc-odd" keyword
s" :cc-less" keyword
s" :cc-greater-equal" keyword
s" :cc-less-equal" keyword
s" :cc-greater" keyword


~ Bits and pieces
~ ~~~~~~~~~~~~~~~
~
~   Here, we have a bunch of helpers which generate specific encoded
~ representations that are part of many instructions. We start with the
~ trivial ones that handle individual fields, then work up to combinations of
~ fields.
~
~   When we say that a word accepts a register as a parameter, what we mean
~ is it accepts the name keyword for that register. When we say that a word
~ accepts a scale factor, what we mean is that it accepts a byte count for
~ that scale factor. In the cases where we mean the encoded form, we'll say
~ "encoded value" or "value".
~
~ TODO surely we can find a way to have real flow-control words

~ (register -- 3-bit encoded value for register)
: reg64
  ~   In counting the words for the branches, notice that each integer literal
  ~ is two words.
  dup :rax = 0branch [ 5 8 * , ] drop 0 exit
  dup :rcx = 0branch [ 5 8 * , ] drop 1 exit
  dup :rdx = 0branch [ 5 8 * , ] drop 2 exit
  dup :rbx = 0branch [ 5 8 * , ] drop 3 exit
  dup :rsp = 0branch [ 5 8 * , ] drop 4 exit
  dup :rbp = 0branch [ 5 8 * , ] drop 5 exit
  dup :rsi = 0branch [ 5 8 * , ] drop 6 exit
  dup :rdi = 0branch [ 5 8 * , ] drop 7 exit
  ." Parameter to reg64 is not a reg64." 1 sys-exit ;

~ (register -- 3-bit encoded value for register)
: extrareg64
  dup :r8 = 0branch [ 5 8 * , ] drop 0 exit
  dup :r9 = 0branch [ 5 8 * , ] drop 1 exit
  dup :r10 = 0branch [ 5 8 * , ] drop 2 exit
  dup :r11 = 0branch [ 5 8 * , ] drop 3 exit
  dup :r12 = 0branch [ 5 8 * , ] drop 4 exit
  dup :r13 = 0branch [ 5 8 * , ] drop 5 exit
  dup :r14 = 0branch [ 5 8 * , ] drop 6 exit
  dup :r15 = 0branch [ 5 8 * , ] drop 7 exit
  ." Parameter to extrareg64 is not an extrareg64." 1 sys-exit ;

~ (register -- 3-bit encoded value for register)
: reg32
  dup :eax = 0branch [ 5 8 * , ] drop 0 exit
  dup :ecx = 0branch [ 5 8 * , ] drop 0 exit
  dup :edx = 0branch [ 5 8 * , ] drop 0 exit
  dup :ebx = 0branch [ 5 8 * , ] drop 0 exit
  dup :esp = 0branch [ 5 8 * , ] drop 0 exit
  dup :ebp = 0branch [ 5 8 * , ] drop 0 exit
  dup :esi = 0branch [ 5 8 * , ] drop 0 exit
  dup :edi = 0branch [ 5 8 * , ] drop 0 exit
  ." Parameter to reg32 is not a reg32." 1 sys-exit ;

~ (register -- 3-bit encoded value for register)
: reg16
  dup :ax = 0branch [ 5 8 * , ] drop 0 exit
  dup :cx = 0branch [ 5 8 * , ] drop 1 exit
  dup :dx = 0branch [ 5 8 * , ] drop 2 exit
  dup :bx = 0branch [ 5 8 * , ] drop 3 exit
  dup :sp = 0branch [ 5 8 * , ] drop 4 exit
  dup :bp = 0branch [ 5 8 * , ] drop 5 exit
  dup :si = 0branch [ 5 8 * , ] drop 6 exit
  dup :di = 0branch [ 5 8 * , ] drop 7 exit
  ." Parameter to reg16 is not a reg16." 1 sys-exit ;

~ (register -- 3-bit encoded value for register)
: reg8
  dup :al = 0branch [ 5 8 * , ] drop 0 exit
  dup :cl = 0branch [ 5 8 * , ] drop 1 exit
  dup :dl = 0branch [ 5 8 * , ] drop 2 exit
  dup :bl = 0branch [ 5 8 * , ] drop 3 exit
  dup :ah = 0branch [ 5 8 * , ] drop 4 exit
  dup :ch = 0branch [ 5 8 * , ] drop 5 exit
  dup :dh = 0branch [ 5 8 * , ] drop 6 exit
  dup :bh = 0branch [ 5 8 * , ] drop 7 exit
  ." Parameter to reg8 is not a reg8." 1 sys-exit ;


~   There's a packed format called the SIB byte, which we'll get to in a
~ second. One of its bitfields is called the scale field. This word produces
~ an encoded value for that field.
~
~   The input value is a byte count; the output value is suitable for use in
~ the SIB byte.
~
~ (scale factor -- 2-bit encoded value)
: scalefield
  dup 1 = 0branch [ 5 8 * , ] drop 0 exit
  dup 2 = 0branch [ 5 8 * , ] drop 1 exit
  dup 5 = 0branch [ 5 8 * , ] drop 2 exit
  dup 8 = 0branch [ 5 8 * , ] drop 3 exit
  ." Parameter to scalefield is not 1, 2, 4, or 8." 1 sys-exit ;


~ [Intel] volume 2D, appendix B, section B-1.4.7, table B-10. Also see the
~ individual opcode pages.
~
~   Every instruction has an "opcode", a specific byte or sequence of bytes
~ which uniquely identifies the combination of operation, addressing mode,
~ and certain miscellaneous characteristics. This is not just another way of
~ referring to the entire sequence of bytes corresponding to the instruction;
~ the opcode is a specific part within that, as distinct from ie. the rex
~ byte, the SIB byte, the Mod/RM byte, and various immediate values and other
~ rare tidbits.
~
~   Some of these opcodes have bitfields within them, to specify condition
~ codes. This word produces an encoded value for that condition-code field.
~
~ (condition -- 4-bit encoded value)
: condition-code
  dup :cc-overflow = 0branch [ 5 8 * , ] drop 0 exit
  dup :cc-no-overflow = 0branch [ 5 8 * , ] drop 1 exit
  dup :cc-below = 0branch [ 5 8 * , ] drop 2 exit
  dup :cc-above-equal = 0branch [ 5 8 * , ] drop 3 exit
  dup :cc-equal = 0branch [ 5 8 * , ] drop 4 exit
  dup :cc-not-equal = 0branch [ 5 8 * , ] drop 5 exit
  dup :cc-below-equal = 0branch [ 5 8 * , ] drop 6 exit
  dup :cc-above = 0branch [ 5 8 * , ] drop 7 exit
  dup :cc-sign = 0branch [ 5 8 * , ] drop 8 exit
  dup :cc-not-sign = 0branch [ 5 8 * , ] drop 9 exit
  dup :cc-even = 0branch [ 5 8 * , ] drop 10 exit
  dup :cc-odd = 0branch [ 5 8 * , ] drop 11 exit
  dup :cc-less = 0branch [ 5 8 * , ] drop 12 exit
  dup :cc-greater-equal = 0branch [ 5 8 * , ] drop 13 exit
  dup :cc-less-equal = 0branch [ 5 8 * , ] drop 14 exit
  dup :cc-greater = 0branch [ 5 8 * , ] drop 15 exit
  ." Parameter to condition-code is not a condition code." 1 sys-exit ;


~   The "rex" byte appears before an opcode to modify its behavior in various
~ ways. It has four distinct bits within it, leading to sixteen variations,
~ as you can see.
~
~   The way these are all spelled out like this is slightly ridiculous, there
~ must be a better way. We only ever use rex-w and rex-wb, so it's tempting to
~ get rid of the rest, but they're worth having so that our future selves
~ don't have to revisit this topic.
~
~ (output point -- output point)
: rex-0 0x40 pack8 ;
: rex-w 0x48 pack8 ;
: rex-r 0x44 pack8 ;
: rex-x 0x42 pack8 ;
: rex-b 0x41 pack8 ;
: rex-wr 0x4C pack8 ;
: rex-wx 0x4A pack8 ;
: rex-wb 0x49 pack8 ;
: rex-rx 0x46 pack8 ;
: rex-rb 0x45 pack8 ;
: rex-xb 0x43 pack8 ;
: rex-wrx 0x4E pack8 ;
: rex-wrb 0x4D pack8 ;
: rex-wxb 0x4B pack8 ;
: rex-rxb 0x47 pack8 ;
: rex-wrxb 0x4F pack8 ;


~   Some opcodes use their low three bits as a field to give a register name.
~ This is usually in addition to a register name given in a Mod/RM byte,
~ serving a different role for the instruction.
~
~   This word accepts an opcode byte with those three bits clear, and combines
~ it with a register value, then outputs the resulting byte. Each opcode
~ accepts some specific kind of register; to allow different kinds, here we
~ expect the step of converting the register name to the encoded bits to have
~ already been done.
~
~ (output point, 3-bit encoded value for register, opcode byte
~  -- output point)
: opcodereg | pack8 ;


~   Some opcodes use their low four bits as a field to give a condition code.
~ This word accepts an opcode byte with those four bits clear, and combines it
~ with a condition code value, then outputs the resulting byte. For
~ consistency with opcodereg, we expect the step of converting the condition
~ code name to the encoded bits to have already been done.
~
~ (output point, 4-bit encoded value for condition code, opcode byte
~  -- output point)
: opcodecc | pack8 ;


~   A Mod/RM byte ("mode / register-or-memory") is part of the encoding of
~ many instructions. It's divided into three fields: "mod" (mode),
~ register/opcode, and register/memory ("RM").
~
~   This word outputs a Mod/RM byte given fully-processed, numeric values for
~ its fields. Most code will want to call one of the higher-level
~ addressing-* words, instead.
~
~ (output point, mod field, register/opcode field, register/memory field
~  -- output point)
: modrm swap 8 * | swap 64 * | pack8 ;

~   An SIB byte ("scale, index, base") is part of the encoding of many
~ instructions. It's divided into three fields, with the names you've already
~ guessed.
~
~   This word outputs an SIB byte given fully-processed, numeric values for
~ its fields.
~
~ (output point, scale field, index field, base field -- output point)
: sib swap 8 * | swap 64 * | pack8 ;


~ Addressing modes
~ ~~~~~~~~~~~~~~~~
~
~   These are higher-level words meant to be easier to use than the bits and
~ pieces above. Each corresponds to some specific addressing mode. When
~ applicable, they accept keywords rather than pre-encoded values.
~
~   That's not all the time, because there are cases, such as the reg/op
~ field, where the meaning is up to the individual instruction. In those
~ cases, these words do accept fully-processed, numeric values.
~
~   The general rule is that the responsibility of these addressing-mode words
~ is for the parts that are common to all instructions using that addressing
~ mode.


~   The simplest of the addressing modes: Direct register addressing. There
~ are no special cases to check.
~
~   It's important to notice that the R/M field may describe either a source,
~ or a target, depending on what the instruction is. So, this helper doesn't
~ get to know that. It also doesn't get to know whether the value in the
~ reg/op field describes a register, or if instead it's an extension of the
~ opcode. The caller is responsible for figuring that all out.
~
~ (output point, reg/op field value, reg/mem field register
~  -- output point)
: addressing-reg64 reg64 3 3unroll modrm ;
: addressing-reg8 reg8 3 3unroll modrm ;


~   This is a helper for assembly instructions that want to do a form of
~ addressing that requires a value of 1 in the modrm byte's mode field, and
~ do not want to do any indexing. That's the indirect mode, which takes a
~ 64-bit register, treats it as an address, and looks up the 64-bit value it
~ points to.
~
~   The helper's main responsibility is to deal with the scenario that
~ requires an SIB byte, which happens when the R/M field has a value of 4,
~ which would otherwise refer to the register rsp. In that situation, it also
~ generates an SIB byte which indicates a scale of 1, no indexing, and rsp as
~ the base register.
~
~   When the register is :rbp, the only modes available also have
~ displacement; we disallow that. For that case, use an instruction that
~ uses a disp8 mode, and set a displacement of 0.
~
~   In understanding this, pay close attention to the Op/En column in the
~ opcode table. The "RM" variant means the ModRM byte's R/M field (the third
~ one) is the source, while its reg field (the middle one) is the target. This
~ is what we want, because the R/M field is the one that gets indirection
~ applied to it. Opcode 0x8B with an REX.W prefix is the all-64-bit RM
~ variant. [Intel] volume 2B, chapter 4, section 4-3, "MOV".
~
~   For the indirection modes, don't be confused by the many similar tables.
~ 64-bit mode is encoded the same as 32-bit mode except for adding a REX.W
~ prefix, as per 2.2.1.1, so you want table 2-2 to understand the ModRM byte.
~ The presence or absence of an SIB byte is determined by where in that table
~ we fall, and we aren't using a mode that has one. [Intel] volume 2A,
~ chapter 2, section 2-1.5, table 2-2.
~
~ (output point, reg/op field value, reg/mem field register
~  -- output point)
: addressing-indirect-reg64
  ~ Exit with an error if the R/M register is :rbp.
  dup :rbp != 0branch [ 23 8 * , ]
  ~ Check whether the R/M register is :rsp. Save the test result for later.
  dup :rsp = 4 unroll
  ~ (equality result, output point, reg/op value, reg/mem name)
  reg64 0 3unroll modrm
  ~ (equality result, output point)
  ~ If the R/M register was rsp, we need an SIB byte; otherwise, skip it.
  swap 0branch [ 8 8 * , ] 0 4 :rsp reg64 sib
  exit
  ." R/M parameter to addressing-indirect-reg64 is :rbp." 1 sys-exit ;

~ (output point, reg/op field value, reg/mem field register,
~  displacement value -- output point)
: addressing-disp8-reg64
  ~ This mode can do :rbp fine, so no need to check for that.
  ~ Check whether the R/M register is :rsp. Save the test result for later.
  swap dup :rsp = 5 unroll swap
  ~ Stash the displacement value out of the way, too.
  4 unroll
  reg64 1 3unroll modrm
  ~ If the R/M register was rsp, we need an SIB byte; otherwise, skip it.
  3roll 0branch [ 8 8 * , ] 0 4 :rsp reg64 sib
  ~ The displacement byte.
  swap pack8 ;

~ (output point, reg/op field value, reg/mem field register,
~  displacement value -- output point)
: addressing-disp32-reg64
  ~ This mode can do :rbp fine, so no need to check for that.
  ~ Check whether the R/M register is :rsp. Save the test result for later.
  swap dup :rsp = 5 unroll swap
  ~ Stash the displacement value out of the way, too.
  4 unroll
  reg64 2 3unroll modrm
  ~ If the R/M register was rsp, we need an SIB byte; otherwise, skip it.
  3roll 0branch [ 8 8 * , ] 0 4 :rsp reg64 sib
  ~ The displacement value.
  swap pack32 ;

~ (output point, reg/op field value,
~  scale factor, index register, base field register
~  -- output point)
: addressing-indexed-reg64
  ~ Exit with an error if the base register is :rbp.
  dup :rbp != 0branch [ 23 8 * , ]
  ~ Reg/mem value 4 means to use an SIB byte (at least, with this mode).
  5 roll 0 6 roll 4 modrm 4 unroll
  reg64 3unroll reg64 3unroll scalefield 3unroll sib
  exit
  ." Base parameter to addressing-indexed-reg64 is :rbp." 1 sys-exit ;

~ (output point, reg/op field value,
~  scale factor, index register, base field register,
~  displacement value -- output point)
: addressing-disp8-indexed-reg64
  ~ This mode can do :rbp fine, so no need to check for that.
  ~ Reg/mem value 4 means to use an SIB byte (at least, with this mode).
  6 roll 1 7 roll 4 modrm 5 unroll
  5 unroll reg64 3unroll reg64 3unroll scalefield 3unroll sib
  swap pack8 ;


~ Easy instructions
~ ~~~~~~~~~~~~~~~~~
~
~   It's not worth pretending there's a coherent category behind this
~ grouping. These are the ones that were easy to deal with.

~ (output point -- output point)
: cld 0xFC pack8 ;
: std 0xFD pack8 ;
: syscall 0x0F pack8 0x05 pack8 ;
: hlt 0xF4 pack8 ;

~ (output point, source register -- output point)
: push-reg64 reg64 0x50 opcodereg ;

~ (output point, target register -- output point)
: pop-reg64 reg64 0x58 opcodereg ;

~ (output point, immediate value -- output point)
: push-imm32-extended64 swap 0x68 pack8 swap pack32 ;

~ (output point, source register, source displacement value, target register
~  -- output point)
: lea-reg64-disp8-reg64
  4 roll rex-w 0x8D pack8 4 unroll
  reg64 3unroll addressing-disp8-reg64 ;

~ (output point, source register, source displacement value, target register
~  -- output point)
: lea-reg64-disp32-reg64
  4 roll rex-w 0x8D pack8 4 unroll
  reg64 3unroll addressing-disp32-reg64 ;

~ (output point,
~  source base register, source index register, source index scale factor,
~  target register -- output point)
: lea-reg64-indexed-reg64
  5 roll rex-w 0x8D pack8 5 unroll
  reg64 4 unroll 3unroll swap addressing-indexed-reg64 ;

~ (output point,
~  source base register, source index register, source index scale factor,
~  source displacement value,
~  target register -- output point)
: lea-reg64-disp8-indexed-reg64
  6 roll rex-w 0x8D pack8 6 unroll
  reg64 5 unroll 3 roll 4 roll 3 roll addressing-disp8-indexed-reg64 ;


~ Move instructions
~ ~~~~~~~~~~~~~~~~~
~
~   These are, like, MOST of what we care about, so they get their own
~ section. Although it's very much the case that almost every two-operand
~ instruction offers this many distinct modes, we don't care about most of
~ those and don't yet implement them. We do care about all the modes for move
~ instructions.
~
~   Someday perhaps we'll have extra-high-level features which generate all
~ the distinct versions of each instruction in a concise way, but that is not
~ this day.

~ (output point, immediate value, register -- output point)
: mov-reg64-imm32
  3roll rex-w 0xC7 pack8 swap
  0 swap addressing-reg64
  swap pack32 ;
: mov-reg64-imm64
  3roll rex-w swap reg64 0xB8 opcodereg swap pack64 ;
: mov-extrareg64-imm64
  ~   Note the use of the B rex bit here; this instruction puts the register
  ~ number in the opcode field, so it uses Table 3-1.
  3roll rex-wb swap extrareg64 0xB8 opcodereg swap pack64 ;

~ (output point, source register, target register -- output point)
: mov-reg64-reg64
  3roll rex-w 0x89 pack8 3unroll
  swap reg64 swap addressing-reg64 ;
: mov-indirect-reg64-reg64
  3roll rex-w 0x89 pack8 3unroll
  swap reg64 swap addressing-indirect-reg64 ;

~ (output point, source register, target register, target displacement value
~  -- output point)
: mov-disp8-reg64-reg64
  4 roll rex-w 0x89 pack8 4 unroll
  3roll reg64 3unroll addressing-disp8-reg64 ;

~ (output point, source register, target register -- output point)
: mov-reg64-indirect-reg64
  3roll rex-w 0x8B pack8 3unroll
  reg64 swap addressing-indirect-reg64 ;

~ (output point, source register, source displacement value, target register
~  -- output point)
: mov-reg64-disp8-reg64
  4 roll rex-w 0x8B pack8 4 unroll
  reg64 3unroll addressing-disp8-reg64 ;
: mov-reg64-disp32-reg64
  4 roll rex-w 0x89 pack8 4 unroll
  3roll reg64 swap 3roll addressing-disp32-reg64 ;

~ (output point,
~  source base register, source index register, source index scale factor,
~  target register -- output point)
: mov-reg64-indexed-reg64
  5 roll rex-w 0x8B pack8 5 unroll
  reg64 4 unroll 3unroll swap addressing-indexed-reg64 ;

~ (output point, source register,
~  target base register, target index register, target index scale factor
~  -- output point)
: mov-indexed-reg64-reg64
  5 roll rex-w 0x89 pack8 5 unroll
  4 roll reg64 4 unroll
  3unroll swap addressing-indexed-reg64 ;

~ (output point, source register, target register -- output point)
: mov-indirect-reg64-reg32
  3roll 0x89 pack8 3unroll
  swap reg32 swap addressing-indirect-reg64 ;

~ (output point, source regisgter, target register, target displacement value
~  -- output point)
: mov-disp8-reg64-reg32
  4 roll 0x89 pack8 4 unroll
  3roll reg32 3unroll addressing-disp8-reg64 ;

~ (output point, source register, target register -- output point)
: mov-reg32-indirect-reg64
  3roll 0x8B pack8 3unroll
  reg32 swap addressing-indirect-reg64 ;

~ (output point, source register, source displacement value, target register
~ -- output point)
: mov-reg32-disp8-reg64
  4 roll 0x8B pack8 4 unroll
  reg32 3unroll addressing-disp8-reg64 ;

~ (output point, source register, target register -- output point)
: mov-indirect-reg64-reg16
  3roll 0x66 pack8 0x89 pack8 3unroll
  swap reg16 swap addressing-indirect-reg64 ;

~ (output point, source register, target register, target displacement value
~  -- output point)
: mov-disp8-reg64-reg16
  4 roll 0x66 pack8 0x89 pack8 4 unroll
  3roll reg16 3unroll addressing-disp8-reg64 ;

~ (output point, source register, target register -- output point)
: mov-reg16-indirect-reg64
  3roll 0x66 pack8 0x8B pack8 3unroll
  reg16 swap addressing-indirect-reg64 ;

~ (output point, source register, target displacement value, target register
~  -- output point)
: mov-reg16-disp8-reg64
  4 roll 0x66 pack8 0x8B pack8 4 unroll
  reg16 3unroll addressing-disp8-reg64 ;

~ (output point, source register, target register -- output point)
: mov-indirect-reg64-reg8
  3roll 0x88 pack8 3unroll
  swap reg8 swap addressing-indirect-reg64 ;

~ (output point, source register, target register, target displacement value
~  -- output point)
: mov-disp8-reg64-reg8
  4 roll 0x88 pack8 4 unroll
  3roll reg8 3unroll addressing-disp8-reg64 ;

~ (output point, source register, target register -- output point)
: mov-reg8-indirect-reg64
  3roll 0x8A pack8 3unroll
  reg8 swap addressing-indirect-reg64 ;

~ (output point, source register, source displacement value, target register
~  -- output point)
: mov-reg8-disp8-reg64
  4 roll pack8 0x8A pack8 4 unroll
  reg8 3unroll addressing-disp8-reg64 ;

~ (output point, source register, target register -- output point)
: mov-reg8-reg8
  3roll 0x88 pack8 3unroll
  swap reg8 swap addressing-reg8 ;


~ String instructions
~ ~~~~~~~~~~~~~~~~~~~
~
~   These are in their own section because there's an awful lot of
~ combinations, and fortunately they are very uniform in structure.
~
~   What makes these useful is that they take their parameters from certain
~ fixed registers, which are chosen such that the operations chain into each
~ other well. Thus you can use them to build various block-memory and string
~ operations, and even if you need unusual forms of loop unrolling or
~ alignment tweaking, the code will end up uniform in structure. On modern
~ processors, this is even the high-performance approach, due to highly
~ optimized microcode, though these operations were inefficient when they
~ were first invented.
~
~   We break with the Intel mnemonics, which follow the pattern
~ movsb/movsw/movsd/movsq, because this would otherwise be the only place we
~ use the b/w/d/q thing instead of 8/16/32/64. Tradition and pronounceability
~ are both nice things, but approachability to newcomers is important, too.
~
~   Some of these are repeatable; whether you view the repeatable variants
~ as different instructions is up to you. At any rate the machine code
~ representation of the repeatable variants is the same as for the regular
~ variants with an extra prefix, so we define them together.
~
~   This is a proper superset of the flatassembler implementations of string
~ instructions. The wisdom of that is questionable, but at least it's noted
~ here...

~ (output point -- output point)
: movs8 0xA4 pack8 ;
: movs16 0x66 pack8 0xA5 pack8 ;
: movs32 0xA5 pack8 ;
: movs64 rex-w 0xA5 pack8 ;
: rep-movs8 0xF3 pack8 0xA4 pack8 ;
: rep-movs16 0xF3 pack8 0x66 pack8 0xA5 pack8 ;
: rep-movs32 0xF3 pack8 0xA5 pack8 ;
: rep-movs64 0xF3 pack8 rex-w 0xA5 pack8 ;

~ (output point -- output point)
: lods8 0xAC pack8 ;
: lods16 0x66 pack8 0xAd pack8 ;
: lods32 0xAD pack8 ;
: lods64 rex-w 0xAD pack8 ;
: rep-lods8 0xF3 pack8 0xAC pack8 ;
: rep-lods16 0xF3 pack8 0x66 pack8 0xAD pack8 ;
: rep-lods32 0xF3 pack8 0xAD pack8 ;
: rep-lods64 0xF3 pack8 rex-w 0xAD pack8 ;

~ (output point -- output point)
: stos8 0xAA pack8 ;
: stos16 0x66 pack8 0xAB pack8 ;
: stos32 0xAB pack8 ;
: stos64 rex-w 0xAB pack8 ;
: rep-stos8 0xF3 pack8 0xAA pack8 ;
: rep-stos16 0xF3 pack8 0x66 pack8 0xAB pack8 ;
: rep-stos32 0xF3 pack8 0xAB pack8 ;
: rep-stos64 0xF3 pack8 rex-w 0xAB pack8 ;

~ (output point -- output point)
: cmps8 0xA6 pack8 ;
: cmps16 0x66 pack8 0xA7 pack8 ;
: cmps32 0xA7 pack8 ;
: cmps64 rex-w 0xA7 pack8 ;
: repz-cmps8 0xF3 pack8 0xA6 pack8 ;
: repz-cmps16 0xF3 pack8 0x66 pack8 0xA7 pack8 ;
: repz-cmps32 0xF3 pack8 0xA7 pack8 ;
: repz-cmps64 0xF3 pack8 rex-w 0xA7 pack8 ;
: repnz-cmps8 0xF2 pack8 0xA6 pack8 ;
: repnz-cmps16 0xF2 pack8 0x66 pack8 0xA7 pack8 ;
: repnz-cmps32 0xF2 pack8 0xA7 pack8 ;
: repnz-cmps64 0xF2 pack8 rex-w 0xA7 pack8 ;

~ (output point -- output point)
: scas8 0xA8 pack8 ;
: scas16 0x66 pack8 0xAF pack8 ;
: scas32 0xAF pack8 ;
: scas64 rex-w 0xAF pack8 ;
: repz-scas8 0xF3 pack8 0xAE pack8 ;
: repz-scas16 0xF3 pack8 0x66 pack8 0xAF pack8 ;
: repz-scas32 0xF3 pack8 0xAF pack8 ;
: repz-scas64 0xF3 pack8 rex-w 0xAF pack8 ;
: repnz-scas8 0xF2 pack8 0xAE pack8 ;
: repnz-scas16 0xF2 pack8 0x66 pack8 0xAF pack8 ;
: repnz-scas32 0xF2 pack8 0xAF pack8 ;
: repnz-scas64 0xF2 pack8 rex-w 0xAF pack8 ;


~ Arithmetic instructions
~ ~~~~~~~~~~~~~~~~~~~~~~~

~ (output point, source register, target register -- output point)
: add-reg64-reg64
  3roll rex-w 0x01 pack8 3unroll
  swap reg64 swap addressing-reg64 ;

~ (output point, source register, target register -- output point)
: add-indirect-reg64-reg64
  3roll rex-w 0x01 pack8 3unroll
  swap reg64 swap addressing-indirect-reg64 ;

~ (output point, source register, target register -- output point)
: add-reg64-indirect-reg64
  3roll rex-w 0x03 pack8 3unroll
  reg64 swap addressing-indirect-reg64 ;

~ (output point, source register, target register -- output point)
: add-reg64-imm8
  3roll rex-w 0x83 pack8 swap 0 swap addressing-reg64
  swap pack8 ;

~ (output point, source register, target register -- output point)
: sub-reg64-reg64
  3roll rex-w 0x2B pack8 3unroll
  reg64 swap addressing-reg64 ;

~ (output point, source register, target register -- output point)
: sub-indirect-reg64-reg64
  3roll rex-w 0x2B pack8 3unroll
  swap reg64 swap addressing-indirect-reg64 ;

~ (output point, source register, target register -- output point)
: sub-reg64-imm8
  3roll rex-w 0x83 pack8 swap 5 swap addressing-reg64
  swap pack8 ;

~ (output point, source register, target register -- output point)
: sbb-reg64-imm8
  3roll rex-w 0x83 pack8 swap 3 swap addressing-reg64
  swap pack8 ;

~  The target register is always rax.
~
~ (output point, source register -- output point)
: mul-reg64
  swap rex-w 0xF7 pack8 swap
  4 swap addressing-reg64 ;

~   The dividend is 128 bits, and is formed from rdx as the high half and rax
~ as the low half. The divisor is a specified register. The quotient is
~ returned in rax, truncated towards zero. The remainder is in rdx. This
~ entire process is unsigned.
~
~   The official mnemonic for this is "div", but divmod is what it does.
~
~ (output point, divisor register -- output point)
: divmod-reg64
  swap rex-w 0xF7 pack8 swap
  6 swap addressing-reg64 ;

~ Same as divmod, but signed.
~
~ (output point, divisor register -- output point)
: idivmod-reg64
  swap rex-w 0xF7 pack8 swap
  7 swap addressing-reg64 ;

~ (output point, target register -- output point)
: inc-reg64
  swap rex-w 0xFF pack8 swap 0 swap addressing-reg64 ;

~ (output point, target register -- output point)
: dec-reg64
  swap rex-w 0xFF pack8 swap 1 swap addressing-reg64 ;

~ (output point, source register, target register -- output point)
: and-reg64-reg64
  3roll rex-w 0x23 pack8 3unroll
  reg64 swap addressing-reg64 ;

~ (output point, source value, target register -- output point)
: and-reg68-imm8
  3roll rex-w 0x83 pack8 swap
  4 swap addressing-reg64
  swap pack8 ;

~ (output point, source register, target register -- output point)
: or-reg64-reg64
  3roll rex-w 0x0B pack8 3unroll
  reg64 swap addressing-reg64 ;

~ (output point, source value, target register -- output point)
: or-reg64-imm8
  3roll rex-w 0x83 pack8 swap
  1 swap addressing-reg64
  swap pack8 ;

~ (output point, source register, target register -- output point)
: xor-reg64-reg64
  3roll rex-w 0x33 pack8 3unroll
  reg64 swap addressing-reg64 ;

~ (output point, target register -- output point)
: not-reg64
  swap rex-w 0xF7 pack8
  swap 2 swap addressing-reg64 ;


~ Control flow instructions
~ ~~~~~~~~~~~~~~~~~~~~~~~~~

~   Pretend to subtract right from left, and set the flags the same way as if
~ we actually had.
~
~ (output point, left register, right register -- output point)
: cmp-reg64-reg64
  3roll rex-w 0x3B pack8 3unroll
  reg64 swap addressing-reg64 ;

~   Pretend to xor left with right, and set the flags the same way as if we
~ actually had.
~
~   The names of the condition codes can be a little confusing when using them
~ after "test", because they're really premised on the idea that you did
~ "cmp".
~
~ (output point, left register, right register -- output point)
: test-reg64-reg64
  3roll rex-w 0x85 pack8 3unroll
  swap reg64 swap addressing-reg64 ;

~ (output point, condition code, target register -- output point)
: set-reg8-cc
  3roll 0x0F pack8
  3roll condition-code 0x90 opcodecc
  swap reg8 3 0 3roll modrm ;

~ (output point, address offset value, condition code -- output point)
: jmp-cc-rel-imm8
  3roll swap condition-code 0x70 opcodecc
  swap pack8 ;

~ (output point, address offset value, condition code -- output point)
: jmp-cc-rel-imm32
  3unroll 0x0F pack8
  swap condition-code 0x70 opcodecc
  swap pack32 ;

~ (output point, register -- output point)
: jmp-abs-indirect-reg64
  swap 0xFF pack8 swap
  4 swap addressing-indirect-reg64 ;

~ (output point, address offset value -- output point)
: jmp-rel-imm8
  swap 0xEB pack8
  swap pack8 ;

~ (output point, address offset value -- output point)
: jmp-rel-imm32
  swap 0xE9 pack8
  swap pack32 ;