[Devember 2021] CPU Simulating Engine (a rewrite from last Devember)

Medic5700 · October 18, 2021, 11:57pm

I don’t know if I can use my Devember project from last year but with realistic modified goals, but I’m gonna try.
Long story short, I wanted to make a CPU simulator that you could (allegedly) “easily” reprogram to simulate different CPUs (different ISAs, branch predictors, etc), that was WAY too ambitious and crashed and burned into an (incomplete) janky (working) prototype (with some serious structural flaws).

This year I want to remake/recode/refactor/reeeeeeeeeeee the engine with more realistic goals.
Goals:

Documentation: last year I didn’t do documentation for just jumping in and getting started. You got a big old blob of source code with no clear way to figure out how to run the dang thing, or what it’s supposed to do in a clear way. I want to make some clear ‘getting started’ documentation you can jump right into and tinker with.
Structural changes: currently, the old prototype has a number of structural choices that while good enough for a prototype (EX: memory/registers are stored as a python dictionary), is not good enough for project goals (EX: tracking changes to memory/registers). These changes, while not seeming like a significant change, would require a moderate rewriting of almost the ENTIRE project (at the least).
Memory: currently, memory is stored as a big python dictionary. What I want is memory that can track access, fork and merge different states, has multiple reconfigurable cache levels, track energy usage, etc. That is too ambitious, so I’m aiming to make a dummy Modular Meta Memory Management Unit (MMMU) that has an API that’s compatible with those goals, but simple enough to actually implement in time. Also sanity, I would like to retain some of my sanity.
The Compiler: currently the compiler is duct taped together. It can ‘compile’ instructions (IE: copy an execution tree from one array into another more special array). It doesn’t support compiler directives (EX: initialize memory location with int), translating an instruction to machine code, support for variable length instructions, etc. I want at least some support for those features structurally.
The Execution Engine: (IE the main loop that runs the assembly/machine code in the simulator) If source code could actively be on fire and burning down, this would be it. When I was coding it I didn’t really know what exactly I wanted it to do, nor how exactly to do it, and it shows…
Statistics: One of the main features of this engine was to track statistics on how many times memory was used, how many times an instruction was used, how much energy a particular instruction used, etc. I want to implement something from that, if only at the engine level
System Calls: Want something more extensible then the current ‘if “halt” then break’, still figuring it out

Currently, the CPU simulator can handle some simple and moderate assembly programs, but there are a number of things done in a ‘hacky’ way

For example with a simple “10 + 0 = 10 in r[0]” program:

There is no assembler directive to initialize memory with data (which is why a value needs to be injected directly into a register)
the ‘halt’ instruction under the hood is done in a hacky and non-extensible way, because there is no mechanism for system calls
registers are highlighted when they change, except it’s actually highlighting when the value is different between the old state and new state. A subtle difference, but important, because for example, the Program Counter changes every instruction, but you only want to show when the Program Counter jumps with a jump instruction. The current engine just is not capable of showing that subtle difference without a project rewrite
The after execution report only says ‘CPU Halted’ because there way no stat tracking done to display. Further, the robust way to do stat tracking would also require a project rewrite

Github link: Book-of-Algorithms/CPUsimulator.py at master · Medic5700/Book-of-Algorithms · GitHub
Devember2020 link: A CPU simulator where you program the instruction set

TLDR: Last year my project was aim for the stars in difficulty (make CPU simulator), result was exploded in Low Earth Orbit (janky incomplete prototype). This year my project is taking last years project and aiming for Geostationary Orbit difficulty.

Medic5700 · October 26, 2021, 4:07am

Random Design Note [2021-10-2X] (TLDR memory subsystems are a maze)

Architecting a meta memory subsystem:

Designing a memory management system is difficult as this CPU Simulator is specifically designed for the weird not easy cases. Because of the complexity, I think it’s better to start out with a guide down the rabbit hole instead of a traditional ‘this hyper complicated thing is what I’m doing because reasons’

Lets start with a simple program, add two numbers together
“add r0, r1, r2” #takes value in registers r1, r2 → store value in register r0

well, you have a couple registers holding numbers, so use an array of ints. But you also need to store the size in bits of each register, in case of an overflow. This means you would need a data structure something like this

memory = [0 for i in range(8)] #a list of ints
bitlength = [64 for i in range(8)]
def add(destination : int, source1 : int, source2 : int):
	memory[destination] = memory[source1] + memory[source2]
	memory[destination] = memory[destination] & (2**bitlength[destination] - 1) #trunks the number to the correct bitlength using a bitwise AND

but you also want to keep track of the changes, so you can display and keep track of the changes every tick

memoryOld = [0 for i in range(8)]
memoryNew = [0 for i in range(8)]
bitLength = [64 for i in range(8)]
def add(destination : int, source1 : int, source2 : int):
	temp = memoryOld[source1] + memoryOld[source2]
	memoryNew[destination] = temp & (2**bitlength[destination] - 1) #trunks the number to the correct bitlength using a bitwise AND

Welp, not all memory is registers, so we also need a way to represent a bank of memory that is separate from registers.

memoryOld = {[0 for j in range(8)] for i in ["r", "m"]} #a two dimensional directionary
memoryNew = {[0 for j in range(8)] for i in ["r", "m"]}
bitLength = {[64 for j in range(8)] for i in ["r", "m"]}
def add(destination : [str, int], source1 : [str, int], source2 : [str, int]): #Note the memory is referred to by a key/index pair
	a = memoryOld[source1[0]][source1[1]]
	b = memoryOld[source2[0]][source2[1]]
	bitlength = bitLength[destination[0]][destination[1]]
	result = a + b
	result = result & (2**bitLength - 1)
	memoryNew[destination[0]][destination[1]] = result

So far, memory is referred to with a [key : str or int, index : int] pair IE: [memory bank, memory index]

Alright, the above example was mainly to get you into the headspace of the usual way to implement something like this. Mainly how the data is stored and accessed.
Now it’s time to take this rollercoaster off a cliff.
Now we get to go through all the edge cases I want to deal with, and how that will shape the memory management system. (at the very least, it will rule out a couple implementations that can’t cope with the edge cases)

The Raspberry Pi Pico, it’s memory is not a bank of orderly ints neatly in a row. It’s got some physical memory addresses missing, IE: 0x00000F might be accessible, but 0x0F0000 could not exist, BUT 0xF00000 does exist and is actually accessing the flash storage
Therefor memory needs to be stored, entirely, as a dictionary.

CPU flags, for convenience, we could access them as memory[‘flags’][‘a specific flag’], therefore memory needs to be referred to with [key : str or int, index : str or int]
This is possible because we are storing the memory in a dictionary

The Commodore 64, has 64KB of memory. The CPU of the Commodore 64 was less complex then other CPUs of the time, yet performed significantly faster. Turns out, the Commodore 64’s CPU had a small trick where, if memory was being accessed from the lower 256 bytes of memory, the CPU could take a short cut and load that memory significantly faster. In essence, the first 256 bytes of RAM could be used as REGISTERS.

Therefor there needs to be some way to keep track of how long it takes to access each individual memory element during execution, as each memory element could have different characteristics. (this is assuming the engine has sophisticated enough to meaningfully track those stats)

Hyperthreading. Two different programs, same core, same execution units. Each hyperthread would need to have access to the same bank of memory, but need different registers that can be accessed the same from the perspective of each hyperthread. IE: Each hyperthread would need it’s own set of registers for program counter, general purpose registers, etc. BUT the bank of main memory would have to be shared and accessable and shared between hyperthreads.
Therefore each memory element needs to be referred to with [hyperthread context, memory bank, memory index] AND you need to store info on which memory elements can be shared.

Speculative execution. A program branches, but the CPU sees it ahead of time, but it doesn’t know which branch to take until after the branch instruction is fully processed and pipelined. Why not take both branches? Modern CPUs essentially take a guess as to which branch to speculatively execute, and discard the results if it’s wrong. This means if you want support for speculative execution, you also need memory that can track ALL memory accesses, AND revert to a previous state if you need to discard the results.

Out of Order Execution. Modern CPUs also reorder the instructions they are executing to allow better scheduling between the multiple execution units, while at the same time picking which instructions to execute in a specific order so there is no memory conflicts. IE: (a+b) + (ab) could be executed as [(ab)=c, (a+b)=d; c+d=result] where c and d are computed at the same time. The challenge is picking instructions that can be scheduled to execute at the same time, and picking instructions that don’t interfere with each other.

Therefor, with speculative execution and out or order execution, you could have a memory version system that’s a node tree, where every node represents a state that can be reverted to. With speculative execution, both branches could be executed safely, and the incorrect branch discarded, and the correct branch get carried forward. With out of order execution, instead of guessing which instruction to queue up next, you could technically run every instruction in advance, and choose whichever instruction doesn’t conflict, or is most optimal.
Now every memory element has a key with 4 elements [memory version node, hyperthread context, memory bank, memory index]

Caching, not all memory behaves the same. Registers get data from L1 cache which gets data from L2 cache which gets data from etc.
Therefore you need a way to create memory pools with different memory characteristics. You also need a way to tell how each memory pool relates to each other. IE: L1 gets data directly from L2, but is not directly connected to DRAM.

You could implement the caches as a linear chain of connections, but that doesn’t work as in many CPUs, L1 cache is actually split into L1_instruction and L1_data. Loading instructions to execute goes through L1_instruction before getting data from L2, and data loads goes through L1_data before getting from L2.
Therefore memory needs to be implemented as connected network of memory pools (a directed graph). This also means that every memory access also needs an entry point, IE: which memory pool to start from. Now every memory access uses 5 elements [memory version node, memory pool entry point, hyperthread context, memory bank, memory index].

Cache algorithms. Not every cache uses the same algorithm to decide what data gets stored and discarded. L3 cache could store only data that was previously accessed. L3 cache could store data that was accessed AND all adjacent data. L3 cache might only store data that was evicted from L2 cache.
Therefore every memory pool needs to be able to have it’s own caching algorithm, or at least the capability to swap out caching algorithms.

What about ECC Memory (Error Correction Code Memory)? This requires logic which should able to be included into individual memory pools. (and the CPU load instruction)
What about Row Hammer? Successive writes to one cell can alter adjacent cells. But implementing this logic would have to be separate from the logic that would drive ECC or Caching algorithms.
Therefore each memory pool would need to actually be a pair of connected objects, a ‘controller’ and a ‘data pool’. And since they are separate objects, why not allow one ‘controller’ to connect to multiple ‘data pools’ at the same time. IE: having different channels of DRAM handled by the same memory controller.

There is more ‘edge cases’ (Specter Vulnerabilities, Table Lookaside Buffer, Duel Threading Memory Controller choking, etc) that are out there, but I think I rambled on for long enough. Modern memory systems are far more complicated then a simple ‘array of ints’. Implementing something that can be expanded into all the above edge cases WHILE BEING EASY TO USE will be an incredible challenge to properly design and code.

Why is this important? Moving data around costs time and energy.

[source: Quantifying the Energy Cost of Data Movement for Emerging Smart Phone Workloads on Mobile Platforms - Dhinakaran Pandiyan and Carole-Jean Wu]

A lot of time and energy. Having a tool that glosses over these details can make it easy to abstract away the problem and ignore it. Having a tool that can directly show how a minor difference in programming can drastically change performance is good, if only as a demonstration.

also, including this video for those interested in CPU architecture
The Apple M1 isn’t magic, it’s good design - M1 Deep Dive pt 2 | Upscaled - YouTube

TLDR: Learned about memory management architecture, learned the edge cases, glimpsed cthulhu, take 1d6 damage to sanity =P

Medic5700 · November 7, 2021, 7:57am

Random Design Note [2021-11-0X]

Architecting an Instruction Set Architecture subsystem:

I didn’t go into detail last year about the ISA subsystem, or why I made some… less then straight forward decisions about how to architect it. Today was that day, until I wrote it out and realized I was half way through what I wanted to say and it’s already too long.

Welp, to start off, lets start with a simple ‘add’ instruction, and assume a spherical cow.

#very psudocode
program = "add(r[0], 1, 2) \n halt" #The program, a simple two instruction program

executionTree = magicParser(program) #This turns the 'program' into an exectuion tree

ISA = {"add" : add, "halt" : magicHaltFunction} #A lookup table for what to do when encountering an instruction

def add(c, a, b): #The instruction definition
    c = a + b

magicEverythingRuns(ISA, executionTree)

The main idea behind this kind of implimentation is to allow for tinkering with the ISA. You can adjust and extend that ‘ISA’ lookup table relativly easily depending on your needs. Lets dive into that idea with a little bit of code from the project. (albeit, mockup code)

#trimmed down code from the project
self.instructionSet = {
    ("nop",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,                               : self.opNop(fRead, fWrite, fConfig, EFunc, EStatus)), 
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b)), 
    ("mult",)       : self.opMultiply,
    ("twos",)       : self.opTwosCompliment,
    ("copy",)       : self.opCopyElement,
    ("and",)        : self.opAND,
    ("or",)         : self.opOR,
    ("xor",)        : self.opXOR,
    ("not",)        : self.opNOT,
    ("jumpeq",)     : (lambda fRead, fWrite, fConfig, EFunc, EStatus,       pointer, a, b           : self.opJump(fRead, fWrite, fConfig, EFunc, EStatus,           "==", pointer, a, b)),
    ("jumpne",)     : (lambda fRead, fWrite, fConfig, EFunc, EStatus,       pointer, a, b           : self.opJump(fRead, fWrite, fConfig, EFunc, EStatus,           "!=", pointer, a, b)),
    ("jump",)       : (lambda fRead, fWrite, fConfig, EFunc, EStatus,       pointer                 : self.opJump(fRead, fWrite, fConfig, EFunc, EStatus,           "goto", pointer)),
    ("shiftl",)     : (lambda fRead, fWrite, fConfig, EFunc, EStatus,       des, a                  : self.opShiftL(fRead, fWrite, fConfig, EFunc, EStatus,         des, a, fWrite("imm", 0, 1))), 
    ("shiftr",)     : (lambda fRead, fWrite, fConfig, EFunc, EStatus,       des, a                  : self.opShiftR(fRead, fWrite, fConfig, EFunc, EStatus,         des, a, fWrite("imm", 0, 1), False)), 
    ("halt",)       : (lambda fRead, fWrite, fConfig, EFunc, EStatus,                               : self.sysCallSimple(fRead, fWrite, fConfig, EFunc, EStatus,    "halt"))
}

def opAdd(self, 
    funcRead : Callable[[int or str, int or str], int], funcWrite : Callable[[int or str, int or str], None], funcGetConfig : Callable[[int or str, int or str], dict], engineFunc : Dict[str, Callable[[Any], Any]], engineStatus : dict, 
    registerDestination : Tuple[int or str, int or str], registerA : Tuple[int or str, int or str], registerB : Tuple[int or str, int or str]):
    """Adds registerA and registerB, stores result in registerDestination"""
            
    a : int = funcRead(registerA)
    b : int = funcRead(registerB)

    bitLength : int = funcGetConfig(registerDestination)["bitLength"]

    c : int = a + b
    c = c & (2**bitLength - 1)

    funcWrite(registerDestination, c)

This looks like a mess, but there is a method to the madness. So lets focus in on the ‘add’ instruction and break it down (and thus learn what’s behind the assumption of a spherical cow)

#more trimmed down
self.instructionSet = {
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b))
}

def opAdd(self,
    funcRead, funcWrite, funcGetConfig, engineFunc, engineStatus,
    registerDestination, registerA, registerB):

    a : int = funcRead(registerA) #reading value from memory/registers
    b : int = funcRead(registerB) #reading value from memory/registers

    bitLength : int = funcGetConfig(registerDestination)["bitLength"] #getting info on how large the destination register is

    c : int = a + b
    c = c & (2**bitLength - 1) #this cuts down the value to a size that can fit in the register

    funcWrite(registerDestination, c) #writing value to memory/registers

self.instructionSet is a dictionary (the lookup table), the odd part is that the keys are tuples (IE: an array of values), that just happen to be of length(1) in this example. This is mainly done for convinence, as some instruction sets can have similar instructions with different addressing modes. Like for example the Commodore 64, same instruction (Add with Carry), but several different addressing modes.

[Source [2021-11-06] 6502 / 6510 Instruction Set | C64 OS ]

#what implementing the Commodore 64 instruction set could look like
self.instructionSet = {
    ("adc", "#$") : somefunction immediate value
    ("adc", "$")  : somefunction zero page memory location
    ("adc", "(")  : somefunction indirect memory location
    ...
}

opAdd(), the function that is called to carry out the operation. First of note is the funcRead, funcWrite, etc… arguments, these are to allow the function access to the nessccary functions to access memory, the engine, etc. Essentially, those are the communication channels for this fuction. This allows for a lot of flexibility with instructions… arguably too much flexability; you will be able to do things you should not want to do. The final variables are registers. These can be thought of as dictionary keys of length(2), or a memory/register bank/index pair. For an example, lets make a vector add instruction, such that 8 registers get added with a value stored in another register.

def opVectorAdd(self,
    funcRead, funcWrite, funcGetConfig, engineFunc, engineStatus,
    registerVectorPointer, registerAmount):

    amount = funcRead(registerAmount) #read a register
    vector_bank, vector_index = registerVectorPointer

    for i in range(8):
        bitLength = funcGetConfig([vector_bank, vector_index + i])["bitlength"]

        result = funcRead([vector_bank, vector_index + i])
        result += amount

        result = result & (2**bitLength - 1)
        funcWrite([vector_bank, vector_index + i], result) #write to a register

    #Note while this is possible, it's also flawed, this SHOULD read the value in registerVectorPointer, and then use that to access that index from memory bank 'm'.
    #What this actually does is take the register accessed (registerVectorPointer), and uses that register's 'key' as a pointer instead of the value IN the register
    #It was a mistake, but I left it in to show you can do the wrong thing if you really wanted to.

#The right way
def opVectorAdd(self,
    funcRead, funcWrite, funcGetConfig, engineFunc, engineStatus,
    registerVectorPointer, registerAmount):
    """adds an int (from registerAmount) to each of 8 memory bytes pointed to by registerVectorPointer"""

    amount = funcRead(registerAmount) #read a register/memory
    vector_index = funcRead(registerVectorPointer) #read a register/memory

    for i in range(8):
        bitLength = funcGetConfig(["m", vector_index + i])["bitlength"]

        result = funcRead(["m", vector_index + i])
        result += amount

        result = result & (2**bitLength - 1)
        funcWrite(["m", vector_index + i], result) #write to register/memory

Now lets get back to the self.instructionSet, and all those lambdas.
Note: lambda, for those that aren’t familiar, it’s an ‘unnamed function decliration’. (EX: add a and b; (lambda a, b : a + b))(EX: return input value incremented by one; (lambda a : a + 1)).
By using lambdas in this way, (plus a couple helper functions) we can ‘remap’ the function definitions fairly quickly

#The default, the assembly call looks like "add r[0], r[1], r[2]"
self.instructionSet = {
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b))
}

#lets say, rearrange it so the assembly call has the destination as the third argument, instead of the first
self.instructionSet = {
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      a, b, des                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b))
}

#what about an immediate value, "add r[0], r[1], 5"
self.instructionSet = {
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b))
}
#That's actually the same as default, as 'immediate values' automatically get loaded into an 'immediate register', so as far as the underlying opAdd function is concerned, it's just another register.

#what about if we wanted to make sure the value being added together was between a register and an immediate value? "add r[0], r[1], 5"
self.instructionSet = {
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b)),
    ("addi",)       : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, enforceAccess(b, "imm"))),
}
def enforceAccess(register, bank) -> returns a register iff the register is the correct bank, otherwise raises an error.

#a dedicated instruction that adds 5 to a register? "add5 r[0]"
self.instructionSet = {
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b)),
    ("add5",)       : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des                      : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, des, makeImmediate(fWrite, 5)))
}
def makeImmediate(fWrite, int) -> uses fWrite to create an immediate value and immediate register which then gets returned (and passed along to opAdd)

#an accumulate instruction that takes a register and adds it to register[0]? "acc r[1]"
self.instructionSet = {
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b)),
    ("acc",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      a                        : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            ['r', 0], ['r', 0], a))
}

#how about a vector add function, that adds an amount (3) to the memory index pointed (stored in r[0])? "addVector r[0], 3"
self.instructionSet = {
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b)),
    ("addVector",)  : [
                          (lambda fRead, fWrite, fConfig, EFunc, EStatus,      vectorPointer, amount : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            ["m", fRead(vectorPointer) + 0], ["m", fRead(vectorPointer) + 0], amount)),
                          (lambda fRead, fWrite, fConfig, EFunc, EStatus,      vectorPointer, amount : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            ["m", fRead(vectorPointer) + 1], ["m", fRead(vectorPointer) + 1], amount)),
                          (lambda fRead, fWrite, fConfig, EFunc, EStatus,      vectorPointer, amount : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            ["m", fRead(vectorPointer) + 2], ["m", fRead(vectorPointer) + 2], amount)),
                          (lambda fRead, fWrite, fConfig, EFunc, EStatus,      vectorPointer, amount : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            ["m", fRead(vectorPointer) + 3], ["m", fRead(vectorPointer) + 3], amount)),
                          (lambda fRead, fWrite, fConfig, EFunc, EStatus,      vectorPointer, amount : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            ["m", fRead(vectorPointer) + 4], ["m", fRead(vectorPointer) + 4], amount)),
                          (lambda fRead, fWrite, fConfig, EFunc, EStatus,      vectorPointer, amount : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            ["m", fRead(vectorPointer) + 5], ["m", fRead(vectorPointer) + 5], amount)),
                          (lambda fRead, fWrite, fConfig, EFunc, EStatus,      vectorPointer, amount : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            ["m", fRead(vectorPointer) + 6], ["m", fRead(vectorPointer) + 6], amount)),
                          (lambda fRead, fWrite, fConfig, EFunc, EStatus,      vectorPointer, amount : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            ["m", fRead(vectorPointer) + 7], ["m", fRead(vectorPointer) + 7], amount)),
                      ]
}
#As far as the engine is concerned, this instruction simply has a list of functions of length(8) to execute, instead of the usual list of functions of length(1). As long as the arguments match properly, the engine is happy.

In essance, this allows for quickly rewiring/rewriting an instruction set without having to rewrite every instruction, because you can reuse functions by combining them with helper functions. It’s almost like microcode (except it isn’t). Actual microcode would allow you to have loops, branches, etc. This doesn’t actually allow that. To fully impliment that, there needs to be a system to jump the program counter to a separate memory bank full of actual microcode.

#IE: something like this, assuming instead of addVector working on any set of memory, it only works on the lower 8 registers
program = "addVector r[0], 3"
microcode = """
    0xF0:
           add r[0], r[0], arg[1]
           add r[1], r[1], arg[1]
           add r[2], r[2], arg[1]
           add r[3], r[3], arg[1]
           add r[4], r[4], arg[1]
           add r[5], r[5], arg[1]
           add r[6], r[6], arg[1]
           add r[7], r[7], arg[1]
           exitMicro
    0xFF:
           #Some other microCode function
"""
self.instructionSet = {
    ("add",)        : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      des, a, b                : self.opAdd(fRead, fWrite, fConfig, EFunc, EStatus,            des, a, b)),
    ("addVector",)  : (lambda fRead, fWrite, fConfig, EFunc, EStatus,      vectorPointer, amount    : self.microCode(0xF0, vectorPointer, amount)
}
def microCode() -> a kind of system call that functions more like a specialized jump to a memory bank storing microcode, and then back

Now for a quirk of the execution engine. It’s not evaluating an op-code, it’s evaluating a node tree. Which theoretically allows for greater flexibility that I haven’t quite fully explored yet.
BUT
It does allow for Very Long Instruction Word support (VLIW)(which is intended), since the execution engine will (more or less) execute instructions ‘line by line’

program = """
add(r[0], r[1], r[2])
halt
"""
#Gets turned into the following by the parser
Root
    line
        add
            r
                [
                    0
            r
                [
                    1
            r
                [
                    2
    line
        halt

#Then each 'line' is loaded into individual memory elements
memory index 0
    line
        add
            r
                [
                    0
            r
                [
                    1
            r
                [
                    2
memory index 1
    line
        halt
memory index 2
    None
...

So in effect, you can relatively easily support Very Long Instruction Word (VLIW), without too much effort as it’s done by the engine by default

program = """
add(r[0], r[4], r[5]), add(r[1], r[6], r[7]), add(r[2], r[4], r[5])
halt
"""
#Gets turned into the following by the parser
Root
    line
        add
            r
                [
                    0
            r
                [
                    4
            r
                [
                    5
        add
            r
                [
                    1
            r
                [
                    6
            r
                [
                    7
        add
            r
                [
                    2
            r
                [
                    4
            r
                [
                    5
    line
        halt

Now for the part no one was expecting… At no point during talk about the instruction set, modifying instructions, adding new instructions, parsing programs, etc… did I actually mention op-codes, or binary.
All of the above is done without ever actually converting to binary or back.
*The instructions aren’t converted into binary machine code, and machine code isn’t decoded into an execution tree.
*The instruction definitions (EX: add instruction) are done while using regular python ints.
*The instruction definitions (EX: add instruction) are implemented without a set register length, they are operating on registers of known, but arbitrary bitlength.
*Shown above is how to make vector instructions AND Very Long Instruction Word instructions without ever having to worry about how it’s encoded into machine code.
*MicroCode is just another program, that got compiled and stored without converting to machine code. (though implementing this functionality is a far off)

*You CAN have instructions that convert their inputs into binary, do some operations, and back to ints #That’s how the two’s compliment instruction is implemented
*You CAN have a compiler that turns the instructions into machine code
*You CAN have a decoder that turns machine code into an execution tree

But you don’t need to, to tinker with the instruction set, it’s optional

I wanted to go into details about other CPU ISAs, like Intel’s x86, and Intel’s EPIC [ Explicitly parallel instruction computing - Wikipedia ], and the TIS-100, etc… But I already wrote more then I intended.

TLDR: wanted to talk about fancy computing, ended up talking about how weird my project is instead. Take 1d4 sanity damage now, 1d4 sanity damage next turn

Medic5700 · December 19, 2021, 5:03am

Random Design Note [2021-12-1X]
Fell off the coding wagon, attempting to get back on the coding wagon.

Often you get to a point in a project where it seems like you are attempting to climb Mt. Everest without a map, and google has forsaken you. You lose all enthusiasm and motivation as you spend more time figuring out which direction is the right direction instead of actually solving a problem, with every decision being a hard decision (Analysis Paralysis).

Burnout is real and you are not alone.
(A good read about burnout [The first half is about burnout, the second half is about blueprints])
Friday Facts #356 - Blueprint library for real | Factorio

Often times, it’s good to take a break, shift gears, etc. The hard part is getting back into the groove of working on a project. The hardest part of the hard part is finding joy in what you are doing.

Blah blah blah, in my case, I would periodically try something small, like writing a test case, or updating some documentation. Eventually I retried something small and wound up coding till 6AM without realizing it.
THAT is when I knew I was back on the coding wagon. (But also needed to pump the breaks a little)

TLDR

Medic5700 · December 28, 2021, 5:36pm

Random Update [2021-12-2X] - Making something literally no one asked for

The Parser, turning a string of source code into a parse tree (not an execution tree, that requires compiling). The CPUSimulator I’m making had some unique requirements that required a… custom solution.

Usually, when making a parser, you have a programming language that you know ahead of time. Then you would make a custom parser from scratch to handle that could parse that language. This was not possible in this case, as this was an engine which should accept multiple languages that should be easily modifiable. IE: when using the CPUSimulator, you are also likely to be parsing an already established assembly language, which you will have to build a custom parser for

Requirements (unordered):
Easy to modify
Can be modified for multiple assembly languages
Parses (low level) assembly source code

I tried to find some ready made solutions, but no module I found was particularly easy to use, or could format the output in a way I needed it to be formatted (it’s been a while since I came up with it, so my memory is a little rusty).

Fortunately, since assembly source code is low level, it is ‘usually’ simple (IE: each line corresponds to a single instruction). Thus, I wouldn’t have to do complicated parsing. Thus for simple parsing of uncomplicated assembly source code, it was feasible to make something that was easy to modify.

Blah blah blah, that’s enough brass tax, let’s get to some code.

# Multiplies two numbers together
# Inputs: r[0], t[0]
# Output: t[1]
loop:   jumpEQ  (end, r[0], 0)
            and     (r[1], r[0], 1)
            jumpNE  (zero, r[1], 1)
                add     (t[1], t[0], t[1])
zero:       shiftL  (t[0], t[0])
            shiftR  (r[0], r[0])
            jump    (loop)
end:    halt

We want to attempt to convert the above into a parse tree. Below is a shorter example that shows what a parse tree could look like.

add (r[0], r[1], r[2])
halt

-> # assembly source code converted to a parse tree

Root
    Line
        add
            r
                0
            r
                1
            r
                2
    Line
        halt

First step, tokenize the source code. (We want to retain the line number and column number to). We then shove the entire list of tokens into a series of nodes. The important part is no character is lost at this stage.

Root
    "add"
    " "
    "("
    "r"
    "["
    "0"
    ","
    " "
    "r"
    ... # It's a long list of nodes, but you get the idea

Next, we want to apply a ‘rule’ to the code, say, filter out a line comment. We make a magic function ‘ruleFilterLineComments’ that takes in a node, and removes all nodes between the “#” and a new line. Returns a new parse tree.
Next, say, cast all the ints from strings to actual ints.
Next, split lines.
Next, filter out leading white spaces.
etc.

Here is the interesting part. Since all these ‘magic rule functions’ take in a parse tree, and outputs a parse tree, we can effectively daisy chain them together like LEGO.

def parse(sourceCode : str):
    tokens = tokeniz(sourceCode)
    root = Node("root")
    for i in tokens:
        root.append(Node(i)) # adds a child to root
    
    root = ruleFilterLineComments(root, "#")
    root = ruleCastInts(root)
    root = ruleFilterLeadingWhiteSpace(root)
    root = ruleSplitLines(root, "\n")
    root = ruleNestContainers(root, {"(" : ")", "[" : "]"})
    
    # etc... you get the idea
    
    return root

Below is the actual code that parses the first example. Note the detailed documentation of what goes in, what comes out. Every function (‘rule’) is documented similar to this, to make it easier to understand what each rule does exactly and precisely. (Plus it helps to have good documentation when debugging)

        def parseCode(self, sourceCode : str) -> Tuple[ParseNode, Dict[str, ParseNode]]:
            """Takes a string of source code, returns a parsed instruction tree and a dictionary representing labels/pointers
            
            Takes source code of the form:
                #This is a comment, non-functional example code
                label1:     add(r[0],r[0],r[0])
                            and(r[1],r[2],r[0]) #Another comment
                label2:     jump(label1)
            Returns:
                None                                        :Root           1       lineNum=None    charNum=None
                    None                                    :Line           2       lineNum=2       charNum=31
                        'add'                               :Namespace      3       lineNum=2       charNum=31
                            '('                             :Container      4       lineNum=2       charNum=31
                                None                        :Argument       5       lineNum=2       charNum=33
                                    'r'                     :Namespace      6       lineNum=2       charNum=33
                                        '['                 :Container      7       lineNum=2       charNum=33
                                            0               :Int            8       lineNum=2       charNum=35
                                None                        :Argument       5       lineNum=2       charNum=38
                                    'r'                     :Namespace      6       lineNum=2       charNum=38
                                        '['                 :Container      7       lineNum=2       charNum=38
                                            0               :Int            8       lineNum=2       charNum=40
                                None                        :Argument       5       lineNum=2       charNum=43
                                    'r'                     :Namespace      6       lineNum=2       charNum=43
                                        '['                 :Container      7       lineNum=2       charNum=43
                                            0               :Int            8       lineNum=2       charNum=45
                    None                                    :Line           2       lineNum=3       charNum=31
                        'and'                               :Namespace      3       lineNum=3       charNum=31
                            '('                             :Container      4       lineNum=3       charNum=31
                                None                        :Argument       5       lineNum=3       charNum=33
                                    'r'                     :Namespace      6       lineNum=3       charNum=33
                                        '['                 :Container      7       lineNum=3       charNum=33
                                            1               :Int            8       lineNum=3       charNum=35
                                None                        :Argument       5       lineNum=3       charNum=38
                                    'r'                     :Namespace      6       lineNum=3       charNum=38
                                        '['                 :Container      7       lineNum=3       charNum=38
                                            2               :Int            8       lineNum=3       charNum=40
                                None                        :Argument       5       lineNum=3       charNum=43
                                    'r'                     :Namespace      6       lineNum=3       charNum=43
                                        '['                 :Container      7       lineNum=3       charNum=43
                                            0               :Int            8       lineNum=3       charNum=45
                    None                                    :Line           2       lineNum=4       charNum=32
                        'jump'                              :Namespace      3       lineNum=4       charNum=32
                            '('                             :Container      4       lineNum=4       charNum=32
                                'label1'                    :Token          5       lineNum=4       charNum=39
                
                Dict
                    "label1"    :   0
                    "label2"    :   2
            """
            assert type(sourceCode) is str

            labels : dict = None
            
            #tokenizes sourceCode, and turns it into a Node Tree
            root : ParseNode = self.Node("root")
            for i in self._tokenize(sourceCode):
                root.append(self.Node("token", i[0], i[1], i[2]))

            logging.debug(debugHelper(inspect.currentframe()) + "this is the original code: " + "\n" + repr(sourceCode))
            logging.debug(debugHelper(inspect.currentframe()) + "tokenized code: " + "\n" + str(root))

            #Note: at this point, rules do operations on the Node Tree, but the depth of the Node Tree remains 2

            root = self.ruleFilterLineComments(root, "#")
            logging.debug(debugHelper(inspect.currentframe()) + "ruleFilterLineComments: " + "\n" + str(root))

            root = self.ruleStringSimple(root)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleStringSimple: " + "\n" + str(root))

            root = self.ruleLowerCase(root)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleLowerCase: " + "\n" + str(root))            

            root = self.ruleRemoveLeadingWhitespace(root, [" ", "\t"])
            logging.debug(debugHelper(inspect.currentframe()) + "ruleRemoveLeadingWhitespace: " + "\n" + str(root))

            root = self.ruleRemoveEmptyLines(root)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleRemoveEmptyLines: " + "\n" + str(root))

            root, labels = self.ruleFindLabels(root)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleFindLabels: " + "\n" + str(root) + "\nlabels: " + str(labels))
            i : int = 0
            while i < len(root.child): #removes the label nodes, as they don't need to be executed
                if root.child[i].type == "label":
                    root.remove(root.child[i])
                else:
                    i += 1
            
            root = self.ruleLabelNamespace(root, self.nameSpace)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleLabelNamespace: " + "\n" + str(root))

            root = self.ruleRemoveToken(root, " ", False)
            root = self.ruleRemoveToken(root, "\t", False)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleRemoveToken: " + "\n" + str(root))

            root = self.ruleCastInts(root)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleCastInts: " + "\n" + str(root))

            root = self.ruleCastHex(root)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleCastHex: " + "\n" + str(root))

            # This is where the Node Tree is allowed to go to depth > 2
            root = self.ruleContainer(root, {"(":")", "[":"]"})
            logging.debug(debugHelper(inspect.currentframe()) + "ruleContainer: " + "\n" + str(root))

            # This will roll containers trailing a namespace object into a child of namespace object
            filteredNameSpace : Dict[NameSpaceObject] = {}
            for i, j in self.nameSpace.items():
                if j.type in ["instruction", "directive", "registerBank", "registerAlias"]:
                    filteredNameSpace[i] = j
            root = self.ruleNestContainersIntoInstructions(root, filteredNameSpace, True)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleNestContainersIntoInstructions: " + "\n" + str(root))

            temp : List[self.Node] = self.ruleSplitLines(root, "line", "\n")
            root = self.Node("root")
            for i in temp:
                root.append(i)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleSplitLines: " + "\n" + str(root))

            #removes empty lines/empty line nodes
            i : int = 0
            while i < len(root.child):
                if len(root.child[i].child) == 0:
                    root.remove(root.child[i])
                else:
                    i += 1
            logging.debug(debugHelper(inspect.currentframe()) + "remove empty line nodes: " + "\n" + str(root))

            root = self.ruleSplitTokens(root, "argument", ',', True)
            logging.debug(debugHelper(inspect.currentframe()) + "ruleSplitTokens: " + "\n" + str(root))

            root = self.ruleFilterContainerKeepChildren(root, ["[", "("])
            logging.debug(debugHelper(inspect.currentframe()) + "ruleFilterContainerKeepChildren: " + "\n" + str(root))

            return root, labels

Note: there is a lot of debug logging in there, has proven useful.
Note: While the actual code does the job, there is still a lot of room for improvement, such as some ‘rules’ having more configuration options.

The benefits of this approach of daisy chaining smaller functions together is that you can parse multiple simple assembly programs using the same building blocks (you don’t have to start from scratch with every assembly language). Further, since each block does one thing, it’s ‘relatively’ easy to understand and use. “Relatively”. Information is retained about line number and column number (important later, as the CPUSimluator retains this node tree depending … on things… while running the program, you can effectively point to the exact source code line, or even token, that is being executed)[very situational though].
There are some significant downsides. This is only able to parse simple assembly programing languages. There is some z-fighting (IE: .ascii(“text # Not a Comment”) # A comment; If you filter out line comments first, you’ll cut half the string, causing a mismatch quotation error). This algorithm is very slow, as every ‘rule’ is effectively deep copying and manipulating a large Tree of Nodes, sometimes multiple times per ‘rule’.

TLDR: Welp, I think I made a neat thing. Downside, I don’t think the larger project will be functional by the end of Devember.

[Image Source: https://www.reddit.com/r/ProgrammerHumor/comments/rnhwow/day_before_yesterday/ ]

paulwratt · December 30, 2021, 10:48am

I know a lot of people think GH is development only, and that you dont use issue trackers as blog posts, but Geoff Geerling’s RPi PCIe & CM4 device discussions proove that is BS top to bottom, and totally blown that out of the water, plus they are there for posterity.

I do some posting like this, and the forum makes it hard on other people to comment, because the threads and OP topics can get lost quite easily (eg just look at the advanced administration thread). It can be hard to keep these together with the projects code.

Another way I tried to tackle it was including a blog in the docs folder of the GH project, but they also dont lend themselves to replys, but at least they are in the same code repository.

Anyway, thanks for that extra exposure of your developement process, and actually you have one more day left (here its the 31st in ten miniutes so maybe you have 2 days left)

Cheers

Paul

Medic5700 · January 1, 2022, 12:09am

Random Update [2021-12-3X] - Irony

Use Cases

To start off, the original use case for this project, a CPU Simulator, at it’s core, was a visualization aid. Literally just some ascii text to show which numbers were in which registers, what was written to recently, etc. It was just supposed to be a dummy class to make animations of low level algorithms easier. Then I thought… hm… It would probably be easier to make something that can automatically parse an assembly program, and display it automatically.
…The idea snowballed…

The General Idea

A CPU Simulator that you can reconfigure (IE: 8bit, 16bit, 64bit, different instruction sets, different languages, etc) and run actual assembly programs on. Mainly meant for academic and educational use.

Use Case 1: Highschool

Here's a file to include in your project
Use in your assignment:
    from CPUsim import CPUsim
    yourProgram = "My Assignment"
    CPU = CPUsim()
    CPU.loadProgram(yourProgram)
    CPU.run()

Make an assembly program that can multiply two numbers together

Something simple enough for a highschool teacher to include as part of an assignment to a student. Mainly geared towards just showing the basics (IE: what a register is, how and where numbers are stored, how assembly instructions are turned into machine code) with a live example (EX: Here’s a program, watch it run, you can step through each instruction as it runs). Ergo, this is what I aim to make the default. The nothing configured and just ‘works out of the box’ for simple use cases like this. (This also means I explicitly avoid using any non-default python modules [numpy, Rich, tkinter because not included by default in Ubuntu]. So far, I haven’t used anything non-default yet)

I would include a screen shot of the random custom built emulator we used in highschool, but I didn’t grab a copy of the program. So Y’all’ll have to settle from some random reference sheets from my old notes we used during tests to convert machine code into an assembly program by hand (with pencil and paper) to figure out what that block of one’s and zero’s did. Especially one test. IT WAS FUN when no one could figure out what that program did… BEFORE WE WERE TOUGHT WHAT A FACTORIAL WAS…

Use Case 2: University

More advanced features are needed here, like showing cache, and interrupts, and system calls, and blah blah blah. To adequately show all these options, especially if you want to highlight particular stuff, you need to make it configurable. But also easy to understand so you can walk through a class on what is actually happening (IE: this instruction multiplies two numbers together, but also can trigger an overflow flag, and can raise an exception. And it’s explainable WITHOUT jumping between several different points in a program). But also easy to use enough that you can still give it to a student as a single file for an assignment.

SPIM from a decade ago

Use Case 3: Emulating obscure architectures

This is more for ‘probing the possibility space’ of what’s possible and what is out of scope.
IE: A Commodore 64 6502 processor that can access the first 512 bytes of RAM twice as fast as the rest of RAM. Thus each RAM element can have different properties.
IE: The TIS-100, a made up computer from a video game of the same name. So how would linking together several ‘CPUs’ work?
IE: Emulating ECC memory, implying extra memory per address. Ergo, 8bits per memory element can’t be assumed.
IE: Brainfuck the programming language. A little out there, but how could you make a CPU that runs that code natively, without resorting to making an ‘OS’ that runs the code on top of in the simulator?
IE: The AMD Bulldozer architecture. With the unusual arrangement of two cores sharing a single floating point unit. Is that possible to simulate?
IE: Some Very Long Instruction Word architectures. Good to include support for VLIW. Excluded are the VLIW architectures that allowed you to intercept the un-stabilized output of multi-cycle instructions via very careful timings. (Think intercepting the output of a two cycle multiplication to get only the lower half of bits)
IE: Register File Windows… No. Very No. Just because I’m in Wonderland doesn’t mean I’ll willingly offer my neck to the Queen.

In short, the truly hard part of this project isn’t building it. It’s coming up with a hundred ways to not build it, and navigating all future ideas/features through that minefield. (See a previous post about memory [Devember 2021] CPU Simulating Engine (a rewrite from last Devember) - #2 by Medic5700 )

Use Case 4: Academic Research

Lets say you want to test how various cache sizes and controllers could affect performance on an algorithm.
Alternate, Lets say you want to simulate a Row Hammer Attack on DRAM.

The first thing to consider is repeatability. When you run a test, it should have consistent results, ergo this simulator needs to be deterministic. Deterministic is great until simulating DRAM for a Row Hammer Attack, which requires randomness. To make the CPU simulator deterministic AND be able to simulate DRAM for a Row Hammer attack, the CPU simulator needs functions for a psudo-random number generator at every level. In an instruction, in memory, in… other parts I didn’t think of yet…
The next thing to consider is running a full array of tests with minor variations. IE: running once with fancy animations and stuff is great. Running it silently a hundred times via the command line with logging is better. That is why it’s designed as a module, with functions that make it easier for running as part of a larger project. The downside is it doesn’t have fancy UI interfaces, or the ‘just double click the executable to start’ that a custom application or an emulator like SPIM does.

Alternatively, research would also likely require some very deep low level customization of the CPU Simulator to be usable. Hence the modularity. You can modify the ‘fancy UI’ to suit your needs without worrying about affecting other stuff IE: compartmentalization. Ditto with the memory controllers (once I implement it), instruction sets, etc. To be clear, there is still a level of communication between modules that allows you to modify one section and the changes to be automatically picked up by other modules… but I aim to make those interactions simple and easy to understand. IE: I added an instruction to the instruction Set, the Compiler will get an update that this instruction is added, and it will autofill it with a placeholder. Stuff like that.

In short, all of these use cases requires a separate and custom solution. I decided to attempt to bridge all those use cases with a CPU engine that could adequately handle all those use cases. And since each use case has specific considerations, it affects all other use cases. IE: making something detailed enough for research purposes while also being simple enough to include in a highschool assignment… is difficult.

TLDR

I didn’t get something usable by the end of Devember, but also I think I wrote the equivalent of a small book with this project. Small Yay XD!

A Happy New Year to Y’all and to Y’all a good 2021 Yeet!

Justin_White · February 7, 2022, 4:32am

Nice work, there is a lot here. Wendell mentioned the book Code by Charles Petzold in one of the Level1 News shows, I read it and it was good. It is a good bridge for people just getting exposed to this stuff. Who knows, in few years I may be reading your book with all the information you have here. Nicely done.

Medic5700 · July 8, 2022, 1:57am

Random Design Note [2022-07-07] (TLDR fixing the unittest that crashed my desktop)

A quick anecdote of the importance of crash testing your code

Unit Testing is really good to find errors (if a little boring to write), AND helps with development of functions you haven’t quite nailed down all use cases yet (IE: test driven development). This is a quick story about how a unittest nearly crashed my computer, and how unit testing helped fix the issue.

Setup

# The function being tested, source code simplified for now
def opTwosCompliment(self, 
    funcRead : Callable[[tuple[int | str, int | str]], int], 
    funcWrite : Callable[[tuple[int | str, int | str], int], None], 
    funcGetConfig : Callable[[tuple[int | str, int | str]], dict[str, Any]],

    registerDestination : tuple[int | str, int | str], 
    registerA : tuple[int | str, int | str]
) -> None:
    a : int = funcRead(registerA)
    bitLength : int = funcGetConfig(registerDestination)["bitLength"]

    inputNumber : int = a & (2**bitLength - 1) # Cuts down number to correct bitLength BEFORE operating on it
    bitArray : list[Literal[0, 1]] = [inputNumber >> i & 1 for i in range(bitLength)] # Converts to bit array, index 0 is least significant bit
    bitArray = [not i for i in bitArray] # performs the bitwise NOT operation
    result : int = sum([bit << i for i, bit in enumerate(bitArray)]) # converts bit array to int
    result += 1

    result = result & (2**bitLength - 1) # trunk result

    funcWrite(registerDestination, result)

# The unit test, part of a much larger testing suite
def test_opTwosCompliment_largeRegisterSize04A(self):
    """tests opTwosCompliment on 'Twos r0 = r1' with bitLength '2**20, 2**20' -> 'Twos 0xffff... = 1'
    # the test source code omitted for now

Twos Compliment: A way of converting a signed positive number to a signed negative number, and vice versa (IE: 1 → -1; 0B0001 → 0B1111). The basic algorithm is to perform a logical not on a number, add one (IE: 1 → -1; 0B0001 → NOT → 0B1110 → +1 → 0B1111).
registerDestination, registerA: think of it as the index number of a particular register.
funcRead, funcWrite, funcGetConfig: the API functions that allow interaction with registers
The Registers store positive integers, along with metadata like ‘bitlength’

The important part to note is that, unlike most CPU simulators, you don’t know the bitlength of the registers ahead of time; AND you are explicitly expecting weird use cases (IE: bitlengths to mismatch, large bitlengths, absurdly large bitlengths)

The testing motivation

Having a register with a bitlength of 2**20 (IE: a 256 KiloByte large number) seems absurd, but the point of this CPU simulator is to play with absurd scenarios. Also, since this is a ‘building block’ that is meant to be put together with other ‘building blocks’, it needs to be extra robust. You could place asserts to arbitrarily limit the function, but I wanted to specifically minimize the number of ‘small edge cases’ the user would have to know about when using this program. IE: if there was going to be a limitation of the function, I want it to be a limitation of python, not a limitation of the source code I wrote (in this section at least).

Red Alert Shields Up

The first indication of a problem was when my computer locked up running unittests. I was able to wrestle enough control of my desktop to kill the python process AND notice python taking up 50 GB of memory (commit size)[my desktop has 16 GB of RAM]. Usually when running unittests, you expect at most a ‘FAIL’ or an ‘ERROR’. What you don’t expect is an individual test to be able to crash your desktop.

After some poking around, I narrowed down the problem to the following line

result : int = sum([bit << i for i, bit in enumerate(bitArray)]) # converts bit array to int

While efficient and concise on small bitLengths, it does not scale at larger bitlengths. It is essentially creating a 1 million element array storing 256 KB large numbers, resulting in about 256 GB of memory used BEFORE summing them together. Clearly a better solution is needed.

First I tried replacing the line with something more linear

result : int = sum([bit << i for i, bit in enumerate(bitArray)]) # converts bit array to int
result += 1

# A                                 
# | above Gets replaced with below |
#                                  V

result : int = 1
i : int
bit : Literal[0, 1]
for i, bit in enumerate(bitArray):
    result += bit << i

By having a large testing suite already in place, I can swap out a line with a new algorithm, test it, and be sure I didn’t break something else/another use case.
It works, but is slow. Faster than a desktop crash, but still slow. I decided to tweak the algorithm to see if I could squeeze out some extra performance. Having unittests allowed me to test each change to be sure I didn’t break anything, but also the specified unittest served as a de-facto quick and dirty performance speed test.

# the unittest (test_opTwosCompliment_largeRegisterSize04A) is unable to complete due to crashing my desktop
result : int = sum([bit << i for i, bit in enumerate(bitArray)]) # converts bit array to int
result += 1

# the unittest (test_opTwosCompliment_largeRegisterSize04A) takes about ~65 seconds to complete with below code variation
result : int = 1
i : int
bit : Literal[0, 1]
for i, bit in enumerate(bitArray):
    result += bit << i

# the unittest (test_opTwosCompliment_largeRegisterSize04A) takes about ~65 seconds to complete with below code variation
result : int = 0
i : int
bit : Literal[0, 1]
for i, bit in enumerate(bitArray):
    result |= bit << i
result += 1

My attempts at optimization did not result in any significant performance gain (and possible performance losses), proven by the runtime of unittest test_opTwosCompliment_largeRegisterSize04A. Out of ideas, I turned to Google, and after searching through a number of results, I stumbled across a single offhand one line comment on stackoverflow that set me on the right direction (something along the lines of ‘you can use xor for bit inversion’).

The new algorithm

Instead of trying to optimize the conversion of an array of bits to an int, I can try doing the relevant operations mathematically. IE: instead of converting an int to a bit array, manipulating the bit array, converting back, etc; I could do the operation purely with numbers and let the Python Engine do the heavy lifting to optimize the execution of manipulating a couple spectacularly large numbers.

result : int = ((2**bitLength - 1) ^ inputNumber) + 1 # Performs Twos compliment <= (Not inputnumber) + 1 <= (bitmask xor inputnumber) + 1
result = result & (2**bitLength - 1) # trunk result

The use of the above results in unittest test_opTwosCompliment_largeRegisterSize04A taking about ~25 seconds to complete (A large improvement over the ~65 seconds of the previous attempts). There is also potential that performance will improve as the underlying python execution engine improves over time. Python 3.11 performance improvements look promising ( What’s New In Python 3.11 — Python 3.11.0b3 documentation )( What’s New In Python 3.11 — Python 3.11.0b3 documentation )

The Result

def opTwosCompliment(self, 
    funcRead : Callable[[tuple[int | str, int | str]], int], 
    funcWrite : Callable[[tuple[int | str, int | str], int], None], 
    funcGetConfig : Callable[[tuple[int | str, int | str]], dict[str, Any]],

    registerDestination : tuple[int | str, int | str], 
    registerA : tuple[int | str, int | str]
) -> None:
    """Takes value from registerA, performs twos compliment, stores result in registerDestination
            
    #TODO if destination register is larger, should extend most significant bit to larger register size
    """
    assert callable(funcRead)

    assert callable(funcWrite)

    assert callable(funcGetConfig)

    assert type(registerDestination) is tuple or type(registerDestination) is list
    assert len(registerDestination) == 2
    assert all([type(i) is str or type(i) is int for i in registerDestination])
    assert all([i >= 0 for i in registerDestination if type(i) is int])
    assert self.isISARegisterVector(registerDestination)

    assert type(registerA) is tuple or type(registerA) is list
    assert len(registerA) == 2
    assert all([type(i) is str or type(i) is int for i in registerA])
    assert all([i >= 0 for i in registerA if type(i) is int])
    assert self.isISARegisterVector(registerA)

    a : int = funcRead(registerA)
    bitLength : int = funcGetConfig(registerDestination)["bitLength"]

    # the input number would usually be cut down to the correct size before operating on it
    # but since this is a logical xor operation, all relavent bits stay within the 'trunk result' later on
    # inputNumber : int = a & (2**bitLength - 1) # Cuts down number to correct bitLength BEFORE operating on it
    result : int = ((2**bitLength - 1) ^ inputNumber) + 1 # Performs Twos compliment <= (Not inputnumber) + 1 <= (bitmask xor inputnumber) + 1

    result = result & (2**bitLength - 1) # trunk result

    funcWrite(registerDestination, result)


def test_opTwosCompliment_largeRegisterSize04A(self):
    """tests opTwosCompliment on 'Twos r0 = r1' with bitLength '2**20, 2**20' -> 'Twos 0xffff... = 1'
        
    'Twos r0 = r1' with bitLength '2**20, 2**20' -> 'Twos 0xffff... = 1'
    -> # create registers
    r0 = [0, 0];                    value = 2**2**20 - 1; bitLength = 2**20
    r1 = [0, 1];                    value = 0; bitLength = 2**20
    ->
    opTwosCompliment(
        funcRead                    = funcDummyRead
        funcWrite                   = funcDummyWrite
        funcGetConfig               = funcDummyConfig
        registerDestination         = [0, 1]
        registerA                   = [0, 0]
    )
    -> # input
    registerA                       = 2**2**20 - 1
    ->
    'Twos 0xffff... = 1'
    -> # output
    registerDestination             = 1
    -> # written
    r1 = [0, 1];                    value = 1; bitLength = 2**20
    """
        
    r0 : int = 2**2**20 - 1
    r1_out : int = 1

    MMMU : self.dummyMMMU = self.dummyMMMU()
    MMMU.createRegister(            0, 0,                                   value = r0, bitLength = 2**20)
    MMMU.createRegister(            0, 1,                                   value = 0, bitLength = 2**20)

    returnValue : None = self.ISA.opTwosCompliment(
        funcRead                                                            = MMMU.dummyReadWrapper,
        funcWrite                                                           = MMMU.dummyWriteWrapper,
        funcGetConfig                                                       = MMMU.dummyGetConfigWrapper,
        registerDestination                                                 = [0, 1],
        registerA                                                           = [0, 0]
    )

    expectedActivity : list[tuple[str, str | int, str | int, int]] = [      # order matters
        ('read',                    0, 0,                                   r0),
        ('getConfig',               0, 1,                                   2**20),
        ('write',                   0, 1,                                   r1_out)
    ]

    resultActivity : list[tuple[str, str | int, str | int, int]] = MMMU.getActivity()

    expectedRegisters : list[tuple[str | int, str | int, int]] = [          # order matters
        (0, 0,                      r0),                                    # input
        (0, 1,                      r1_out)                                 # output
    ]

    resultRegisters : list[tuple[str | int, str | int, int]] = [(i, j, MMMU.readWrittenRegister(i, j)) for i, j, _ in expectedRegisters]

    self.assertEqual(returnValue, None,
        f'\nAssert function return value is None:\nExpected None\nResult {returnValue}')
    self.assertTrue(all([i in resultActivity for i in expectedActivity]),
        f'\nAssert Activities Done:\nExpected activity:\n\t{expectedActivity}\nResult activity:\n\t{resultActivity}')
    self.assertTrue(all([i == j for i, j in zip(expectedActivity, resultActivity)]),
        f'\nAssert Activities Done In Order:\nExpected activity:\n\t{expectedActivity}\nResult activity:\n\t{resultActivity}')
    self.assertTrue(all([i == j for i, j in zip(expectedRegisters, resultRegisters)]),
        f'\nAssert Registers Correct Value:\nExpected registers:\n\t{expectedRegisters}\nResult registers:\n\t{resultRegisters}')

Above is the current complete code (incomplete, work in progress, and a little out of context)

By having unit tests set up allowed me to quickly diagnose a problem, and fix it with confidence that I wasn’t accidentally creating more unseen problems. It also helps with development when you’re not sure how to implement a function, but you can instead focus on how the function should behave by writing unit tests BEFORE implementing that functionality. Really helps nail down what you are doing when having minor analysis paralysis, or lack of confidence in a change you are about to make.

TLDR

python unittest meme
I have more work to do =P