Return to

[Devember 2021] Architecting a TTA Processor Core

Hardware engineer here joining in, longtime lurker but you might recognize ‘UnexpectedSpud’ from the Twitch streams. Probably not. This one’s not really a software project but hopefully someone finds it interesting… it’s been a pet project for a long time but I’ve never mustered the motivation to do anything beyond some back of the napkin theorycrafting until now.

The ultimate goal is to design and implement a single-cycle (not pipelined) transport-triggered-architecture processor core. Put simply, this means designing a CPU with one instruction: MOVE. The whole thing, from the ground up.

The design will be implemented in VHDL and targeted at an undecided FPGA, for which I intend to spin one board for the processor module itself (core, memory, system bus interface) and another to act as your traditional motherboard for the platform and provide enough I/O for demonstration, probably ATX form factor. Sounds like a lot of work but with my own pre-existing board component libraries and the wonders of modern FPGA hardware, I think it’s realistic in two months. Only issue might be the turnaround time on JLCPCB’s $4+shipping board runs, but we’ll cross that bridge when we reach it.

Here’s the executive summary:

      - Processor Module CCA
      - ATX I/O Baseboard CCA
      - Processor Core Bitstream/VHDL Package
      - Machine Code First Stage Bootloader
      - Machine Code I/O Demonstration Payload
      - Theory of Operation
      - ISA Datasheet (Instruction set, etc)
      - Block Diagrams


  • Half of the work here is just getting a modern FPGA to configure itself. Between power requirements, timing requirements, configuration protocols/etc, it will be nice to finally pile together a generic reference design for my own use in future projects that need an FPGA.

  • I’ve never really version controlled any of my personal projects. I intend to take that a bit further and learn about how to actually set up a continuous integration (probably Jenkins) and version control (probably SVN) environment for myself, since the only other experience I have with the matter is from Git memes and one employer’s halfass TeamCity deployment.

And then maybe for next year’s Devember I can do an assembler and/or compiler in C, but that’s way beyond what I can handle in a little over two months.

Godspeed, gentlemen. More to follow.



Development Log can be found at

I’ll bump the thread whenever major updates go up, but otherwise all details will be over there!


You’ve got my attention =P

How will you implement an ‘exchange’ instruction? That is exchange a value in a register with a value in memory? These are needed to implement semaphores. And have to execute in one cycle. If you don’t have one then you can not have non maskable interrupts.

Yeah, there’s going to be a massive information dump in the next couple days… I’m putting together a quick HTML site now with all the info/dev log that I’ll just host on an old parked web domain + link to instead of blowing up this thread.

Keep in mind, this is going to be a pretty basic processor core - think like the old MOS 6502. Direct linear memory addressing, no cache, no hardware multitasking or permission levels. I have some ideas for adding things like that in the future, but for now I’ll be happy to have something working at the end of December. Start small.

An ‘exchange’ will be a load+store+move, not monotonic. The way I have it laid out on paper: there’s a single interrupt vector with a queue of identifier tags which indicate the interrupt source. There is no nonmaskable interrupt vector… this core itself doesn’t generate interrupts; there isn’t a counterpart to the x86 exception/trap.

The nice part of building something ground-up from the gate level is that I can disregard programming paradigms that exist right now and make my own. In this case, until I add a lot more hardware complexity, the software/kernel is going to be responsible for basically everything. We can deal with that next Devember haha


Alright, the Dev Log’s up - nothing there yet, but I’ve updated that reply I reserved up at the top with a link.

Have had a domain name idling for a while now, so I decided to spin up a Linode and unpark it. Figure it’ll work out better to info dump over there instead of blowing up this thread every day.


First couple pieces of documentation are up.

I’ve just learned a valuable lesson on cache busting after wondering why nothing was actually updating on the Dev Log page. Spent a solid two hours trying to get Apache to acknowledge the .htaccess file, which was fun. Everything’s in good working order again.

1 Like

No updates for the last couple weeks, since I came to the realization that I was spending far more time redrawing documentation than I was actually accomplishing anything every time I made a conceptual improvement.

In large part, the processor core itself is done and I’m going everything uploaded w/ commentary over the four day weekend here.

^ That simulation isn’t really going to mean anything to anyone yet, but you’re mostly looking at the top-level system memory bus (which connects the core to the main memory controller, bootloader, local peripherals, and remote I/O bus… shown is the bus port into the execution core) and I/O buffers. Just know that it starts up and reaches the first stage bootloader, which is currently nonexistent.

No activity at the top level pins, because the first stage bootloader is a chunk of 256x16b entirely in FPGA fabric; its job will be to load the second stage bootloader (equivalent to the x86 BIOS) from an I2C EEPROM, but that machine code isn’t written yet. It has a 32kb (1k x 32) write-through cache, but there aren’t any read/writes to system memory to show yet, so it’s just idling.

I’ll probably get the PCB put together first so I can write that bootloader while it’s in flight, but first I’ve got three weeks of progress to get documented. :kissing:

1 Like

Here’s a quick overview of the instruction set:

Words are 16 bits, and so each instruction word is in the form of 0xSSTT, where SS is the source address, and TT is the target address. Let’s say you want to add the contents of general purpose registers 0 and 3 ($R.0, $R.3), and then branch to the address in general purpose register 7 ($R.7) if the result is zero:

R.0   -> ADD.A    // Move R.0 to Add Operand A
R.1   -> ADD.B    // Move R.1 to Add Operand B
R.7   -> GA.0     // Move R.7 to Global Address [15:0]
ADD.Y -> BR.0     // Branch if ADD.Y == 0

…which translated to machine code looks something like:


Thanks for the post, I understood your example. So the “address” is not purely an address but it acts as data/instruction based on its position and its value.

Does the compute happens while reading (the source) or does there exist two sets of register (one for each Source and Target with same address) and compute happens in every clock.

Yep, that’s correct - ‘address’ might not be a good term for it, I’ll see if I can think of a better way to describe that.

Here’s a simplified snippet of the core’s datapath:

The way it’s written, all calculations execute in parallel on every clock, but there will only be one register which is written to on any given clock. I guess a better way to put it is that one calculation is updated and all others will hold their value on every clock.

Each target is usually a register (‘control’ targets like branch/jump don’t actually hold anything)… the enable signal to each of these registers is decoded from the instruction target field. The registers are all hardwired to their respective functional blocks. The results from those functional blocks are returned through a single massive writeback bus that’s a few hundred bits wide into a multiplexer which selects the data being moved, which is selected by the instruction source field.

The really nice part about this architecture is that it’s extremely easy to add new instructions/operations - just put another set of target registers on the bus, give them an ‘address’, and add the output into the source decode mux. Because adding new operations only adds more width to the datapath and doesn’t make it longer/slower, there is only a small hit to performance from making that big source decode multiplexer larger.