Base CS from zero CS · 03 · 05

A toy CPU

Put it all together: a toy CPU with two registers and a four-instruction set. Walk a small program — load two numbers, add them, store the result — through the fetch–decode–execute loop step by step.

CS ◷ 25 min

You have learned about instructions, the fetch–decode–execute cycle, registers, and machine code as individual ideas. Now it is time to put them all together in one place and watch them work as a unified system.

In this lesson you will build a mental model of a complete — if tiny — CPU. It has two registers, a four-instruction set, and a small memory. Then you will trace a real program through it: load two numbers from memory, add them, store the result. Every single fetch, decode, and execute step is shown explicitly. By the end, the CPU will no longer be a black box. It will be a machine you can run in your head.

Goal

After this lesson you can describe the components of a minimal CPU (registers, PC, memory, ALU), read and write programs in the toy instruction set, and trace the fetch–decode–execute cycle step by step for a complete small program.

The toy CPU’s components. Our toy CPU has these parts:

Two general-purpose registers: R0 and R1. Each holds one 8-bit value (0–255).
A program counter (PC): holds the memory address of the next instruction to fetch. Also 8-bit, so it can address memory cells 0–255.
Memory: 256 cells, each holding one 8-bit value. Addresses 0–7 will hold our program (the instructions). Addresses 200, 201, and 202 will hold our data (the input numbers and the result).
An Arithmetic Logic Unit (ALU): a circuit that can add two 8-bit values and store the result in a register.

The toy CPU cannot do very much — it is designed to demonstrate the essentials, not to be practical. Real CPUs have 16–31 general-purpose registers, can address gigabytes of memory, and have instruction sets with hundreds of opcodes. But the principles are the same: you will see the same four-part anatomy — registers, PC, memory, ALU — inside every real processor you ever work with.

The toy instruction set. Our CPU understands exactly four instructions. Each instruction is exactly two bytes long: an instruction byte followed by an operand byte. The PC advances by 2 after each non-JUMP instruction.

Instruction byte layout (byte 0):

Bits 7–6	Bit 5	Bits 4–0	Meaning
opcode	reg	0 (unused)	opcode selects the operation; bit 5 selects the register (0 = R0, 1 = R1)

Operand byte (byte 1): an 8-bit value (0–255) — a memory address for LOAD/STORE/JUMP, or 0x00 (unused) for ADD.

The four opcodes:

Opcode (bits 7–6)	Bit 5	Byte 1	Name	Meaning
`00`	reg	address (0–255)	LOAD Ra, addr	Load memory[addr] into Ra
`01`	reg	address (0–255)	STORE addr, Ra	Store Ra into memory[addr]
`10`	0	0x00 (unused)	ADD R0, R1	R0 ← R0 + R1. Bit 5 and byte 1 are ignored.
`11`	0	address (0–255)	JUMP addr	Set PC to addr. Bit 5 ignored.

Full bit-level example for our four program instructions:

LOAD R0, 200:
  Byte 0 = 00_0_00000 = 0x00   (opcode 00 = LOAD, bit 5 = 0 → R0)
  Byte 1 = 11001000  = 0xC8    (decimal 200 — the memory address)

LOAD R1, 201:
  Byte 0 = 00_1_00000 = 0x20   (opcode 00 = LOAD, bit 5 = 1 → R1)
  Byte 1 = 11001001  = 0xC9    (decimal 201)

ADD R0, R1:
  Byte 0 = 10_0_00000 = 0x80   (opcode 10 = ADD; bit 5 and byte 1 unused)
  Byte 1 = 00000000  = 0x00    (unused, always 0)

STORE 202, R0:
  Byte 0 = 01_0_00000 = 0x40   (opcode 01 = STORE, bit 5 = 0 → R0)
  Byte 1 = 11001010  = 0xCA    (decimal 202 — the destination address)

Because byte 1 is a full 8-bit field, it can address any of the 256 memory cells — including cells 200, 201, and 202. Our program fits entirely within this range.

▸Why this works

Why only four instructions? Real instruction sets have hundreds of opcodes. But you only need a handful to demonstrate everything that matters: data movement (LOAD/STORE), computation (ADD), and control flow (JUMP). Every more complex instruction is a combination or extension of these ideas. nand2tetris uses a similarly minimal instruction set (called “Hack”) to build a complete working computer from first principles; the entire language fits in 16 bits.

The initial state of memory. Before the program runs, the OS loads the program’s instruction bytes into memory starting at address 0, and places the input data at addresses 200 and 201. The PC is set to 0.

Each instruction is 2 bytes, so our four instructions occupy addresses 0–7:

Address  Content    Meaning
0        0x00       LOAD R0, 200 — byte 0 (instruction byte)
1        0xC8         ← byte 1 (operand: address 200)
2        0x20       LOAD R1, 201 — byte 0
3        0xC9         ← byte 1 (operand: address 201)
4        0x80       ADD R0, R1 — byte 0
5        0x00         ← byte 1 (unused)
6        0x40       STORE 202, R0 — byte 0
7        0xCA         ← byte 1 (operand: address 202)
...
200      25             ← first input number
201      17             ← second input number
202      0              ← result slot (empty)

Registers before execution: R0 = 0, R1 = 0, PC = 0.

For readability, the trace below labels each instruction by its mnemonic rather than its raw byte pair.

The program counter drives everything. Before tracing each cycle in detail, notice the key role of the PC:

The PC starts at 0 — so the first instruction fetched begins at address 0.
After each non-JUMP instruction, the PC increases by 2 (because each instruction is 2 bytes in our toy CPU).
If the program contained a JUMP, the PC would be set to the jump target address instead.

The PC is the cursor that sweeps through the instruction sequence. It is the only reason the CPU knows which instruction is “next.”

LOAD R0,200

0–1

LOAD R1,201

2–3

ADD R0,R1

4–5

STORE 202,R0

6–7

···

200

201

202

Memory before execution. Each instruction occupies 2 bytes. Instructions occupy addresses 0–7. Data inputs sit at 200 and 201; the result slot at 202 is 0. PC = 0 (highlighted instruction).

Worked example

Full trace of the four-instruction program.

We track: PC (program counter), R0, R1, and the content of memory address 202 after each cycle.

Cycle 1

State before: PC = 0, R0 = 0, R1 = 0, mem[202] = 0.

Fetch: Read 2 bytes starting at PC = 0. Bytes = 0x00 0xC8. Instruction = LOAD R0, 200.
Decode: Opcode bits 7–6 = 00 → LOAD. Bit 5 = 0 → destination register R0. Byte 1 = 0xC8 = 200 → address field = 200.
Execute: Read memory[200]. Result = 25. Write 25 into R0.
PC advance: PC = 0 + 2 = 2.

State after: PC = 2, R0 = 25, R1 = 0, mem[202] = 0.

Cycle 2

State before: PC = 2, R0 = 25, R1 = 0, mem[202] = 0.

Fetch: Read 2 bytes starting at PC = 2. Bytes = 0x20 0xC9. Instruction = LOAD R1, 201.
Decode: Opcode bits 7–6 = 00 → LOAD. Bit 5 = 1 → destination register R1. Byte 1 = 0xC9 = 201 → address field = 201.
Execute: Read memory[201]. Result = 17. Write 17 into R1.
PC advance: PC = 2 + 2 = 4.

State after: PC = 4, R0 = 25, R1 = 17, mem[202] = 0.

Cycle 3

State before: PC = 4, R0 = 25, R1 = 17, mem[202] = 0.

Fetch: Read 2 bytes starting at PC = 4. Bytes = 0x80 0x00. Instruction = ADD R0, R1.
Decode: Opcode bits 7–6 = 10 → ADD. No address operand. Sources = R0, R1. Destination = R0.
Execute: ALU computes R0 + R1 = 25 + 17 = 42. Write 42 into R0.
PC advance: PC = 4 + 2 = 6.

State after: PC = 6, R0 = 42, R1 = 17, mem[202] = 0.

Cycle 4

State before: PC = 6, R0 = 42, R1 = 17, mem[202] = 0.

Fetch: Read 2 bytes starting at PC = 6. Bytes = 0x40 0xCA. Instruction = STORE 202, R0.
Decode: Opcode bits 7–6 = 01 → STORE. Bit 5 = 0 → source register R0. Byte 1 = 0xCA = 202 → destination address = 202.
Execute: Write R0’s value (42) into memory[202].
PC advance: PC = 6 + 2 = 8.

State after: PC = 8, R0 = 42, R1 = 17, mem[202] = 42.

The program is done. Memory address 202 now holds 42, the sum of the two input numbers (25 + 17). The CPU will attempt to fetch the instruction at address 8 — but there is no fifth instruction in our program, so in a real system the OS would have detected the end of the program and stopped the process before reaching address 8.

The entire computation required exactly four fetch–decode–execute cycles. Each cycle was the same three-step structure. The CPU never needed to “understand” what it was computing — it just ran its loop.

▸Edge cases

What happens when the PC runs off the end? In our toy example the PC would reach address 8 and try to fetch whatever bytes are there — garbage, or perhaps another program’s code. Real operating systems prevent this: the program binary contains metadata telling the OS where the program ends (or there is an explicit HALT instruction or a system call to exit). The OS sets up memory protection so the program cannot execute beyond its own region. In practice, the last instruction of almost every program is a call to the OS “exit” system call, which terminates the process cleanly.

Practice 0 / 5

In our toy CPU, each instruction is 2 bytes. After executing the instruction that starts at PC = 4 (not a JUMP), what is the new value of PC?

Before cycle 3 in our trace, R0 = 25 and R1 = 17. After ADD R0, R1 executes, what value does R0 hold?

After cycle 4 (STORE 202, R0) in our trace, what value does memory address 202 hold?

How many fetch–decode–execute cycles did our 4-instruction program require to compute 25 + 17 and store the result?

At the start of cycle 1, the PC = 0. After all 4 cycles complete, PC = ?

Check yourself

Quiz

In the toy CPU trace, what is the role of the ADD instruction compared to the LOAD instructions?

Recap

A minimal CPU needs: registers (fast working storage — R0, R1), a program counter (a register holding the next instruction’s address), memory (byte-addressed cells for instructions and data), and an ALU (the circuit that computes arithmetic). The CPU’s behaviour is entirely captured by the fetch–decode–execute cycle: fetch the instruction bytes at PC, decode the opcode and operands, execute the corresponding action, advance PC. In our toy ISA each instruction is 2 bytes — an instruction byte (opcode + register select) and an operand byte (8-bit address or value) — so PC advances by 2 per cycle and all 256 memory addresses are reachable. A four-instruction program (LOAD, LOAD, ADD, STORE) runs in four cycles and computes a sum without the CPU ever “knowing” it was adding — it only ran its loop. Every larger program — no matter how complex — is the same loop running longer, on more instructions, with a larger register file and more memory. The mechanism never changes. Now when you step through code in a debugger and watch the program counter advance instruction by instruction, you are watching this exact loop — and you can predict what the CPU will do next before the debugger shows you.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Tiny Stack VMBuild the machine that runs the machine. You define a small bytecode instruction set, write an assembler that turns readable mnemonics into bytes, and write the interpreter loop that fetches, decodes, and executes them one at a time. By the end you will have demystified the whole stack between a line of source and a running program — because you wrote every layer of it yourself.