Syntax checking complete-ish

16 Sep, 2025

Progress update! Nachos—the Chip‑8 IDE I’m building—now has a feature‑complete tokenizer and syntax checker for the Octo language. It successfully parses every Octo example I’ve thrown at it except one. I’ll share that tiny exception later, but first, a quick tour of how the tokenizer and parser work, what they catch, and what I learned along the way.

Tokenization: turning text into tokens

Tokenization is the first step in compilation. The source text becomes a stream of tokens—objects that carry a type (Identifier, Number, Plus), position (line, column), and sometimes a value (identifier name, numeric literal, string contents).

For example, this Octo snippet:


Our first do-nothing function
: noop
return

becomes this list of token objects:

1Colon(line=0, column=0)
2Identifier(name="noop", line=0, column=2)
3Return(line=1, column=0)

The tokenizer:

strips comments and whitespace,
recognizes identifiers, numbers, and strings,
identifies symbols/operators like +, -, :, <<, :=,
tags keywords like Return, Scroll-Left, Exit, etc.

Tokenization does not validate logic or structure. It doesn’t know if you forgot a parameter or mismatched parentheses—it just recognizes the pieces.

Tokenizer shape (pseudocode)

A tokenizer typically reads characters, decides what token kind could start at that position, consumes the rest, and emits a token:

 1// Pseudocode
 2while (canContinue()) {
 3    val ch = peekChar()
 4    when {
 5        ch == '#' -> consumeComment()
 6        ch.isWhitespace() -> consumeWhitespace()
 7        ch.isLetter() || ch == '_' || ch == ':' -> consumeIdentifierOrLabelOrDirective()
 8        ch.isDigit() -> consumeNumber() // e.g., 0x.., 0b.., decimal
 9        ch == '"' -> consumeStringOrError()
10        else -> consumeSymbolOrError() // operators, punctuation, or unknown
11    }
12}

Some errors can be detected here. For instance, Nachos emits an Error token if it sees an opening quote without a matching closing quote:

kotlin // Example outcome String(line=10, column=15, value="incomplete Error(line=10, column=15, message="Unterminated string literal")

Parsing: building structure and catching mistakes

Once we have tokens, the parser turns them into higher‑level constructs and performs syntax checks. Octo is a small assembly-like language:

arithmetic groups with parentheses and evaluates left‑to‑right,
types are limited (numbers, identifiers, strings),
macros are simple,
control flow is jumps/ifs/loops,
registers are manual.

This keeps the parse tree fairly linear and the parser straightforward. Unlike tokenization, parsing enforces rules and evaluates context:

expands and validates macros,
verifies identifiers are defined,
matches braces/parentheses,
checks expected types and arities,
annotates nodes with metadata for the assembler.

The result is a list of ParseTokens. When something goes wrong, the parser emits Error nodes with clear, positional messages—but keeps going to surface as many issues as possible in one pass.

Example: macro expansion

Input:

:macro foo SIZE {
v0 := SIZE
i := CALLS # CALLS is the number of times this macro has been expanded.
}
foo 0x42
foo 0x42

Parsed summary:

1Macro, tokens size: 11, lines [1, 2, 3, 4]
2MacroExpand, tokens size: 2, lines [5]
3Assignment, tokens size: 3, lines [2, 5]
4IAssign, tokens size: 3, lines [3, 0]
5MacroExpand, tokens size: 2, lines [6]
6Assignment, tokens size: 3, lines [2, 6]
7IAssign, tokens size: 3, lines [3, 0]

Zooming into one expansion to show parameters and substitutions:

1MacroExpand, tokens size: 2, lines [5]
2Identifier(name=foo, line=5, column=12) Number "66" 5:16
3Assignment, tokens size: 3, lines [2, 5]
4Register(register=v0, line=2, column=16) Assignment(line=2, column=19) Number "66" 5:16
5IAssign, tokens size: 3, lines [3, 0]
6Register(register=i, line=3, column=16) Assignment(line=3, column=18) Number "1" 0:0

Example: catching errors (and continuing)

If we pass the wrong type and then forget an argument:

:macro foo SIZE {
    v0 := SIZE
    i := CALLS
}
foo "hello"
foo

The parser emits errors but continues analysis so the user gets multiple helpful messages in one run:

1MacroExpand, tokens size: 2, lines [5]
2Identifier(name=foo, line=5, column=0) String "hello" 5:4
3Error, tokens size: 4, lines [2, 5]
4Register(register=v0, line=2, column=4) Assignment(line=2, column=7) String "hello" 5:4 Error "Expected Register, Identifier, Number or Key" 5:4
5IAssign, tokens size: 3, lines [3, 0]
6Register(register=i, line=3, column=4) Assignment(line=3, column=6) Number "1" 0:0
7Error, tokens size: 3, lines [6]
8Identifier(name=foo, line=6, column=0) Error "Unexpected end of program" 6:0 Error "Error parsing macro foo" 6:0

The first macro expansion fails because a String can’t be assigned to a register. The second fails because the invocation is incomplete.

The one exception

One example file didn’t parse on the first try: caveexplorer.8o, line 222. A label happens to use the name exit, which is a SuperChip‑8 keyword. It’s a fun edge case—and also a nice sign that the project is far enough along to surface real, actionable issues from real programs. Real code is already producing real feedback.

Lessons learned

Fail fast at the character level, fail friendly at the syntax level.
Keep tokens rich: line/column and literal values make downstream errors precise.
Make macros first‑class in the parser: preserve context for better diagnostics and assembly.
Small language, simple parser: lean into Octo’s linearity and constraints.
Real code > synthetic tests: example suites flush out keyword/name collisions and corner cases.

What’s next

Wire the tokenizer and parser into the editor for live diagnostics.
Build the assembler so Octo programs can run end‑to‑end.
Add debugging tools in the IDE: state inspection, breakpoints, step/run, and register/memory views.

Still plenty to do, but Nachos is now producing real value—and that’s delicious.

#Emulation #Kotlin #Compose #Programming