Time travel, raw
2024-07-06
Raw log from my notes re: Time travel follows.
Sae
RV32I with some RV32C/refactoring WIP from long ago. The WIP probably feels way too magic for me now, but we should take a look at it. Now uses Niar.
TODOs
Decombing
First priority is decombing the design to try to get the build time down. Itās currently redonkulous:
[2024-07-03 13:05:24,917] niar: INFO: building sae for icebreaker
[2024-07-03 13:05:24,917] niar: DEBUG: starting elaboration
[2024-07-03 13:05:25,148] niar: DEBUG: elaboration finished in 0:00:00.230441
[2024-07-03 13:05:25,148] niar: DEBUG: 'sae.il': 425,987 bytes
[2024-07-03 13:05:25,148] niar: DEBUG: starting synthesis/pnr
[2024-07-03 13:05:25,148] niar: INFO: [run] execute_build
[2024-07-03 13:08:12,179] niar: DEBUG: synthesis/pnr finished in 0:02:47.031564
[2024-07-03 13:08:12,207] niar: INFO:
[2024-07-03 13:08:12,207] niar: INFO: === sae ===
[2024-07-03 13:08:12,207] niar: INFO:
[2024-07-03 13:08:12,207] niar: INFO: Number of wires: 2859
[2024-07-03 13:08:12,207] niar: INFO: Number of wire bits: 9313
[2024-07-03 13:08:12,207] niar: INFO: Number of public wires: 2859
[2024-07-03 13:08:12,208] niar: INFO: Number of public wire bits: 9313
[2024-07-03 13:08:12,208] niar: INFO: Number of ports: 4
[2024-07-03 13:08:12,208] niar: INFO: Number of port bits: 4
[2024-07-03 13:08:12,208] niar: INFO: Number of memories: 0
[2024-07-03 13:08:12,208] niar: INFO: Number of memory bits: 0
[2024-07-03 13:08:12,208] niar: INFO: Number of processes: 0
[2024-07-03 13:08:12,208] niar: INFO: Number of cells: 5732
[2024-07-03 13:08:12,208] niar: INFO: $scopeinfo 19
[2024-07-03 13:08:12,208] niar: INFO: SB_CARRY 452
[2024-07-03 13:08:12,208] niar: INFO: SB_DFF 79
[2024-07-03 13:08:12,208] niar: INFO: SB_DFFE 35
[2024-07-03 13:08:12,208] niar: INFO: SB_DFFESR 1380
[2024-07-03 13:08:12,208] niar: INFO: SB_DFFSR 8
[2024-07-03 13:08:12,208] niar: INFO: SB_GB_IO 1
[2024-07-03 13:08:12,208] niar: INFO: SB_IO 3
[2024-07-03 13:08:12,208] niar: INFO: SB_LUT4 3737
[2024-07-03 13:08:12,208] niar: INFO: SB_RAM40_4K 18
[2024-07-03 13:08:12,208] niar: INFO:
[2024-07-03 13:08:12,208] niar: INFO: Device utilisation:
[2024-07-03 13:08:12,208] niar: INFO: ICESTORM_LC: 5033/ 5280 95%
[2024-07-03 13:08:12,208] niar: INFO: ICESTORM_RAM: 18/ 30 60%
[2024-07-03 13:08:12,208] niar: INFO: SB_IO: 4/ 96 4%
[2024-07-03 13:08:12,208] niar: INFO: SB_GB: 5/ 8 62%
[2024-07-03 13:08:12,208] niar: INFO: ICESTORM_PLL: 0/ 1 0%
[2024-07-03 13:08:12,208] niar: INFO: SB_WARMBOOT: 0/ 1 0%
[2024-07-03 13:08:12,208] niar: INFO: ICESTORM_DSP: 0/ 8 0%
[2024-07-03 13:08:12,208] niar: INFO: ICESTORM_HFOSC: 0/ 1 0%
[2024-07-03 13:08:12,208] niar: INFO: ICESTORM_LFOSC: 0/ 1 0%
[2024-07-03 13:08:12,208] niar: INFO: SB_I2C: 0/ 2 0%
[2024-07-03 13:08:12,208] niar: INFO: SB_SPI: 0/ 2 0%
[2024-07-03 13:08:12,208] niar: INFO: IO_I3C: 0/ 2 0%
[2024-07-03 13:08:12,208] niar: INFO: SB_LEDDA_IP: 0/ 1 0%
[2024-07-03 13:08:12,208] niar: INFO: SB_RGBA_DRV: 0/ 1 0%
[2024-07-03 13:08:12,208] niar: INFO: ICESTORM_SPRAM: 0/ 4 0%
[2024-07-03 13:08:12,208] niar: INFO:
After moving the fault check out of fetch.resolve
: 1:47, 4825 LCs.
After using .all()
: 1:44, 4802 LCs.
After fixing our IL digest behaviour: priceless.
After splitting out just OP_IMM: 404k IL, 2:36, 5038 LCs. O_o
I guess I need to split out the decode a little more? Or maybe itās just a matter of decomposing more.
After replacing multiple m.d.sync += self.write_xreg(v_i.rd, ...)
with one of those and a comb wire out
for the value: 404k IL, 1:35, 4851 LCs.
Weāll split it out as much as possible at first, and then slowly reintegrate. We already do the register save in fetch.init
, and now with some care after splitting out OP_IMM itās a bit better again.
Need to remember that the toolchain does much less deduplication than we assume. Keep going on that, esp with insn decode.
Using ~insn[:16].bool()
instead of == 0
: 404k IL, 1:52, 4800 LCs.
Using wb_reg.any()
instead of != 0
: no change.
After splitting out LOAD: 404k IL, 1:59, 4945 LCs. Uhm.
After factoring the xreg fetch into common: 402k IL, 1:46, 4972 LCs. Hmmmmmm.
After adding the read register: ran out of BELs. Welp. (6515 cells.)
After changing the read register comb->sync: 6394 cells. Improved slightly.
After splitting out OP: 371k IL, 7166 cells. ā¦
After refactoring OP with out
: 7120 cells.
After splitting out STORE: 368k IL, 7134 cells.
After splitting out BRANCH: 343k IL, 6386 cells.
After splitting out JALR: 339k IL, 5599 cells and PNR is working again.
After changing jump(m, pc)
ās context manager return to ~_.bool()
: 339k IL, 5602 cells. Uh, ok. Reverting that for now just ācause maybe thereās a cross-over point (size of bv).
Using any()
instead of bool()
causes cell reduction? At 5587. 339k IL, 1:04, 4774 LCs.
OK, switching sync->comb on read regs bumps back up to 5664 (+77), and increases PNR time significantly (maybe because weāre close to cell count?). 338k IL, 2:41, 4984 LCs (94%).
Next step is to do the instruction decode in one place and then pass info to following stages.
Added imm
and funct3
to LOAD: 342k IL, 1:06, 4905 LCs (92%).
Did the same to OP_IMM: 342k IL, 0:58, 4886 LCs (92%).
Removed v_sxi
wire and just used imm[:12].as_signed()
in place: 342k IL, 1:01, 4907 LCs (92%).
Did same deal to OP: 343k IL, 1:01, 4806 LCs.
Did same deal to STORE: 344k IL, 1:16, 4816 LCs. !!!
I forgot to only use the bottom 12 bits of imm
. Fixed: 344k IL, 1:09, 4910 LCs. What?
Try doing the sign-extension in resolve: 344k IL, 1:13, 4845 LCs.
Do the thing for BRANCH: 337k IL, 1:20, 4859 LCs.
JALR too, thatās everything: 337k IL, 1:13, 4812 LCs.
Drop v_sxi
and do the sign extension in resolve: 336k IL, 1:15, 4825 LCs.
Same for LOAD: 336k IL, 1:17, 4774 LCs (90%). Huh.
op.op_imm and op.op can be refactored.
Hackily done: 333k IL, 0:59, 4595 LCs. OK yeah, that helps!
Done so that it actually works (still hack): 333k IL, 1:05, 4633 LCs (87%).
Put register file in memory now that itās all separated out.
(How much is it using, really? Half XCOUNT: 288k IL, 0:19, 3340 LCs (63%). OK, quite a bit.)
Pico fits in 750ā1000. SERV fits in 198??????
Dumped it in a register file. Gave it two read ports so no existing code has to change, I think itās just duplicated the memories but theyāre so small it doesnāt matter. 265k IL, 0:13, 2367 LCs (44%).
Cleaned up our reg read and write logic: 264k IL, 0:14, 2300 LCs (43%).
TODOs remaining:
- Read all the accepted Amaranth RFCs.
OK cool.
- Do a once-over and generally clean up the Hart.
2291 LCs.
2162 LCs after combing the MMU interface.
2269 when I do it to the MMU write port. No point since we have to register a lot anyway. Back to 2162. Similarly it grows when I use comb to set req_width
ā maybe because everything else in the FSM (back at this point) is sync, so they all switch together. Hrm.
Letās try changing the write_xreg to comb. Basis: 8e3c38ca
, 2162 LCs. Hm, nah ā this canāt work. We assemble the components over multiple cycles fairly often (use xwr_reg
to store v_X.rd
etc.). What about read_xreg? The result is we must read the result when expected, since weāre no longer registering the address. I anticipate a growth in LCs (but faster reads).
2266 LCs, but havenāt removed the extra states yet so tests all fail.
2303 LCs, necessitated an ALU refactor. So I donāt think thereās a benefit to this other than speed? Letās see how many cycles CXXRTL gains.
57508 after change, 57505 before. Well! The ALU split is something though, if I keep that itās gonna be uglier anyway. How can I fix up?
Maybe we can tell the ALU where to read its inputs.
Adding a top-level ādelayā that gates the whole FSM adds 100LCs. Itād be nice to have one wait state. Ah well.
2230 LCs after centralising the ALU. Splitting it into two stages gets us to ā¦ 2246? Oh. OK. Really didnāt expect that. Guess I wonāt do that.
If we donāt have reg read always on, we can actually shuffle bits like imm
into xrd2_val
.
Up to 2303 on adding xrd[12]_en
. ā¦ and hold the phone, xrd2_val
is comb-driven. Nvm.
It barely helps us anyway since we need to post-process for the ALU. Cancel that.
2202 after dropping the m.If(funct7[5])
out of the Else()
block ā we know theyāre mutually exclusive, whereas it doesnāt.
On the contrary, 2189 after changing the m.If(funct3 == ā¦ADDSUB): ā¦ / m.If(funct3 == ā¦.SR)
to if/elif: easier mux? It resembles a switch (which are probably optimised ā¦).
2202 after collapsing l.wait states. Cleaner. 2194 after fixing the default bug ā thatās 41%.
I would love to get a better idea where all these cells are being spent, but itās pretty hard to say after optimisations.
-
Consider the same for the MMU.
-
At minimum, use
amaranth.lib.stream
for its interface.
-
At minimum, use
- Stop embedding the āaddress busā in the MMU along with the UART hook.
- Clean up UART.
- Move onto the big task.
Aside: ABTest
I feel like I want to make a little sandbox or something that makes evaluating the RTLIL diff between Amaranth expressions easier (optionally running it through opt
or all the way through synthesis). Ideally it could even run in-situ, i.e.
with ABTest.A():
m.d.comb += blah.eq(x[:16] == 0)
with ABTest.B():
m.d.comb += blah.eq(~x[:16].bool())
Or even:
m.d.comb += blah.eq(ABTest(x[:16] == 0, ~x[:16].bool()))
All such sites would be toggled individually (with others defaulting to āAā, no cartesian product) and then outputs presented for comparison.
Saeās a bit slow to try this with right now.
RV32C
This will take some re-understanding. We know the shape of the ISA(s) better now so we might be able to design something less Heppin Magic.
The cherry-picking went fairly straightforwardly, lots of conflicts but all easily resolved. Glad we did it in this order!
Thereās so much magic in isa.py
that Iām resolved to redesign this in a much more straightforward way.
Design
Users of an ISA defined with this tool:
- assembler/disassembler
- Opcodes and registers accessible via reflection, including support for defining shorthands (
J
,LI
, etc.). - Need to be able to go the other way, too.
- Opcodes and registers accessible via reflection, including support for defining shorthands (
- gateware
- Clear and easy access to layouts and op constants.
- Exposes metadata like ILEN/XLEN/XCOUNT for gateware to use.
- subclassing ISAs
- Can be added to, removed from (e.g. RV32E reducing XCOUNT).
Goals:
- Much less magic.
- Avoid metaclasses, avoid
__call__
. - Inspecting signatures is OK
- Avoid metaclasses, avoid
- Just enough flexibility to express RV; other ISAs are currently a non-goal.
- I think the current design encapsulates most of what weāll need here.
Notes on the existing design:
- Our current design doesnāt include an intermediate representation:
IThunk.__call__
winds up callingshape.const().as_value().value
, building up args to pass toconst()
; thereās no real point of ācalling it doneā except for the first time itās called. - Many layouts define immediates in groups of
imm0_4
,imm5
,imm6_11
kinds of things. Sometimes they also omit e.g.imm0
(implied 0). - An
ISA
has a notion of aRegister
, which is a class defined in the return value ofISA.RegisterSpecifier()
(!).- This uses
locals().update(members)
to define the members of anIntEnum
, where registers is built up from a list of (name, alias0, ā¦, aliasn) tuples and a target size. - I think weāll still need something like this; itās actually one of the least magic parts of this.
- This uses
- All
ISA
members can define_needs_named
and_needs_finalised
attributes, processed inISAMeta.__new__
.-
_needs_named
causes the assignment of__name__
and__fullname__
attributes, according to the name being assigned to. -
_needs_finalised
callsfinalise
on the object with a reference to theISA
. - This lets members finish initialising themselves with an awareness of everything else defined in the
ISA
, including things defined (lexically) after them. -
ILayout
is an empty baseclass with anILayoutMeta
metaclass. -
ILayoutMeta.__new__
takes an optional argumentlen
and assigns it tocls.len
(wherecls
is the newly-created class). - If
layout
is specified, it marks the class as needing finalisation and checks thatcls.len
is in fact defined (either now, or in a superclass). Otherwise, itās considered a layout base class. -
ILayoutMeta.finalise
:- assembles the defining context dictionary by iterating the
ISA
ās MRO backwards for theirdir()
s, discounting names starting with underscore (_
);- In other words, items in the
ISA
class and superclasses define the context for type-shape lookups.
- In other words, items in the
- assembles the full type-shape dictionary by iterating the
ILayout
instanceās MRO backwards for annotations, starting from afterISA.ILayout
itself;- In other words, annotations in class and its
ISA.ILayout
superclasses define the set of type-shapes available tolayout
items. - The context dictionary is used as
locals()
here.
- In other words, annotations in class and its
- iterates over the
layout
tuple given by the subclass, constructingcls._fields
by matching names toShapeCastable
s:- Members can be strings, in which case they refer to an annotation with a matching name.
- If the exact match lookup is unsuccessful, the classās
resolve()
function is called with some context (the remaining items in the layout, length of instruction remaining needing to be allocated to a field), which must succeed.
- If the exact match lookup is unsuccessful, the classās
- Members can be
(name, shapecastable)
.
- Members can be strings, in which case they refer to an annotation with a matching name.
- initialises
cls.shape = StructLayout(cls._fields)
. - initialises
cls.values
andcls.defaults
by callingresolve_values
on the classās existing (set by subclass definition)values
anddefaults
members, if any.- These may not overlap.
-
ints
areints
, strings are treated as keys for theShapeCastable
for the corresponding field.- If item lookup fails, the
ShapeCastable
ās__call__
is tried.
- If item lookup fails, the
- assembles the defining context dictionary by iterating the
-
ILayoutMeta.resolve
just raises an error. Is this really exposed on subclass instances? Surprising. -
ILayoutMeta.xfrm
constructs the class and callsxfrm
on it.- If
I
is anILayout
subclass, this just meansI.xfrm(ā¦)
is the same asI().xfrm()
, i.e. get an unrefined thunk and then transform it.-
Digression: for whatever reason we really like being able to use classes in these positions. It āmustā be a class because itās the result of defining something with
class Blah:
, which itself is needed because we often want to supply code, nested classes, etc. But why the insistence on calling the class itself? We donāt ever have class instances, and doesnāt that seem a bit strange? - Thinking forward, the class instances should be the intermediate representation, not a separate thunk class. You call
I()
orI(a=b)
, you get a<myisa.MyISA.I object>
, with the args hittingI.__init__
like a regular human being. - This prevents our delightful (ā¦) hack with
I(s)
. We can actually just call itI.shape(s)
, which already exists because thatās what it does lol!! - I have some lingering concerns here around repeated work that currently happens in
finalise
etc. but letās deoptimise now, and reoptimise after the design is sane.
-
Digression: for whatever reason we really like being able to use classes in these positions. It āmustā be a class because itās the result of defining something with
- If
-
ILayoutMeta.__call__
allows zero or one positional arguments, plus kwargs.- In the above example, this is
I(ā¦)
. - Zero positionals asserts a layout is defined, and returns a new
IThunk(cls, kwargs)
.- In other words,
"I(a=b)"
. This denotes a partially refined instruction based onI
. - Note that even
I()
is valid syntax, to get the same kind of thunk but not refining any part of it.
- In other words,
- One positional asserts a
Signal
argument is given and wraps it in the subclassāsshape
(cls.shape(s)
), so you can callI(s)
to decodes
.
- In the above example, this is
- The
IThunk
is as close as we get to an āintermediate representationā here.- Sets
_needs_named
, as itās probably going to be assigned in an expression likeADDI = I(funct3=I.IFunct.ADDI)
. - Stores the class it was constructed from and the
kwargs
we got. -
xfrms
initialised to empty. -
asm_args
is defined fromlist(self.layout)
: itās the list of arguments an assembly call need to provide. If your layout is("opcode", "rd", "imm")
, we need an opcode, dest register and immediate.-
opcode
is defined as typeOpcode
andrd
asReg
in the defining context, andimm
is handled inIL.resolve
when itās in the final position. - The
opcode
is refined by being specified inkwargs
, leaving justrd
andimm
for the āasm argsā. So how does that happen?
-
- We iterate over all
values
anddefaults
in the IL class, and names inkwargs
, removing fromasm_args
any specified there. - Next we iterate names in
kwargs
, asserting all specified are a part of thelayout
, and none are part of the IL classāsvalues
(the distinction betweenvalues
anddefaults
being whether they can be overridden in a thunk ctor or not).
- Sets
- It has
clone()
andpartial(**kwargs)
; the former returns a newIThunk
with copies of all settings (for declaration, immutable definition), the latter clones and updatesclone.kwargs
with given kwargs, removing those fromclone.asm_args
(further refinement of anIThunk
). - It also has
xfrm(xfn, **kwarg_defaults)
, which appends a new transform toclone.xfrms
, with some optional default kwargs.- Transforms are a function which are handed a set of kwargs, and return a dict to update kwargs given to the next one (or to the
ilcls.shape.const(ā¦)
call at the end). - The kwargs start out as the thunkās own
kwargs
mixed with any given to theIThunk.__call__
, latter superseding the former. - The transform functionās signature is analysed: if you take a parameter
x
, the kwargx
is filled in (mandatory). If you specifyx=default
, thenkwarg_defaults
and finallydefault
are used as fallbacks.- I wonder why
kwarg_defaults
is only allowed when no default is given. I guess theyāre either really mandatory to specify, or possibly optional. - An example here is
shamt_xfrm(shamt, *, imm11_5=0)
.SRAI
overrides this withSRAI = I(funct3=I.IFunct.SRI).xfrm(I.shamt_xfrm, imm11_5=0b0100000)
; the others donāt override it at all. - In other words,
kwarg_defaults
is more like ādefault overridesā. In either case I donāt imagine a user is actually setting one in a thunk, so maybe they should be treated that way.
- I wonder why
- Whatās unspecified here is a way for transforms to also transform
asm_args
, and thatās where I got up to with# clone.asm_args. ## RESUME XXX GOOD LUCK
.
- Transforms are a function which are handed a set of kwargs, and return a dict to update kwargs given to the next one (or to the
- When an
IThunk
is called, we resolve theargs_for
the given kwargs.- We call the transform pipe with
self.kwargs | args
, i.e. those given while constructing the thunk mixed with those given while calling it. - The result of the pipe is asserted to match the layout and not override anything itās not allowed to override.
- The
ilcls.values
,ilcls.defaults
(both already āresolvedā) and result of resolving the pipeās output are all combined and become the args passed toshape.const
.
- We call the transform pipe with
- Note that transforms are called in the order given, so we must transform
asm_args
back-to-front, as inputs used by earlier transforms may be provided by later ones.- Actually this is just backwards unless we do yet-more-thunking/accumulating. Letās reverse the order of how it should be called, so we can apply
asm_args
changes asxfrm()
is called repeatedly. Actually call the transforms in reverse order.
- Actually this is just backwards unless we do yet-more-thunking/accumulating. Letās reverse the order of how it should be called, so we can apply
-