(October 2019)

Compiling a CPU, in a cheap FPGA board

For the TL;DR crowd:

I am developing a strange habit - I keep "twisting" products from failed companies... into nice toys.

For science! :-)

After completing my AtomicPI saga, something else caught my attention - a very cheap FPGA board, that came out of yet another failed company - the Pano Logic G2,

Why not compile an open-source CPU inside this FPGA?
And in fact, since this FPGA is quite big, why not make it a multi-core one?

And then compile and run programs inside it - with an open-source cross-compiler, that uses an open-source real-time OS? The same OS that most European satellites and their instruments are using?

Why not, indeed!

(he said, a month ago - and dove into the abyss).

How fast could it go? And since the Pano Logic was meant to be a thin client, and comes with VGA, USB, Ethernet... it's packing all the pieces necessary to create a standalone computer! Could it be that this can be made into a fully open-source computer?

Keep reading - I believe you'll learn a thing or two.

P.S. The material is heavily technical and long, so I'll try to lighten it up here and there, with the occasional rant / funny picture. Also, please remember that I am a software developer, not a HW one; I simply enjoy fooling around with technology like this, so take everything said in this blog post - and in the referenced repos - with a grain of salt.

First step: the hardware

This adventure begun a few months ago, when I read a magnificent article from Tom Verbeure - a principal hardware engineer at NVIDIA. Tom built a real-time ray tracer on a dirt-cheap FPGA board; and "dirt-cheap" is not an exaggeration, since even now, you find ads like this on e-Bay:

Lot of 25 Pano Logic Thin / Zero Desktop Client Black w/ Power Supply Buy now: US $170.00

I'll just quote Tom here, so you can understand the "why" and "how" behind this:

Pano Logic was a Bay Area startup that wanted to get rid of PCs in large organizations by replacing them with tiny, CPU-less thin clients; connected to a central server. Think of them as VNC replacements. No CPU? No software upgrades! No viruses!

...The thin clients had a wired Ethernet interface, a couple of USB ports, an audio port and a video port. And all this was glued together with an FPGA

...The company has been defunct since 2013 and the clients are not supported by anything. But they are amazing for hobby purposes and can be bought dirt cheap on eBay.

So I got my hands on a Pano Logic - in particular, a G2 model; with the Spartan6 LX100 FPGA inside it. This is a rather large FPGA, promising far more power than any hobbyist has a right to - but since Pano Logic (the company) failed, the product itself is of no use to anyone but hackers and tinkerers; and it's therefore sold at amazing bargain prices.

I followed Tom's instructions - first dismantling the box, and then soldering wires to the JTAG connector:

Soldering JTAG wires

On the other end, I soldered 6 pins from pin header strips - and used a small piece of perfboard, to create an "adapter" of sorts. This allowed me to "plug" the 6 cables into the JTAG connector of a Xilinx programmer. Note that these programmers can be found for cheap on eBay (see Tom's article linked above for details).

When soldering, Blu-Tack is sometimes better than helping hands

The end result - FPGA visible in IMPACT

The last picture you see above, shows the IMPACT tool - made by Xilinx, the company that created these FPGAs - being able to see the chip.

Intermission - on open source, and the abyss

Just like many other engineers, I learned over the years to hate non-determinism; in all its forms, and all its manifestations. This means that I gravitate towards open-source operating systems; where I can use my engineering skills to fully trace what happened, and why; and fully control the OS's behavior.

I don't want my computer to decide to upgrade while I am giving a presentation. I don't want some fancy antivirus decide that it must "scan" every .c and .cpp file read by my compiler during a build, because it performs "on-access scans".

I want myself - not some mega-corporation - to be in control of my own hardware. And to automate all the workflows and processes that I need; like installing my developing environments on new machines by running a few simple one-liners...

bash$ sudo apt install gcc-8 vim git make cscope exuberant-ctags tmux
bash$ git clone https://github.com/ttsiodras/dotfiles
bash$ git clone https://github.com/ttsiodras/dotvim .vim
...

...and seeing all the myriads of complex dependencies being perfectly resolved under my Debian (or in a similar way, under my Arch)...

...or orchestrating the creation of a complex open-source cross-compiler; allowing me to deterministically build applications with a real-time, freely-accessible OS (that happens to fly on many European satellites)...

...or installing a company's HW synthesis tools - via...

bash$ sudo apt install spartan6-xilinx-synthesis

Actually....
That last one was a lie.

NEVER GONNA HAPPEN. EVER. Abandon all hope on that front, kids. Think about it - the way commercial companies operate, it makes ZERO financial sense for them to break down 15+ GB installs ~~of-monstrous hodge-podges of kitchen-sinks~~ into a proper dependency graph of packages using each other - and allow you to `apt install` only the parts you need.

HW is cheap - throw money at the problem! Yes?

But what does that mean for our endeavour with the Pano Logic G2?

Well, the last version of the Xilinx synthesis tools that was supporting the Spartan family under Linux, was the freely available ISE 14.7 WebPACK. I have installed this in my machine, and it does - thankfully - allow me to synthesize for an older Spartan3 board I have.

It's also miniscule. So tiny!

bash$ du -s -h Xilinx/14.7/
15G     Xilinx/14.7/

But I digress - and forgot that... the final Linux version of WebPACK doesn't support Spartan6 LX100 chips.

Let me repeat that - in case you didn't catch it - in a way that will make it clear:

bash$ sudo apt install gcc-8

We are sorry, but we detected a 9 year old CPU that is not supported
by the freely available version of our compiler.

Please buy our BRAND NEW CPU - WITH BONUS NSA MANAGEMENT EXTENSIONS!
Or sell your left kidney and buy our BRAND NEW COMPILER TOOLCHAIN
that supports everything!

(sigh)

Searching the Xilinx site some more, we see that there is a free version of the ISE WebPACK that targets Spartan6 devices - but only for Windows.

After downloading and unzipping this package... what do you know! That setup actually installs a Virtual Machine, containing...

...a Linux distribution!

Now that would be a nice example of something actually "ironic" - if one were inclined to, erm, tell Alanis Morissette about the true meaning of the word.

But let's continue our investigation - and have a look at this .ova file:

$ cd Xilinx_ISE_S6_Win10_14.7_ISE/ova
$ tar tvf ISE_S6_VM.ova
-rw-r----- vboxovf10/vbox_v5.2.VBOX_VERSION_PATCHr11 12425 2018-02-03 00:39 ISE_S6_VM.ovf
-rw-rw---- vboxovf10/vbox_v5.2.VBOX_VERSION_PATCHr11 7253232128 2018-02-03 00:39 ISE_S6_VM-disk001.vmdk

The .vmdk file contained inside is a virtual drive. After extracting it from the .ova with tar, we discover that this is a dynamic volume; so it can't be mounted as-is with qemu-nbd.

It must first be converted to a "normal" VMDK - and then, we can mount it:

$ qemu-img convert ISE_S6_VM-disk001.vmdk -O vmdk plain.vmdk
$ qemu-nbd -r -c /dev/nbd0 plain.vmdk
$ mount /dev/nbd0p1 /iso4/
$ ls -l /iso4/opt/Xilinx
drwxrwxr-x. 3 500 500 4096 Dec  8  2016 14.7

...and of course, there the Xilinx toolchain is - right where we'd expect it to be... In the same folder as the "Spartan3-supporting" version!

Maybe we don't have to boot this thing at all - we'll just copy the entire tree of ISE_DS, to create two folders - one with the normal (2013-era) WebPACK ISE that we used for ZestSC1/Spartan3 Mandelbrot experiments, and this new one (2016-era) for the upcoming Pano Logic ones.

A symlink will point to one or the other:

$ ls -l
drwxr-xr-x  4 ttsiod users 4096 Oct 12 21:14 ./
drwxr-xr-x 11 ttsiod users 4096 Oct 12 20:59 ../
lrwxrwxrwx  1 root   root    15 Oct 12 20:46 ISE_DS -> ISE_DS.Spartan6/
drwxr-xr-x  7 ttsiod users 4096 Mar  4  2018 ISE_DS.Spartan3/
drwxrwxr-x  6 ttsiod users 4096 Dec  8  2016 ISE_DS.Spartan6/

...and since Xilinx tools depend on license files, a script will switch everything from one form to the other - depending upon what we want to do:

#!/bin/bash

check_symlink()
{
    if [ ! -h "$1" ] ; then
        echo "$1 was not a symlink! Aborting..."
        exit 1
    fi
}

if [ $# -eq 0 ] ; then
    cd || exit 1
    ls -l .Xilinx/Xilinx.lic Xilinx/Xilinx.lic Xilinx/14.7/ISE_DS
    echo 
    echo Use xilinx.sh 3 or xilinx.sh 6
    echo
else
    if [ "$1" -ne 3 -a "$1" -ne 6 ] ; then
        echo Use xilinx.sh 3 or xilinx.sh 6
        exit 1
    fi
    XIL="$1"
    cd || exit 1
    cd .Xilinx || exit 1
    check_symlink Xilinx.lic
    rm Xilinx.lic || exit 1
    ln -s Xilinx.lic.spartan${XIL} Xilinx.lic || exit 1
    cd ../Xilinx/14.7/ || exit 1
    check_symlink ISE_DS
    rm ISE_DS || exit 1
    ln  -s ISE_DS.Spartan${XIL} ISE_DS || exit 1
    cd .. || exit 1
    check_symlink Xilinx.lic
    rm Xilinx.lic || exit 1
    ln -s Xilinx.lic.spartan${XIL} Xilinx.lic || exit 1
    cd || exit 1
    ls -l .Xilinx/Xilinx.lic Xilinx/Xilinx.lic Xilinx/14.7/ISE_DS
    echo
    echo Now go run this:
    echo "    cd ~/Xilinx/14.7/ISE_DS"
    echo "    . settings64.sh"
fi

$ xilinx.sh 6
lrwxrwxrwx 1 ttsiod users 15 Oct 19 16:10 Xilinx/14.7/ISE_DS -> ISE_DS.Spartan6
lrwxrwxrwx 1 ttsiod users 19 Oct 19 16:10 .Xilinx/Xilinx.lic -> Xilinx.lic.spartan6
lrwxrwxrwx 1 ttsiod users 19 Oct 19 16:10 Xilinx/Xilinx.lic -> Xilinx.lic.spartan6

Now go run this:
    cd ~/Xilinx/14.7/ISE_DS
    . settings64.sh

$

Additionally, to avoid wasting a metric ton of hard drive storage, we use rdfind; to identify the files that are identical between these two subtrees - and form hard links so they only occupy space once:

$ rdfind -makehardlinks true ISE_DS.Spartan{3,6}/

After this finished, it became clear that the two trees shared almost EVERYTHING. In fact, the total storage cost went BELOW the original storage cost used for just the single ISE for the Spartan3!...

In the absence of miracles, this can only mean one thing: that apparently there's plenty of copies of files spread all over - even within the same folder subtree.

So... can we now, finally, launch the thing ?

Err... no.

The *free* WebPACK version, put inside a Linux Virtual machine, and made by its makers to specifically target Spartan6 targets... will first check that the MAC address of the `eth0` Ethernet adapter *has a specific value*.

I don't know what else to say. I believe the situation is describing itself, very eloquently - about the merits of closed-source software.

Let's check the .ovf file in the original package:

$ grep MAC  ISE_S6_VM.ovf | head -1
<Adapter slot="0" enabled="true" MACAddress="08002768C935"...

We see here that the Virtual Machine is equipped with an "eth0" Ethernet adapter, with a specific MAC address. Since my laptop only has a "wlan0" interface, I added a dummy one - making it the way Xilinx apparently expects it:

$ cd /etc/systemd/network
$ cat 25-dummy.netdev
[Match]

[NetDev]
Name=eth0
Kind=dummy
MACAddress=08:00:27:68:C9:35

$ sudo systemctl restart systemd-networkd
$ sudo ifconfig eth0 up
$ sudo ifconfig eth0 | grep ether
    ether 08:00:27:68:c9:35  txqueuelen 1000  (Ethernet)

So does it work now?

Nope.

First, you have to disable your wlan0 adapter (!) - otherwise the detected lmhostid by the Xilinx tools, is the MAC address of the wlan0 adapter!

Clearly, Xilinx doesn't check whether there's an eth0 with the MAC they want... No, they look up which network adapter internet traffic goes through - and check that adapter's MAC.

Maybe.

Or maybe they stop at the first network adapter they find during enumeration.

Or maybe they draw lottery tickets from /dev/urandom - and perform an rm -rf /usr Russian roulette once in a blue moon.

Remember, we are talking about the free version of WebPACK here - that is officially distributed for people who somehow payed the company to get Xilinx Spartan6 chips, and want to program them.

And, yet, the free version distributed, has to perform checks like these - because, erm, it has to... IT JUST HAS TO.

(facepalm)

The next time you wonder about the impact of Linus Torvalds, and Richard Stallman, and Fabrice Bellard, and all the other magnificent fellows of the open-source SW world... JUST PAY A VISIT TO ANY OF THE GRAVEYARDS OF CLOSED SOURCE "HEAVENS".

And you'll then remember how the world was... before these giants decided to rescue us.

Back to "compiling" our CPU

Now that we ~~have sacrificed our firstborns and~~ are able to run the synthesis toolchain for our target - and see our FPGA being detected in IMPACT - we can finally move to "compiling" our CPU.

Over the last 4 years, I've been working as a real-time embedded SW engineer in the European Space Agency. In a very large percentage of our missions, our SW runs on one form or another of an open-source CPU design - specifically, on a SPARC derivative called LEON.

So when I begun fooling around with CPU synthesis and FPGAs, I forked this repository; that contains a mirror of the open-source version of GRLIB, the home of LEONs. My own copy is here; please remember that I am a software developer, not a HW one; I just enjoy playing around with technology. What you are reading is just one of my hobbies - don't go and bet the family farm on my repository's code quality :-)

Also, don't expect this post to start from a 'hello, VHDL world' and end with a working LEON3. That would require a book, not a blog post. Instead, we will follow along the traditional paths of engineering; we will base our efforts on pre-existing designs, and tweak them to match our own target. This is in fact one of the roles served by the designs folder in the original repository.

Programming languagues - living in the HW and SW worlds

As one might expect, writing code for programmable HW shares similarities with the SW development workflows. You have stages of processing your inputs in both worlds - instead of compiling compilation units into object files and then linking them, the FPGA tools perform synthesis, followed by placement-and-routing. You run your unit and integration tests prior to deploying your SW in production - just as you run your VHDL testbenches in your simulator prior to deploying your circuit to your FPGA.

And you edit your VHDL or Verilog code with your Vim, or perhaps your Emacs - NOTHING ELSE, INFIDEL - just like you would for your traditional SW programming languages.

I am utterly lying here, of course; the truth is that most of the HW designers I know are editing inside their Vendor-provided IDEs. Remember, these are walled gardens - with the designers pretty much "trapped" inside them.

Some of my friends have literally invested their lives in learning the peculiarities of specific toolchains - heck, even of specific versions of the toolchains!

Think about it - what else can you do, when you don't have the source code of a tool? All you can do, is "learn", over decades, the things to avoid... So that the black-box you build your designs with, doesn't go... banana.

But there are significant differences, too.

For example, HW tools have far more issues with re-using previous work. If you touch a single .c file in a codebase containing thousands of source files, only that one will be recompiled when you make - you'll pay the small price for a quick compilation of a single file, and a re-link. Fast build-compile-test cycles.

But in the HW world, that doesn't seem to the case. There is no edit-compile-run cycle; there's edit-compile-GoForAtripToTheAlpsAndStayForAweek-then-run cycle.

Another crazy difference I experienced was that builds are NOT deterministic; in the sense that in a design that utilises almost all resources of your FPGA, you may try rebuilding your code after just adding a comment - only to see it fail to satisfy the timing constraints it did in the previous build!

I am NOT joking. The placement and routing stages, in particular, are apparently very "tough" (algorithmically speaking). Heuristics are applied, in the cost functions that are used to estimate routing and timing performance... These in turn "feed" the gradient descents and simulated annealings that try to find the best location in the search space. In the end, this translates to, potentially, your "compilation" ending up trapped in a different, worse, "local minimum" than the one it found in your previous build.

Which is why you see HW designers COMMITING the bitfiles they generated, after they see them actually work on the chip.

Put simply:

A SW developer commiting an executable in his source repository, is an idiot.
A HW developer doing the same, is a wise man.

Apparently.

I am told this has improved in newer versions of HW toolchains; that they now allow you to "seed" the random processes driving the search space, so that they at least behave deterministically.

Which is nice.

Executive Summary: HW design is a strange land. It is, after all, a land full of clocks!

Tweaking, and then simulating with GHDL

As we said above, we will now base our efforts on pre-existing designs, and tweak them to match our own target.

After cloning my repository, navigate to designs; and copy the folder of my previous (unexpectedly successful!) attempt to bootstrap a LEON3 inside my Spartan3 board:

bash$ cp -a leon3-zestsc1-xc3s1000 lets-make-a-cpu

First of all, the master configuration file - config.vhd - defines a number of things that are FPGA specific. We are targeting a Spartan6 now, not a Spartan3 - so...

--- ../leon3-zestsc1-xc3s1000/config.vhd	2019-03-17 09:37:11.151623486 +0100
+++ config.vhd	2019-10-19 09:21:21.950877962 +0200
@@ -1,7 +1,7 @@
 
 
 -----------------------------------------------------------------------------
--- My customizations for my ZestSC1 board - based on the original design
+-- My customizations for my PanoLogic G2 board - based on the original design
 -- for the leon3-digilent-xc3s1000.
 --
 -- Original Copyright:
@@ -15,22 +15,22 @@
 
 package config is
 -- Technology and synthesis options
-  constant CFG_FABTECH : integer := spartan3;
-  constant CFG_MEMTECH : integer := spartan3;
-  constant CFG_PADTECH : integer := spartan3;
+  constant CFG_FABTECH : integer := spartan6;
+  constant CFG_MEMTECH : integer := spartan6;
+  constant CFG_PADTECH : integer := spartan6;
   constant CFG_TRANSTECH : integer := TT_XGTP0;
   constant CFG_NOASYNC : integer := 0;
   constant CFG_SCAN : integer := 0;
 
 -- Clock generator
-  constant CFG_CLKTECH : integer := spartan3;
+  constant CFG_CLKTECH : integer := spartan6;

We change all references of spartan3 to spartan6
We set CFG_CLKMUL and CFG_CLKDIV to the same value - e.g. 5 - for now, the LEON will be running at the same speed as the board's clock (25MHz). After we've done our first successful synthesis/placement/routing, we'll see the maximum frequency our circuit can be run - and we will bump up the clock accordingly.
In the Makefile.inc, we change to using the proper HW parts:

--- ../leon3-zestsc1-xc3s1000/Makefile.inc	2019-02-28 20:55:35.143510266 +0100
+++ Makefile.inc	2019-10-19 08:55:49.590853311 +0200

@@ -1,12 +1,12 @@
-TECHNOLOGY=Spartan3
-PART=xc3s1000
-PACKAGE=ft256
-SPEED=-5
+TECHNOLOGY=Spartan6
+PART=xc6slx100
+PACKAGE=fgg484
+SPEED=-2
 SYNFREQ=48
 
 # PROMGENPAR=-x xcf04s -u 0 $(TOP).bit -p mcs -w -o digilent-xc3s1000
 MANUFACTURER=Xilinx
-MGCPART=3s1000$(PACKAGE)
+MGCPART=6slx100$(PACKAGE)
 MGCTECHNOLOGY=$(TECHNOLOGY)
 MGCPACKAGE=$(PACKAGE)

In the ZestSC1 experiments, we used a USB/TTL dongle, that we connected to a couple of GPIO pins - and through that, we obtained access to the LEON3 Debug Support Unit. But we are using a (much faster!) JTAG interface now - so we adapt the configuration to disable the former and enable the latter:

   constant CFG_AHB_MONWAR : integer := 0;
   constant CFG_AHB_DTRACE : integer := 0;
 -- DSU UART
-  constant CFG_AHB_UART : integer := 1;
+  constant CFG_AHB_UART : integer := 0;
 -- JTAG based DSU interface
-  constant CFG_AHB_JTAG : integer := 0;
+  constant CFG_AHB_JTAG : integer := 1;

Finally, since this FPGA is a monster compared to the Spartan3 XC3S1000, we can bump up the amount of BlockRAM (used to create the "memory" of the LEON cores) by 16 times! [2]

 -- LEON2 memory controller
   constant CFG_MCTRL_LEON2 : integer := 1;
   constant CFG_MCTRL_RAM8BIT : integer := 0;
@@ -132,7 +133,7 @@
   constant CFG_ROMMASK : integer := 16#E00# + 16#100#;
 -- AHB RAM
   constant CFG_AHBRAMEN : integer := 1;
-  constant CFG_AHBRSZ : integer := 16;
+  constant CFG_AHBRSZ : integer := 256;
   constant CFG_AHBRADDR : integer := 16#400#;
   constant CFG_AHBRPIPE : integer := 0;
 -- UART 1

And for now, that's it - LEON configuration wise.

Now, there are many ways to use LEONs in one's design. To make things easier, for this 1st test, I will be using the freely available evaluation version of GRMON. GRMON is a debugging monitor/control tool specifically made to assist development with LEONs. For later stages in particular, where we will be loading the software we compiled inside our CPU, GRMON offers a GDB server; allowing us to debug things over good old GDB. Very convenient.

GRMON is not open-source, sadly - but at least the developers behind it know what they are doing. You don't download 15GB of kitchen sinks, you download 5 MB of a properly made, platform-specific command-line tool; that does one thing, and does it well.

One might even call this a philosophy.

Speaking of small, nice tools, you better download xc3sprog as well. It can be compiled from source - it's fully open; and then, instead of launching IMPACT to program our XC6LX100, we will be able to spawn a tiny 300KB executable - and do all the work via a simple incantation in our Makefile:

xc3sprog -c xpc -v YourBitfileGoesHere

But enough about tooling, let's get back to the code.

What about leon3mp.vhd - the VHDL file that describes our LEON3 core?

--- ../leon3-zestsc1-xc3s1000/leon3mp.vhd	2019-03-17 09:36:24.901622577 +0100
+++ leon3mp.vhd	2019-10-19 08:45:55.520843754 +0200
@@ -60,15 +60,14 @@
     use_ahbram_sim          : integer := 0
   );
   port (
-    resetn   : in  std_ulogic;
-    clk	     : in  std_ulogic;
-    iu_error : out std_ulogic;
-    dsuact   : out std_ulogic;
-    dsu_rx   : out std_ulogic;
-    dsu_tx   : in  std_ulogic;
-    rx       : out std_ulogic;
-    tx       : in  std_ulogic;
-    IO : inout std_logic_vector(46 downto 0)
+    resetn        : in  std_ulogic;
+    clk           : in  std_ulogic;
+    iu_error      : out std_ulogic;
+    dsuact        : out std_ulogic;
+    rx            : out std_ulogic;
+    tx            : in  std_ulogic
   );
 end;
 
@@ -76,7 +75,7 @@
 
    constant blength : integer := 12;
    constant fifodepth : integer := 8;
-   constant maxahbm : integer := CFG_NCPU+CFG_AHB_UART; -- A truly "Spartan" set of AHB masters :-)

+   constant maxahbm : integer := CFG_NCPU+CFG_AHB_JTAG; -- A truly "Spartan" set of AHB masters :-)

Compared to the previous ZestSC1/Spartan3 design, GRMON won't be controlling the LEON's Debug Support Unit (DSU) via special serial data; we will be using JTAG instead (spawning grmon -u -xilusb - or, if you are using a Digilent HS2-compatible device, grmon -u -digilent). We therefore need to drop these DSU-serial signals (dsu_rx, dsu_tx).
The Pano UCF file also has no IO. It contains many signals towards other parts that look like a lot of fun, though - VGA signals, for instance... :-) Looking forward to hooking my HW Mandelbrot, directly on a monitor :-)

    -- my ZestSC1 board's frequency in KHz
-   constant BOARD_FREQ : integer := 48000;
-   -- cpu frequency in KHz will be 34000 - as per my S/P/R results,
+   constant BOARD_FREQ : integer := 25000;
+   -- cpu frequency in KHz will be 25000 - as per my S/P/R results,
    -- my design can easily reach this speed.
    constant CPU_FREQ : integer := BOARD_FREQ * CFG_CLKMUL / CFG_CLKDIV;
    constant IOAEN : integer := 0;
@@ -126,13 +123,9 @@
    attribute syn_keep : boolean;
    attribute syn_preserve : boolean;
    
-  -- RS232 APB Uart
-  signal rxd1 : std_logic;
-  signal txd1 : std_logic;
-
   -- A "heartbeat" LED for the DSU - I used it to make sure the
-  -- locally instantiated clock here beats indeed at 34MHz
-  -- (search below for 34000000 to see the logic)
+  -- locally instantiated clock here beats indeed at 25MHz

The clock in the Pano runs at 25MHz, not 48MHz.
We also need to instantiate the JTAG controller - and remove the DSU-controlling UART:

@@ -199,35 +192,28 @@
     dsuo.tstop <= '0'; dsuo.active <= '0';
   end generate;
 
+  ahbjtaggen0 :if CFG_AHB_JTAG = 1 generate
+    ahbjtag0 : ahbjtag generic map(tech => fabtech, hindex => CFG_NCPU)
+      port map(rstn, clkm, tck, tms, tdi, tdo, ahbmi, ahbmo(CFG_NCPU),
+               open, open, open, open, open, open, open, gnd(0));
+  end generate;
+
   -- To verify that the clock shenanigans actually work on my board,
   -- I hooked this up to LED6 (i.e. the 2nd from the right) and
   -- confirmed that the clock driving the LEON3 and the DSU and all
-  -- the rest is indeed a 34MHz clock.
+  -- the rest is indeed a 25MHz clock.
   process(clkm)
   begin
       if rising_edge(clkm) then
         counter_dsu <= counter_dsu + 1;
-        if counter_dsu = 34000000 then
+        if counter_dsu = 25000000 then
             counter_dsu <= 0;
             heartbeat_led_dsu <= not heartbeat_led_dsu;
         end if;
       end if;
   end process;
 
-  -- Debug UART
-  dcomgen : if CFG_AHB_UART = 1 generate
-    dcom0 : ahbuart
-      generic map (hindex => CFG_NCPU, pindex => 4, paddr => 7)
-      port map (rstn, clkm, dui, duo, apbi, apbo(4), ahbmi, ahbmo(CFG_NCPU));
-    dui.rxd <= rxd1;
-  end generate;
-  nouah : if CFG_AHB_UART = 0 generate apbo(4) <= apb_none; end generate;
-
-  urx_pad : inpad generic map (tech  => padtech) port map (dsu_tx, rxd1);
-  utx_pad : outpad generic map (tech => padtech) port map (dsu_rx, txd1);
-  txd1 <= duo.txd;
-  
 ----------------------------------------------------------------------
 ---  APB Bridge and various periherals -------------------------------

All of these components that we are using - the AHBUART we just removed, the AHBJTAG we just added - they are coming from the open-source contents of the GRLIB. And this relates to an important concern about the HW world vs the SW one: the ecosystem of pre-made "library IP blocks" that one needs to make a system operational.

Now, I am not the only one stating that - when compared to their SW counterparts - the HW synthesis toolchains are in an abysmal state. There is a movement underway to implement open-source alternatives (e.g. see Yosys, arachne-pnr, etc). But for these efforts to succeed, an ecosystem of open library IPs needs to be developed around them.

I know the current "DNA" of HW engineers is very much of a proprietary nature - but IMHO, the HW design community needs to evolve beyond this. Become open-source mutants, like us SW people!

I am pretty sure some truly spectacular super-powers would come out of such a mutation.

Finally, let's update our testbench to comply with all the changes we did to our LEON3 design:

Adapt to new interfaces (remove DSU TX/RX, etc)
Remove the big test sending serial data over TX to control the DSU
And just reset the LEON for a little while.

--- ../leon3-zestsc1-xc3s1000/testbench.vhd	2019-03-08 17:41:44.770716858 +0100
+++ testbench.vhd	2019-10-20 08:58:26.754650548 +0200
@@ -31,6 +31,7 @@
 library gaisler;
 use gaisler.libdcom.all;
 use gaisler.sim.all;
 library techmap;
 use techmap.gencomp.all;
 use std.textio.all;
@@ -56,8 +57,9 @@
   signal rstn : std_ulogic := '1';
   signal iu_error : std_ulogic;
   signal dsuact : std_ulogic;
-  signal dsu_tx : std_logic;
-  signal dsu_rx : std_logic;
 
   component leon3mp
     port (
@@ -65,8 +67,8 @@
       resetn : in  std_ulogic;
       iu_error : out std_ulogic;
       dsuact : out std_ulogic;
-      dsu_rx : out std_ulogic; -- UART1 tx data
-      dsu_tx : in  std_ulogic  -- UART1 rx data
   );
   end component;
 
@@ -75,12 +77,12 @@
 begin
   d3 : leon3mp
     port map (
-        clk => CLK,
         resetn => rstn,
+        clk => CLK,
         iu_error => iu_error,
         dsuact => dsuact,
-        dsu_rx => dsu_rx,
-        dsu_tx => dsu_tx
     );
 
   clk <= not clk after CLK_PERIOD/2;
@@ -94,79 +96,21 @@
       severity failure;  
   end process;
 
-  dsucom : process
-    procedure dsucfg(signal dsutx : out std_ulogic; signal dsurx : in std_ulogic) is
-      variable w32 : std_logic_vector(31 downto 0);
-      variable c8  : std_logic_vector(7 downto 0);
-      constant txp : time := 320 * 1 ns;
-      variable l : line;
-    begin
-      dsutx <= '1';
-      write(l, String'("Resetting for 40 cycles"));
-      writeline(output, l);
-      rstn <= '1';
-      wait for 40*CLK_PERIOD;
-      rstn <= '0';
-      wait for 10*CLK_PERIOD;
-
-      wait for 5000 ns;
-
-      -- Send exactly what grmon3 sends.
-      txc(dsutx, 16#55#, txp);
-      txc(dsutx, 16#55#, txp);
-      txc(dsutx, 16#55#, txp);
-      txc(dsutx, 16#55#, txp);
-      txc(dsutx, 16#80#, txp);
-      txc(dsutx, 16#ff#, txp);
-      txc(dsutx, 16#ff#, txp);
-      txc(dsutx, 16#ff#, txp);
-      txc(dsutx, 16#f0#, txp);
-      txc(dsutx, 16#80#, txp);
-      txc(dsutx, 16#ff#, txp);
-      txc(dsutx, 16#ff#, txp);
-      txc(dsutx, 16#ff#, txp);
-      txc(dsutx, 16#f0#, txp);
-      txc(dsutx, 16#ff#, txp);
-
-      -- and look at the magnificent output from our design;
-      -- the DSU replies with 00 00 10 70 ; the proper response!
-
-      -- This test can also be used - it is the original
-      -- scenario taken from digilent-xc3s1000.
-
-      -- txc(dsutx, 16#55#, txp);		-- sync uart
-
-      -- txc(dsutx, 16#c0#, txp);
-      -- txa(dsutx, 16#90#, 16#00#, 16#00#, 16#00#, txp);
-      -- txa(dsutx, 16#00#, 16#00#, 16#20#, 16#2e#, txp);
-
-      -- wait for 25000 ns;
-      -- txc(dsutx, 16#c0#, txp);
-      -- txa(dsutx, 16#90#, 16#00#, 16#00#, 16#20#, txp);
-      -- txa(dsutx, 16#00#, 16#00#, 16#00#, 16#01#, txp);
-
-      -- txc(dsutx, 16#c0#, txp);
-      -- txa(dsutx, 16#90#, 16#40#, 16#00#, 16#24#, txp);
-      -- txa(dsutx, 16#00#, 16#00#, 16#00#, 16#0D#, txp);
-
-      -- txc(dsutx, 16#c0#, txp);
-      -- txa(dsutx, 16#90#, 16#70#, 16#11#, 16#78#, txp);
-      -- txa(dsutx, 16#91#, 16#00#, 16#00#, 16#0D#, txp);
-
-      -- txa(dsutx, 16#90#, 16#40#, 16#00#, 16#44#, txp);
-      -- txa(dsutx, 16#00#, 16#00#, 16#20#, 16#00#, txp);
-
-      -- txc(dsutx, 16#80#, txp);
-      -- txa(dsutx, 16#90#, 16#40#, 16#00#, 16#44#, txp);
-
-      -- Look! The DSUACT signal goes high! All good.
-      wait for 50000 ns;
-
-      write(l, String'("Test completed."));
-      writeline(output, l);
-    end procedure;
+  jtagproc : process
+    variable l : line;
   begin
-    dsucfg(dsu_tx, dsu_rx);
-    wait;
-  end process;
+    write(l, String'("Resetting for 40 cycles"));
+    writeline(output, l);
+    rstn <= '1';
+    wait for 40*CLK_PERIOD;
+    rstn <= '0';
+    wait for 10*CLK_PERIOD;
+
+    wait for 5000 ns;
+
+    write(l, String'("Looks like we are booting."));
+    writeline(output, l);
+    assert false report "Reached end of test" severity failure;
+   end process;
+
 end;

Time to launch GHDL to simulate this circuit - GHDL being a magnificent open-source simulator that you can compile from source (or install via your Linux distribution's repositories):

bash$ make simulation-setup
...

bash$ make simulation
...
Resetting for 40 cycles
Panologic G2 LX100 Demonstration design
GRLIB Version 2017.3.0, build 4208
Target technology: spartan6  , memory library: spartan6  
ahbctrl: AHB arbiter/multiplexer rev 1
ahbctrl: Common I/O area disabled
ahbctrl: AHB masters: 2, AHB slaves: 8
ahbctrl: Configuration area at 0xfffff000, 4 kbyte
ahbctrl: mst0: Cobham Gaisler          LEON3 SPARC V8 Processor       
ahbctrl: mst1: Cobham Gaisler          JTAG Debug Link                
ahbctrl: slv1: Cobham Gaisler          AHB/APB Bridge                 
ahbctrl:       memory at 0x80000000, size 1 Mbyte
ahbctrl: slv2: Cobham Gaisler          LEON3 Debug Support Unit       
ahbctrl:       memory at 0x90000000, size 256 Mbyte
ahbctrl: slv3: Cobham Gaisler          Single-port AHB SRAM module    
ahbctrl:       memory at 0x40000000, size 1 Mbyte, cacheable, prefetch
ahbctrl: slv4: Cobham Gaisler          Test report module             
ahbctrl:       memory at 0x20000000, size 1 Mbyte
apbctrl: APB Bridge at 0x80000000 rev 1
apbctrl: slv1: Cobham Gaisler          Generic UART                   
apbctrl:       I/O ports at 0x80000100, size 256 byte 
apbctrl: slv2: Cobham Gaisler          Multi-processor Interrupt Ctrl.
apbctrl:       I/O ports at 0x80000200, size 256 byte 
apbctrl: slv3: Cobham Gaisler          Modular Timer Unit             
apbctrl:       I/O ports at 0x80000300, size 256 byte 
testmod4: Test report module
ahbram3: AHB SRAM Module rev 1, 256 kbytes
gptimer3: Timer Unit rev 1, 8-bit scaler, 2 32-bit timers, irq 8
irqmp: Multi-processor Interrupt Controller rev 4, #cpu 1, eirq 0
apbuart1: Generic UART rev 1, fifo 4, irq 2, scaler bits 12
ahbjtag AHB Debug JTAG rev 2
dsu3_2: LEON3 Debug support unit + AHB Trace Buffer, 2 kbytes
leon3_0: LEON3 SPARC V8 processor rev 3: iuft: 0, fpft: 0, cacheft: 0
leon3_0: icache 1*8 kbyte, dcache 1*8 kbyte
clkgen_spartan3e: spartan3/e sdram/pci clock generator, version 1
clkgen_spartan3e: Frequency 25000 KHz, DCM divisor 5/5
        1750 ns : cpu0: 0x00000000    unimp  (trapped)
Looks like we are booting.
testbench.vhd:113:5:@6us:(assertion failure): Reached end of test
ghdl:error: assertion failed
  from: process work.testbench(behav).jtagproc at testbench.vhd:113
ghdl:error: simulation failed
make: *** [Makefile:38: simulation] Error 1

All good! The LEON3 traps after 1.75 microseconds, since it reads a nice 32-bit zero from our "ram" - which is not valid code for a SPARC.

The "assertion failure" is normal, since that's how the testbench ends:

assert false report "Reached end of test" severity failure;

Run Forrest, Run

Now, there's plenty more things we can do here - like configuring the simulated RAM to have a binary we compile ourselves.

But we are insane SW people here, playing with forces we don't comprehend.

Let's launch the thing in the real HW!

$ make ise
...
... laptop fans wake up - sounds like an airplane here...
... 5 minutes pass... 
... there's no edit-compile-run cycle... there's...
... edit-compile-GoForAtripToTheAlpsAndStayForAweek-then-maybe-run cycle...
...
FLEXnet Licensing error:-5,357
For further information, refer to the FLEXnet Licensing documentation,
available at "www.flexerasoftware.com".
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:Map:258 - A problem was encountered attempting to get the license for this
   architecture.

Ah yes, I forgot!

$ # No wireless lan interface tolerated by Xilinx ;
$ # temporarily remove the driver for wlan0 from the kernel
$ sudo rmmod wl

$ # Must also have 'eth0' - with the magic MAC address set,
$ # so process my /etc/systemd/network/25-dummy.netdev
$ sudo systemctl restart systemd-networkd
$ sudo ifconfig eth0 up

I refuse to memorize idiocy - so I just add these commands in the Makefile ; the network will be automatically made the way Xilinx wants it, every time the build takes place - and will then be automatically set back to normal (modprobe wl ; dhclient wlan0).

At least in the UNIX way of doing things, you can easily cope - and automatically handle - even insane requirements.

Come to think of it, perhaps I should investigate making synthesis happen inside a Docker container; and setup this insane network inside the container. Hmm.

Oh well, postponed for later investigation.

For now, take 2:

$ make ise
...
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
INFO:Security:56 - Part 'xc6slx100' is not a WebPack part.
WARNING:Security:42 - Your software subscription period has lapsed. Your current
version of Xilinx tools will continue to function, but you no longer qualify for
Xilinx software updates or new releases.
----------------------------------------------------------------------
...
(panic attack at first - but thankfully, synthesis continues fine regardless)

(shakes head)

You poor, poor HW people...

...
All constraints were met.
...   
Generating Pad Report.

All signals are completely routed.

Design statistics:
   Minimum period:  ........ (Maximum frequency:  83.081MHz)

...
Creating bit map...
Saving bit stream in "TheBigLeonski.bit".
Creating bit mask...
Saving mask bit stream in "TheBigLeonski.msk".
Bitstream generation is complete.

Woohoo! We're good. Way beyond good, in fact - we can bump up our LEON's clock way above our current setting of 25MHz.

But before we do that, let's bump the number of cores - In fact, this FPGA is such a monster, it can easily accommodate 2, even 4 LEONs. The utilization report - showing percentage of utilised resources - is far from maximised, in everything (except BlockRAMs [2]).

So we bump up the number of cores:

-- 2 LEON cores, please!
constant CFG_NCPU : integer := (2);

...and we bump up the clock, to a very safe 50MHz:

-- 10/5 = 2, so 2x25MHz = 50MHz
constant CFG_CLKMUL : integer := (10);
constant CFG_CLKDIV : integer := (5);

This is the part that should make you stand and notice - here we are, casually specifying, in code, that we want 2 cores in our CPU. FPGAs are amazing.

We run our synthesis again, and a few minutes later...

That's it - GRMON sees both our cores, running at 50MHz:

JTAG chain (1): xc6slx100 
GRLIB build version: 4208
Detected frequency:  50.0 MHz

Component                            Vendor
LEON3 SPARC V8 Processor             Cobham Gaisler
LEON3 SPARC V8 Processor             Cobham Gaisler
JTAG Debug Link                      Cobham Gaisler
AHB/APB Bridge                       Cobham Gaisler
LEON3 Debug Support Unit             Cobham Gaisler
Single-port AHB SRAM module          Cobham Gaisler
Generic UART                         Cobham Gaisler
Multi-processor Interrupt Ctrl.      Cobham Gaisler
Modular Timer Unit                   Cobham Gaisler

Time to compile some SW and run it inside this!

The cross-compiler

One can compile a cross compiler for this target from source (and in fact I frequently do, as part of my duties in the Agency) . But to avoid making this gigantic blog post even heavier, let's just use the precompiled open-source toolchain of BCC2 - from here. We un-tar under /opt; and build our hello world:

$ cat hello.c
#include <stdio.h>
int main() { puts("Hello, Big Leonski!"); }

$ /opt/bcc-2.0.8-gcc/bin/sparc-gaisler-elf-gcc -mcpu=leon3 \
    -o hello hello.c

$ /opt/grmon-eval-3.1.0/linux/bin64/grmon -u -xilusb
...
grmon3> load hello                      
40000000 .text       25.2kB /  25.2kB   [===============>] 100%
40006500 .rodata      128B              [===============>] 100%
40006580 .data        1.2kB /   1.2kB   [===============>] 100%
Total size: 26.53kB (1.18Mbit/s)
Entry point 0x40000000
Image /var/tmp/hello loaded

grmon3> run
Hello, Big Leonski!

  CPU 0:  Program exited normally.
  CPU 1:  Power down mode

And that's it - we have ourselves a multi-core CPU, built from our own source code, running binaries built from our own source code, with a cross-compiler that can also be built from openly accessible source code.

What's next?

Ideally, one would want to support the remaining pieces of this board; It has two USB slots, Ethernet, and most importantly 128MB of DDR2 SDRAM. These last two pieces in particular, would elevate it to something like the first "serious" machine I worked with, back when I was a student: a SPARCStation. I'd love that; and if the HW controllers involved are supported by Linux, bootstrapping the undisputed king of OSes inside this would be a breeze.

Alas, I am told by my friends that DDR controllers are no joke; they are not the playground of bored SW engineers.

Sigh :-)

Still, I hope you found this (very long) read an interesting one.
Cheers!

Discussion in Slashdot

Discussion in Hacker News

Discussion in Hackaday

Discussion in Reddit/Linux Discussion in Reddit/FPGA

Notes

To HW designers reading this - please remember who is the intended audience of this blog post. Come to think of it, remember this is written by a SW developer; cue appropriate meme.
Until an actual HW wizard makes the Pano Logic DDR2 SDRAM work! Which will gives us an insane 128MB of space... At that point, I will boot Linux in this thing.

Index

Updated: Tue Jun 13 21:45:26 2023

The comments on this website require the use of JavaScript. Perhaps your browser isn't JavaScript capable; or the script is not being run for another reason. If you're interested in reading the comments or leaving a comment behind please try again with a different browser or from a different connection.