Devember devlog: Neural network Go package

Pretty excited for this one. This project will be a nice warmup to next year, as I’m switching from being a DevOps Engineer, to a Go Developer (woohoo). Therefore the goal is to explore more of the advanced topics about Go: practice proper concurrency patterns, the Plan9 flavor of assembly and profiling. Profiling is rather ez now, I’ve explored it before, but I’ve never programmed in this specific assembly language. I’ve only done Intel 8088 in the past. But oh, how I want to optimize, to use those oh-so-nice go features. This, of course, means that I’m aiming to use all of the cores available on a given machine. Go channels and shiz.

Goal: solve MNIST database. I’m aiming less about getting the error rate as low as I can, since that means having a bit more math than I want to get into, tougher network types. A simple Linear Regression network will do.

Also, I’m a bit of a cheat, since I’m not starting from nothing now. I’ve actually started this as something to show of my work for hiring, a month ago. But the scope was a bit much for too little time, and so I’m recycling it for here. There’s still a lot of work to do, so I’m not even ashamed.

What I’ve got now: a little internal package for reading and writing IDX files, some sort of an ugly implementation of maths with still failing tests, that I’m not sure if are accurate yet at all. Attempted to preempt a bit and try to vectorize from the get-go. So I’m using a helper package for a matrix representation.

Here she is:

Development stats NOW!


Oh damn, congrats!

Very nice. Seems interesting to say the least. Looking forward to following along to see how Go helps you solve these problems :smiley:

1 Like

Best of luck to you.
I will probably understand none of this Devember :stuck_out_tongue:


Thanks, both of you!

I’ve used the neural network series from YT channel 3blue1brown to gain the understanding of the math, it’s pretty good! Also got a hint about what to do with biases from StackExchange I think, and some awesome properties I found from researching this. I’ll def describe everything here.


Day 1

Overhauled the nums executable command line options parsing with help from an awesome go-flags package. Now i’ve got a friendly output for how to use it. Also added and implemented all of the profiling options for use later.

bin ➤ ./nums -h                                                                                                                                   git:master
  nums [OPTIONS] <new | run | train>

Profiling and tracing options:
      --cpuprof= Turn on CPU profiling
      --trace=   Turn on runtime tracing
      --memprof= Turn on memory profiling

Help Options:
  -h, --help     Show this help message

Available commands:
  new    Create an empty network
  run    Run the network on test data
  train  Train on a given network.

More importantly I want to describe what I went through and how I’m planning to implement the network, to both share my thoughts/findings and review:

A neural network is basically is a graph with an input and output layers + any number of layers in between. Every node is connected to all of the following layer nodes (this is important, explain later). My first instinct, of course, was to implement it as a graph.

type Node struct {
  Activation float32
  Nodes      []*Node

It became obvious early that this will never perform well:

  1. You cannot vectorize this
  2. To calculate a single node activation, you’d need to deference every connection, so you’d probably cache miss every single time, at least +100ns for every single weight. And neural networks are big, this will add up fast

So, I’ve reworked Node into an array of connections (weights), since it’s connected to every single node in the next layer, you can just keep an array.

type Node struct {
  Activation float32
  Weights    []float32

This is heaps better than the last option, so I went with this. I’ve implemented the network traversal, and calculating of activations, all was well and good, I got the vectorization I wanted. Until it was time to implement backpropogation. Essentially, backpropogation is done after you calculate the activation for a training input, and then start moving backwards through the graph, taking every activation/weight/bias and making a ratio of everything while going through to the front of the network (input). This ratio tells you how much you need to adjust the values of every single weight and bias.

So what you want optimally, is for every node to have every weight that is connected to it, also in vectorizable state to allow for acceleraton. And those weights are one per each previous layer of nodes. You have to load every single one, from every single node of the previous layer. The code got convoluted fast. Not good for speed either.

So after some hours of trying to figure this out, trying to learn the math, seeing that matrices are being used in some places to explain, I figured I’d try to implement with matrices, maybe that would be easier. Computers are good at matrices right?

So I organized my matrices like so:

type LinearRegression struct {
  // ...
  matrices     []*mat.Dense

One matrix represents a layer of connections, not nodes.

Due to how the underlying data is laid out, every row of the matrix is a single node’s weights to the next layers nodes, you can get that as a raw vector (yay). You may notice that gray circle. That’s how you add biases to the matrix, you just add an activation that is 1 in the right places, then the weight essentially becomes the bias (simplified the function is a2 = a1 * w + b, a1 being prev layer activation, so you cheat this by essentially doing a2 = a1 * w + 1 * b) . This way bias is supplied together with the weights. This is the little trick I found from StackExchange. F-kn genius. Bias needs no special treatment.

This does not fix the layout of the data. For backpropogation, I still don’t have a vectorizable layout, the right weights are still scattered. After an hour or so of staring at it furiously, it dawned on me. You simply transpose the matrix. That’s it.
This is the transposed variant of the last one. Suddenly, the row becomes a vector of those much needed connections from every single previous node. Now I can get them. As a vector. Well, almost. Underlying layout is still wrong with the package that I’m using, I only get an interface that let’s me address the members At(x, y int) float32. Not good. But I think I can get away by copying the matrix into the right layout. So I basically end up having a transposed copy for use in that moment. Good enough.

rows, cols := n.matrices[idx].Dims()
transposed := mat.NewDense(cols, rows, nil)

Volia. transposed is now the right layout, and I only need to do this once per every layer calculation, mega savings I think. I’m yet to see how this will turn out. There’s another cool idea I have to save allocations for later. The copy won’t go anywhere though, I think.

With this, you can only have networks whose nodes connect to every other node in the next layer. You could not express it differently. So this won’t work very well for networks that use mutations and add nodes spontaneously, without any hacks.

So, the entire network is expressed as an array of matrices. One layer = one matrix num matrices = num layers - 1. They can be different sizes (for different number of nodes in layers) and work properly. This is basically where I’m at right now. Tests still fail, probably because I don’t add the cheat bias activation right yet. That’s what I’m starting to work on tomorrow. I’m done for today!

EDIT: Crappy grammar.


Day 2

Nothing landed to code, I had very little time to work on it, was debugging, trying to figure out what caused the issue.

Day 3

I’ve rewritten the backpropogation code in a nicer way, but still something somewhere does not add up. So, I’ve spent most of the time today figuring out the structure on paper, and doing cross-checking with code to see what’s wrong. So far it seems that the way I structure my matrices isn’t exactly correct, will probably shuffle things around. That means the layouts posted in Day 1 are probably not correct. I’ll finish working everything out on paper, and then transfer all of that into code. Might have some success today, but that will be a write up for tomorrow.

Todays resulting code
func (n *LinearRegression) Backpropogate(activations []float64, desiredOutput []float64) (Delta, error) {
	if len(activations) != n.layers[0] || len(desiredOutput) != n.layers[len(n.layers)-1] {
		return Delta{}, ErrIncorrectArgument

	allActivations, err := n.traverse(activations)
	if err != nil {
		return Delta{}, errors.Wrap(ErrInternalError, err.Error())

	// Initialize ratios set
	ratios := make([][]float64, len(allActivations), len(allActivations))
	for idx := 0; idx < len(ratios); idx++ {
		ratios[idx] = make([]float64, len(allActivations[idx]), len(allActivations[idx]))

	f64.CostActivationRatioOutputsNaive(allActivations[len(allActivations)-1], desiredOutput, ratios[len(ratios)-1])

	derrivSigmaPastLayer := append(make([]float64, 0, len(allActivations[len(allActivations)-1])), allActivations[len(allActivations)-1]...)

	normalizedPastLayer := append(make([]float64, 0, len(allActivations[len(allActivations)-1])), allActivations[len(allActivations)-1]...)

	for idx := len(n.matrices) - 1; idx >= 0; idx-- {
		derrivSigma := append(make([]float64, 0, len(allActivations[idx])), allActivations[idx]...)

		rows, cols := n.matrices[idx].Dims()
		transposed := mat.NewDense(cols, rows, nil)

		for cIdx := 0; cIdx < len(ratios[idx]); cIdx++ {
			ratios[idx][cIdx] = f64.CostActivationRatioNaive(transposed.RawRowView(cIdx), derrivSigmaPastLayer, ratios[idx+1])
			f64.CostWeightRatioNaive(normalizedPastLayer, derrivSigma, &ratios[idx][cIdx])

		derrivSigmaPastLayer = derrivSigma
		normalizedPastLayer = append(normalizedPastLayer[0:0], allActivations[idx]...)

	return Delta{
		deltas: 1,
		values: ratios,
	}, nil

I’ve rewritten to save some allocations, by reusing calculations. Also changed how I copy data to save some more allocations. Later on, I will use sync.Pool to reuse slices, to cut down on allocations significantly, but I only will do that once I can properly measure everything.


Can you show your paper work? I didn’t know that was a way to work towards better code. Perhaps if I saw what you’re doing I would understand a bit more.
PS I know nothing about coding.


I will create a visual like I did with those matrices and graph above, at least the graphs and matrices, writing formulas in an app (I use LibreOffice Draw) is a bit time consuming, plus I’d just end up writing the same ones that are on the 3Blue1Brown series.

My paperwork is usually a mind stream, not really concise, it’s just there for me to understand it as I work basically. It’s nothing spectacular, definitely not mathematical eye candy or anything, probably syntactically inaccurate anyway. I’m mostly useless when it comes to complicated math, even though I have the fair understanding of the relevant branches because of my unfinished Computer Science degree.

I just try to create a visual image of how everything should work really. I’ve probably redrawn and rewritten everything about 3-4 times now :laughing:

Here’s the link to the series BTW, if you’re interested:

It does not assume that you know coding, so you should be able to understand it well. Even if you don’t really know the mathematical concepts, the visuals in there will give you a good idea.

Thanks for the link. It helps. I subscribed to it as well.

1 Like

Writing out algorithms on paper, and drawing diagrams of structures is usually very helpful. It makes you organize your thoughts.


Day 4

After about an hour or two spent I managed to find where the issue was. I was, indeed, structuring the matrices wrong in code. Now about halfway of going over the whole LinearRegression implementation, to make sure everything’s right. Only a few spots wrong, but that means some changes to tests, and I’ve got to make sure they’re right. Will probably finish that up tomorrow.


Day 5

Did I fix it? Maybe. You know the surreal moment when something seems to work, and you don’t know why? Yeah, that.

The resulting commit here. I’d feel victorious, but I don’t know if it’s actually fine. Looks ugly at places, but it will be revisited for optimization anyway. Kind of tired of the math so I’ll be switching gears and working on supporting areas for a while.

What I need to do now:

  • Saving/Loading the network to/from file - will probably just use encoding/binary, as simple as possible
  • Write benchmarks for Activation and Backpropogation - the very important part, and the very easy part
  • Finish Delta implementation w/ tests - that’s what Backpropogation spits out
  • Implement a form of batching (for network training), Go will shine here for sure
  • Wire up all of the things to the executable
  • Make sure everything works
  • Start optimizing

An observation on my own code: I’m slicing the slices (heh) quite a lot. That is not free. I’m very afraid of excessive bound checking hidden in there, but I won’t know if it’s there until I measure.

EDIT: By the way, my weekend will be spent away from home. So I get to break my promise haha. I might sneak something in, but it’s unlikely.


Day 6

Opted to go for the quick task of writing a benchmark today. Results are as follows:

goos: linux
goarch: amd64
BenchmarkLinearRegressionActivate/Small-4         732676              1646 ns/op             704 B/op          5 allocs/op
BenchmarkLinearRegressionActivate/Medium-4         38347             30459 ns/op            8339 B/op          6 allocs/op
BenchmarkLinearRegressionActivate/Large-4            235           4734385 ns/op          139654 B/op          6 allocs/op
BenchmarkLinearRegressionActivate/XLarge-4           241           4699473 ns/op          192203 B/op          6 allocs/op
BenchmarkLinearRegressionBackpropogate/Small-4            109921             10567 ns/op            4128 B/op         21 allocs/op
BenchmarkLinearRegressionBackpropogate/Medium-4             6558            179501 ns/op          151687 B/op         27 allocs/op
BenchmarkLinearRegressionBackpropogate/Large-4                33          33896203 ns/op        21602672 B/op         27 allocs/op
BenchmarkLinearRegressionBackpropogate/XLarge-4               31          32851399 ns/op        21899277 B/op         27 allocs/op
ok  15.147s

Top 10 functions:

Showing nodes accounting for 13690ms, 82.52% of 16590ms total
Dropped 147 nodes (cum <= 82.95ms)
Showing top 10 nodes out of 71
      flat  flat%   sum%        cum   cum%
    7150ms 43.10% 43.10%     7150ms 43.10%
    1880ms 11.33% 54.43%     1880ms 11.33%
    1180ms  7.11% 61.54%     1180ms  7.11%  runtime.memclrNoHeapPointers
    1000ms  6.03% 67.57%     1000ms  6.03%
     940ms  5.67% 73.24%      940ms  5.67%
     370ms  2.23% 75.47%      370ms  2.23%  sync.(*Mutex).Lock
     310ms  1.87% 77.34%      310ms  1.87%  math.Exp
     310ms  1.87% 79.20%     1240ms  7.47%  math.pow
     300ms  1.81% 81.01%      300ms  1.81%  math.Log
     250ms  1.51% 82.52%      250ms  1.51%  sync.(*Mutex).Unlock

So, three findings:

  1. Bound checks are not there (phew)
  2. If I just optimize one function with AVX, I will get mega performance gains. The rest will follow.
  3. I don’t have that many allocations. But they’re probably big.

There’s also a PDF with a graph in the repository. But I’m pretty happy about what I’m seeing. Next tasks ahead!


Days 7, 8 and 9

Had a fantastic weekend and had to help out a buddy by writing a small tool for him to convert between file formats. He works in a bio lab, and was given some GUI coding tool (you code with pcb schematics-like components) implementation that was stupid beyond belief. It loads the entire binary file into memory (a matrix) and does a transpose to get the correct stuff, limiting the size of files (which my buddy was hitting the limits of), and taking five years (10 minutes approx) to complete. Got to the bottom of the problem and rewrote the thing with 54 lines on Go code. Does the same file (~400MB) in ~35 seconds and has no memory restrictions + gzips immediately to save space. Scientists, man.

Day 10

Implemented network marshaling and unmarshaling. Now I can persist them in files. Fantastic. Removed some unused code, thus bumping up my test coverage up to 77.7%. No performance improvements will be worked on until I finish up with the tasks. Only a couple things remain.

I did self-serialization by implementing io.WriteTo. So you just complete your work with the network and call LinearRegression.WriteTo(writer) to write it out wherever. I give an io.Reader to another function and it figures out the type of network, unmarshals it and puts into the correct struct (just one for now lol), returns a Network. Cool stuff, I love working with file formats in Go.

Self-serialization implemented here:

Creating a network from an io.Reader implemented here:


Day 11

Completed implementation and tests for the Delta struct. It now has a naive implementation for applying another one of Delta to itself, and later calculating averages from all the applied components. Basically it sums all the data, and keeps track of how many Deltas were summed, divides all the components by that to get an average later.

Also started working on the training mechanism. Decided I will implement it simply as a function with a signature of:

func Train(network Network, batchSize int, stepModifier float64) chan<- TrainingCase

It will set up everything according to the settings and then I can just feed the channel the training set, and it will just do its work on all cores. Close the channel when complete, and done. It’s interesting TDD’ing this thing, I’ll finish it tomorrow probably.


An intelligent software engineer? In the wild? Good God, man, make haste!

1 Like

It’s weird being called intelligent for things that all software engineers should be doing anyway. Guess it’s not the norm yet :laughing:

1 Like

Day 12

Did not yet finish the training mechanism, mostly because I decided to change how I’ll be doing it. Instead of just having a function I decided to make a Trainer interface, to make it extendable:

// Trainer describes a training mechanism for Network types.
type Trainer interface {

	// Train launches the training routine for a network.
	// A channel is returned which accepts TrainingCases from
	// the the training set. Closing the channel completes the training.
	// After closing the channel, user must wait on `done` channel before 
	// performing any actions on the trained Network to prevent races.
	// Train will attepmt to use all cores of the machine by spawning
	// runtime.NumCPU() number of goroutines.
	Train(Network) (trainingSet chan<- TrainingCase, done <-chan struct{})

Train now just gets a Network on invocation, figured it would be much simpler that way, leave details to the struct that will implement it. Train also returns two channels now. One that will be fed the training set and the other that will be signaled when all the goroutines complete their work, because at that point, all mutations to the Network will be complete. Man I love Go for concurrency, so easy and beautiful to write.


Ya that’s always been the issue with more scientific workloads in go multi-dimensional arrays aren’t super efficient. You can’t for example just say make an array of size [32][16][2]float32 in a make statement. So either everything stays dynamically sized or you allocate the outer, iterate through the outer setting each item by the size of the inner.

If you wanted to get crazy you could just deal with a raw byte array and serialize the floats and have your matrix struct handle the what was the slice management when you need them sounds crazy but I do it a lot with multi-d geographic data I work with.


Yep, that’s mostly what I took away from programming in C ages ago, it’s way faster to calculate positions in one large array. You have only one array, so you don’t cache miss and de-reference nearly as much, which is great.

The gonum package that I’m using for matrices does this underneath. So I’m mostly fine with weights. I’m not doing that with the Delta struct yet (ratios calculated by backprop), that’s less than efficient, I have an idea how to implement it with gonum matrices too, I might do it later as an optimization path!

1 Like