Brains not Beer



  • Total Good


BC: Before PC   AD: After Digital
Turing   Layers
Logic Gate   Biology
Neuron   Images
Learning   Time Series
XOR Fail  

Advanced Topics

  • Feedback/Recursion
  • Optimization
  • Reinforcement
  • Dropout


  • NN Applications (by Zeke)
  • NN Tuning (by Thunder)


  • 1943 MCP neuron (binary logic)
  • 1948 Hebbian Learning
  • 1958 Rasenblatt Perceptron
  • 1964 Kernel Perceptron
  • 1969 XOR fail
  • 1972 ELIZA vs. Parry

AI Dies

It's Alive!

  • 1970 Conway: Game of Life
  • 1972 [ELIZA]
  • 1980s Multilayer FFNN
  • 1990 robotics and games
  • 2000 sports (basketball)
  • 2010 Deep learning
  • 2012 convolutional
  • 2013 hyperopt

Turing's Universal Computer

After winning WWII by cracking the Enigma code, Turing designed the first Programmable Computer
After winning WWII by cracking the Enigma code, Turing designed the first Programmable Computer
  • Logic Gates (Relays)
  • Memory (Tape)

Biological Brains

Neuroscientists simulated a whole brain

... of a nematode worm (C Elegans)

~300 neurons

~200 in central nervous system

C ElegansMost photographed organism of all time?
C Elegans
Most photographed organism of all time?

Pretend Brains

Artificial brains aren't at all like human brains,

. . .

or even a worm brain

Neuron simulations are broad abstractions - pulses in time not modeled - chemistry - neuron internal feedback loops

Brain? Really?

No, not really. Just pretend. In a computer.


  • 18th century: stars modeled as crystal spheres
  • Copernicus: Earth at the center
C Elegans' brain is shaped like a donut or our pharynx
C Elegans' brain is shaped like a donut or our pharynx


Typical animal neuron
Typical animal neuron

"Pretend" Neuron

Math mirrors life
Math mirrors life


  • You don't need to know Linear Algebra, just...
    • multiply
    • add
    • check thresholds
  • Equation/code is short (in python)
  • 100s of neurons
    • Not billions
  • Train for minutes
    • Not decades


McCulloch Pitts Neuron (MCP)

Modeled after biological neurons. Can be combined to perform any logical or mathematical operation.

Binary output: 0 or +1 Any Number of binary inputs Inhibitory input with "veto" power

Let's Simulate

  • raise your hand if:
    • either of your neighbors raises her hand
  • put your hand down:
    • either if both of your neighbors are same "state"


Cellular Automata, Wolfram Rule 82 = XOR
Cellular Automata, Wolfram Rule 82 = XOR


Cellular Automata, Wolfram Rule 110 = Complex
Cellular Automata, Wolfram Rule 110 = Complex

Game of Life

Game of Life gliders
Game of Life gliders

Rosenblatt's Perceptron

Designed to be "trainable" Rosenblatt provided a training algorithm

Binary output: -1 or +1 Any number of real inputs Threshold = 0 Weights and inputs can be real-valued

Let's Build One

iPython Notebook

Learned Logic

Modern Neurons

Activation functions

  • sigmoid
  • saturation
  • threshold
  • linear
  • sync
  • tanh

Priorities Matter

  • Neanderthals' big eyes likely drove them to extinction1
  • Too much of a good thing
    • Less room for imagination
    • Less neurons for social interaction
Neanderthals' large eyes may have hurt their chances
Neanderthals' large eyes may have hurt their chances

Lesson: Depth Wins

A deeper brain may be better


High resolution sensing


  • Many layers (6+ for)
  • Many neurons/layer
  • Sophisticated Connection Architectures
    • fully-connected
    • convolutional
    • recursive
    • sparse
    • random
    • scale-free

Neural Nets were "made" for ...

  • Images (object recognition)
  • Sound (speech recognition)
  • Time series (weather, finance, election prediction)

Pattern Recognition

  • Prediction
  • Segmentation (sound, image)
  • Feature detection
  • Fraud detection
  • Intrusion detection
  • Game cheating detection . . .

But often they can produce useful features that seemingly don't make sense

. . .

except for images

Neural Nets help when ...

You don't know what to look for
(feature engineering)

  • FFT
  • DCT
  • Wavelets
  • RF
  • Statistics (mean, std, diff, polynomial)
  • Resampling/Interpolation/Extrapolation

And when ...

Conventional control laws fail

  • shooting a basketball
  • kicking a soccer ball
  • stabilizing an inverted pendulum
  • helicopter stunts

Neural Nets can help invert "Physics" models

  • Infer reflectance despite shadow/glare/haze
  • 2-D image -> 3-D object
  • When direct measurement of 3-D not possible
    • stereoscopic vision
    • structured light
    • lidar
    • radar
    • sonar
    • Kinect or RealSense

NNs "see" through structured noise

Both images and sound often suffer from

  • occlusion
  • obsucration/haze/fog/fade
  • rotation/translation/warping

NNs need data and power

  • Lots of examples to learn from
  • CPU/GPU cycles to burn
    • Google speech recognition doesn't run on your phone...yet



  • The most basic ML task is classification
  • Predict "rain" (1) "no rain" (0) for PDX tomorrow

Supervised Learning

We have historical "examples" of rain and shine

Weather Underground

Since we know the classification (training set)...

Supervised classification (association)

Rain, Shine, Partly-Cloudy ?

Wunderground lists several possible "conditions" or classes

If we wanted to predict them all

We would just make a binary classifier for each one

All classification problems can be reduced a binary classification


Sounds mysterious, like a "flux capacitor" or something...

It's just a multiply and threshold check:

{% highlight python %} if (weights * inputs) > 0: output = 1 else: output = 0 {% endhighlight %}




Time Series

Need something a little better

Works fine for "using" (activating) your NN

But for learning (backpropagation) you need it to be predictable...


Again, sounds mysterious... like a transcendental function

It is a transcendental function, but the word just means

Curved, smooth like the letter "C"

What Greek letter do you think of when I say "Sigma"?


What Roman (English) character?

  • "E"?
  • "S"?
  • "C"?


You didn't know this was a Latin/Greek class, did you...

Σ (uppercase) σ (lowercase) ς (last letter in word) c (alternatively)

Most English speakers think of an "S"

when they hear "Sigma".

So the meaning has evolved to mean S-shaped.

Shaped like an "S"

The trainer ((backpropagator)[]) can predict the change in weights required Wants to nudge the output closer to the target

target: known classification for training examples output: predicted classification your network spits out

But just a nudge.

Don't get greedy and push all the way to the answer Because your linear sloper predictions are wrong And there may be nonlinear interactions between the weights (multiply layers)

So set the learning rate () to somthething less than 1 the portion of the predicted nudge you want to "dial back" to


Example: Predict Rain in Portland

Visualizing a Brain2

  • watch the weights evolve
  • activate with examples and watch intermediate layers
Output column heatmap
Output column heatmap
Input row heatmap
Input row heatmap

Get historical weather for Portland then ...

  1. Backpropagate: train a perceptron
  2. Activate: predict the weather for tomorrow!

NN Advantages

  • Easy
    • No math!
    • No tuning!
    • Just plug and chug.
  • General
    • One model can apply to many problems
  • Advanced
    • They often beat all other "tuned" approaches

Disadvantage #1: Slow to Learn

  • cubic to learn
    • quadratic to activate


  • 24+ hr for complex Kaggle example on laptop
  • 90x30x20x10 ~= 1M DOF
    • 90 input dimensions (regressors)
    • 30 nodes for hidden layer 1
    • 20 nodes for hidden layer 2
    • 10 output dimensions (predicted values)

Disadvantage #2: They don't often scale (difficult to parallelize)

  • Fully-connected NNs can't be easily hyper-parallelized (GPU)
    • Large matrix multiplications
    • Layers depend on all elements of previous layers

Scaling Workaround

At Kaggle workshop we discussed paralleling linear algebra

  • Split matrices up and work on "tiles"
  • Theano, Keras for python

Scaling Workaround Limitations

But tiles must be shared/consolidated and theirs redundancy

Disadvantage #3: They overfit

  • Too manu nodes = overfitting

What is the big O?

  • Degrees of freedom grow with number of nodes & layers
  • Each layer's nodes connected to each previous layer's
  • That a lot of wasted "freedom"
  • Many weights are randomly zeroed/ignored (Random Dropout)

O(N^2) to activate

O(N^3) to learn

Not so fast, big O...

{% highlight python %} >>>[30, 20, 10]) 6000 >>> np.sum([30, 20, 10])**2 3600 {% endhighlight %}

Rule of thumb

NOT N**2

But M * N**2

N: number of nodes M: number of layers

Automated Architecture Limits

assert(M * N**2 < len(training_set) / 10.)

I'm serious... put this into your code. I wasted a lot of time training models for Kaggle that was overfit.

Augment with your Brain

  • Imprint your net with the structure of the problem
    • Feature engineering
    • Choose activation function
    • Partition your NN
  • Prune and evolve your NN
    • Genetic algorithms

This is a virtuous cycle!

  • More structure (no longer fully connected)
    • Each independent path (segment) is parallelizable!
  • Automatic tuning, pruning, evolving is all parallelizable!
    • Just train each NN separately
    • Check back in with Prefrontal to "compete"


  • limit connections

jargon: receptive fields

  • limit weights

jargon: weight sharing

All the rage: convolutional networks


  • limit weight ranges (e.g. -1 to 1, 0 to 1, etc)
  • weight "snap to grid" (snap learning)
  • dream up your own activation function
  • improve the back-propagation function


Code highlighting test

{% highlight javascript %} function linkify( selector ) { if( supports3DTransforms ) {

var nodes = document.querySelectorAll( selector );

for( var i = 0, len = nodes.length; i &lt; len; i++ ) {
  var node = nodes[i];

  if( !node.className ) {
    node.className += ' roll';

} } {% endhighlight %}