The most basic ML task is classification
In NN lingo, this is called “association”
So lets predict “rain” (1) “no rain” (0) for PDX tomorrow
We have historical “examples” of rain and shine
Since we know the classification (training set)…
Supervised classification (association)
Wunderground lists several possible “conditions” or classes
If we wanted to predict them all
We would just make a binary classifier for each one
All classification problems can be reduced a binary classification
Sounds mysterious, like a “flux capacitor” or something…
It’s just a multiply and threshold check:
if (weights * inputs) > 0:
output = 1
else:
output = 0
(Diagram of a perceptron)
Works fine for “using” (activating) your NN
But for learning (backpropagation) you need it to be predictable…
Again, sounds mysterious… like a transcendental function
It is a transcendental function, but the word just means
Curved, smooth like the letter “C”
What Roman (English) character?
You didn’t know this was a Latin/Greek class, did you…
Σ (uppercase) σ (lowercase) ς (last letter in word) c (alternatively)
Most English speakers think of an “S” when they hear “Sigma” you think of an S. So the meaning has evolved to mean S-shaped.
That’s what we want, something smooth, shaped like an “S”
The trainer ((backpropagator)[https://en.wikipedia.org/wiki/Backpropagation]) can predict the change in weights
required
Wants to nudge the output
closer to the target
target
: known classification for training examples
output
: predicted classification your network spits out
Don’t get greedy and push all the way to the answer Because your linear sloper predictions are wrong And there may be nonlinear interactions between the weights (multiply layers)
So set the learning rate (\alpha) to somthething less than 1 the portion of the predicted nudge you want to “dial back” to
Get historical weather for Portland then …
NN Advantages
Disadvantage #1: Slow training
Disadvantage #2: They don’t scale (unparallelizable)
Scaling Workaround
At Kaggle workshop we discussed paralleling linear algebra
Scaling Workaround Limitations
But tiles must be shared/consolidated and theirs redundancy
Disadvantage #3: They overfit
What is the big O?
Not so fast, big O…
>>> np.prod([30, 20, 10])
6000
>>> np.sum([30, 20, 10])**2
3600
Rule of thumb
NOT N**2
But M * N**2
N: number of nodes M: number of layers
assert(M * N**2 < len(training_set) / 10.)
I’m serious… put this into your code. I wasted a lot of time training models for Kaggle that overfitted.
You do need to know math!
This is a virtuous cycle!
Structure you can play with (textbook)
jargon: receptive fields
jargon: weight sharing
All the rage: convolutional networks
Unconventional structure to play with
New ideas, no jargon yet, just crackpot names
Joke: “What’s the difference between a scientist and a crackpot?”
Ans: “P-value”
I’m a crackpot!
Resources
Code highlighting test
function linkify( selector ) {
if( supports3DTransforms ) {
var nodes = document.querySelectorAll( selector );
for( var i = 0, len = nodes.length; i < len; i++ ) {
var node = nodes[i];
if( !node.className ) {
node.className += ' roll';
}
}
}
}