The Artificial Neuron: One Math Equation for Intelligence
What actually computed those 384-dimensional embeddings? A staggeringly simple unit of computation called the Artificial Neuron. Scale it up a trillion times, and you get ChatGPT.
In our previous articles we saw how AI converts words to tokens, tokens to embeddings, and embeddings to searchable vectors. But here's the question nobody asks first: what actually computed those 384-dimensional embeddings in the first place?
The answer is a staggeringly simple unit of computation called the Artificial Neuron — a mathematical imitation of the biological brain cell, refined since 1943. Every embedding model, every language model, every image generator is built from billions of these stacked in layers. Understand one, and you understand the fundamental building block of all modern AI.
In 1943, neurologists Warren McCulloch and Walter Pitts asked: 'What if we could model the biological brain cell in mathematics?' Their answer launched the entire AI industry.
From Biology to Math
Your brain contains ~86 billion neurons. Each one is a tiny machine that receives signals from other neurons, weighs them (some inputs matter more), decides whether to fire and how strongly, and passes the result forward. The artificial neuron does exactly the same — but with numbers instead of electrical impulses.
Same Logic, Different Medium
- Receive: Electrical signals from dendrites.
- Weigh: Some inputs matter more.
- Decide: Threshold check in cell body.
- Pass: Fire down the axon.
- Receive: Input numbers (x).
- Weigh: Multiply by weights (w).
- Decide: Sum + Bias → Activation.
- Pass: Output number forward.
The entire neuron collapses into one compact equation:
Output = Activation( Σ(x × w) + b )The Equation Components
Raw data coming in — pixels, embeddings, sensor values.
How important each input is. The AI learns these during training; they start random and get refined.
A starting offset. Without it, a neuron forced to start from zero can't represent some patterns.
The decision gate that transforms the sum into an output — and adds non-linearity.
The 5-Step Decision Pipeline
Raw numbers arrive: x₁, x₂, x₃…
Each input is multiplied by its learned importance: xᵢ × wᵢ.
Add them all up and add the bias: Σ(xᵢwᵢ) + b.
Pass the sum through ReLU / Sigmoid / Softmax.
One number flows forward. This runs for every neuron, millions of times per forward pass.
A Concrete Example: The Electronics Store Evaluator
Abstract math is hard to feel, so let's make it real. You work in an electronics store and a new device arrives: should we stock it? You evaluate three things — and weight them by how much your customers care:
- Scenario: Should we stock this gadget?
- Inputs (scores): Weight 0.9, Camera 0.7, Battery 0.3.
- Weights (importance): Weight ×0.5, Camera ×0.3, Battery ×0.8.
- Math:
(0.9×0.5) + (0.7×0.3) + (0.3×0.8) = 0.45 + 0.21 + 0.24 = 0.90. - Verdict:
0.90 > 0.5threshold → Stock it! ✅
Notice battery has a high importance weight (0.8) but a low score (0.3), so it contributes less than weight — a key insight about how weights and inputs interact. That is exactly what an artificial neuron does: a weighted vote of inputs, followed by a threshold decision. The AI doesn't guess those weights — it learns them automatically from thousands of examples. This is precisely what happens inside an embedding model, millions of times per forward pass, to produce those 384 output numbers.
Why You Need a Bias
What if a device had no data — all inputs are zero? Without bias: (0×0.5)+(0×0.3)+(0×0.8) = 0. The neuron always outputs 0 — completely stuck. With a bias of +0.3, the same zero inputs produce 0 + 0.3 = 0.3. The neuron can start from a non-zero position and learn to be naturally optimistic or pessimistic — like an expert with priors before seeing any data.
The Effect of Bias
Σ(0 × w) = 0 — always starts from zero. Limited representational power; must travel 0.50 to reach the threshold.
Σ(0 × w) + 0.3 = 0.3 — a flexible head start. Only 0.20 away from firing. The AI learns the right bias during training. ✅
The Activation Function: The Decision Gate
After computing the weighted sum, the neuron passes it through an activation function, which does two critical things: it makes the decision (how strongly to fire) and adds non-linearity (without it, a million neurons collapse into one linear equation — useless for complex patterns). The real world isn't linear: "more camera quality is always better" stops being true at 200MP. Activation functions let the network learn these curves. Three you need to know:
The Big Three Functions
Rule: max(0, x). If negative, output 0; if positive, pass through. Used in ~99% of hidden layers.
Rule: Squash to [0, 1]. Outputs a probability (e.g. 88% chance it's spam). Best for Yes/No decisions.
Rule: A distribution that sums to 100%. Distributes probability across many options (Cat vs Dog vs Bird).
def relu(x):
return max(0, x)
# relu(-0.3) → 0.0 (blocked) ; relu(0.76) → 0.76 (passed through)
import math
def sigmoid(x):
return 1 / (1 + math.exp(-x))
# sigmoid(-5) → 0.007 ; sigmoid(0) → 0.5 ; sigmoid(5) → 0.993
import numpy as np
def softmax(scores):
e = np.exp(scores)
return e / e.sum()
softmax([2.0, 1.0, 0.5]) # → [0.59, 0.24, 0.17] "59% glasses, 24% ring, 17% earbuds"The golden rule for which function goes where:
| Where | Use | Used in Embeddings? |
|---|---|---|
| Hidden layers (between input & output) | ReLU — almost always | ✅ Yes — between Transformer layers |
| Output layer, binary decision | Sigmoid — yes/no probability | ⚠️ Rarely — only binary classification heads |
| Output layer, multi-class | Softmax — probability distribution | ✅ Yes — normalizes attention scores |
Before vs. After Training
The weights start completely random. Training adjusts them until the network makes correct predictions.
The Power of Training
weights = [0.23, -0.71, 0.15],bias = 0.88- Accuracy on test set: 47% ≈ random guessing.
weights = [0.50, 0.30, 0.80],bias = 0.05- Accuracy on test set: 90% after 1,000 training steps.
The next article explains exactly how those weights go from random → 90% accurate.
Build a Neuron from Scratch in Python
import numpy as np
def neuron(inputs, weights, bias, activation='relu'):
"""A single artificial neuron."""
weighted_sum = np.dot(inputs, weights) + bias # Step 1: weighted sum
if activation == 'relu': # Step 2: activation
return max(0.0, weighted_sum)
elif activation == 'sigmoid':
return 1 / (1 + np.exp(-weighted_sum))
return weighted_sum # linear
inputs = [0.9, 0.7, 0.3] # weight, camera, battery scores
weights = [0.5, 0.3, 0.8] # importance of each feature
result = neuron(inputs, weights, bias=0.0, activation='relu')
print(f"Raw Score: {result:.2f}") # → 1.04
print("Stock it ✅" if result > 0.5 else "Pass ❌") # → Stock it ✅From One Neuron to a Network
A single neuron answers one yes/no question. Understanding language, recognizing faces, generating code — those require layers of neurons, each passing its output as input to the next, learning increasingly abstract patterns:
The Hierarchy of Abstraction
Receives raw features (tokens, pixels).
Detects low-level patterns — edges in images, "words related to weight" in text.
Builds abstract concepts — "lightweight + translation = translation device category."
Produces the final prediction or the next token in a chat.
The AI doesn't design these layers explicitly — it figures out what patterns to detect on its own, through training. And it's the same formula scaled across 80 years:
- 1943: 1 Neuron (the concept).
- 1980s: ~1K Neurons (early nets).
- 2012: 60M Neurons (AlexNet — vision).
- 2019: 1.5B Parameters (GPT-2).
- 2026: Trillions of Parameters (GPT-4 / Claude).
How This Creates the 384-Dimensional Embeddings
In Part 2 we used a model that converted sentences into 384-dimensional vectors. Now you know exactly how: the 384 dimensions aren't magic — they're the output values of 384 neurons in the model's final layer. Each one learned, through training on billions of sentences, to capture a different aspect of meaning: topic, sentiment, language, formality, and hundreds of subtle dimensions we don't have names for.
The Core Insight
Understanding this single unit explains everything: Why does training take so long? The AI is adjusting billions of weights simultaneously. Why does more data help? More examples = more chances to refine those weights. Why are bigger models better? More neurons = more capacity for subtle patterns. What are those 384 embedding dimensions? The output of 384 neurons in the final layer. Intelligence isn't a complex formula — it's one simple formula (Σwx + b) repeated billions of times in parallel.
Try It Yourself
Extend the neuron above into a 2-layer network:
def two_layer_network(inputs, w1, b1, w2, b2):
"""A minimal neural network with one hidden layer."""
hidden = [neuron(inputs, w1[i], b1[i], 'relu') for i in range(3)] # hidden layer
return neuron(hidden, w2, b2, 'sigmoid') # output layer
w1 = [[0.5, 0.3, 0.2], [0.1, 0.8, 0.4], [0.6, 0.2, 0.7]]
b1 = [0.1, 0.0, -0.1]
w2, b2 = [0.4, 0.6, 0.3], 0.2
result = two_layer_network([0.9, 0.7, 0.3], w1, b1, w2, b2)
print(f"Network output: {result:.3f}") # → 0.742
print("Stock it ✅" if result > 0.5 else "Pass ❌")Experiments to try:
- Bias experiment: change
b2 = 0.2tob2 = -2.0. The output drops dramatically — bias shifts the decision boundary. - Dead neuron: set
b1 = [-5.0, -5.0, -5.0]. With ReLU, all hidden neurons output 0 and the network goes blind — the "dead ReLU" problem. - Activation swap: change
'relu'to'sigmoid'in the hidden layer and compare outputs.
The key question — how does the network learn the right values for those weights and biases? — is the subject of the next article.
Key Takeaways
Intelligence isn't a complex formula. It's one simple formula (Σwx + b) repeated billions of times in parallel.
When you download an AI model, you are literally downloading a list of weights. Those numbers are the model's 'experience'.
Without bias, a neuron can't learn patterns that don't start at zero. It's the mathematical equivalent of having 'priors' or intuition.
Up Next in the Series
We have a neuron that makes decisions — but its weights started random. Next, we'll see exactly how those weights go from random guessing to 90% accuracy: the training loop of Forward Pass → Loss → Backpropagation → Gradient Descent. Continue the series →