In our previous articles we saw how AI converts words to tokens, tokens to embeddings, and embeddings to searchable vectors. But here's the question nobody asks first:
What actually computed those 384-dimensional embeddings in the first place?
The answer is a staggeringly simple unit of computation called the Artificial Neuron — a mathematical imitation of the biological brain cell that scientists have been refining since 1943.
Every embedding model, every language model, every image generator — all of them are built from billions of these neurons stacked in layers. Once you understand one, you understand the fundamental building block of all modern AI.
The Biological Original
Your brain contains approximately 86 billion neurons. ChatGPT has trillions of parameters (weights inside its neurons) — not neurons per se, but the same mathematical principle scaled up enormously. Each biological neuron is a tiny machine that:
- Receives electrical signals from other neurons via dendrites
- Weighs those signals (some inputs matter more than others)
- Decides whether to fire — and if so, how strongly
- Passes the result forward to the next layer of neurons
This is how you recognize a face, understand a sentence, feel pain, and solve a math problem. It's all neurons receiving, weighing, deciding, and passing.
In 1943, neurologists Warren McCulloch and Walter Pitts asked: What if we could model this in mathematics?
Their answer became the foundation of every AI system in existence today.
Same 4-Step Logic — Different Medium
🧬 Biological Neuron
⚙️ Artificial Neuron (Pure Math)
The Artificial Neuron: Same Idea, Pure Math
The artificial neuron does exactly what the biological one does — but with numbers instead of electrical impulses.
Output = Activation( Σ(x × w) + b )
Don't worry if the formula looks scary — we'll make it completely concrete with a real example in the next section.
Inputs
The raw data coming in — numbers representing pixels, word embeddings, sensor values, or anything else
Weights
How important is each input? The AI learns these numbers during training. They start random and get refined.
Bias
A starting offset. Without it, a neuron forced to start from zero can't represent some patterns. Bias gives it flexibility.
Activation Function
The decision gate. It transforms the weighted sum into the final output — and crucially adds non-linearity.
Data Pipeline Through a Single Neuron
This 5-step pipeline runs for every single neuron, millions of times per forward pass through the model
A Concrete Example: The Electronics Store Evaluator
Abstract math is hard to feel. Let's make it real.
Imagine you work in an electronics store. A new device arrives and you have to decide: should we stock it? You evaluate three things:
- Weight (lighter = better for customers): score 0.9 out of 1
- Camera quality: score 0.7 out of 1
- Battery life: score 0.3 out of 1
But not all three matter equally. Based on your store's customer research:
- Weight matters most: weight 0.5
- Camera matters somewhat: weight 0.3
- Battery is secondary: weight 0.8
How Each Input's Weight Shapes Its Contribution
Notice: battery has a HIGH importance weight (0.8) but a LOW score (0.3), so it contributes less than weight — a key insight about how weights and inputs interact
Weighted Sum = (0.9×0.5) + (0.7×0.3) + (0.3×0.8)
= 0.45 + 0.21 + 0.24
= 0.90
0.90 > 0.5 threshold → ✅ Stock this device!
That is exactly what an artificial neuron does. It's a weighted vote of inputs, followed by a threshold decision. The AI doesn't guess those weights — it learns them automatically by looking at thousands of examples of good and bad devices.
Why You Need a Bias
What if the device had no data — all inputs are zero?
Without bias: (0×0.5) + (0×0.3) + (0×0.8) = 0. The neuron always outputs 0. It's completely stuck.
With a bias of +0.3: the same zero inputs produce 0 + 0.3 = 0.3. The neuron can start from a non-zero position. It can learn to be naturally optimistic or pessimistic about certain categories — just like an expert who has priors before seeing any data.
No Bias
Σ(0 × w) = 0
Always starts from zero. Limited representational power.
Bias = +0.3
Σ(0 × w) + 0.3 = 0.3
Flexible starting point. The AI learns the right bias during training. ✅
How Bias Shifts the Decision Starting Point
Bias = a head start. The neuron needs less incoming signal to fire — giving the network more expressive flexibility.
The Activation Function: The Decision Gate
After computing the weighted sum, the neuron passes it through an activation function. This does two critical things:
- Makes the decision (how strongly does this neuron fire?)
- Adds non-linearity (without this, a million neurons collapse into one linear equation — useless for complex patterns)
Think about it: the real world isn't linear. "More camera quality is always better" stops being true at some point. A phone camera at 200MP is overkill. Activation functions let the network learn these thresholds and curves.
There are three you need to know:
ReLU — The Most Common (Hidden Layers)
Rule: If the input is negative, output 0. If positive, pass it through unchanged.
def relu(x):
return max(0, x)
relu(-5.0) # → 0.0 (blocked)
relu(-0.3) # → 0.0 (blocked)
relu(0.76) # → 0.76 (passed through)
relu(2.5) # → 2.5 (passed through)
Why ReLU? It's dead simple to compute (no exponentials), works well in practice, and doesn't cause the "vanishing gradient" problem that plagued earlier functions.
Where: Used in almost every hidden layer of every modern neural network.
Sigmoid — The Probability Maker (Binary Output)
Rule: Squish any number into the range [0, 1]. This makes the output interpretable as a probability.
import math
def sigmoid(x):
return 1 / (1 + math.exp(-x))
sigmoid(-5) # → 0.007 (nearly 0%)
sigmoid( 0) # → 0.500 (50/50)
sigmoid( 5) # → 0.993 (nearly 100%)
Where: The output layer of a binary classifier (spam vs. not spam, cat vs. dog).
Interpret: "There's a 73% chance this email is spam."
Softmax — The Multi-Class Chooser (Final Layer)
When you have more than two categories, Softmax converts raw scores into a probability distribution that sums to exactly 100%.
import numpy as np
def softmax(scores):
exp_scores = np.exp(scores)
return exp_scores / exp_scores.sum()
raw_scores = [2.0, 1.0, 0.5]
probabilities = softmax(raw_scores)
# "59% glasses, 24% ring, 17% earbuds" — and they sum to 100%
Three Activation Functions at a Glance
ReLU
0 or pass through
Hidden layers ✅
Sigmoid
Squash to 0 → 1
Binary output ✅
Softmax
Sums to 100%
Multi-class ✅
The golden rule for activation functions:
Before vs After Training
The weights start completely random. Training adjusts them until the network makes correct predictions.
Before Training
After Training
The next article explains exactly how those weights go from random → 90% accurate.
Build a Neuron from Scratch in Python
import numpy as np
def neuron(inputs, weights, bias, activation='relu'):
"""
A single artificial neuron.
inputs: list of input values
weights: list of importance multipliers (same length as inputs)
bias: starting offset
activation: 'relu', 'sigmoid', or 'linear'
"""
# Step 1: Weighted sum
weighted_sum = np.dot(inputs, weights) + bias
# Step 2: Activation
if activation == 'relu':
return max(0.0, weighted_sum)
elif activation == 'sigmoid':
return 1 / (1 + np.exp(-weighted_sum))
return weighted_sum # linear (no activation)
Now run it on our electronics store example:
inputs = [0.9, 0.7, 0.3] # weight score, camera score, battery score
weights = [0.5, 0.3, 0.8] # importance of each feature
bias = 0.0
result = neuron(inputs, weights, bias, activation='relu')
print(f"Raw Score: {result:.2f}") # → 1.04 (before ReLU clips negatives)
threshold = 0.5
if result > threshold:
print("DECISION: Stock this device ✅")
else:
print("DECISION: Pass ❌")
Output:
Raw Score: 1.04
DECISION: Stock this device ✅
From One Neuron to a Network
A single neuron can answer one yes/no question. But understanding language? Recognizing faces? Generating code? Those require layers of neurons, each one passing its output as input to the next.
The Same Formula — Scaled Across 80 Years
1943
1980s
2012
2019
2020–now
Every dot runs the same formula: Σ(x × w) + b → activation() — just repeated billions of times in parallel
Every line = a weight. Every node = a neuron. The AI learns all weights simultaneously through training.
Each hidden layer learns increasingly abstract patterns:
- Layer 1 might detect: "this device description contains words related to weight"
- Layer 2 might detect: "lightweight + translation = translation device category"
- Layer 3 might detect: "translation device + low price = high match for this query"
The AI doesn't design these layers explicitly. It figures out what patterns to detect on its own, through training.
Abstraction Builds Across Layers
How This Neuron Creates the 384-Dimensional Embeddings We Saw Earlier
In the previous article, we used a model that converted sentences into 384-dimensional vectors. Now you know exactly how that happens:
The 384 dimensions aren't magic — they're the output values of 384 neurons in the model's final layer. Each one learned (through training on billions of sentences) to capture a different aspect of meaning: topic, sentiment, language, formality, and hundreds of other subtle dimensions we don't have names for.
The Four Network Types You Need to Know
The embedding models we used in earlier articles are Transformers — networks of neurons that learned to compress meaning into 384-dimensional vectors by reading billions of sentences.
Real-World Impact
The Core Insight
A neuron is just a weighted vote.
Stack enough of them and they can learn anything.
The magic isn't in any single neuron. It's in what emerges when billions of them are trained together on enough data.
Understanding this single unit explains why AI works the way it does:
- Why does training take so long? The AI is adjusting billions of weights simultaneously
- Why does more data help? More examples = more opportunities to refine those weights
- Why do bigger models perform better? More neurons = more capacity to learn subtle patterns
- What are those 384 embedding dimensions? The output of 384 neurons in the final layer of the model
Pro Tips for Builders
- Start with ReLU in hidden layers — it trains faster and avoids the vanishing gradient problem that crippled older activations like tanh
- Always include bias — a neuron without bias can only produce a hyperplane through the origin, severely limiting what it can learn
- Weights matter at init — random weights that are too large cause exploding gradients; use Xavier or He initialization in frameworks
- More neurons ≠ better — overfitting is real. Start small, measure validation loss, then scale
- The 384 dimensions in our embedding model are literally 384 output neurons in the final layer — each one learned to capture a different semantic dimension of meaning
Try It Yourself
Extend the neuron function above into a 2-layer network:
def two_layer_network(inputs, w1, b1, w2, b2):
"""A minimal neural network with one hidden layer."""
# Layer 1: hidden layer with ReLU
h1 = neuron(inputs, w1[0], b1[0], 'relu')
h2 = neuron(inputs, w1[1], b1[1], 'relu')
h3 = neuron(inputs, w1[2], b1[2], 'relu')
hidden = [h1, h2, h3]
# Layer 2: output layer with sigmoid (binary decision)
output = neuron(hidden, w2, b2, 'sigmoid')
return output
# Random weights (in real training, these are learned)
w1 = [[0.5, 0.3, 0.2], [0.1, 0.8, 0.4], [0.6, 0.2, 0.7]]
b1 = [0.1, 0.0, -0.1]
w2 = [0.4, 0.6, 0.3]
b2 = 0.2
inputs = [0.9, 0.7, 0.3]
result = two_layer_network(inputs, w1, b1, w2, b2)
print(f"Network output: {result:.3f}")
print(f"Decision: {'Stock it ✅' if result > 0.5 else 'Pass ❌'}")
Network output: 0.742
Decision: Stock it ✅
Experiments to try:
- Bias experiment — change
b2 = 0.2tob2 = -2.0and re-run. The output should drop dramatically, showing how bias shifts the decision boundary - Dead neuron — set
b1 = [-5.0, -5.0, -5.0]. With ReLU, all hidden neurons will output 0 and the network becomes blind — this is the "dead ReLU" problem - Activation swap — change
'relu'to'sigmoid'in Layer 1 and compare outputs. Sigmoid will produce different values but the same decision logic
The key question: how does the network learn the right values for w1, b1, w2, b2? That's the subject of the next article.
Next in AI Fundamentals
How AI Learns: The Training Loop
Forward Pass → Loss → Backpropagation → Gradient Descent. The four-step cycle that turns random weights into intelligence — and why it takes millions of dollars of compute to do it at scale.