Part 4 — The Artificial Neuron: The Tiny Decision Machine Behind Every AI

The Artificial Neuron: One Math Equation for Intelligence

What actually computed those 384-dimensional embeddings? A staggeringly simple unit of computation called the Artificial Neuron. Scale it up a trillion times, and you get ChatGPT.

Primary Objective

Weights & Biases | Activation Functions | ReLU | Sigmoid

In our previous articles we saw how AI converts words to tokens, tokens to embeddings, and embeddings to searchable vectors. But here's the question nobody asks first: what actually computed those 384-dimensional embeddings in the first place?

The answer is a staggeringly simple unit of computation called the Artificial Neuron — a mathematical imitation of the biological brain cell, refined since 1943. Every embedding model, every language model, every image generator is built from billions of these stacked in layers. Understand one, and you understand the fundamental building block of all modern AI.

💡

The 1943 Origin

In 1943, neurologists Warren McCulloch and Walter Pitts asked: 'What if we could model the biological brain cell in mathematics?' Their answer launched the entire AI industry.

From Biology to Math

Your brain contains ~86 billion neurons. Each one is a tiny machine that receives signals from other neurons, weighs them (some inputs matter more), decides whether to fire and how strongly, and passes the result forward. The artificial neuron does exactly the same — but with numbers instead of electrical impulses.

Same Logic, Different Medium

🧬BIOLOGICAL NEURON

Receive: Electrical signals from dendrites.
Weigh: Some inputs matter more.
Decide: Threshold check in cell body.
Pass: Fire down the axon.

⚙️ARTIFICIAL NEURON

Receive: Input numbers (x).
Weigh: Multiply by weights (w).
Decide: Sum + Bias → Activation.
Pass: Output number forward.

The entire neuron collapses into one compact equation:

text

Output = Activation( Σ(x × w) + b )

The Equation Components

📊INPUTS (x)

Raw data coming in — pixels, embeddings, sensor values.

⚖️WEIGHTS (w)

How important each input is. The AI learns these during training; they start random and get refined.

➕BIAS (b)

A starting offset. Without it, a neuron forced to start from zero can't represent some patterns.

🚦ACTIVATION (f)

The decision gate that transforms the sum into an output — and adds non-linearity.

The 5-Step Decision Pipeline

📊

INPUTS

Raw numbers arrive: x₁, x₂, x₃…

⚖️

× WEIGHTS

Each input is multiplied by its learned importance: xᵢ × wᵢ.

➕

SUM + BIAS

Add them all up and add the bias: Σ(xᵢwᵢ) + b.

🚦

ACTIVATION

Pass the sum through ReLU / Sigmoid / Softmax.

✅

OUTPUT

One number flows forward. This runs for every neuron, millions of times per forward pass.

A Concrete Example: The Electronics Store Evaluator

Abstract math is hard to feel, so let's make it real. You work in an electronics store and a new device arrives: should we stock it? You evaluate three things — and weight them by how much your customers care:

The Electronics Store Evaluator

Scenario: Should we stock this gadget?
Inputs (scores): Weight 0.9, Camera 0.7, Battery 0.3.
Weights (importance): Weight ×0.5, Camera ×0.3, Battery ×0.8.
Math: (0.9×0.5) + (0.7×0.3) + (0.3×0.8) = 0.45 + 0.21 + 0.24 = 0.90.
Verdict: 0.90 > 0.5 threshold → Stock it! ✅

How Each Input's Weight Shapes Its Contribution

⚖️ Weight (0.9 × 0.5)

📸 Camera (0.7 × 0.3)

🔋 Battery (0.3 × 0.8)

Notice battery has a high importance weight (0.8) but a low score (0.3), so it contributes less than weight — a key insight about how weights and inputs interact. That is exactly what an artificial neuron does: a weighted vote of inputs, followed by a threshold decision. The AI doesn't guess those weights — it learns them automatically from thousands of examples. This is precisely what happens inside an embedding model, millions of times per forward pass, to produce those 384 output numbers.

Why You Need a Bias

What if a device had no data — all inputs are zero? Without bias: (0×0.5)+(0×0.3)+(0×0.8) = 0. The neuron always outputs 0 — completely stuck. With a bias of +0.3, the same zero inputs produce 0 + 0.3 = 0.3. The neuron can start from a non-zero position and learn to be naturally optimistic or pessimistic — like an expert with priors before seeing any data.

The Effect of Bias

No Bias

Σ(0 × w) = 0 — always starts from zero. Limited representational power; must travel 0.50 to reach the threshold.

Bias = +0.3

Σ(0 × w) + 0.3 = 0.3 — a flexible head start. Only 0.20 away from firing. The AI learns the right bias during training. ✅

The Activation Function: The Decision Gate

After computing the weighted sum, the neuron passes it through an activation function, which does two critical things: it makes the decision (how strongly to fire) and adds non-linearity (without it, a million neurons collapse into one linear equation — useless for complex patterns). The real world isn't linear: "more camera quality is always better" stops being true at 200MP. Activation functions let the network learn these curves. Three you need to know:

The Big Three Functions

⚡ReLU (Hidden Layers)

Rule: max(0, x). If negative, output 0; if positive, pass through. Used in ~99% of hidden layers.

🌗SIGMOID (Binary)

Rule: Squash to [0, 1]. Outputs a probability (e.g. 88% chance it's spam). Best for Yes/No decisions.

📊SOFTMAX (Multi-Class)

Rule: A distribution that sums to 100%. Distributes probability across many options (Cat vs Dog vs Bird).

python

1234567891011121314

def relu(x):
    return max(0, x)
# relu(-0.3) → 0.0 (blocked) ; relu(0.76) → 0.76 (passed through)

import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))
# sigmoid(-5) → 0.007 ; sigmoid(0) → 0.5 ; sigmoid(5) → 0.993

import numpy as np
def softmax(scores):
    e = np.exp(scores)
    return e / e.sum()
softmax([2.0, 1.0, 0.5])  # → [0.59, 0.24, 0.17]  "59% glasses, 24% ring, 17% earbuds"

The golden rule for which function goes where:

Where	Use	Used in Embeddings?
Hidden layers (between input & output)	ReLU — almost always	✅ Yes — between Transformer layers
Output layer, binary decision	Sigmoid — yes/no probability	⚠️ Rarely — only binary classification heads
Output layer, multi-class	Softmax — probability distribution	✅ Yes — normalizes attention scores

Before vs. After Training

The weights start completely random. Training adjusts them until the network makes correct predictions.

The Power of Training

Before Training

weights = [0.23, -0.71, 0.15], bias = 0.88
Accuracy on test set: 47% ≈ random guessing.

After Training

weights = [0.50, 0.30, 0.80], bias = 0.05
Accuracy on test set: 90% after 1,000 training steps.

The next article explains exactly how those weights go from random → 90% accurate.

Build a Neuron from Scratch in Python

python

12345678910111213141516

import numpy as np

def neuron(inputs, weights, bias, activation='relu'):
    """A single artificial neuron."""
    weighted_sum = np.dot(inputs, weights) + bias        # Step 1: weighted sum
    if activation == 'relu':                              # Step 2: activation
        return max(0.0, weighted_sum)
    elif activation == 'sigmoid':
        return 1 / (1 + np.exp(-weighted_sum))
    return weighted_sum  # linear

inputs  = [0.9, 0.7, 0.3]   # weight, camera, battery scores
weights = [0.5, 0.3, 0.8]   # importance of each feature
result  = neuron(inputs, weights, bias=0.0, activation='relu')
print(f"Raw Score: {result:.2f}")                         # → 1.04
print("Stock it ✅" if result > 0.5 else "Pass ❌")        # → Stock it ✅

From One Neuron to a Network

A single neuron answers one yes/no question. Understanding language, recognizing faces, generating code — those require layers of neurons, each passing its output as input to the next, learning increasingly abstract patterns:

The Hierarchy of Abstraction

🏁

INPUT LAYER

Receives raw features (tokens, pixels).

🔍

HIDDEN LAYER 1

Detects low-level patterns — edges in images, "words related to weight" in text.

🧠

HIDDEN LAYER N

Builds abstract concepts — "lightweight + translation = translation device category."

✅

OUTPUT LAYER

Produces the final prediction or the next token in a chat.

The AI doesn't design these layers explicitly — it figures out what patterns to detect on its own, through training. And it's the same formula scaled across 80 years:

The Scale of Intelligence

1943: 1 Neuron (the concept).
1980s: ~1K Neurons (early nets).
2012: 60M Neurons (AlexNet — vision).
2019: 1.5B Parameters (GPT-2).
2026: Trillions of Parameters (GPT-4 / Claude).

How This Creates the 384-Dimensional Embeddings

In Part 2 we used a model that converted sentences into 384-dimensional vectors. Now you know exactly how: the 384 dimensions aren't magic — they're the output values of 384 neurons in the model's final layer. Each one learned, through training on billions of sentences, to capture a different aspect of meaning: topic, sentiment, language, formality, and hundreds of subtle dimensions we don't have names for.

The Core Insight

💡

Why AI Works the Way It Does

Understanding this single unit explains everything: Why does training take so long? The AI is adjusting billions of weights simultaneously. Why does more data help? More examples = more chances to refine those weights. Why are bigger models better? More neurons = more capacity for subtle patterns. What are those 384 embedding dimensions? The output of 384 neurons in the final layer. Intelligence isn't a complex formula — it's one simple formula (Σwx + b) repeated billions of times in parallel.

Try It Yourself

Extend the neuron above into a 2-layer network:

python

1234567891011

def two_layer_network(inputs, w1, b1, w2, b2):
    """A minimal neural network with one hidden layer."""
    hidden = [neuron(inputs, w1[i], b1[i], 'relu') for i in range(3)]  # hidden layer
    return neuron(hidden, w2, b2, 'sigmoid')                            # output layer

w1 = [[0.5, 0.3, 0.2], [0.1, 0.8, 0.4], [0.6, 0.2, 0.7]]
b1 = [0.1, 0.0, -0.1]
w2, b2 = [0.4, 0.6, 0.3], 0.2
result = two_layer_network([0.9, 0.7, 0.3], w1, b1, w2, b2)
print(f"Network output: {result:.3f}")   # → 0.742
print("Stock it ✅" if result > 0.5 else "Pass ❌")

Experiments to try:

Bias experiment: change b2 = 0.2 to b2 = -2.0. The output drops dramatically — bias shifts the decision boundary.
Dead neuron: set b1 = [-5.0, -5.0, -5.0]. With ReLU, all hidden neurons output 0 and the network goes blind — the "dead ReLU" problem.
Activation swap: change 'relu' to 'sigmoid' in the hidden layer and compare outputs.

The key question — how does the network learn the right values for those weights and biases? — is the subject of the next article.

Key Takeaways

Simple Equation, Infinite Scale

Intelligence isn't a complex formula. It's one simple formula (Σwx + b) repeated billions of times in parallel.

Weights are Memory

When you download an AI model, you are literally downloading a list of weights. Those numbers are the model's 'experience'.

Bias is Flexibility

Without bias, a neuron can't learn patterns that don't start at zero. It's the mathematical equivalent of having 'priors' or intuition.

Up Next in the Series

💡

Next: How a Network Learns

We have a neuron that makes decisions — but its weights started random. Next, we'll see exactly how those weights go from random guessing to 90% accuracy: the training loop of Forward Pass → Loss → Backpropagation → Gradient Descent. Continue the series →