Skip to main content
AI-Developer → AI Fundamentals#4 of 4

Part 4 — The Artificial Neuron: The Tiny Decision Machine Behind Every AI

Your brain has 86 billion neurons. ChatGPT has trillions. Today we isolate just one digital brain cell and watch it make a decision — in Python.

March 12, 2026
11 min read
#AI#Neural Networks#Neuron#Deep Learning#Activation Functions#Python

In our previous articles we saw how AI converts words to tokens, tokens to embeddings, and embeddings to searchable vectors. But here's the question nobody asks first:

What actually computed those 384-dimensional embeddings in the first place?

The answer is a staggeringly simple unit of computation called the Artificial Neuron — a mathematical imitation of the biological brain cell that scientists have been refining since 1943.

Every embedding model, every language model, every image generator — all of them are built from billions of these neurons stacked in layers. Once you understand one, you understand the fundamental building block of all modern AI.


The Biological Original

Your brain contains approximately 86 billion neurons. ChatGPT has trillions of parameters (weights inside its neurons) — not neurons per se, but the same mathematical principle scaled up enormously. Each biological neuron is a tiny machine that:

  1. Receives electrical signals from other neurons via dendrites
  2. Weighs those signals (some inputs matter more than others)
  3. Decides whether to fire — and if so, how strongly
  4. Passes the result forward to the next layer of neurons

This is how you recognize a face, understand a sentence, feel pain, and solve a math problem. It's all neurons receiving, weighing, deciding, and passing.

In 1943, neurologists Warren McCulloch and Walter Pitts asked: What if we could model this in mathematics?

Their answer became the foundation of every AI system in existence today.

Same 4-Step Logic — Different Medium

Cell Body ① Receive ② Weigh ③ Decide ④ Pass

🧬 Biological Neuron

x₁ x₂ x₃ ×w₁ ×w₂ ×w₃ Σ(x·w) + bias f(Σ) activate out ① Inputs ② Weights ③ Sum+b ④ Activate ⑤ Output

⚙️ Artificial Neuron (Pure Math)


The Artificial Neuron: Same Idea, Pure Math

The artificial neuron does exactly what the biological one does — but with numbers instead of electrical impulses.

Output = Activation( Σ(x × w) + b )

Don't worry if the formula looks scary — we'll make it completely concrete with a real example in the next section.

(x)

Inputs

The raw data coming in — numbers representing pixels, word embeddings, sensor values, or anything else

(w)

Weights

How important is each input? The AI learns these numbers during training. They start random and get refined.

(b)

Bias

A starting offset. Without it, a neuron forced to start from zero can't represent some patterns. Bias gives it flexibility.

(f)

Activation Function

The decision gate. It transforms the weighted sum into the final output — and crucially adds non-linearity.

Data Pipeline Through a Single Neuron

📊
INPUTS
x₁, x₂, x₃…
⚖️
× WEIGHTS
each xᵢ × wᵢ
SUM + BIAS
Σ(xᵢwᵢ) + b
🚦
ACTIVATION
ReLU / Sigmoid
OUTPUT
one number

This 5-step pipeline runs for every single neuron, millions of times per forward pass through the model


A Concrete Example: The Electronics Store Evaluator

Abstract math is hard to feel. Let's make it real.

Imagine you work in an electronics store. A new device arrives and you have to decide: should we stock it? You evaluate three things:

  • Weight (lighter = better for customers): score 0.9 out of 1
  • Camera quality: score 0.7 out of 1
  • Battery life: score 0.3 out of 1

But not all three matter equally. Based on your store's customer research:

  • Weight matters most: weight 0.5
  • Camera matters somewhat: weight 0.3
  • Battery is secondary: weight 0.8

How Each Input's Weight Shapes Its Contribution

⚖️ Weight  score 0.9 × importance 0.5 = 0.45
📸 Camera  score 0.7 × importance 0.3 = 0.21
🔋 Battery  score 0.3 × importance 0.8 = 0.24
Total: 0.45 + 0.21 + 0.24 = 0.90 ✅

Notice: battery has a HIGH importance weight (0.8) but a LOW score (0.3), so it contributes less than weight — a key insight about how weights and inputs interact

⚖️
Weight Score
0.9
× 0.5 importance
📸
Camera Score
0.7
× 0.3 importance
🔋
Battery Score
0.3
× 0.8 importance

Weighted Sum = (0.9×0.5) + (0.7×0.3) + (0.3×0.8)

= 0.45 + 0.21 + 0.24

= 0.90

0.90 > 0.5 threshold → ✅ Stock this device!

That is exactly what an artificial neuron does. It's a weighted vote of inputs, followed by a threshold decision. The AI doesn't guess those weights — it learns them automatically by looking at thousands of examples of good and bad devices.

This is exactly what happens inside every neuron in an embedding model — weighted inputs scored, summed, and decided millions of times per forward pass to produce those 384 output numbers.

Why You Need a Bias

What if the device had no data — all inputs are zero?

Without bias: (0×0.5) + (0×0.3) + (0×0.8) = 0. The neuron always outputs 0. It's completely stuck.

With a bias of +0.3: the same zero inputs produce 0 + 0.3 = 0.3. The neuron can start from a non-zero position. It can learn to be naturally optimistic or pessimistic about certain categories — just like an expert who has priors before seeing any data.

No Bias

Σ(0 × w) = 0

Always starts from zero. Limited representational power.

Bias = +0.3

Σ(0 × w) + 0.3 = 0.3

Flexible starting point. The AI learns the right bias during training. ✅

How Bias Shifts the Decision Starting Point

No Bias Bias +0.3 0 0.25 0.5 0.75 1.0 0 0.25 0.5 0.75 1.0 threshold must travel 0.50 to reach threshold only 0.20 away ✅ starts at 0.3

Bias = a head start. The neuron needs less incoming signal to fire — giving the network more expressive flexibility.


The Activation Function: The Decision Gate

After computing the weighted sum, the neuron passes it through an activation function. This does two critical things:

  1. Makes the decision (how strongly does this neuron fire?)
  2. Adds non-linearity (without this, a million neurons collapse into one linear equation — useless for complex patterns)

Think about it: the real world isn't linear. "More camera quality is always better" stops being true at some point. A phone camera at 200MP is overkill. Activation functions let the network learn these thresholds and curves.

There are three you need to know:

ReLU — The Most Common (Hidden Layers)

Rule: If the input is negative, output 0. If positive, pass it through unchanged.

def relu(x):
    return max(0, x)

relu(-5.0)  # → 0.0  (blocked)
relu(-0.3)  # → 0.0  (blocked)
relu(0.76)  # → 0.76 (passed through)
relu(2.5)   # → 2.5  (passed through)
x y 0 negative → 0 positive → itself

Why ReLU? It's dead simple to compute (no exponentials), works well in practice, and doesn't cause the "vanishing gradient" problem that plagued earlier functions.

Where: Used in almost every hidden layer of every modern neural network.

Sigmoid — The Probability Maker (Binary Output)

Rule: Squish any number into the range [0, 1]. This makes the output interpretable as a probability.

import math

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

sigmoid(-5)  # → 0.007  (nearly 0%)
sigmoid( 0)  # → 0.500  (50/50)
sigmoid( 5)  # → 0.993  (nearly 100%)
0.5 1.0 0 output always 0 → 1

Where: The output layer of a binary classifier (spam vs. not spam, cat vs. dog).

Interpret: "There's a 73% chance this email is spam."

Softmax — The Multi-Class Chooser (Final Layer)

When you have more than two categories, Softmax converts raw scores into a probability distribution that sums to exactly 100%.

import numpy as np

def softmax(scores):
    exp_scores = np.exp(scores)
    return exp_scores / exp_scores.sum()

raw_scores = [2.0, 1.0, 0.5]
probabilities = softmax(raw_scores)

# "59% glasses, 24% ring, 17% earbuds" — and they sum to 100%
👓
59%
Glasses
💍
24%
Ring
🎧
17%
Earbuds

Three Activation Functions at a Glance

= 0 = x 0

ReLU

0 or pass through
Hidden layers ✅

0.5 1.0 0

Sigmoid

Squash to 0 → 1
Binary output ✅

59% 24% 17% 👓 💍 🎧 Σ=100%

Softmax

Sums to 100%
Multi-class ✅

The golden rule for activation functions:

Where Use Used in Embeddings?
Hidden layers (between input and output) ReLU — almost always ✅ Yes — used between Transformer layers
Output layer, binary decision Sigmoid — yes/no probability ⚠️ Rarely — only for binary classification heads
Output layer, multi-class decision Softmax — probability distribution ✅ Yes — used in the attention mechanism to normalize scores

Before vs After Training

The weights start completely random. Training adjusts them until the network makes correct predictions.

Before Training

weights = [0.23, -0.71, 0.15]
bias = 0.88
Accuracy on test set
47%
≈ random guessing

After Training

weights = [0.50, 0.30, 0.80]
bias = 0.05
Accuracy on test set
90%
after 1,000 training steps

The next article explains exactly how those weights go from random → 90% accurate.


Build a Neuron from Scratch in Python

import numpy as np

def neuron(inputs, weights, bias, activation='relu'):
    """
    A single artificial neuron.
    inputs:  list of input values
    weights: list of importance multipliers (same length as inputs)
    bias:    starting offset
    activation: 'relu', 'sigmoid', or 'linear'
    """
    # Step 1: Weighted sum
    weighted_sum = np.dot(inputs, weights) + bias

    # Step 2: Activation
    if activation == 'relu':
        return max(0.0, weighted_sum)
    elif activation == 'sigmoid':
        return 1 / (1 + np.exp(-weighted_sum))
    return weighted_sum  # linear (no activation)

Now run it on our electronics store example:

inputs  = [0.9, 0.7, 0.3]   # weight score, camera score, battery score
weights = [0.5, 0.3, 0.8]   # importance of each feature
bias    = 0.0

result = neuron(inputs, weights, bias, activation='relu')
print(f"Raw Score: {result:.2f}")   # → 1.04 (before ReLU clips negatives)

threshold = 0.5
if result > threshold:
    print("DECISION: Stock this device ✅")
else:
    print("DECISION: Pass ❌")

Output:

Raw Score: 1.04
DECISION: Stock this device ✅

From One Neuron to a Network

A single neuron can answer one yes/no question. But understanding language? Recognizing faces? Generating code? Those require layers of neurons, each one passing its output as input to the next.

The Same Formula — Scaled Across 80 Years

1
neuron
1943
~1K
early nets
1980s
60M
AlexNet
2012
1.5B
GPT-2
2019
175B+
GPT-3/4
2020–now

Every dot runs the same formula: Σ(x × w) + b → activation() — just repeated billions of times in parallel

w w IN₁ IN₂ IN₃ H₁ H₂ H₃ OUT Sigmoid Input Layer Hidden Layer ReLU neurons Output Layer Sigmoid / Softmax

Every line = a weight. Every node = a neuron. The AI learns all weights simultaneously through training.

Each hidden layer learns increasingly abstract patterns:

  • Layer 1 might detect: "this device description contains words related to weight"
  • Layer 2 might detect: "lightweight + translation = translation device category"
  • Layer 3 might detect: "translation device + low price = high match for this query"

The AI doesn't design these layers explicitly. It figures out what patterns to detect on its own, through training.

Abstraction Builds Across Layers

Layer 1 — Raw "words about weight" Layer 2 — Abstract "device category" Layer 3 — Semantic "match score for query" OUT decision raw features abstract concepts

How This Neuron Creates the 384-Dimensional Embeddings We Saw Earlier

In the previous article, we used a model that converted sentences into 384-dimensional vectors. Now you know exactly how that happens:

📝
Input sentence
"I need coffee"
tokenized → ~4 token embeddings
🧠
12 Transformer layers
~117M neurons, each computing
Σ(x × w) + b → ReLU/Softmax
Final layer output: 384 neurons, each firing with a different value
[0.23, -0.71, 0.15, 0.88, ... × 380 more]
That array of 384 numbers IS the embedding

The 384 dimensions aren't magic — they're the output values of 384 neurons in the model's final layer. Each one learned (through training on billions of sentences) to capture a different aspect of meaning: topic, sentiment, language, formality, and hundreds of other subtle dimensions we don't have names for.


The Four Network Types You Need to Know

Type Abbreviation Core Idea Used For
Feed-Forward FNN Data flows one way only Simple classification
Convolutional CNN Scans local regions (patches) Images, video
Recurrent RNN/LSTM Has memory of previous inputs Time series, older text models
Transformer Attends to all inputs simultaneously All modern LLMs

The embedding models we used in earlier articles are Transformers — networks of neurons that learned to compress meaning into 384-dimensional vectors by reading billions of sentences.


Real-World Impact

🔬
Medical Diagnosis
95%
accuracy detecting melanoma
Stanford 2019: matching board-certified dermatologists on 100,000+ labeled skin images — zero hand-written rules, pure neuron learning.
🎤
Speech Recognition
Every
word you say to Siri or Google
Audio waveform → phoneme probabilities → word probabilities. A chain of neurons firing in sequence. No rules ever written.
🌍
Your Embedding Model
117M
neurons in MiniLM-L12-v2
"I need coffee" and "محتاج قهوة" produce identical 384-dim vectors. 50+ languages understood — no translation rules, pure learning.

The Core Insight

A neuron is just a weighted vote.

Stack enough of them and they can learn anything.

The magic isn't in any single neuron. It's in what emerges when billions of them are trained together on enough data.

Understanding this single unit explains why AI works the way it does:

  • Why does training take so long? The AI is adjusting billions of weights simultaneously
  • Why does more data help? More examples = more opportunities to refine those weights
  • Why do bigger models perform better? More neurons = more capacity to learn subtle patterns
  • What are those 384 embedding dimensions? The output of 384 neurons in the final layer of the model

Pro Tips for Builders

  • Start with ReLU in hidden layers — it trains faster and avoids the vanishing gradient problem that crippled older activations like tanh
  • Always include bias — a neuron without bias can only produce a hyperplane through the origin, severely limiting what it can learn
  • Weights matter at init — random weights that are too large cause exploding gradients; use Xavier or He initialization in frameworks
  • More neurons ≠ better — overfitting is real. Start small, measure validation loss, then scale
  • The 384 dimensions in our embedding model are literally 384 output neurons in the final layer — each one learned to capture a different semantic dimension of meaning

Try It Yourself

Extend the neuron function above into a 2-layer network:

def two_layer_network(inputs, w1, b1, w2, b2):
    """A minimal neural network with one hidden layer."""
    # Layer 1: hidden layer with ReLU
    h1 = neuron(inputs, w1[0], b1[0], 'relu')
    h2 = neuron(inputs, w1[1], b1[1], 'relu')
    h3 = neuron(inputs, w1[2], b1[2], 'relu')
    hidden = [h1, h2, h3]

    # Layer 2: output layer with sigmoid (binary decision)
    output = neuron(hidden, w2, b2, 'sigmoid')
    return output

# Random weights (in real training, these are learned)
w1 = [[0.5, 0.3, 0.2], [0.1, 0.8, 0.4], [0.6, 0.2, 0.7]]
b1 = [0.1, 0.0, -0.1]
w2 = [0.4, 0.6, 0.3]
b2 = 0.2

inputs = [0.9, 0.7, 0.3]
result = two_layer_network(inputs, w1, b1, w2, b2)
print(f"Network output: {result:.3f}")
print(f"Decision: {'Stock it ✅' if result > 0.5 else 'Pass ❌'}")
Network output: 0.742
Decision: Stock it ✅

Experiments to try:

  1. Bias experiment — change b2 = 0.2 to b2 = -2.0 and re-run. The output should drop dramatically, showing how bias shifts the decision boundary
  2. Dead neuron — set b1 = [-5.0, -5.0, -5.0]. With ReLU, all hidden neurons will output 0 and the network becomes blind — this is the "dead ReLU" problem
  3. Activation swap — change 'relu' to 'sigmoid' in Layer 1 and compare outputs. Sigmoid will produce different values but the same decision logic

The key question: how does the network learn the right values for w1, b1, w2, b2? That's the subject of the next article.


Next in AI Fundamentals

How AI Learns: The Training Loop

Forward Pass → Loss → Backpropagation → Gradient Descent. The four-step cycle that turns random weights into intelligence — and why it takes millions of dollars of compute to do it at scale.

MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →