Neural networks (nn) only learn continuous piecewise linear (CPWL) function if their activation functions are also CPWL. If a network uses smooth activation functions like Simoid, Tanh or Swish/GELU, the resulting network represents a smooth, non-linear function. However, the vast majority of models (or LLM or VLM) rely on ReLU or its variants.
The Mathematics of Composition:
A standard feedforward nn is essentially a series of alternating affine transformations (matrix multiplications and bias additions) and non-linear activation functions. Mathematically, a layer at computes:
Real analysis: The composition of continuous piecewise linear function is strictly a continuous piecewise linear function. Combination of piecewise linear segments will not create true curves but instead more linear 'hinges'.
The Geometry of Convex Polytopes
Each neuron in a hidden dimension defines a hyperplane in the input space. This hyperplane acts as a "fold."
- One one side of the hyperplane, the neuron's pre-activation is negative, the ReLU outputs zero, and the neuron is "dead" (via a gradient becoming zero).
- On the other side, the neuron is active and passes the linear signal forward.
When the network consists of thousands or millions of neurons, these hyperplanes intersect, fracturing the high-dimensional input space into a massive grid of distinct, non-overlapping regions called convex polytopes. Inside any single polytope, the activation state of every single neuron in the network is fixed (either ON or OFF). This is because the non-linear "decisions" are locked in the entire neural network collapses into a single, massive matrix multiplication for any input that falls within that specific region.
For an input inside a specific polytope , the network's output is exactly:
where and are the effective weight matrix and bias for that specific region. The network changes its slope only when the input crosses a boundary into a neighboring polytope.
Implications for AI and Robustness
- Universal approximation: Even through the nn are composed on straight lines, networks can approximate any continuous function given enough hinges (or neurons), just like you can approximate a smooth curve by drawing enough, small straight lines segments.
- Adversarial vulnerabilities: The CPWL nature of networks is a primary reason why adversarial examples are so effective. Because the network is locally linear, methods like the Fast Gradient Sign Method (FGSM) can easily exploit the linear slope within a polytope to push an input across a decision boundary using tiny, calculated perturbations.