Piecewise Linear Functions in NN

Neural networks (nn) only learn continuous piecewise linear (CPWL) function if their activation functions are also CPWL. If a network uses smooth activation functions like Simoid, Tanh or Swish/GELU, the resulting network represents a smooth, non-linear function. However, the vast majority of models (or LLM or VLM) rely on ReLU or its variants.

The Mathematics of Composition:

A standard feedforward nn is essentially a series of alternating affine transformations (matrix multiplications and bias additions) and non-linear activation functions. Mathematically, a layer at $i$ computes:

h_i = \sigma(W_i h_{i-1} +b_i)

Real analysis: The composition of continuous piecewise linear function is strictly a continuous piecewise linear function. Combination of piecewise linear segments will not create true curves but instead more linear 'hinges'.

The Geometry of Convex Polytopes

Each neuron in a hidden dimension defines a hyperplane in the input space. This hyperplane acts as a "fold."

One one side of the hyperplane, the neuron's pre-activation is negative, the ReLU outputs zero, and the neuron is "dead" (via a gradient becoming zero).
On the other side, the neuron is active and passes the linear signal forward.

When the network consists of thousands or millions of neurons, these hyperplanes intersect, fracturing the high-dimensional input space into a massive grid of distinct, non-overlapping regions called convex polytopes. Inside any single polytope, the activation state of every single neuron in the network is fixed (either ON or OFF). This is because the non-linear "decisions" are locked in the entire neural network collapses into a single, massive matrix multiplication for any input that falls within that specific region.

For an input $x$ inside a specific polytope $P$ , the network's output is exactly:

y = W_p x + b_p

where $W_p$ and $b_p$ are the effective weight matrix and bias for that specific region. The network changes its slope only when the input crosses a boundary into a neighboring polytope.

Implications for AI and Robustness

Universal approximation: Even through the nn are composed on straight lines, networks can approximate any continuous function given enough hinges (or neurons), just like you can approximate a smooth curve by drawing enough, small straight lines segments.
Adversarial vulnerabilities: The CPWL nature of networks is a primary reason why adversarial examples are so effective. Because the network is locally linear, methods like the Fast Gradient Sign Method (FGSM) can easily exploit the linear slope within a polytope to push an input across a decision boundary using tiny, calculated perturbations.