Problem definition:
Given two Hilbert space \((X, \bra{\cdot} \ket{\cdot}_{X})\), \((Y, \bra{\cdot} \ket{\cdot}_{Y})\), and let \(F: X \to Y\) be a function that is differentiable at \(x_0 \in X\), then how could we define gradient?
In Hilbert space, the derivative and gradient have different definitions. A quick way to understand this is through their types(or type signature as in Haskell
):
Suppose we have a function \(F: X \to \mathbb{R}\) where \(X\) is a Hilbert space, i.e. equipped with an inner product: \((X, \bra{\cdot} \ket{\cdot}_{X})\)
The derivative of \(F\), let's say \(F^{\prime}\), has the type of \(X \to X^*\) where \(X^*\) is the dual of the Hilbert space \(X\). More specifically, we can write \(X \to \mathcal{L}(X;\mathbb{R})\).
To interpret this, suppose we have implemented a derivative function \(F^{\prime}\) in computers(in Haskell/Ocaml/julia/python
or any language you like), given a point \(x_0\) on \(X\), we output \(F^{\prime}(x_0) \in \mathcal{L}(X;\mathbb{R})\), where \(\mathcal{L}(X;\mathbb{R})\) means the space of linear map between space \(X\) and \(\mathbb{R}\). In other word, the output of \(F^{\prime}(x_0)\) is itself a function, which the input is a small change in space \(X\), the output is a corresponding change in space \(\mathbb{R}\), i.e. derivative function \(F^{\prime}\) is indeed a higher order function. This makes sense and is what we exactly observed in any valid implementation of derivative functions.
On the other hand, the gradient of \(F\), let's say \(F^*\), has the type of \(X \to X\). To define this, let's firstly define a dual function:
\(\Phi: X \to X^*, x \mapsto \bra{x}\) The function simply maps the points \(x\) on \(X\) to its dual space \(X^*\), where \(\bra{x}\) is a function that takes any point, for example \(y\), in \(X\), and return a value in \(\mathbb{R}\). More specifically, we can define \(\bra{x}\) as this:
\(\bra{x}: X \to \mathbb{R}, y \mapsto \bra{x}\ket{y}\)We can see, \(\bra{x}\) is a closure function that wraps \(x\) inside. As is easily seen, the function \(\bra{x}\) has a very special input: \(x\), where \(\bra{x} \ket{x} = 1\). This indicates that there is an inverse function of \(\Phi\):
\(\Phi^{-1}: X^* \to X, \bra{x} \mapsto x\) In a computer, the inverse function \(\Phi^{-1}\) could be interpreted as the following statements: Given a function \(\bra{x}\), find the point \(x\) such that \(\bra{x} \ket{x} = 1\). This is actually a root-finding task(this could also be interpreted as optimization task using the variational rule).
As we have defined the dual function \(\Phi\), we then move on to define the gradient function \(F^*\):
\(F^*: X \to X, F^* = \Phi^{-1} \circ F^{\prime}\)To interpret, what a gradient function does, is given a point \(x_0 \in X\), firstly input it into derivative function \(F^{\prime}\), and get a function in dual space: \(F^{\prime}(x_0) = \bra{z} \in \mathcal{L}(X;\mathbb{R})\). Then we use the inverse of the dual function \(\Phi\) to map \(\bra{z}\) back to \(z \in X\).
How to interpret the functionality of \(\Phi^{-1}\)? Again, we can use variational rule: Suppose \(y \in X\) is any vector with unit norm: \(\bra{y} \ket{y} = 1\). Suppose \(\bra{z}\) also has unit norm in dual space. Then the maximized value the function \(\bra{z}\) could obtain is \(1\), and the corresponding input value is unique: it is just \(z\). Which means that the gradient is the unit-normed vector in space \(X\) that provides the greatest value change of function \(F\).
Why \(F^{\prime}(x_0)\) always corresponds to a point in dual space: \(\bra{z}\)? Just refer to the Riesz representation theorem!
[1] http://www.individual.utoronto.ca/jordanbell/notes/gradienthilbert.pdf