RBF-NNs are typically used for either function approximation or data classification. My interest is in function approximation, so we'll start there. This tutorial will approach the RBF-NN from the perspective of an RBF researcher/practitioner, and as an applied mathematician. I will likely avoid using NN terminology (except where helpful).

Alright, let's begin. An RBF-NN for function approximation is essentially a *least-squares approximant*. The goal is not to *interpolate* data, but to approximate it in a way that filters out noise (if any). Let's set up the problem. First, let's say that we have a collection of \(X = \{{\bf x}_i \}_{i=1}^N\) *data sites* at which we have data sampled. Let's also define a second collection of point called the centers, to be the set \(X_c = \{{\bf x}^c_k \}_{k=1}^n\); these could be a subset of the data sites, or could be distinct from the data sites. Further, let's say we have samples of some scalar-valued function \(f({\bf x})\) at the data sites. Our goal is to approximate \(\left.f\right|_X\) using an RBF-NN. The RBF-NN approximant can be written as:
$$
s({\bf x}) = \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x} - {\bf x}^c_k\|\right),
$$
where \(\phi\left(\epsilon,{\bf x}\right)\) is an RBF, the \(c_k\) values are the *unknown* approximation coefficients (NN *weights*), and the \(\epsilon_k\) values are called *shape parameters*. In the theory for RBF interpolation, using a single \(\epsilon\) with distinct data sites (and centers being the same as data sites) guarantees the existence and uniqueness of the RBF interpolant. In the RBF-NN context, we like the flexibility of having one or more shape parameters. A common choice for \(\phi\) is the *Gaussian RBF*
$$
\phi(\epsilon,r) = e^{-(\epsilon r)^2}.
$$

While the above approach is somewhat standard in the RBF-NN literature, it is a bit outdated when compared to the RBF interpolation literature. It is now common to augment RBFs with polynomials to improve the approximation. Let \(\ell\) be the degree of a total-degree polynomial in \(d\) dimensions. Then, our new RBF-NN approximant takes the form
$$
s({\bf x}) = \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x} - {\bf x}^c_k\|\right) + \sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}),
$$
where \(p_j({\bf x})\) are the monomial basis functions in \(d\) dimensions; for instance, if \(d=2\), \(p_1({\bf x}) = 1\), \(p_2({\bf x}) = x\), \(p_3({\bf x}) = y\), \(p_4({\bf x}) = xy\), and so on. The \(\lambda_j\) values are *not* interpolation coefficients in the traditional sense. Instead, they help us enforce additional constraints on the RBF-NN coefficients \(c_k\). Now, we must enforce conditions on the RBF-NN approximant to obtain the \(c_k\) and \(\lambda_j\) values. The conditions we will impose are:
$$
\left.s({\bf x})\right|_X = \left.f({\bf x})\right|_X, \\\\
\sum\limits_{k=1}^n c_k p_j({\bf x}) = 0, j=1,\ldots,{\ell+d \choose d}.
$$
To find the coefficients \(c_k\) and \(\lambda_j\), we use the above conditions to solve the following block linear system:
$$
\underbrace{\begin{bmatrix}
\Phi & P \\
\hat{P}^T & O
\end{bmatrix}}_{\tilde{A}}
\underbrace{\begin{bmatrix}
{\bf c} \\
{\bf \lambda}
\end{bmatrix}}_{\tilde{c}}
=
\underbrace{\begin{bmatrix}
{\bf f} \\
0
\end{bmatrix}}_{\tilde{f}},
$$
where
$$
A_{ij} = \phi\left(\epsilon_j, \|{\bf x}_i - {\bf x}^c_j\|\right),i=1,\ldots,N, j=1,\ldots,n, \\\\
P_{ij} = p_j\left({\bf x}_i \right),i=1,\ldots,N, j=1,\ldots,{\ell+d \choose d},\\\\
\hat{P}_{ij} = p_j\left({\bf x}^c_i \right), i=1,\ldots,n, j=1,\ldots,{\ell+d \choose d},\\\\
O_{ij} = 0, i,j = 1,\ldots, {\ell+d \choose d}.
$$
From the above system, it is clear that \({\bf \lambda}\) is the vector of *Lagrange multipliers* that enforce the constraint \(\hat{P}^T {\bf c}=0\). This latter constraint is called *polynomial reproduction*, and it ensures that the RBF-NN can reproduce polynomials up to degree \(\ell\) at the centers. It is important to note that \(n\) and \(\ell\) are hyperparameters for the RBF-NN. They can in principle be trained, but our focus here is on training the parameters \(\epsilon_k\), \(c_k\), and \({\bf x}^c_k\). As described above, \(n << N\). The above linear system is therefore overdetermined (more equations than unknowns). We are interested in the *least-squares* solution to the above linear system. In other words, we wish to find \(\tilde{c}\) so that
$$
\|\tilde{A}\tilde{c} - \tilde{f}\|_2^2
$$
is minimized. Of course, this is a well-understood problem from numerical linear algebra, and there are several possible techniques to solve this linear system. One way to do this is to use the reduced (or economy) \(QR\) factorization of \(\tilde{A}\), since \(\tilde{A}\) is a tall and skinny rectangular matrix:
$$
\begin{align}
\tilde{A} &= QR \\\\
&= Q\begin{bmatrix} R_1 \\\\ 0\end{bmatrix} \\\\
&= \begin{bmatrix} Q_1 & Q_2 \end{bmatrix}\begin{bmatrix} R_1 \\\\ 0\end{bmatrix} \\\\
&= Q_1 R_1,
\end{align}
$$
where \(Q_1\) is an orthogonal matrix, and \(R_1\) is an upper-triangular matrix. This can then be used to solve the linear system in three steps:
$$
\begin{align}
Q_1 R_1 \tilde{c} &= \tilde{f}, \\\\
\implies R_1 \tilde{c} &= Q_1^T \tilde{f}, \\\\
\implies \tilde{c} &= R_1^{-1} Q_1^T \tilde{f}.
\end{align}
$$

In the next blog post, I will describe how to train the parameters \(c_k\), \(\epsilon_k\), and \({\bf x}^c\).