RBF-NN for Function Approximation: Part 1 (Setup)

RBF-NNs are typically used for either function approximation or data classification. My interest is in function approximation, so we'll start there. This tutorial will approach the RBF-NN from the perspective of an RBF researcher/practitioner, and as an applied mathematician. I will likely avoid using NN terminology (except where helpful).

Alright, let's begin. An RBF-NN for function approximation is essentially a least-squares approximant. The goal is not to interpolate data, but to approximate it in a way that filters out noise (if any). Let's set up the problem. First, let's say that we have a collection of $$X = \{{\bf x}_i \}_{i=1}^N$$ data sites at which we have data sampled. Let's also define a second collection of point called the centers, to be the set $$X_c = \{{\bf x}^c_k \}_{k=1}^n$$; these could be a subset of the data sites, or could be distinct from the data sites. Further, let's say we have samples of some scalar-valued function $$f({\bf x})$$ at the data sites. Our goal is to approximate $$\left.f\right|_X$$ using an RBF-NN. The RBF-NN approximant can be written as: $$s({\bf x}) = \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x} - {\bf x}^c_k\|\right),$$ where $$\phi\left(\epsilon,{\bf x}\right)$$ is an RBF, the $$c_k$$ values are the unknown approximation coefficients (NN weights), and the $$\epsilon_k$$ values are called shape parameters. In the theory for RBF interpolation, using a single $$\epsilon$$ with distinct data sites (and centers being the same as data sites) guarantees the existence and uniqueness of the RBF interpolant. In the RBF-NN context, we like the flexibility of having one or more shape parameters. A common choice for $$\phi$$ is the Gaussian RBF $$\phi(\epsilon,r) = e^{-(\epsilon r)^2}.$$

While the above approach is somewhat standard in the RBF-NN literature, it is a bit outdated when compared to the RBF interpolation literature. It is now common to augment RBFs with polynomials to improve the approximation. Let $$\ell$$ be the degree of a total-degree polynomial in $$d$$ dimensions. Then, our new RBF-NN approximant takes the form $$s({\bf x}) = \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x} - {\bf x}^c_k\|\right) + \sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}),$$ where $$p_j({\bf x})$$ are the monomial basis functions in $$d$$ dimensions; for instance, if $$d=2$$, $$p_1({\bf x}) = 1$$, $$p_2({\bf x}) = x$$, $$p_3({\bf x}) = y$$, $$p_4({\bf x}) = xy$$, and so on. The $$\lambda_j$$ values are not interpolation coefficients in the traditional sense. Instead, they help us enforce additional constraints on the RBF-NN coefficients $$c_k$$. Now, we must enforce conditions on the RBF-NN approximant to obtain the $$c_k$$ and $$\lambda_j$$ values. The conditions we will impose are: $$\left.s({\bf x})\right|_X = \left.f({\bf x})\right|_X, \\\\ \sum\limits_{k=1}^n c_k p_j({\bf x}) = 0, j=1,\ldots,{\ell+d \choose d}.$$ To find the coefficients $$c_k$$ and $$\lambda_j$$, we use the above conditions to solve the following block linear system: $$\underbrace{\begin{bmatrix} \Phi & P \\ \hat{P}^T & O \end{bmatrix}}_{\tilde{A}} \underbrace{\begin{bmatrix} {\bf c} \\ {\bf \lambda} \end{bmatrix}}_{\tilde{c}} = \underbrace{\begin{bmatrix} {\bf f} \\ 0 \end{bmatrix}}_{\tilde{f}},$$ where $$A_{ij} = \phi\left(\epsilon_j, \|{\bf x}_i - {\bf x}^c_j\|\right),i=1,\ldots,N, j=1,\ldots,n, \\\\ P_{ij} = p_j\left({\bf x}_i \right),i=1,\ldots,N, j=1,\ldots,{\ell+d \choose d},\\\\ \hat{P}_{ij} = p_j\left({\bf x}^c_i \right), i=1,\ldots,n, j=1,\ldots,{\ell+d \choose d},\\\\ O_{ij} = 0, i,j = 1,\ldots, {\ell+d \choose d}.$$ From the above system, it is clear that $${\bf \lambda}$$ is the vector of Lagrange multipliers that enforce the constraint $$\hat{P}^T {\bf c}=0$$. This latter constraint is called polynomial reproduction, and it ensures that the RBF-NN can reproduce polynomials up to degree $$\ell$$ at the centers. It is important to note that $$n$$ and $$\ell$$ are hyperparameters for the RBF-NN. They can in principle be trained, but our focus here is on training the parameters $$\epsilon_k$$, $$c_k$$, and $${\bf x}^c_k$$. As described above, $$n << N$$. The above linear system is therefore overdetermined (more equations than unknowns). We are interested in the least-squares solution to the above linear system. In other words, we wish to find $$\tilde{c}$$ so that $$\|\tilde{A}\tilde{c} - \tilde{f}\|_2^2$$ is minimized. Of course, this is a well-understood problem from numerical linear algebra, and there are several possible techniques to solve this linear system. One way to do this is to use the reduced (or economy) $$QR$$ factorization of $$\tilde{A}$$, since $$\tilde{A}$$ is a tall and skinny rectangular matrix: \begin{align} \tilde{A} &= QR \\\\ &= Q\begin{bmatrix} R_1 \\\\ 0\end{bmatrix} \\\\ &= \begin{bmatrix} Q_1 & Q_2 \end{bmatrix}\begin{bmatrix} R_1 \\\\ 0\end{bmatrix} \\\\ &= Q_1 R_1, \end{align} where $$Q_1$$ is an orthogonal matrix, and $$R_1$$ is an upper-triangular matrix. This can then be used to solve the linear system in three steps: \begin{align} Q_1 R_1 \tilde{c} &= \tilde{f}, \\\\ \implies R_1 \tilde{c} &= Q_1^T \tilde{f}, \\\\ \implies \tilde{c} &= R_1^{-1} Q_1^T \tilde{f}. \end{align}

In the next blog post, I will describe how to train the parameters $$c_k$$, $$\epsilon_k$$, and $${\bf x}^c$$.