RBF-NN for Function Approximation: Part 3 (Linear solves vs Gradient Descent)

In part 1 of this series on RBF-NN for function approximation, we discussed how to set up a basic RBF neural network, and find the corresponding neural network weights. See here: RBF-NN Part 1. In Part 2, we discussed how to train the different RBF-NN parameters using gradient descent. As a part of that discussion, we also presented an update formula for the weights: RBF-NN Part 2. It is reasonable to wonder: which approach is better for computing the RBF-NN weights? Do they give the same answer? When should you prefer the linear solve to gradient descent? This post will tackle this issue.

Equivalence of least-squares and mean-squared error minimization

Recall from Part 1 that if \(\tilde{A}\) is the rectangular matrix corresponding to the RBF-NN (containing RBFs and polynomials), \(\tilde{c}\) is the vector of RBF-NN weights/coefficients, and \(\tilde{f}\) is the vector of function samples (padded with zeroes), the weights can be calculated by the following linear solve: $$ \tilde{c} = R_1^{-1} Q_1^{T} \tilde{f}, $$ where \(Q_1\) and \(R_1\) are obtained from the economy-sized QR decomposition of \(A\). This, we said, was the least-squares solution to the linear system \(\tilde{A}\tilde{c} = \tilde{f}\). On the other hand, we had the gradient descent rule for the weights that minimize the mean-squared error. We derived this to be: $$ c_k^{new} = c_k^{old} + \frac{\eta_{c_k}}{N}\sum\limits_{i=1}^N 2e_i\phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right), k=1,\ldots,n. $$ Are these two approaches the same if you take enough iterations of gradient descent? Well, to show that, consider the least-squares problem of finding \(\tilde{c}\) that minimizes \(\|\tilde{A}\tilde{c} - \tilde{f}\|^2_2\). Let's first show why the \(QR\) approach actually works for solving this system. First, the \(QR\) decomposition of \(\tilde{A}\) is guaranteed to produce a matrix \(Q\) so that: $$ \|Q x\|_2 = \|x\|_2, $$ for any (non-zero) vector \(x\). We also have the important property that \(Q^T = Q^{-1}\). We can therefore transform the least-squares problem by multiplying through by \(Q^T\): $$ \|\tilde{A}\tilde{c} - \tilde{f}\|^2_2 = \|Q^T\tilde{A}\tilde{c} - Q^T\tilde{f}\|^2_2, $$ as this does not change the magnitude of the 2-norm. Great! But now, recall that \(\tilde{A} = QR\). This implies that \(R = Q^T \tilde{A}\). Thus, we can rewrite our least-squares problem as: $$ \|\tilde{A}\tilde{c} - \tilde{f}\|^2_2 = \|R\tilde{c} - Q^T\tilde{f}\|^2_2. $$ In our case, further simplifications can be made using the fact that \(\tilde{A}\) is a tall and skinny rectangular matrix. Regardless, it is clear that we can indeed use the QR-decomposition to find a solution to the least-squares problem.

Okay, now the question is if the solution from the QR problem is the same as the solution after a large number of steps of gradient descent. For this, we need to show that solving the least-squares problem is the same as minimizing the mean-squared error. Recall that the mean-squared error is given by: $$ E = \frac{1}{N}\sum\limits_{i=1}^N \left(f\left({\bf x}_i\right) - s\left({\bf x}_i\right) \right)^2. $$ Recall from Part 1 that the vector \({\bf f}\) is defined as: $$ {\bf f} = \begin{bmatrix} f({\bf x}_1)&\ldots& f({\bf x}_N)\end{bmatrix}^T. $$ Similarly, define the vector \({\bf s}\) to be: $$ {\bf s} = \begin{bmatrix} s({\bf x}_1)&\ldots& s({\bf x}_N)\end{bmatrix}^T. $$ This means that the mean-squared error \(E\) can be rewritten as: $$ E = \frac{1}{N} \|{\bf f} - {\bf s}\|_2^2. $$ Let's manipulate the least-squares expression a bit by using the definitions of \(\tilde{A}\),\(\tilde{c}\), and \(\tilde{f}\). $$ \begin{align} \|\tilde{A}\tilde{c} - \tilde{f}\|_2^2 &= \left\Vert \begin{bmatrix} \Phi {\bf c} + P {\bf \lambda}\\ {\bf 0} \end{bmatrix} - \begin{bmatrix} {\bf f}\\ {\bf 0} \end{bmatrix} \right\Vert^2_2 \\\\ &= \left\Vert \left(\Phi {\bf c} + P {\bf \lambda}\right) - {\bf f} \right\Vert^2_2, \\\\ &= \| {\bf s} - {\bf f} \|^2_2, \\\\ &= N E. \end{align} $$ This means that \(\|\tilde{A}\tilde{c}-\tilde{f}\|_2^2\) is simply a scaled version of \(E\), the mean-squared error. When you find the minimum of \(E\) and set it to zero, the \(\frac{1}{N}\) term goes away, giving a solution vector \(\tilde{c}\) that is identical to the least-squares solution vector. Of course, this makes it clear that gradient descent only approximately solves the least-squares problem. Thus, we take an accuracy hit by applying gradient descent to the weights. What do we gain, if anything?

Comparing costs: gradient descent vs least-squares

If we use the QR decomposition to solve the least-squares problem once, the cost is incurred is \(O(Nn^2)\) (ignoring the polynomial terms). The resulting coefficients are accurate to machine precision (or close). If we never expect the weights to change, this approach is the best one to take.

On the other hand, imagine that we perform gradient descent on the centers \( {\bf x}^c_1,\ldots, {\bf x}^c_n\). This will change \(\tilde{A}\) within each iteration of gradient descent, causing \(\tilde{c}\) to change within each iteration as well. In this scenario, if gradient descent takes \(T\) iterations to converge, the total dominant cost of solving the full least-squares problem is \(O(TNn^2)\). If \(T = O(N)\), then this cost is \(O(N^2n^2)\), which is prohibitively expensive. In contrast, if we only update the weights \(c_1,\ldots,c_n\) for each of the \(T\) iterations, the dominant cost is \(O(TNn)\). If \(T = O(N)\), then this cost is \(O(N^2n)\), which is significantly cheaper than the cost of a linear solve every step. Another reason to avoid solving the least-squares problem exactly is the issue of generalization past the training data. We will discuss this when we discuss alternative algorithms to gradient descent, starting with the next post.

RBF-NN for Function Approximation: Part 2 (Training)

In the previous post (RBF-NN: Part 1), I discussed the use of RBF Neural Networks (RBF-NN) for function approximation. In that post, we set up an RBF-NN of the form: $$ s({\bf x}) = \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x} - {\bf x}^c_k\|\right) + \sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}), $$ where \(\phi(\epsilon,r)\) is a Radial Basis Function (RBF), \(p_j({\bf x})\) are the monomials in \(d\) dimensions (up to degree \(\ell\)), \(c_k\) are the NN weights, and \(\epsilon_k\) are shape parameters. In the last post, we saw how to solve for the \(c_k\) values (together with the \(\lambda_j\) values). Today, we'll focus on training the different parameters in the RBF-NN. More specifically, we will discuss how to train the parameters \(c_k\), \(\epsilon_k\), and \({\bf x}^c_k\). The \(\lambda_j\) parameters do not need to be trained separately, and can be computed from the other parameters. For the following discussion, remember that we have \(N\) data sites \(X = \{{\bf x}_i\}_{i=1}^N\) (which are the locations of our training data), and \(n\) centers \(X_c = \{{\bf x}^c_k\}_{k=1}^n\) (which are the locations of each ``neuron''). We assume for now that we are given \(N\), \(n\), and \(\ell\) (polynomial degree).

Training via Gradient Descent (Theory)

Before we begin, we must ask ourselves what ``training'' is. The idea is to use the sampled function values \(f({\bf x}_k)\) at the given data sites \(X\) to teach the neural network \(s({\bf x})\) about \(f\). Formally speaking, we want the error on the training set \(X\) to be minimized. We also have a more important goal that we will discuss in a subsequent post: how to generalize the RBF-NN so that the error on a test set (distinct from the training set) is also kept low.

We said we want to minimize the training error. Before we talk about how to do that, we need to define this error. There are a few different ways to do so. We will use a simple measure of error called mean-squared error \(E\). We can define this error at the data sites as: $$ E = \frac{1}{N} \sum\limits_{i=1}^N \left( f\left({\bf x}_i\right) - s\left({\bf x}_i\right) \right)^2. $$ For the purposes of training, it is useful to think of \(E\) and \(s\) as explicit functions of the different training parameters. Thus, we write \(E\) as: $$ E\left(c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right) = \frac{1}{N} \sum\limits_{i=1}^N \left( f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right) \right)^2. $$ Phew! That's tedious, but it tells us exactly what RBF-NN parameters \(E\) and \(s\) depend on. It also reminds us that the true function \(f\) most certainly does not depend on the RBF-NN parameters. Okay, now that we've written down the error, we have to discuss how to minimize it. Obviously, E is a function of many RBF-NN parameters. Ideally, we'd like to find the minimum value of E corresponding to all these parameters. More precisely, we wish to minimize E with respect to the RBF-NN parameters.

From calculus, recall that to find the minimum of a function, you take its derivative and set it to zero. Let's see how to minimize \(E\) with respect to just \(c_1\). First, remember that \(E\) is a function of many variables. Thus, when differentiating \(E\) with respect to \(c_1\), you need its partial derivative with respect to \(c_1\). Setting this derivative to zero, we get $$ \begin{align} \implies \frac{1}{N} \frac{\partial}{\partial c_1} \sum\limits_{i=1}^N \left(f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right) \right)^2 &=0, \\\\ \implies \frac{1}{N} \sum\limits_{i=1}^N \frac{\partial}{\partial c_1}\left(f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right)\right)^2 &=0. \end{align} $$ For convenience, let $$ e_i = f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right), $$ so that the long expression above becomes: $$ \frac{1}{N} \sum\limits_{i=1}^N \frac{\partial}{\partial c_1}e_i^2 =0. $$ Differentiating the above expression with the chain rule, we get $$ \frac{1}{N} \sum\limits_{i=1}^N 2e_i \frac{\partial e_i}{\partial c_1} =0. $$ It is clear that the above procedure works for all the parameters \(c_1,\ldots,c_n,\ldots\). In general, all we have to do is compute the partial derivative of \(e_i\) with respect to a parameter, and we can find the minimum (in priniciple). Let's continue doing this for \(c_1\). Focusing on that partial derivative, we have $$ \frac{\partial e_i}{\partial c_1} = \frac{\partial}{\partial c_1} \left(f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right) \right). $$ Again, writing out all the parameters came in handy. We know that \(f\) is not a function of any RBF-NN parameters. Therefore, we have $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right). $$ Great! This in principle can be computed for the RBF-NN for each parameter. All we have to do is set the derivative to zero, solve for the parameters \(c_1,\ldots,c_n,\ldots\), and we're good to go! Right? In theory, yes. In practice, it is incredibly difficult to actually solve for the parameters by minimizing \(E\) with respect to those parameters. This is where the idea of gradient descent comes in.

The procedure of setting the partial derivative of \(E\) to zero is too tedious to be feasible. Gradient descent offers a compromise: instead of jumping to \(\frac{\partial E}{\partial c_1} = 0\) in one shot as detailed above, how about we descend there gradually? Gradient descent for the parameter \(c_1\) therefore amounts to an update rule of the form: $$ c_1^{new} = c_1^{old} - \eta_{c_1} \frac{\partial E}{\partial c_1}, $$ where \(0 < \eta_{c_1} < 1\) is called the "learning rate". It controls the "speed" at which \(c_1^{new}\) approaches the value \(c_1^*\), where \(c_1^*\) is the value of \(c_1\) that you would've gotten if you had been able to minimize E. If you run gradient descent long enough, it is guaranteed to converge to the minimum. In other words, if you ran a really large number of iterations of gradient descent, you will be guaranteed to minimize \(E\). In practice, you run it until \(E\) hits a predefined threshold/tolerance or for a certain number of iterations, stop the gradient descent, and live with what you get.

Training \(c_1,\ldots,c_n\) via Gradient Descent

We will now derive the formulas for training each of the RBF-NN parameters using gradient descent. Recall that there are three types of parameters: the \(c\) parameters, the \(\epsilon\) parameters, and the \({\bf x}^c\) parameters. We will need to compute partial derivatives for each of these cases. We'd sort of gotten started with \(c_1\), so let's keep going with that. Gradient descent gives us: $$ c_1^{new} = c_1^{old} - \eta_{c_1} \frac{\partial E}{\partial c_1}. $$ From our previous derivation above, we know this is the same as $$ c_1^{new} = c_1^{old} - \eta_{c_1} \frac{1}{N}\sum\limits_{i=1}^N2e_i\frac{\partial e_i}{\partial c_1}, $$ where $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right). $$ Plugging in the definition of \(s\), we have $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1} \left(\sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right) + \sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}_i) \right). $$ Plunging forward bravely, this gives us $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}\sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right) - \frac{\partial}{\partial c_1}\sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}_i). $$ The second term vanishes since it doesn't depend on \(c_1\). This leaves the first term: $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}\sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right). $$ To help us do this derivative, we will expand the summand: $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}\left(c_1 \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right) + c_2\phi\left(\epsilon_2,\|{\bf x}_i - {\bf x}^c_2\|\right)+ \ldots + c_n\phi\left(\epsilon_n,\|{\bf x}_i - {\bf x}^c_n\|\right) \right). $$ Wait a minute-- clearly, only the first term in the summand is a function of \(c_1\). Then, the derivatives of the other terms vanish, leaving us with: $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}c_1 \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right). $$ This is a very straightforward derivative to do, giving us $$ \frac{\partial e_i}{\partial c_1} = -\phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right). $$ In general, for any \(c_k\), by analogy, we therefore have $$ \frac{\partial e_i}{\partial c_k} = -\phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right). $$ The gradient descent rule for the \(n\) \(c\) values then looks like: $$ c_k^{new} = c_k^{old} + \eta_{c_k} \frac{1}{N}\sum\limits_{i=1}^N 2e_i \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right), k=1,\ldots,n. $$

Training \(\epsilon_1,\ldots,\epsilon_n\) via Gradient Descent

Having seen things in detail for the \(c_k\) values, we can skip a few steps to find the gradient descent formula for the \(\epsilon_k\). We know that we need $$ \frac{\partial e_i}{\partial \epsilon_1} = -\frac{\partial}{\partial \epsilon_1}\sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right) - \frac{\partial}{\partial \epsilon_1}\sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}_i). $$ Once again, the second term drops out, leave us with $$ \frac{\partial e_i}{\partial \epsilon_1} = -\frac{\partial}{\partial \epsilon_1}\left(c_1 \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right) + c_2\phi\left(\epsilon_2,\|{\bf x}_i - {\bf x}^c_2\|\right)+ \ldots + c_n\phi\left(\epsilon_n,\|{\bf x}_i - {\bf x}^c_n\|\right) \right). $$ Only the first term in the summand is a function of \(\epsilon_1\). This gives us $$ \frac{\partial e_i}{\partial \epsilon_1} = -\frac{\partial}{\partial \epsilon_1} c_1 \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right) = - c_1\frac{\partial}{\partial \epsilon_1} \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right). $$ Here, we've made an assumption that the weights \(c_k\) are not a function of the values \(\epsilon_k\). This assumption is interesting, and a topic of ongoing research. Let's leave that alone for now and proceed with the update rule above. In general, we have $$ \frac{\partial e_i}{\partial \epsilon_k} = - c_k\frac{\partial}{\partial \epsilon_k} \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right), $$ with the corresponding gradient descent rule: $$ \epsilon_k^{new} = \epsilon_k^{old} + \eta_{\epsilon_k} \frac{1}{N}\sum\limits_{i=1}^N 2e_i c_k\frac{\partial}{\partial \epsilon_k} \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right), k=1,\ldots,n. $$ We've left that last partial derivative fairly generic, so that the process is applicable to any RBF.

Training \({\bf x}^c_1,\ldots,{\bf x}^c_n\) via Gradient Descent

Now, we come to the final part of this post: training the RBF-NN centers using gradient descent. This is fun, because here we're actually coming up with an update rule to move points around in space! As usual, let's do everything for \({\bf x}^c_1\). We require the quantity $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = -\frac{\partial }{\partial {\bf x}^c_1} s\left({\bf x}_i,{\bf x}^c_1,\ldots,{\bf x}^c_n \right), $$ where we've suppressed other arguments to \(s\) for clarity. Noting immediately that the polynomial part of \(s\) does not depend on \({\bf x}^c\), we get $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = -\frac{\partial }{\partial {\bf x}^c_1} \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k, \|{\bf x}_i - {\bf x}^c_k\|\right). $$ Just as in the previous derivations, this is immediately simplified. Only the first RBF depends on \({\bf x}^c_1\), making the others vanish on differentiation. This yields $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = -\frac{\partial }{\partial {\bf x}^c_1} c_1 \phi\left(\epsilon_1, \|{\bf x}_i - {\bf x}^c_1\|\right) = -c_1 \frac{\partial }{\partial {\bf x}^c_1}\phi\left(\epsilon_1, \|{\bf x}_i - {\bf x}^c_1\|\right). $$ We can further simplify the last term using the chain rule. \(\phi(r)\) is a radial function, and \(r = \|{\bf x} - {\bf x}^c\|\). The chain rule gives: $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = -c_1 \left.\left(\frac{\partial \phi }{\partial r} \frac{\partial r}{\partial {\bf x}^c}\right)\right|_{{\bf x}^c = {\bf x}^c_1}. $$ This simplifies to $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = c_1 \left.\frac{\partial \phi}{\partial r}\right|_{r = r_1} \frac{{\bf x}_i - {\bf x}^c_1}{r_1}, $$ where \(r_1 = \|{\bf x}_i - {\bf x}^c_1\|\). To reduce the number of operations, it's useful to rewrite this as: $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = c_1 \left.\left(\frac{1}{r}\frac{\partial \phi}{\partial r}\right)\right|_{r = r_1} {\bf x}_i - {\bf x}^c_1. $$ The quantity in parenthesis can be computed analytically, then evaluated at \(r = r_1\) for any RBF. The general expression for any center \({\bf x}^c_k\) is: $$ \frac{\partial e_i}{\partial {\bf x}^c_k} = c_k \left.\left(\frac{1}{r}\frac{\partial \phi}{\partial r}\right)\right|_{r = r_k} {\bf x}_i - {\bf x}^c_k. $$ Finally, the gradient descent rule for the centers is: $$ \left({\bf x}^c_k\right)^{new} = \left({\bf x}^c_k\right)^{old} - {\bf \eta}_{{\bf x}^c_k} \frac{1}{N}\sum\limits_{i=1}^N e_i c_k \left.\left(\frac{1}{r}\frac{\partial \phi}{\partial r}\right)\right|_{r = r_k} {\bf x}_i - {\bf x}^c_k, k=1,\ldots, n. $$

Wrapping up

We came up with gradient descent update formulas for all the RBF-NN parameters, given a fixed set of hyperparameters. In the next blog post, we'll discuss Stochastic Gradient Descent (SGD) for training.

RBF-NN for Function Approximation: Part 1 (Setup)

RBF-NNs are typically used for either function approximation or data classification. My interest is in function approximation, so we'll start there. This tutorial will approach the RBF-NN from the perspective of an RBF researcher/practitioner, and as an applied mathematician. I will likely avoid using NN terminology (except where helpful).

Alright, let's begin. An RBF-NN for function approximation is essentially a least-squares approximant. The goal is not to interpolate data, but to approximate it in a way that filters out noise (if any). Let's set up the problem. First, let's say that we have a collection of \(X = \{{\bf x}_i \}_{i=1}^N\) data sites at which we have data sampled. Let's also define a second collection of point called the centers, to be the set \(X_c = \{{\bf x}^c_k \}_{k=1}^n\); these could be a subset of the data sites, or could be distinct from the data sites. Further, let's say we have samples of some scalar-valued function \(f({\bf x})\) at the data sites. Our goal is to approximate \(\left.f\right|_X\) using an RBF-NN. The RBF-NN approximant can be written as: $$ s({\bf x}) = \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x} - {\bf x}^c_k\|\right), $$ where \(\phi\left(\epsilon,{\bf x}\right)\) is an RBF, the \(c_k\) values are the unknown approximation coefficients (NN weights), and the \(\epsilon_k\) values are called shape parameters. In the theory for RBF interpolation, using a single \(\epsilon\) with distinct data sites (and centers being the same as data sites) guarantees the existence and uniqueness of the RBF interpolant. In the RBF-NN context, we like the flexibility of having one or more shape parameters. A common choice for \(\phi\) is the Gaussian RBF $$ \phi(\epsilon,r) = e^{-(\epsilon r)^2}. $$

While the above approach is somewhat standard in the RBF-NN literature, it is a bit outdated when compared to the RBF interpolation literature. It is now common to augment RBFs with polynomials to improve the approximation. Let \(\ell\) be the degree of a total-degree polynomial in \(d\) dimensions. Then, our new RBF-NN approximant takes the form $$ s({\bf x}) = \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x} - {\bf x}^c_k\|\right) + \sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}), $$ where \(p_j({\bf x})\) are the monomial basis functions in \(d\) dimensions; for instance, if \(d=2\), \(p_1({\bf x}) = 1\), \(p_2({\bf x}) = x\), \(p_3({\bf x}) = y\), \(p_4({\bf x}) = xy\), and so on. The \(\lambda_j\) values are not interpolation coefficients in the traditional sense. Instead, they help us enforce additional constraints on the RBF-NN coefficients \(c_k\). Now, we must enforce conditions on the RBF-NN approximant to obtain the \(c_k\) and \(\lambda_j\) values. The conditions we will impose are: $$ \left.s({\bf x})\right|_X = \left.f({\bf x})\right|_X, \\\\ \sum\limits_{k=1}^n c_k p_j({\bf x}) = 0, j=1,\ldots,{\ell+d \choose d}. $$ To find the coefficients \(c_k\) and \(\lambda_j\), we use the above conditions to solve the following block linear system: $$ \underbrace{\begin{bmatrix} \Phi & P \\ \hat{P}^T & O \end{bmatrix}}_{\tilde{A}} \underbrace{\begin{bmatrix} {\bf c} \\ {\bf \lambda} \end{bmatrix}}_{\tilde{c}} = \underbrace{\begin{bmatrix} {\bf f} \\ 0 \end{bmatrix}}_{\tilde{f}}, $$ where $$ A_{ij} = \phi\left(\epsilon_j, \|{\bf x}_i - {\bf x}^c_j\|\right),i=1,\ldots,N, j=1,\ldots,n, \\\\ P_{ij} = p_j\left({\bf x}_i \right),i=1,\ldots,N, j=1,\ldots,{\ell+d \choose d},\\\\ \hat{P}_{ij} = p_j\left({\bf x}^c_i \right), i=1,\ldots,n, j=1,\ldots,{\ell+d \choose d},\\\\ O_{ij} = 0, i,j = 1,\ldots, {\ell+d \choose d}. $$ From the above system, it is clear that \({\bf \lambda}\) is the vector of Lagrange multipliers that enforce the constraint \(\hat{P}^T {\bf c}=0\). This latter constraint is called polynomial reproduction, and it ensures that the RBF-NN can reproduce polynomials up to degree \(\ell\) at the centers. It is important to note that \(n\) and \(\ell\) are hyperparameters for the RBF-NN. They can in principle be trained, but our focus here is on training the parameters \(\epsilon_k\), \(c_k\), and \({\bf x}^c_k\). As described above, \(n << N\). The above linear system is therefore overdetermined (more equations than unknowns). We are interested in the least-squares solution to the above linear system. In other words, we wish to find \(\tilde{c}\) so that $$ \|\tilde{A}\tilde{c} - \tilde{f}\|_2^2 $$ is minimized. Of course, this is a well-understood problem from numerical linear algebra, and there are several possible techniques to solve this linear system. One way to do this is to use the reduced (or economy) \(QR\) factorization of \(\tilde{A}\), since \(\tilde{A}\) is a tall and skinny rectangular matrix: $$ \begin{align} \tilde{A} &= QR \\\\ &= Q\begin{bmatrix} R_1 \\\\ 0\end{bmatrix} \\\\ &= \begin{bmatrix} Q_1 & Q_2 \end{bmatrix}\begin{bmatrix} R_1 \\\\ 0\end{bmatrix} \\\\ &= Q_1 R_1, \end{align} $$ where \(Q_1\) is an orthogonal matrix, and \(R_1\) is an upper-triangular matrix. This can then be used to solve the linear system in three steps: $$ \begin{align} Q_1 R_1 \tilde{c} &= \tilde{f}, \\\\ \implies R_1 \tilde{c} &= Q_1^T \tilde{f}, \\\\ \implies \tilde{c} &= R_1^{-1} Q_1^T \tilde{f}. \end{align} $$

In the next blog post, I will describe how to train the parameters \(c_k\), \(\epsilon_k\), and \({\bf x}^c\).

Learning machine learning

I've recently embarked on a project to teach myself some machine learning basics. Since my background is in Radial Basis Function (RBF) interpolation and numerical methods for PDEs, it seems natural to start with RBF neural networks (RBF-NN) and work outward from there.

There are many great tutorials online, and mine aren't going to be very special. However, they will contain the details that I needed to get things up and running, and will eventually include code also. I will also be applying some of the state-of-the-art technology from the RBF interpolation world to the RBF-NN world.

At some point, I will move on from RBF-NN into SVM, deep learning, and so forth. Let's go.

From ICERM: challenges faced by RBF methods

I recently gave a couple of talks at the ICERM workshop on localized kernel methods for PDEs. The first talk was about my research, and the second talk was about a software package I'm developing called KernelPack. There will be more about KernelPack in this space as I continue developing the software. I plan to document my efforts (and difficulties) as I develop it.

Towards the end of the workshop, we had a panel discussion on the challenges faced by Radial Basis Function (RBF) methods, both in terms of research challenges faced by our community, and in the broader challenges faced by our community in getting the word out there about the strengths of these methods. As a result of this fruitful and rather friendly discussion, several ideas emerged:

1. The need for theoretical results to help relate eigenvalues of global differentiation matrices to the RBF-finite difference weights generated by the local methods. This plays more generally into the theme of robustness: RBF-FD methods have become very robust and efficient over the past 3 years, but a lot more work needs to be done along these lines for a variety of problems.

2. The need to find interesting problems that specifically showcase the strength of RBF-FD methods: they are meshfree, high-order, extremely easy to generate and compute, and require little to no tuning.

3. The need to tackle missing features of RBF-FD methods: meshfree conservation, monotonicity, the ability to handle shocks, and the development of software packages that both budding and seasoned researchers can use.

4. The relative isolation of the RBF community from the wider world of meshfree methods.

Much like writing a grant proposal, this panel discussion has solidified in my mind the types of challenges I'd like to tackle in my research. This is why workshops and conferences can be awesome!

NA from the Bottom-Up: Results

This is a semi-continuation of my previous blog post. My goal for Spring 2016 was to design and teach a numerical analysis course from the ground up. Lagrange interpolation with polynomials was to be the basis of this class.

After a semester, I would say this technique was a success. Polynomials proved to be a very simple way of deriving ODE and PDE methods. Because the class understood polynomial interpolation, differentiation and integration well, it was a breeze to use the Method of Weighted Residual framework to teach finite differences (FD), Chebyshev and Fourier spectral methods, radial basis function (RBF) spectral methods, and RBF-generated FD methods for PDEs.

At any point, to generate a numerical method, I could follow a consistent framework:

1. Pick a basis (polynomials, RBFs, Fourier).

2. Pick a set of collocation points (evenly-spaced, Chebyshev, scattered nodes).

3. Pick a support (global, compact) for that basis.

4. Pick a smoothness (piecewise-smooth, infinitely-smooth) for that basis.

5. Pick weight functions: delta functions for collocation methods, or the basis function itself for a Galerkin method.

Students reported enjoying the class and understanding the material well. Their HW performance reflected this as well.

The course notes for the class are located at http://www.math.utah.edu/~vshankar/5620.html.

Having taught the course for a semester, I now also have a basis (no pun intended) for the next time I teach it. I can think about adding more material on finite element and finite volume methods; I spent only one lecture on the former and no lectures on the latter this semester.

Numerical Analysis from the Bottom-Up

I have attended and sat in on many numerical analysis and numerical methods courses. In fact, a pair of beautiful courses taught by Professor Krzysztof Sikorski (https://www.cs.utah.edu/~sikorski/, http://www.coe.utah.edu/2012/07/27/in-memoriam-professor-kris-sikorski/) was responsible for my own research into numerical methods for PDEs.

That said, second courses in numerical methods (numerics for ODEs and PDEs) always seem to be taught in a scattered fashion: there are a large number of topics to cover in numerical ODEs and PDEs, and these courses valiantly attempt to go over them all. However, in my experience, these courses have only ever commented on the unifying principles underlying all these numerical methods. As a result, one's numerics education is a scattered experience, and we are forced to put together the pieces ourselves afterwards when doing research.

My goal this semester (Spring 2016) was to do this differently. While teaching Numerical Analysis II, I decided to build the course bottom-up. The goal was to identify a unifying principle (or principles) to all the numerical methods for ODEs and PDEs. I realized, partly due to my own research experience, that the most sensible explicit unifying principle is polynomial interpolation.

Starting with polynomial interpolation, it is possible to derive finite differences, and quadrature rules. Using those as building blocks, one can derive all the multistep and multistage methods for ODEs. It is also quite easy to derive error estimates for these methods using the polynomial interpolation error formula, an approach that seems surprisingly rare (though some sources do take this approach).

For numerical PDEs, it is straightforward to start with the Method of Weighted Residuals, and derive all collocation and Galerkin methods from there. Further, by picking a polynomial/piecewise polynomial basis (regardless of the collocation/Galerkin framework), it becomes straightforward to derive spectral methods, finite difference methods and finite element methods.

I make no claims that this is the best way to teach numerical analysis. However, I think it is worth experimenting with this bottom-up approach. To that end, I've been writing my own texts in Latex and posting them up on a class webpage. Hopefully, over time, these documents will evolve into something more polished and rigorous.