RBF-NN for Function Approximation: Part 2 (Training)

In the previous post (RBF-NN: Part 1), I discussed the use of RBF Neural Networks (RBF-NN) for function approximation. In that post, we set up an RBF-NN of the form: $$ s({\bf x}) = \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x} - {\bf x}^c_k\|\right) + \sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}), $$ where \(\phi(\epsilon,r)\) is a Radial Basis Function (RBF), \(p_j({\bf x})\) are the monomials in \(d\) dimensions (up to degree \(\ell\)), \(c_k\) are the NN weights, and \(\epsilon_k\) are shape parameters. In the last post, we saw how to solve for the \(c_k\) values (together with the \(\lambda_j\) values). Today, we'll focus on training the different parameters in the RBF-NN. More specifically, we will discuss how to train the parameters \(c_k\), \(\epsilon_k\), and \({\bf x}^c_k\). The \(\lambda_j\) parameters do not need to be trained separately, and can be computed from the other parameters. For the following discussion, remember that we have \(N\) data sites \(X = \{{\bf x}_i\}_{i=1}^N\) (which are the locations of our training data), and \(n\) centers \(X_c = \{{\bf x}^c_k\}_{k=1}^n\) (which are the locations of each ``neuron''). We assume for now that we are given \(N\), \(n\), and \(\ell\) (polynomial degree).

Training via Gradient Descent (Theory)

Before we begin, we must ask ourselves what ``training'' is. The idea is to use the sampled function values \(f({\bf x}_k)\) at the given data sites \(X\) to teach the neural network \(s({\bf x})\) about \(f\). Formally speaking, we want the error on the training set \(X\) to be minimized. We also have a more important goal that we will discuss in a subsequent post: how to generalize the RBF-NN so that the error on a test set (distinct from the training set) is also kept low.

We said we want to minimize the training error. Before we talk about how to do that, we need to define this error. There are a few different ways to do so. We will use a simple measure of error called mean-squared error \(E\). We can define this error at the data sites as: $$ E = \frac{1}{N} \sum\limits_{i=1}^N \left( f\left({\bf x}_i\right) - s\left({\bf x}_i\right) \right)^2. $$ For the purposes of training, it is useful to think of \(E\) and \(s\) as explicit functions of the different training parameters. Thus, we write \(E\) as: $$ E\left(c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right) = \frac{1}{N} \sum\limits_{i=1}^N \left( f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right) \right)^2. $$ Phew! That's tedious, but it tells us exactly what RBF-NN parameters \(E\) and \(s\) depend on. It also reminds us that the true function \(f\) most certainly does not depend on the RBF-NN parameters. Okay, now that we've written down the error, we have to discuss how to minimize it. Obviously, E is a function of many RBF-NN parameters. Ideally, we'd like to find the minimum value of E corresponding to all these parameters. More precisely, we wish to minimize E with respect to the RBF-NN parameters.

From calculus, recall that to find the minimum of a function, you take its derivative and set it to zero. Let's see how to minimize \(E\) with respect to just \(c_1\). First, remember that \(E\) is a function of many variables. Thus, when differentiating \(E\) with respect to \(c_1\), you need its partial derivative with respect to \(c_1\). Setting this derivative to zero, we get $$ \begin{align} \implies \frac{1}{N} \frac{\partial}{\partial c_1} \sum\limits_{i=1}^N \left(f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right) \right)^2 &=0, \\\\ \implies \frac{1}{N} \sum\limits_{i=1}^N \frac{\partial}{\partial c_1}\left(f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right)\right)^2 &=0. \end{align} $$ For convenience, let $$ e_i = f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right), $$ so that the long expression above becomes: $$ \frac{1}{N} \sum\limits_{i=1}^N \frac{\partial}{\partial c_1}e_i^2 =0. $$ Differentiating the above expression with the chain rule, we get $$ \frac{1}{N} \sum\limits_{i=1}^N 2e_i \frac{\partial e_i}{\partial c_1} =0. $$ It is clear that the above procedure works for all the parameters \(c_1,\ldots,c_n,\ldots\). In general, all we have to do is compute the partial derivative of \(e_i\) with respect to a parameter, and we can find the minimum (in priniciple). Let's continue doing this for \(c_1\). Focusing on that partial derivative, we have $$ \frac{\partial e_i}{\partial c_1} = \frac{\partial}{\partial c_1} \left(f\left({\bf x}_i\right) - s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right) \right). $$ Again, writing out all the parameters came in handy. We know that \(f\) is not a function of any RBF-NN parameters. Therefore, we have $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right). $$ Great! This in principle can be computed for the RBF-NN for each parameter. All we have to do is set the derivative to zero, solve for the parameters \(c_1,\ldots,c_n,\ldots\), and we're good to go! Right? In theory, yes. In practice, it is incredibly difficult to actually solve for the parameters by minimizing \(E\) with respect to those parameters. This is where the idea of gradient descent comes in.

The procedure of setting the partial derivative of \(E\) to zero is too tedious to be feasible. Gradient descent offers a compromise: instead of jumping to \(\frac{\partial E}{\partial c_1} = 0\) in one shot as detailed above, how about we descend there gradually? Gradient descent for the parameter \(c_1\) therefore amounts to an update rule of the form: $$ c_1^{new} = c_1^{old} - \eta_{c_1} \frac{\partial E}{\partial c_1}, $$ where \(0 < \eta_{c_1} < 1\) is called the "learning rate". It controls the "speed" at which \(c_1^{new}\) approaches the value \(c_1^*\), where \(c_1^*\) is the value of \(c_1\) that you would've gotten if you had been able to minimize E. If you run gradient descent long enough, it is guaranteed to converge to the minimum. In other words, if you ran a really large number of iterations of gradient descent, you will be guaranteed to minimize \(E\). In practice, you run it until \(E\) hits a predefined threshold/tolerance or for a certain number of iterations, stop the gradient descent, and live with what you get.

Training \(c_1,\ldots,c_n\) via Gradient Descent

We will now derive the formulas for training each of the RBF-NN parameters using gradient descent. Recall that there are three types of parameters: the \(c\) parameters, the \(\epsilon\) parameters, and the \({\bf x}^c\) parameters. We will need to compute partial derivatives for each of these cases. We'd sort of gotten started with \(c_1\), so let's keep going with that. Gradient descent gives us: $$ c_1^{new} = c_1^{old} - \eta_{c_1} \frac{\partial E}{\partial c_1}. $$ From our previous derivation above, we know this is the same as $$ c_1^{new} = c_1^{old} - \eta_{c_1} \frac{1}{N}\sum\limits_{i=1}^N2e_i\frac{\partial e_i}{\partial c_1}, $$ where $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}s\left({\bf x}_i,c_1,\ldots,c_n,\epsilon_1,\ldots,\epsilon_n,{\bf x}^c_1,\ldots,{\bf x}^c_n\right). $$ Plugging in the definition of \(s\), we have $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1} \left(\sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right) + \sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}_i) \right). $$ Plunging forward bravely, this gives us $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}\sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right) - \frac{\partial}{\partial c_1}\sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}_i). $$ The second term vanishes since it doesn't depend on \(c_1\). This leaves the first term: $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}\sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right). $$ To help us do this derivative, we will expand the summand: $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}\left(c_1 \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right) + c_2\phi\left(\epsilon_2,\|{\bf x}_i - {\bf x}^c_2\|\right)+ \ldots + c_n\phi\left(\epsilon_n,\|{\bf x}_i - {\bf x}^c_n\|\right) \right). $$ Wait a minute-- clearly, only the first term in the summand is a function of \(c_1\). Then, the derivatives of the other terms vanish, leaving us with: $$ \frac{\partial e_i}{\partial c_1} = -\frac{\partial}{\partial c_1}c_1 \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right). $$ This is a very straightforward derivative to do, giving us $$ \frac{\partial e_i}{\partial c_1} = -\phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right). $$ In general, for any \(c_k\), by analogy, we therefore have $$ \frac{\partial e_i}{\partial c_k} = -\phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right). $$ The gradient descent rule for the \(n\) \(c\) values then looks like: $$ c_k^{new} = c_k^{old} + \eta_{c_k} \frac{1}{N}\sum\limits_{i=1}^N 2e_i \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right), k=1,\ldots,n. $$

Training \(\epsilon_1,\ldots,\epsilon_n\) via Gradient Descent

Having seen things in detail for the \(c_k\) values, we can skip a few steps to find the gradient descent formula for the \(\epsilon_k\). We know that we need $$ \frac{\partial e_i}{\partial \epsilon_1} = -\frac{\partial}{\partial \epsilon_1}\sum\limits_{k=1}^n c_k \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right) - \frac{\partial}{\partial \epsilon_1}\sum\limits_{j=1}^{{\ell+d \choose d}} \lambda_j p_j({\bf x}_i). $$ Once again, the second term drops out, leave us with $$ \frac{\partial e_i}{\partial \epsilon_1} = -\frac{\partial}{\partial \epsilon_1}\left(c_1 \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right) + c_2\phi\left(\epsilon_2,\|{\bf x}_i - {\bf x}^c_2\|\right)+ \ldots + c_n\phi\left(\epsilon_n,\|{\bf x}_i - {\bf x}^c_n\|\right) \right). $$ Only the first term in the summand is a function of \(\epsilon_1\). This gives us $$ \frac{\partial e_i}{\partial \epsilon_1} = -\frac{\partial}{\partial \epsilon_1} c_1 \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right) = - c_1\frac{\partial}{\partial \epsilon_1} \phi\left(\epsilon_1,\|{\bf x}_i - {\bf x}^c_1\|\right). $$ Here, we've made an assumption that the weights \(c_k\) are not a function of the values \(\epsilon_k\). This assumption is interesting, and a topic of ongoing research. Let's leave that alone for now and proceed with the update rule above. In general, we have $$ \frac{\partial e_i}{\partial \epsilon_k} = - c_k\frac{\partial}{\partial \epsilon_k} \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right), $$ with the corresponding gradient descent rule: $$ \epsilon_k^{new} = \epsilon_k^{old} + \eta_{\epsilon_k} \frac{1}{N}\sum\limits_{i=1}^N 2e_i c_k\frac{\partial}{\partial \epsilon_k} \phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right), k=1,\ldots,n. $$ We've left that last partial derivative fairly generic, so that the process is applicable to any RBF.

Training \({\bf x}^c_1,\ldots,{\bf x}^c_n\) via Gradient Descent

Now, we come to the final part of this post: training the RBF-NN centers using gradient descent. This is fun, because here we're actually coming up with an update rule to move points around in space! As usual, let's do everything for \({\bf x}^c_1\). We require the quantity $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = -\frac{\partial }{\partial {\bf x}^c_1} s\left({\bf x}_i,{\bf x}^c_1,\ldots,{\bf x}^c_n \right), $$ where we've suppressed other arguments to \(s\) for clarity. Noting immediately that the polynomial part of \(s\) does not depend on \({\bf x}^c\), we get $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = -\frac{\partial }{\partial {\bf x}^c_1} \sum\limits_{k=1}^n c_k \phi\left(\epsilon_k, \|{\bf x}_i - {\bf x}^c_k\|\right). $$ Just as in the previous derivations, this is immediately simplified. Only the first RBF depends on \({\bf x}^c_1\), making the others vanish on differentiation. This yields $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = -\frac{\partial }{\partial {\bf x}^c_1} c_1 \phi\left(\epsilon_1, \|{\bf x}_i - {\bf x}^c_1\|\right) = -c_1 \frac{\partial }{\partial {\bf x}^c_1}\phi\left(\epsilon_1, \|{\bf x}_i - {\bf x}^c_1\|\right). $$ We can further simplify the last term using the chain rule. \(\phi(r)\) is a radial function, and \(r = \|{\bf x} - {\bf x}^c\|\). The chain rule gives: $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = -c_1 \left.\left(\frac{\partial \phi }{\partial r} \frac{\partial r}{\partial {\bf x}^c}\right)\right|_{{\bf x}^c = {\bf x}^c_1}. $$ This simplifies to $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = c_1 \left.\frac{\partial \phi}{\partial r}\right|_{r = r_1} \frac{{\bf x}_i - {\bf x}^c_1}{r_1}, $$ where \(r_1 = \|{\bf x}_i - {\bf x}^c_1\|\). To reduce the number of operations, it's useful to rewrite this as: $$ \frac{\partial e_i}{\partial {\bf x}^c_1} = c_1 \left.\left(\frac{1}{r}\frac{\partial \phi}{\partial r}\right)\right|_{r = r_1} {\bf x}_i - {\bf x}^c_1. $$ The quantity in parenthesis can be computed analytically, then evaluated at \(r = r_1\) for any RBF. The general expression for any center \({\bf x}^c_k\) is: $$ \frac{\partial e_i}{\partial {\bf x}^c_k} = c_k \left.\left(\frac{1}{r}\frac{\partial \phi}{\partial r}\right)\right|_{r = r_k} {\bf x}_i - {\bf x}^c_k. $$ Finally, the gradient descent rule for the centers is: $$ \left({\bf x}^c_k\right)^{new} = \left({\bf x}^c_k\right)^{old} - {\bf \eta}_{{\bf x}^c_k} \frac{1}{N}\sum\limits_{i=1}^N e_i c_k \left.\left(\frac{1}{r}\frac{\partial \phi}{\partial r}\right)\right|_{r = r_k} {\bf x}_i - {\bf x}^c_k, k=1,\ldots, n. $$

Wrapping up

We came up with gradient descent update formulas for all the RBF-NN parameters, given a fixed set of hyperparameters. In the next blog post, we'll discuss Stochastic Gradient Descent (SGD) for training.