In part 1 of this series on RBF-NN for function approximation, we discussed how to set up a basic RBF neural network, and find the corresponding neural network weights. See here: RBF-NN Part 1. In Part 2, we discussed how to train the different RBF-NN parameters using gradient descent. As a part of that discussion, we also presented an update formula for the weights: RBF-NN Part 2. It is reasonable to wonder: which approach is better for computing the RBF-NN weights? Do they give the same answer? When should you prefer the linear solve to gradient descent? This post will tackle this issue.
Equivalence of least-squares and mean-squared error minimization
Recall from Part 1 that if \(\tilde{A}\) is the rectangular matrix corresponding to the RBF-NN (containing RBFs and polynomials), \(\tilde{c}\) is the vector of RBF-NN weights/coefficients, and \(\tilde{f}\) is the vector of function samples (padded with zeroes), the weights can be calculated by the following linear solve: $$ \tilde{c} = R_1^{-1} Q_1^{T} \tilde{f}, $$ where \(Q_1\) and \(R_1\) are obtained from the economy-sized QR decomposition of \(A\). This, we said, was the least-squares solution to the linear system \(\tilde{A}\tilde{c} = \tilde{f}\). On the other hand, we had the gradient descent rule for the weights that minimize the mean-squared error. We derived this to be: $$ c_k^{new} = c_k^{old} + \frac{\eta_{c_k}}{N}\sum\limits_{i=1}^N 2e_i\phi\left(\epsilon_k,\|{\bf x}_i - {\bf x}^c_k\|\right), k=1,\ldots,n. $$ Are these two approaches the same if you take enough iterations of gradient descent? Well, to show that, consider the least-squares problem of finding \(\tilde{c}\) that minimizes \(\|\tilde{A}\tilde{c} - \tilde{f}\|^2_2\). Let's first show why the \(QR\) approach actually works for solving this system. First, the \(QR\) decomposition of \(\tilde{A}\) is guaranteed to produce a matrix \(Q\) so that: $$ \|Q x\|_2 = \|x\|_2, $$ for any (non-zero) vector \(x\). We also have the important property that \(Q^T = Q^{-1}\). We can therefore transform the least-squares problem by multiplying through by \(Q^T\): $$ \|\tilde{A}\tilde{c} - \tilde{f}\|^2_2 = \|Q^T\tilde{A}\tilde{c} - Q^T\tilde{f}\|^2_2, $$ as this does not change the magnitude of the 2-norm. Great! But now, recall that \(\tilde{A} = QR\). This implies that \(R = Q^T \tilde{A}\). Thus, we can rewrite our least-squares problem as: $$ \|\tilde{A}\tilde{c} - \tilde{f}\|^2_2 = \|R\tilde{c} - Q^T\tilde{f}\|^2_2. $$ In our case, further simplifications can be made using the fact that \(\tilde{A}\) is a tall and skinny rectangular matrix. Regardless, it is clear that we can indeed use the QR-decomposition to find a solution to the least-squares problem.
Okay, now the question is if the solution from the QR problem is the same as the solution after a large number of steps of gradient descent. For this, we need to show that solving the least-squares problem is the same as minimizing the mean-squared error. Recall that the mean-squared error is given by: $$ E = \frac{1}{N}\sum\limits_{i=1}^N \left(f\left({\bf x}_i\right) - s\left({\bf x}_i\right) \right)^2. $$ Recall from Part 1 that the vector \({\bf f}\) is defined as: $$ {\bf f} = \begin{bmatrix} f({\bf x}_1)&\ldots& f({\bf x}_N)\end{bmatrix}^T. $$ Similarly, define the vector \({\bf s}\) to be: $$ {\bf s} = \begin{bmatrix} s({\bf x}_1)&\ldots& s({\bf x}_N)\end{bmatrix}^T. $$ This means that the mean-squared error \(E\) can be rewritten as: $$ E = \frac{1}{N} \|{\bf f} - {\bf s}\|_2^2. $$ Let's manipulate the least-squares expression a bit by using the definitions of \(\tilde{A}\),\(\tilde{c}\), and \(\tilde{f}\). $$ \begin{align} \|\tilde{A}\tilde{c} - \tilde{f}\|_2^2 &= \left\Vert \begin{bmatrix} \Phi {\bf c} + P {\bf \lambda}\\ {\bf 0} \end{bmatrix} - \begin{bmatrix} {\bf f}\\ {\bf 0} \end{bmatrix} \right\Vert^2_2 \\\\ &= \left\Vert \left(\Phi {\bf c} + P {\bf \lambda}\right) - {\bf f} \right\Vert^2_2, \\\\ &= \| {\bf s} - {\bf f} \|^2_2, \\\\ &= N E. \end{align} $$ This means that \(\|\tilde{A}\tilde{c}-\tilde{f}\|_2^2\) is simply a scaled version of \(E\), the mean-squared error. When you find the minimum of \(E\) and set it to zero, the \(\frac{1}{N}\) term goes away, giving a solution vector \(\tilde{c}\) that is identical to the least-squares solution vector. Of course, this makes it clear that gradient descent only approximately solves the least-squares problem. Thus, we take an accuracy hit by applying gradient descent to the weights. What do we gain, if anything?
Comparing costs: gradient descent vs least-squares
If we use the QR decomposition to solve the least-squares problem once, the cost is incurred is \(O(Nn^2)\) (ignoring the polynomial terms). The resulting coefficients are accurate to machine precision (or close). If we never expect the weights to change, this approach is the best one to take.
On the other hand, imagine that we perform gradient descent on the centers \( {\bf x}^c_1,\ldots, {\bf x}^c_n\). This will change \(\tilde{A}\) within each iteration of gradient descent, causing \(\tilde{c}\) to change within each iteration as well. In this scenario, if gradient descent takes \(T\) iterations to converge, the total dominant cost of solving the full least-squares problem is \(O(TNn^2)\). If \(T = O(N)\), then this cost is \(O(N^2n^2)\), which is prohibitively expensive. In contrast, if we only update the weights \(c_1,\ldots,c_n\) for each of the \(T\) iterations, the dominant cost is \(O(TNn)\). If \(T = O(N)\), then this cost is \(O(N^2n)\), which is significantly cheaper than the cost of a linear solve every step. Another reason to avoid solving the least-squares problem exactly is the issue of generalization past the training data. We will discuss this when we discuss alternative algorithms to gradient descent, starting with the next post.