Least Squares and QR Decomposition

This small article describes how to solve the linear least squares problem using QR decomposition and why you should use QR decomposition as opposed to the normal equations.

What is the linear least squares problem?

The linear least squares problem is to find a vector \(x \in \mathbb{R}^n\) that minimizes \(||Ax-b||_2^2\), where \(b \in \mathbb{R}^m\) is a given vector and \(A \in \mathbb{R}^{m \times n}\) is a given matrix of full rank with \(m > n\).

Some notes:

Definition: \(x \in \mathbb{R}^n\) means \(x\) is a vector of size \(n\) with real number entries.
Definition: \(A \in \mathbb{R}^{m \times n}\) means \(A\) is a matrix of size \(m \times n\) with real number entries.
Definition: \(|| \cdot ||_2\) is called the 2-norm and is defined as: \(||v||_2^2 = v_1^2 + v_2^2 + \ldots + v_n^2\) for a vector \(v\).
\(m > n\) implies \(A\) is a tall skinny matrix, \(x\) is a short vector, and \(b\) is a tall vector.
There are too few unknowns in \(x\) to solve \(Ax = b\), so we have to settle for getting as close as possible. Hence the minimization problem.
Definition: \(A\) is of full rank means all the columns of \(A\) are linearly independent. The purpose of this is to have only one solution to the problem.

The least squares problem and all the following discussion work with complex numbers as well with a few tweaks.

Example

An example of a linear least squares problem is a polynomial fit (regression) problem. Suppose you have 100 (x,y) coordinates that are suppose to fit closely to a quadratic. Then you want to find a quadratic \(y = a_0 + a_1 x + a_2 x^2\) that closely fits the coordinates. So you are trying to find coefficients \(a_0, a_1, a_2\) such that \begin{align} y_1 &\approx a_0 + a_1 x_1 + a_2 x_1^2 \\ y_2 &\approx a_0 + a_1 x_2 + a_2 x_2^2 \\ \vdots \\ y_{100} &\approx a_0 + a_1 x_{100} + a_2 x_{100}^2. \end{align} The linear least squares problem is to find the coefficients \(a_0, a_1, a_2\) that minimize \begin{equation} (a_0 + a_1 x_1 + a_2 x_1^2 - y_1)^2 + \ldots + (a_0 + a_1 x_{100} + a_2 x_{100}^2 - y_{100})^2. \end{equation} If you form the matrix and vectors \begin{equation} A = \begin{pmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_{100} & x_{100}^2 \end{pmatrix}, \qquad x = \begin{pmatrix} a_0 \\ a_1 \\ a_2 \end{pmatrix}, \qquad b = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_{100} \end{pmatrix} \end{equation} you get the equivalent problem \begin{equation} \min_{x} ||Ax - b||_2^2. \end{equation}

The normal equations

One multivariable calculus technique to solve the minimization is to take the partial derivative with respect to \(x_1, x_2, \ldots, x_n\), set all the equations to zero, and solve for \(x_1, x_2, \ldots, x_n\). Writing the result in matrix form leads to the what is called the normal equations \begin{equation} A^T A x = A^T b. \end{equation} The nice thing about this system is that it's a small \(n \times n\) matrix system that can be solved. The problem with this formulation is that it squares the condition number of the problem.

Briefly, the condition number is the largest singular value divided by the smallest singular value. If you don't know what that means, go read about singular values because singular values are cool. Anyhow, a big condition number means the problem is difficult to solve numerically, i.e. on a computer with finite precision. Squaring a condition number can make a problem impossible to solve in some cases.

One way to prove the condition number is squared is to take the singular value decomposition: \(A = U \Sigma V^T\). Then \begin{equation} A^T A = V \Sigma^T \Sigma V^T, \end{equation} which shows the singular values have been squared and hence the condition number is squared.

For some problems, it doesn't matter if the condition number is squared. For the example above of a quadratic regression, if the data is highly scattered and not at all quadratic, the condition number will be high and the QR decomposition may be required. But if the data is clearly quadratic in nature, the condition number of \(A\) will be small and you can use the normal equations. However, as you go to higher polynomial degrees, the condition number will increase.

The QR decomposition

The QR decomposition of a matrix \(A \in \mathbb{R}^{m \times n}\) is \(A = QR\) where \(Q \in \mathbb{R}^{m \times m}\) is an orthogonal matrix and \(R \in \mathbb{R}^{m \times n}\) is an upper triangular matrix. For the case we care about, \(m > n\), \(R\) has the form \begin{equation} \begin{pmatrix} R_1 \\ R_2 \end{pmatrix} = \begin{pmatrix} R_1 \\ 0 \end{pmatrix} \end{equation} where \(R_1 \in \mathbb{R}^{n \times n}\) is an upper triangular matrix and \(R_2\) is the \((m-n) \times n\) zero matrix.

Definition: Q orthogonal means \(Q^T Q = Q Q^T = I\).

The nice thing about an orthogonal matrix is it can move in and out of the 2 norm. Specifically, for any vector \(v\), \begin{equation} ||Qv||_2 = ||Q||_2 \; ||v||_2 = ||v||_2. \end{equation}

Proof

\begin{equation} ||Qv||_2 \leq ||Q||_2 \; ||v||_2 = ||v||_2. \end{equation} \begin{equation} ||v||_2 = ||Q^T Q v||_2 \leq ||Q^T||_2 \; ||Qv||_2 = ||Qv||_2. \end{equation} Note: I used the fact the 2-norm of an orthogonal matrix is 1.

Solving the least squares problem with QR decomposition

Now for the method of solving the least squares problem. \begin{align}||Ax - b||_2^2 &= ||QRx - b||_2^2 \\ &= ||Q^T Q Rx - Q^Tb||_2^2 \\ &= ||Rx - Q^Tb||_2^2 \end{align} To minimize the last expression, write \(\tilde{b} = Q^T b\) and minimize \begin{equation} ||Rx - \tilde{b}||_2^2. \end{equation} Using the splitting for \(R\) given earlier, we have \begin{equation} Rx - \tilde b = \begin{pmatrix} R_1 \\ 0 \end{pmatrix} x - \begin{pmatrix} \tilde{b}_1 \\ \tilde{b}_2 \end{pmatrix} \end{equation} which means we want to minimize \begin{equation} ||R_1 x - \tilde{b}_1||_2^2 + ||\tilde{b}_2||_2^2. \end{equation} The solution to the least squares problem is given by solving \begin{equation} R_1 x = \tilde{b}_1. \end{equation} This makes the first norm zero, which is the best we can do since the second norm is not dependent on \(x\).

Implementation

One implementation detail is that for a tall skinny matrix, one can perform a skinny QR decomposition. This is given by \(A = Q_1 R_1\) where \(Q_1 \in \mathbb{R}^{m \times n}\) is a tall, skinny matrix and \(R_1 \in \mathbb{R}^{n \times n}\) is a small square matrix. Contrast this with the original QR decomposition and we find that: (i) \(Q_1\) is the first \(n\) columns of \(Q\), and (ii) \(R_1\) is the first n rows of \(R\) which is the same as the definition of \(R_1\) above. Further \(\tilde b_1 = Q_1^T b\), so \(x\) is found by solving \begin{equation} R_1 x = Q_1^T b. \end{equation} The reason for using the skinny QR decomposition, is that it can be much faster to compute.