where $A \in \mathbb{R}^{n \times n}, B \in \mathbb{R}^{n \times m}, C \in \mathbb{R}^{p \times n}$, $x_k, u_k, y_k$ are system state, input and output vectors and $\omega_k$ is the i.i.d. Gaussian measurement noise. As an observer, we want to estimate the true state $x_k$ from output $y_k$.
To analyze this state estimation problem, let’s write the system dynamics for $N$ steps in a single equation:
\[\underbrace{\begin{pmatrix} y_0 \\ y_1 \\ \vdots \\ y_{N-1} \end{pmatrix}}_{\mathbf{y}} = \underbrace{\begin{pmatrix} 0 &0&0&\cdots&0 \\ CB & 0&0&\cdots & 0 \\ \vdots &\vdots& & \\ CA^{N-2}B & CA^{N-1}B & &\cdots & 0 \end{pmatrix}}_{\mathcal{L}(A,B,C)} \underbrace{\begin{pmatrix} u_0 \\ u_1 \\ \vdots \\ u_{N-1} \end{pmatrix}}_{\mathbf{u}} + \underbrace{\begin{pmatrix} C \\ CA^1 \\ \vdots \\ CA^{N-1} \end{pmatrix}}_{\mathcal{O}(C,A)} x_0 + \underbrace{\begin{pmatrix} \omega_0 \\ \omega_1 \\ \vdots \\ \omega_{N-1} \end{pmatrix}}_{\mathbf{\omega}},\]which is
\[\begin{equation} \mathbf{y} = \mathcal{L}(A,B,C) \mathbf{u} + \mathcal{O}(C,A) x_0 + \mathbf{\omega}. \end{equation}\]From (3), we find that if we obtain both $\mathbf{y}, \mathbf{u}$, then the initial state $x_0$ can be optimally estimated by
\[\hat{x}_0 = \mathcal{O}(C,A)^{\dagger} (\mathbf{y} - \mathcal{L}(A,B,C) \mathbf{u})\]as the ordinary least square (OLS) estimator.
The state estimation $\hat{x}_{0:N}$ is unique if $\mathcal{O}(C,A)$ has full column rank.
Note that “$\mathcal{O}(C,A)$ has full column rank” represents $rank(\mathcal{O}(C,A)) =n$, which is the classic observability condition. Therefore, the observability is to ensure a unique $x_0$ given $\mathbf{y}, \mathbf{u}$.
However, sometimes we cannot get access to the input $\mathbf{u}$. If we only have the observation $\mathbf{y}$, then there should be more strict assumptions on matrix $C$.
The state at time $k$ can be directly estimated by $\hat{x}_k = C^{\dagger} y_k$ if $C$ has full column rank ($rank(C) = n$).
In this case, $C$ cannot be a “fat matrix” ($p < n$), which is unfortunately a common situation in the realistic application. The number of variables in the output vector is less than that in the state.
Another choice for state estimation without inputs is to use the initial state $x_0$. In some scenarios the initial state can be accurately observed. Then we have another dynamic equation from $k=1$ to $k=N$ as
\[\underbrace{\begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}}_{\tilde{\mathbf{y}}} = \underbrace{\begin{pmatrix} CB \\ CAB & CB \\ \vdots & & \ddots \\ CA^{N-1}B & \cdots & \cdots & CB \end{pmatrix}}_{\tilde{\mathcal{L}}(A,B,C)} \underbrace{\begin{pmatrix} u_0 \\ u_1 \\ \vdots \\ u_{N-1} \end{pmatrix}}_{\mathbf{u}} + \underbrace{\begin{pmatrix} CA \\ CA^2 \\ \vdots \\ CA^{N} \end{pmatrix}}_{\mathcal{O}(CA,A)} x_0 + \underbrace{\begin{pmatrix} \omega_1 \\ \omega_2 \\ \vdots \\ \omega_N \end{pmatrix}}_{\tilde{\mathbf{\omega}}},\]which is
\[\begin{equation} \tilde{\mathbf{y}} = \tilde{\mathcal{L}}(A,B,C) \mathbf{u} + \mathcal{O}(CA,A) x_0 + \tilde{\mathbf{\omega}}. \end{equation}\]Now we need to firstly estimate the input sequence $\hat{\mathbf{u}}$, then calculate the state estimation with $\hat{\mathbf{u}}$ and system dynamic. Similarly to (a), with $\tilde{\mathbf{y}}$ and $x_0$ the optimal estimation to inputs is
\[\hat{\mathbf{u}} = \tilde{\mathcal{L}}(A,B,C)^{\dagger} (\tilde{\mathbf{y}}-\mathcal{O}(CA,A) x_0).\]And the state estimation is
\[{\begin{pmatrix} \hat{x}_1 \\ \hat{x}_2 \\ \vdots \\ \hat{x}_N \end{pmatrix}} = {\begin{pmatrix} B \\ AB & B \\ \vdots & & \ddots \\ A^{N-1}B & \cdots & \cdots & B \end{pmatrix}} \hat{\mathbf{u}} + {\begin{pmatrix} A \\ A^2 \\ \vdots \\ A^{N} \end{pmatrix}} x_0.\]The state estimation is unique if matrix $CB$ has a full column rank ($rank(CB) = m$).
If $CB$ has full column rank, then matrix $\tilde{\mathcal{L}}(A,B,C)$ has full column rank and its left pseudo-inverse is unique.
Kalman filter is a widely used method for state estimation and we will not talk it in detailed here (there are loads of blogs introducing KF on the web). We point out that Kalman filter is not included in any above cases we listed, while it can deal with the system with process noises:
\[x_{k+1} = A x_k +Bu_k + \nu_k,~ y_k = Cx_k + \omega_k.\]For this system, the $N$ steps dynamic equation is written as
\[\tilde{\mathbf{y}} = \tilde{\mathcal{L}}(A,B,C) \mathbf{u} + \mathcal{O}(CA,A) x_0 + \begin{pmatrix} C \\ CA & C \\ \vdots & & \ddots \\ CA^{N-1} & \cdots & \cdots & C \end{pmatrix} \begin{pmatrix} \nu_0 \\ \nu_1 \\\vdots \\ \nu_{N-1} \end{pmatrix}+ \tilde{\mathbf{\omega}}.\]It is obvious that the previous estimation (estimate $x_0$ with y+u or estimate u with y+$x_0$) fail due to the existence of the process noise $\nu_k$. The Kalman filter needs all the $\mathbf{y},\mathbf{u},x_0$.
Hope this blog helps you build a more comprehensive and clearer understanding of the state estimation with output $y_k$. Contact me without hesitation if you find any mistakes in the blog.
]]>The structural risk minimization (SRM) is an important method for model selection in statistical learning. The choice of the hypothesis function model is nontrivial since there exists a trade-off concerning the model complexity. Considering a classification task, the ideal classifier is less likely to contain in a simplistic model, while learning a complex model consumes substantial computation and risks overfitting.
SRM balances the estimation error and approximation error of the learning through minimizing the sum of empirical risk and penalty term for model complexity. As the model becomes rich and complex, the empirical risk decreases and changes little while the penalty term goes larger.
Empirical risk minimization is to minimize the estimation error on the given training data $\mathcal{T}$.
\[{f^{ERM}_{\mathcal{T}}} = \mathop{\arg\min}\limits_{f \in \mathcal{F}} \hat{\epsilon}_{\mathcal{T}} (f), ~~ \hat{\epsilon}_{\mathcal{T}} (f) = \frac{1}{m} \sum_{i=1}^m l(f(x),y)\]Rademacher complexity measures the capacity of a hypothesis class of real-valued functions. Denote $\mathcal{X}, \mathcal{Y}$ as the input and output spaces in a regression problem. $\mathcal{F}$ is a hypothesis function class, where $f:\mathcal{X} \xrightarrow{} \mathcal{Y}, f\in \mathcal{F}$. For an arbitrary loss function class $\mathcal{L}$ associated with $\mathcal{F}$ mapping from $\mathcal{Z} = \mathcal{X} \times \mathcal{Y}$ to $\mathbb{R}$, we have
\[\mathcal{L}=\{l(z): (x,y) \mapsto l(f(x),y), z:=(x,y), f\in \mathcal{F}\}.\]We then provide the following definition.
Definition 1 Given $\mathcal{L}$ as a function class mapping from $\mathcal{Z}$ to $\mathbb{R}$, and $S=\{z_i\}_{i=1}^m$ as a sequence of $m$ samples from $\mathcal{Z}$, the empirical Rademacher complexity of $\mathcal{L}$ with respect to $S$ is defined as
\[\hat{\mathfrak{R}}_S(\mathcal{L}) = \mathbb{E}_{\sigma} \left[ \sup_{l\in \mathcal{L}} \frac{1}{m} \sum_{i=1}^m \sigma_i l(z_i) \right],\]where $\sigma = (\sigma_i)_{i=1}^m$ are independent uniform random variables distributed in ${-1,1}$. $\sigma_i$ is called Rademacher variable. The Rademacher complexity of $\mathcal{L} $ is $\mathfrak{R}_m(\mathcal{L} ) = \mathbb{E}_{S} \hat{\mathfrak{R}}_S(\mathcal{L})$.
The Rademacher complexity provides an upper bound on the difference between the empirical risk and the expected risk. The generalization bound is presented as follows.
Theorem 1 (One-sided bound) Let $\mathcal{L} $ be a function class mapping from $\mathcal{Z}$ to $[a,b]$ and $S=\{z_i\}_{i=1}^m$ be a sequence of $m$ i.i.d. samples from $\mathcal{Z}$. Then for any $\delta>0$, with at least $1-\delta$ probability
\[\mathbb{E}[l (z)] -\frac{1}{m} \sum_{i=1}^m l (z_i) \leq 2 \hat{\mathfrak{R}}_S(\mathcal{L} ) + 3 (b-a) \sqrt{\frac{\log \frac{2}{\delta}}{2m}}\]holds for all $l \in \mathcal{L}$.
(Two-sided bound) For any $\delta>0$, with at least $1-\delta$ probability
\[\sup_{l \in \mathcal{L} } \left|\mathbb{E}[l (z)] - \frac{1}{m} \sum_{i=1}^m l (z_i) \right| \leqslant 2 \hat{\mathfrak{R}}_S(\mathcal{L} ) + 3 (b-a) \sqrt{\frac{\log \frac{4}{\delta}}{2m}}\]
The Rademacher complexity possesses some important properties, which are helpful in calculation.
SRM consists of selecting an optimal reward function class index $1 \leq k^* \leq C$ and the ERM hypothesis $f^* \in \mathcal{F}_{k^*}$ which minimizes both estimation error and the model complexity penalty.
\[f^{SRM}_{\mathcal{T}} = \mathop{\arg\min}\limits_{1\leqslant k \leqslant C, f\in \mathcal{F}\_k} \hat{\epsilon}_{\mathcal{T}} (f) +2\mathfrak{R}_m(F_k).\]Since we cannot obtain the real value of $\mathfrak{R}_m(F_k)$, substitute with the empirical Rademacher complexity $\hat{\mathfrak{R}}_m(F_k)$. As for why we multiply $2$ here for $\mathfrak{R}_m(F_k)$, one reason is for the subsequent theorem derivation (in order to utilize Theorem 1). It is natural to do a weighted up between empirical risk and model penalty, while I have not figured out how to guarantee the learning bound in such instance.
]]>Theorem 2 (SRM learning guarantee) For the SRM solution $f^{SRM}_{\mathcal{T}}$, with the probability $1-\delta$
\[\epsilon_{\mathcal{D}}(f_{\mathcal{T}}^{SRM}) \leqslant \inf_{f \in \mathcal{F}} \left( \epsilon_{\mathcal{D}}(f) + 4 \mathfrak{R}_m({\mathcal{F}}_{k(f)}) \right)+ 3 (b-a) \sqrt{\frac{\log \frac{4(C+1)}{\delta}}{2m}}\]holds.
the solution is the classic linear feedback controller $u_k = K_k x_k$ and the feedback gain matrix $K_k$ (time-variant in finite horizon) is calculated by DARE:
\[\begin{align} & P_{k-1} = A^TP_{k}A - A^T P_{k} B(R+ B^T P_{k}B)^{-1} B^TP_{k}A+Q \\ & K_k = (R + B^T P_{k+1} B)^{-1} B^T P_{k+1} A, \end{align}\]where $P_N = H$. $H,Q,R$ are all positive definite matrices. Riccati equations are applied in various fields like LQR and Kalman filter. There exist some interesting properties of DARE iteration. The sequence $P_{N:0} = P_N, P_{N=1}, …,P_0$ generated by (1) is proved to be monotonous and converges to its fixed point (also the solution of continuos ARE) as $N$ goes to infinity. Now we present the I-DARE problem.
I-DARE Problem: Suppose the system matrices $A,B$ are known. Given the exact sequence $K_{0:N-1}$ as the solution of DARE, can we infer the parameter matrices $H,Q,R$?
\[\hat{H},\hat{Q},\hat{R} = \textit{I-DARE}(K_{0:N-1})\]
There are two points we need to think over.
Uniqueness (scalar ambiguity) Notice that for a DARE iteration, the parameter sets $(H,Q,R)$ and $(\alpha H, \alpha Q, \alpha R)$ with scalar $\alpha \in \mathbb{R}^+$ generate the same solution sequence $K_{0:N-1}$. Thus, does the inverse problem still possess this property? Yes
Identifiability Are $(H,Q,R)$ in I-DARE problem always identifiable? No. Need some conditions
Suppose two gain matrix sequences $K_{0:N-1}, K’_{0:N-1}$ are generated with two sets of parameters ${H},{Q},{R}$ and ${H’},{Q’},{R’}$ respectively through DARE. If we have $K_k=K_k’$ for all $k$, do they satisfy ${H’} = \alpha H, {Q’} = \alpha Q, {R’}=\alpha R$ for some $\alpha \in \mathbb{R}^+$?
The derivation sketch consists of three main steps. (i) Subtract the Riccati equation for $k=i+1$ from the equation for $k=i$ and re-write the difference in the matrix form:
\[R^{-1} \widetilde{\mathcal{BA}}_i \widetilde{QH}_i \widetilde{\mathcal{A}^c}_i = R'^{-1} \widetilde{\mathcal{BA}}_i \widetilde{Q'H'}_i \widetilde{\mathcal{A}^c}_i.\](ii) Take the trace of both sides and we obtain
\[\mathrm{tr}(R^{-1} \widetilde{\mathcal{BA}}_i \widetilde{QH}_i \widetilde{\mathcal{A}^c}_i) = vec(R^{-1})^T \underbrace{(\widetilde{\mathcal{A}^c}_i^T \otimes \widetilde{\mathcal{BA}}_i)}_{\mathcal{E}_i} vec(\widetilde{QH}_i)\]and
\[(\begin{bmatrix} vec(\overline{Q})^T, vec(\overline{H})^T \end{bmatrix} \otimes vec(\overline{R^{-1}})^T) vec(\mathcal{P}_i(\mathcal{E}_i)) = (\begin{bmatrix} vec(\overline{Q'})^T, vec(\overline{H'})^T \end{bmatrix} \otimes vec(\overline{R'^{-1}})^T) vec(\mathcal{P}_i(\mathcal{E}_i)),\]where $\mathcal{P}_i(\mathcal{E}_i)$ is a $\frac{m(m+1)}{2} \times n(n+1)$ matrix.
(iii) If there exist at least $\frac{mn(n+1)(m+1)}{2}$ linearly independent vectors $vec(\mathcal{P}_i(\mathcal{E}_i))$ in the horizon $k=0,\dots, N-1$, then we can combine them as a full row rank matrix and obtain
\[vec(\overline{Q'})=\alpha \cdot vec(\overline{Q}), vec(\overline{H'}) =\alpha \cdot vec(\overline{H}), vec(\overline{R'^{-1}}) =\frac{1}{\alpha} \cdot vec(\overline{R^{-1}}).\]Here we provide the following theorem. $m,n$ are the dimension of input and state vector.
Theorem 1 If the control horizon is set as $N < \frac{mn(n+1)(m+1)}{2}$, the true weight parameters $H,Q,R$ of the control objective will never be identified accurately, which can be utilized in preserving the system’s intention.
If you are interested in IOC-LQR and this I-DARE problem, I also suggest to read H. Zhang (infer R) and C. Yu (infer Q,R).
]]>