Math Equations & Insights
Vector Calculus
Taylor Polynomial is an approximation of a function f(x) around a point \(x_0\) using a polynomial constructed from the derivatives of f(x) at that point.
Def: The Taylor Polynomial of degree n of \(f: \mathbb{R} \rightarrow \mathbb{R}\) at \(x_0\) is defined as
\(T_n(x) = \sum_{k=0}^n \frac{f^{(k)}(x_0)}{k!} (x-x_0)^k\)
where \(f^{(k)}(x_0)\) is the \(k^{\text{th}}\) derivative of f at \(x_0\) and \(\frac{f^{(k)}(x_0)}{k!}\) are the coefficients of the polynomial.
Functions
Product rule: \((f(x) g(x))' = f'(x) g(x) + f(x) g'(x)\)
Quotient rule: \((f(x) / g(x))' = (f'(x) g(x) - f(x) g'(x)) / (g(x))^2\)
Sum rule: \((f(x) + g(x))' = f'(x) + g'(x)\)
Chain rule: \((g(f(x)))' = (g o f)'(x) = g'(f(x))f'(x)\)
The generalization of the derivative to functions of several variables is the gradient. We find the gradient of the function \(f\) with respect to x by varying one variable at a time and keeping the others constant.
Def: For a function \(f: \mathbb{R}^n \rightarrow \mathbb{R}, x \rightarrow f(x), x \in \mathbb{R}^n\) of n variables \(x_1, ..., x_n\) we define the partial derivatives as
\(\frac{df}{dx_1}\) = \(\lim_{h \to 0} \frac{f(x_1 + h, x_2, ..., x_n) - f(x)}{h}\)
…
\(\frac{df}{dx_n}\) = \(\lim_{h \to 0} \frac{f(x_1, x_2, ..., x_n + h) - f(x)}{h}\)
When we collect them in a row vector form, this is called the gradient of \(f\) or the Jacobian.
Sometimes we are interested in derivatives of higher order. Consider a function \(f: \mathbb{R}^2 \rightarrow \mathbb{R}\) of two variables x, y. We use the following notation for higher-order partial derivatives (and for gradients):
\(\frac{d^2 f}{dx^2}\) is the second partial derivative of f with respect to x
\(\frac{d^n f}{dx^n}\) is the \(n^{\text{th}}\) partial derivative of f with respect to x
The Hessian is the collection of all second-order partial derivatives. It generalizes the second derivative to multiple dimensions and is used to study the curvature of a function. It is a square matrix.
Multivariate Taylor Series
We consider a function \(f: \mathbb{R}^D \rightarrow \mathbb{R}\) \(x \rightarrow f(x), x \in \mathbb{R}^D\)
that is smooth at \(x_0\). The multivariate Taylor series of \(f\) at \((x_0)\) is defined as
\(f(x) = \sum_{k=0}^\infty \frac{D_x^{k}f(x_0)}{k!} (x-x_0)^k\)
where \({D_x^{k}f(x_0)}\) is the \(k^{\text{th}}\) total derivative of \(f\) with respect to x, evaluated at \(x_0\).
The Taylor series is an infinite series that represents a function as a sum of terms based on its derivatives at a point. The Taylor series provides a precise representation of the function if it converges to the function at all points near the expansion point.
The Taylor polynomial is a finite approximation of the Taylor series, using only the first few terms (up to n). A Taylor polynomial of degree n includes terms up to the \(n^{\text{th}}\) partial derivatives of the function. It provides a local approximation near a given point but becomes less accurate as you move further from that point.
Probability and Distributions
The sample space Ω is the set of all possible outcomes of the experiment.
The event space \(A\) is a subset of the sample space that represents a particular outcome or group ot outcomes we’re interested in.
The probability \(P\). With each event, we associate a number \(P(A)\) that measures the probability that the event will occur.
The marginal probabbility that \(X\) takes the value x irrespective of the value of random variable \(Y\) is written as p(x).
The fraction of instances (the conditional probability) for which \(Y\) = y is written as \(p(y \mid x)\).
Def: The product rule relates the joint distribution to the conditional distribution via p(x, y) = \(p(y \mid x)\) p(x)
Bayes’ theorem
\(p(x \mid y)\) = \(\frac {p(y \mid x) p(x)}{p(y)}\)
Expectation
\(\mathbb{E}[X] = \sum_{i} x_i P(X=x_i)\) (for discrete variables)
\(\mathbb{E}[X] = \int_{-\infty}^\infty x f_x(x) dx\) (for continuous variables)
The expected value of a function g of a discrete random variable x is given by
\[\mathbb{E}[g(x)] = \sum_{x \in X} g(x) p(x)\]The expected value of a function \(g: \mathbb{R} \rightarrow \mathbb{R}\) of a univariate continuous random variable x is given by
\[\mathbb{E}[g(x)] = \int_{x} g(x) p(x) dx\]Sums and Transformations of Random Variables
\[\mathbb{E}[x+y] = \mathbb{E}[x] + \mathbb{E}[y]\] \[\mathbb{E}[x-y] = \mathbb{E}[x] - \mathbb{E}[y]\]Covariance
The covariance intuitively represents the notion of how dependent random variables are to one another.
The covariance bewteen two univariate random variables X, Y \(\in \mathbb{R}\) is given by the expected product of their derivatives from their respective means,
\[Cov_{\text{x,y}} [x, y] = \mathbb{E}_{\text{x,y}} [(x-\mathbb{E}_x[x]) (y-\mathbb{E}_y[y])]\]The covariance of a variable with itself is called the variance and is denoted by \(V_x[x]\). The square root of the variance is called the standard deviation \(\sigma(x)\)
If we consider two multivariate random variables X and Y with states \(x \in \mathbb{R}^D\) and \(y \in \mathbb{R}^E\) respectively, the covariance between X and Y is defined as
Cov[x, y] = \(\mathbb{E}[xy^T] - \mathbb{E}[x]\mathbb{E}[y]^T = Cov[y, x]^T \in \mathbb{R}^{\text{D x E}}\)
Correlation
The correlation between two random variables X, Y is given by
\[Corr[x, y] = \frac{Cov[x,y]}{\sqrt{V[x] V[y]}} \in [-1, 1]\]The covariance (and correlation) indicate how two random variables are related. Positive correlation means that when x grows, then y is also expected to grow. Negative correlation means that as x increases, then y decreases.
Statistical Independence
Two random variables X,Y are statistically independent if and only if p(x,y) = p(x) p(y)
Gaussian Distribution
The Gaussian distribution is one of the most well-studied probability distributions for continuous-valued random variables. It’s also known as the normal distribution. For a univariate random variable, the Gaussian distribution has a density that is given by:
\[p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}\]Continuous Optimization
Gradient Descent
If we want to find a local optimum \(f(x_0)\) of a function \(f: \mathbb{R}^n \rightarrow \mathbb{R}, x \rightarrow f(x)\), we start with an initial guess \(x_0\) of the parameters we wish to optimize and then iterate according to
\[x_{\text{i+1}} = x_i - y_i ((\nabla{f}) (x_i))^T\]For a suitable step-size \(y_i\), the sequence \(f(x_0) \geq f(x_1) \geq ...\) converges to a local minimum.
Convex Optimization
A convex function is a type of function with a specific shape and property that is crucial in optimization. Intuitively, a convex function has a bowl-shaped or U-shaped curve, which makes it easier to find the minimum of the function because any local minimum is also a global minimum.
A set C is a convex set if for any \(x, y \in C\) and for any scalar \(\theta\) with \(0 \leq \theta \leq 1\), we have
\[\theta x + (1 - \theta) y \in C\]Convex sets are sets such that a straight line connecting any two elements of the set lie inside the set.
Convex functions are functions such that a straight line between any two points of the function lie above the function.