Math Equations & Insights

Vector Calculus

Taylor Polynomial is an approximation of a function f(x) around a point \(x_0\) using a polynomial constructed from the derivatives of f(x) at that point.

Def: The Taylor Polynomial of degree n of \(f: \mathbb{R} \rightarrow \mathbb{R}\) at \(x_0\) is defined as
\(T_n(x) = \sum_{k=0}^n \frac{f^{(k)}(x_0)}{k!} (x-x_0)^k\)

where \(f^{(k)}(x_0)\) is the \(k^{\text{th}}\) derivative of f at \(x_0\) and \(\frac{f^{(k)}(x_0)}{k!}\) are the coefficients of the polynomial.

Functions
Product rule: \((f(x) g(x))' = f'(x) g(x) + f(x) g'(x)\)
Quotient rule: \((f(x) / g(x))' = (f'(x) g(x) - f(x) g'(x)) / (g(x))^2\)
Sum rule: \((f(x) + g(x))' = f'(x) + g'(x)\)
Chain rule: \((g(f(x)))' = (g o f)'(x) = g'(f(x))f'(x)\)

The generalization of the derivative to functions of several variables is the gradient. We find the gradient of the function \(f\) with respect to x by varying one variable at a time and keeping the others constant.

Def: For a function \(f: \mathbb{R}^n \rightarrow \mathbb{R}, x \rightarrow f(x), x \in \mathbb{R}^n\) of n variables \(x_1, ..., x_n\) we define the partial derivatives as

\(\frac{df}{dx_1}\) = \(\lim_{h \to 0} \frac{f(x_1 + h, x_2, ..., x_n) - f(x)}{h}\)
…
\(\frac{df}{dx_n}\) = \(\lim_{h \to 0} \frac{f(x_1, x_2, ..., x_n + h) - f(x)}{h}\)

When we collect them in a row vector form, this is called the gradient of \(f\) or the Jacobian.

Sometimes we are interested in derivatives of higher order. Consider a function \(f: \mathbb{R}^2 \rightarrow \mathbb{R}\) of two variables x, y. We use the following notation for higher-order partial derivatives (and for gradients):

\(\frac{d^2 f}{dx^2}\) is the second partial derivative of f with respect to x
\(\frac{d^n f}{dx^n}\) is the \(n^{\text{th}}\) partial derivative of f with respect to x

The Hessian is the collection of all second-order partial derivatives. It generalizes the second derivative to multiple dimensions and is used to study the curvature of a function. It is a square matrix.

Multivariate Taylor Series

We consider a function \(f: \mathbb{R}^D \rightarrow \mathbb{R}\) \(x \rightarrow f(x), x \in \mathbb{R}^D\)

that is smooth at \(x_0\). The multivariate Taylor series of \(f\) at \((x_0)\) is defined as

\(f(x) = \sum_{k=0}^\infty \frac{D_x^{k}f(x_0)}{k!} (x-x_0)^k\)

where \({D_x^{k}f(x_0)}\) is the \(k^{\text{th}}\) total derivative of \(f\) with respect to x, evaluated at \(x_0\).

The Taylor series is an infinite series that represents a function as a sum of terms based on its derivatives at a point. The Taylor series provides a precise representation of the function if it converges to the function at all points near the expansion point.
The Taylor polynomial is a finite approximation of the Taylor series, using only the first few terms (up to n). A Taylor polynomial of degree n includes terms up to the \(n^{\text{th}}\) partial derivatives of the function. It provides a local approximation near a given point but becomes less accurate as you move further from that point.

Probability and Distributions

The sample space Ω is the set of all possible outcomes of the experiment.

The event space \(A\) is a subset of the sample space that represents a particular outcome or group ot outcomes we’re interested in.

The probability \(P\). With each event, we associate a number \(P(A)\) that measures the probability that the event will occur.

The marginal probabbility that \(X\) takes the value x irrespective of the value of random variable \(Y\) is written as p(x).

The fraction of instances (the conditional probability) for which \(Y\) = y is written as \(p(y \mid x)\).

Def: The product rule relates the joint distribution to the conditional distribution via p(x, y) = \(p(y \mid x)\) p(x)

Bayes’ theorem
\(p(x \mid y)\) = \(\frac {p(y \mid x) p(x)}{p(y)}\)

Expectation

\(\mathbb{E}[X] = \sum_{i} x_i P(X=x_i)\) (for discrete variables)

\(\mathbb{E}[X] = \int_{-\infty}^\infty x f_x(x) dx\) (for continuous variables)

The expected value of a function g of a discrete random variable x is given by

\[\mathbb{E}[g(x)] = \sum_{x \in X} g(x) p(x)\]

The expected value of a function \(g: \mathbb{R} \rightarrow \mathbb{R}\) of a univariate continuous random variable x is given by

\[\mathbb{E}[g(x)] = \int_{x} g(x) p(x) dx\]

Sums and Transformations of Random Variables

\[\mathbb{E}[x+y] = \mathbb{E}[x] + \mathbb{E}[y]\] \[\mathbb{E}[x-y] = \mathbb{E}[x] - \mathbb{E}[y]\]

Covariance

The covariance intuitively represents the notion of how dependent random variables are to one another.

The covariance bewteen two univariate random variables X, Y \(\in \mathbb{R}\) is given by the expected product of their derivatives from their respective means,

\[Cov_{\text{x,y}} [x, y] = \mathbb{E}_{\text{x,y}} [(x-\mathbb{E}_x[x]) (y-\mathbb{E}_y[y])]\]

The covariance of a variable with itself is called the variance and is denoted by \(V_x[x]\). The square root of the variance is called the standard deviation \(\sigma(x)\)

If we consider two multivariate random variables X and Y with states \(x \in \mathbb{R}^D\) and \(y \in \mathbb{R}^E\) respectively, the covariance between X and Y is defined as

Cov[x, y] = \(\mathbb{E}[xy^T] - \mathbb{E}[x]\mathbb{E}[y]^T = Cov[y, x]^T \in \mathbb{R}^{\text{D x E}}\)

Correlation

The correlation between two random variables X, Y is given by

\[Corr[x, y] = \frac{Cov[x,y]}{\sqrt{V[x] V[y]}} \in [-1, 1]\]

The covariance (and correlation) indicate how two random variables are related. Positive correlation means that when x grows, then y is also expected to grow. Negative correlation means that as x increases, then y decreases.

Statistical Independence

Two random variables X,Y are statistically independent if and only if p(x,y) = p(x) p(y)

Gaussian Distribution

The Gaussian distribution is one of the most well-studied probability distributions for continuous-valued random variables. It’s also known as the normal distribution. For a univariate random variable, the Gaussian distribution has a density that is given by:

\[p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}\]

Continuous Optimization

Gradient Descent

If we want to find a local optimum \(f(x_0)\) of a function \(f: \mathbb{R}^n \rightarrow \mathbb{R}, x \rightarrow f(x)\), we start with an initial guess \(x_0\) of the parameters we wish to optimize and then iterate according to

\[x_{\text{i+1}} = x_i - y_i ((\nabla{f}) (x_i))^T\]

For a suitable step-size \(y_i\), the sequence \(f(x_0) \geq f(x_1) \geq ...\) converges to a local minimum.

Convex Optimization

A convex function is a type of function with a specific shape and property that is crucial in optimization. Intuitively, a convex function has a bowl-shaped or U-shaped curve, which makes it easier to find the minimum of the function because any local minimum is also a global minimum.

A set C is a convex set if for any \(x, y \in C\) and for any scalar \(\theta\) with \(0 \leq \theta \leq 1\), we have

\[\theta x + (1 - \theta) y \in C\]

Convex sets are sets such that a straight line connecting any two elements of the set lie inside the set.

Convex functions are functions such that a straight line between any two points of the function lie above the function.