News

My basil finally developed roots! I'm currently reading about quantum mechanics, ferrofluids, and language models.

Blog

[08/09/2024:] Motivating Ladder Operators II
[08/08/2024:] A Cool Way to Garden
[08/08/2024:] Life and Basil Limeade
[08/05/2024:] Motivating Ladder Operators
[08/04/2024:] Migrated to AstroJS
[07/31/2024:] The Classical in the Quantum

Notes

Working on notes on the quantum mechanics, derivatives, and uploading my previous course notes onto this blog!

Projects

Finally started a projects page! I've recently made some nice upgrades to my post component, so it looks pretty clean! ;)

🌊

I'm considering whether or not to continue this project using WebGL or Three.js.

I'm also researching methods for generating the 3D scenes I want for this project automatically.

In the meantime, I've decided to proceed with some preliminary prototypes of the other interactive parts of this project.

Orange Juice

I like orange juice. :)

Mlog


EE A227BT: Convex Optimization

Fall 2024
By Aathreya Kadambi
Expanding on Lectures by Professor Benjamin Recht

This fall I’m thinking of auditing this convex optimization class. According to Professor Recht, these are the problems you can solve, and your computers can solve (assuming some stuff). This class is about models that you can know you are going to be able to solve, and then you acn work on the ones you don’t know, and that’s called research.

Story 1: A Survey

Relevant Lecture: Lecture 1

minf0(x)\min f_0(x) subject to xC\text{subject to }x \in \mathcal{C} xRn,CRnx \in \R^n, \mathcal{C} \subseteq \R^n f0:RnRf_0 : \R^n \rightarrow \R

maxr(x)\max r(x) subject to xC\text{subject to }x \in \mathcal{C} and same thing. But while these two things are the same mathematically, there is a difference in mindset. The first is like minimizing a cost, whereas the latter is maximizing a reward. Some people are minimizers and others are maximizers.

Sometimes we just a constrant like f0(x)=0f_0(x) = 0, and we want to find a solution. Feasibility.

Optimization problems:

  • Circuits
  • Portfolio
  • Lecture hall assignment (minimize the distance the professors have to walk)
  • Scheduling classes
  • Traffic routes

He then introduced convex sets and convex fnuctions. A convex set is one where every line between two points in the set lies in the set. A convex function is one such that every line between two points on the function lies above the function. We can see a convex set from the convex function which is the set lying above the function. Notice that intersection of convex sets is convex. Let Ef0\mathcal{E}_{f_0} be the convex set corresponding to a convex function. We can make a minimization problem: mint\min t  subject to xC\text{ subject to }x \in\mathcal{C} (x,t)Ef0(x,t) \in \mathcal{E}_{f_0}

I think if you intersect all half-planes containing the set, you get the set, and if you intersect all half-planes containing the set, you get the complement ish of the set. For a finite collection of half-planes, we get a polyhedron: AxbAx \le b and we write the linear program: mincx\min c^\top x subject to Axb\text{subject to }Ax \le b We will talk about this, but we will take a different view than the one taken in a linear programming class.

“People will tell you, most problems are not convex, but it’s all a mindset, bro.” - Benjamin Recht.

“I can turn any problem into a convex problem, by like rewriting it in a funny way. But I can’t always solve it.” - Benjamin Recht.

Let’s say we have a nonconvex function. We want to minimize g(x)g(x) where gg is not convex. We can minimize over all probability distributions the expected values with respect to PP of g(x)g(x). This problem is linear in pp, probability distributions are a convex set. But it isn’t easy to solve. “This is one trick, there are a million tricks. But it is not always possible to solve it.”

“I don’t like the name of the class. It should be ‘officially solvable convex problems’.”

It’s just a notation thing, but this notation gives you a convex problem for the nonconvex problem.

Remark. Unfortunately I had to miss two stories because of some prior commitments.

Story 4: Tangent Planes and Optimality

We wish to minimize x0x_0 subject to xCRdx \in \mathcal{C} \subseteq \R^d.

Tangent cone, feasible cone. Some weird letter thingy is the tangent cone:

Tx={d:α>0 w/ x+αdC}\mathscr{T}_{x_{*}} = \{d : \exists \alpha > 0 \text{ w/ } x_{*} + \alpha d \in \mathcal{C}\}

Remark. One of the hard parts about convex geometry is that in 2D, things are misleading. For example, in 2D, the convex hull of {xi=±1}\{x_i = \pm 1\} is the square. In higher dimensions, this has 2n2^n corners and it is the hypercube. It has 2n2n supporting hyperplanes. Now let’s take the convex hull of {±ei}\{\pm e_i\} has 2n2n corners. But it has 2n2^n supporting hyperplanes. This is despite the same that they look exactly the same in 2D. Basically we’re comparing a cube and a diamond thingy.

What are the set of directions that decrease the cost function?

D={d:d0<0}\mathscr{D} = \{ d : d_0 < 0 \}
since any point with d0<0d_0 < 0 decreases the cost since we are minimizing x0x_0. So xx_* is optimal if DTx=\mathscr{D} \cap \mathscr{T}_{x_*} = \varnothing. A way to prove this is to show there is a function h:RdRh : \R^d \rightarrow \R, h(x)>0h(x) > 0 on D\mathscr{D} and h(x)0h(x) \le 0 on T\mathscr{T}.

We study convex optimization because it has a dumb reasoning for global equals local. But if you’re a math nerd, it’s interesting because there are some nice algebraic and geometric ideas here.

By global/local stuff here, does he mean the local stuff is like the tangent plane at that point doesn’t intersect the area where cost decreases, and the global stuff is the global optimality result?

Story 5: Separating hyperplanes

Theorem. If C1\mathcal{C}_1 and C2\mathcal{C}_2 are nonempty, disjoint, and convex, then aRn\exists a \in \R^n and bRb \in \R such that axba x \le b for all xC1x \in \mathcal{C}_1, azba^\top z \ge b for all zCzz \in \mathcal{C}_z.

Notice that before, we were trying to find a separating hyperplane between D\mathscr{D} and T\mathscr{T}.

Proof Sketch.

We assume that there is a closest points uu and vv between the two sets, with uC1u \in \mathcal{C}_1 and vC2v \in \mathcal{C}_2. Then take the line between them and the hyperplane orthogonal to that line.

Theorem. If C\mathcal{C} is closed and convex, let SS be the intersection of all half spaces containing C\mathcal{C}. Then C=S\mathcal{C} = S.

Story 7: Convex Functions

Definitions for convexity:

  • For all x,zdom fx, z \in \text{dom }f and θ[0,1]\theta \in [0,1], f((1θ)x+θz)=(1θ)f(x)+θf(z)f((1-\theta)x + \theta z)= (1-\theta)f(x) + \theta f(z)
  • For all x,zdom fx,z \in \text{dom }f, f(z)f(x)+f(x)(zx)f(z) \ge f(x) + \nabla f(x)^\top (z-x)
  • For all xdom fx \in \text{dom }f, z(f)x)0\nabla_z(f)x) \ge 0.

We also have some operations that preserve convexity:

  • Nonnegative linear combinations
  • Affine composition f(Ax+b)f(Ax + b)
  • Maximization maxiIfi(x)\max_{i \in I}f_i(x)
  • Partial minimization minzf(x,z)\min_{z}f(x,z)
  • Composition with scalar, convex, nondecerasing function

Who has tried the eigenvalue solver? You can put a matrix in that’s postiive definite and get a negative number out. It happens especially as you get rank deficient. It is a real problem and when you’re writing code you have to put safety checks, if you have a negative eigenvalue and its 101210^{12} smaller than the largest eigenvalue, you have to treat it as zero and put a safety check to deal with that.

Remark. Dang that’s some really good advice/wisdom ngl.

Notice that least squares, 12ni=1n(axb)2\frac{1}{2n} \sum_{i=1}^n (a^\top x - b)^2 is a convex function.

It should be our goal to reduce all problems to finding the least squares solution.

If AA is a symmetric matrix, we can define a quadratic function f(x)=12xAxf(x) = \frac{1}{2}x^\top Ax.

Remark. Always put a 12\frac{1}{2} in front of your quadratics. It makes your derivatives nicer.

You should have the gradients for the loss of the least squares and the gradient of q(x)=12xAx+b\topx+cq(x) = \frac{1}{2}x^\top A x + b^\topx + c memorized.

Why do we care about the log od a determinant? For a positive definite matrix.

  • The log of a determiant of a covariance is apparently entropy of a gaussian or something? Some paper on maximizing the log of the determiant. Why do we want to maximize and not minimize, it is concave.
  • I think maybe it also tells us whether or not XX expands or contracts the space overall or something? Not sure tbh.

Take X=Z+tVX = Z + tV, tRt \in \R,

logdet(Z+tV)=logdet(Z1/2(I+tZ1/2VZ1/2)Z1/2)=logdetZ+logdet(I+tZ1/2VZ1/2)\log \det (Z + tV) = \log \det (Z^{1/2} (I + tZ^{-1/2} VZ^{-1/2})Z^{1/2}) = \log \det Z + \log \det (I + tZ^{-1/2}VZ^{-1/2})
using the property that det(XW)=det(X)det(W)\det (XW) = \det(X) \det (W). Now if we let μi\mu_i be the eigenvalues of Z1/2VZ1/2Z^{-1/2}VZ^{-1/2}, the eigenvalues of I+tZ1/2VZ1/2I + tZ^{-1/2}VZ^{-1/2} are 1+tμi1 + t\mu_i.

Remark. It turns out that logdetX=X1\nabla \log \det X = X^{-1}. You can find the Hessian of this and do it another way.

Remark. Why were we allwoed to do X=Z+tVX = Z + tV again? Is it just because of convecity or something?

Let AA be symmetric and λ(A)\lambda(A) the largest eigenvalue of AA. We expect this to be convex because it is a maximization of something.

In this example, λ(A)=maxx=1xAx\lambda(A) = \max_{\|x\| = 1} x^\top A x, and xAxx^\top A x is a linear function of AA, so this must be convex. This is a particularly interesting example because it is not differentiable.

Functions where all sublevel sets are intervals can be minimized with the bracket search. But you have to be able t ofind points in the sublevel sets. The thing about this is that it is simpler in 1D than it is in higher dimensions. I think he called this quasiconvex?

Ceiling function is quasiconvex. There’s a function called card(x)\text{card}(x) with xRnx \in \R^n, also known as the L0L^0 norm, tells you the number of nonzero components of xx.

If a function is log concave, then when you take the logarithm, it is concave. The function xx is log concave. exe^x is also, but one of hte most important is the Gaussian probability density function.

If you have a constraint, f(x)αf(x) \ge \alpha, then logf(x)logα\log f(x) \ge \log \alpha if you have a log concave function.

Cones

He wants to talk about convex cones. This,

mincx \min c^\top x s.t. A0+i=1nAixi0A_0 + \sum_{i=1}^n A_i x_i \ge 0

these problems are called semidefinite programming problems (SDP problems). Today we’ll see how many problems can be written like this.

Let KK be a convex cone. If for all x,zKx, z \in K, αx+βzK\alpha x +\beta z \in K, α,βR+\alpha, \beta \in \R_+. R+nR_+^n is called an orthant or nonnegative orthant.

The Orthant is self-dual, which follows simply. The second order cone also is self dual, that follows from Cauchy-Schwarz. The SDP problem is also self dual or something.

X,Z0X, Z \ge 0 X=i=1nλiuiuiX = \sum_{i=1}^n \lambda_i u_i u_i^\top Z==1nμiviviZ = \sum_{=1}^n \mu_i v_i v_i^\top

You should show that i,jXijZij\sum_{i,j} X_{ij} Z_{ij} for yourself. If you do something, you get:

=i=1nj=1nλiμjTr(uiuivjvj)= \sum_{i=1}^n \sum_{j=1}^n \lambda_i \mu_j \text{Tr}(u_iu_i^\top v_j v_j^\top)

Take two PSD matrices and dot product them together and you always get a positive number, taht’s the cool thing.

This is nice because if you can solve one program, you can solve another. This will end up leading to the primal dual ideas. The second order cone actually encodes quadratic programming, and the positive semidefinite cone encodes everything.

Remark. A piece of motivation is that we are heading towards optimality of f,zx0\langle \nabla f, z - x\rangle \ge 0. As I was typing this he decided to erase it, and someone loudly whispered “no!“. It was really funny.

Convex Q.P.

minimize 12xQx+px\frac{1}{2}x^\top Q x + p^\top x subject to xΩx \in \Omega.

minimize t2t^2 … subject to 12xQx+pxt2\frac{1}{2}x^\top Q x + p^\top x \le t^2, xΩx \in \Omega

12(x+Q1p)Q(x+Q1p)t2\frac{1}{2}(x+Q^{-1}p)^\top Q (x+ Q^{-1}p)\le t^2 Q1/2x+Q1/2pt\|Q^{1/2} x + Q^{-1/2}p \| \le t

Don’t solve least squares using gradient descent, but we can solve least squares via a second order cone problem.

minimize t2t^2 subject to Axb2t2\|Ax - b\|^2 \le t^2. By monotonicity, this is the same as just saying mint\min t such that Axbt\|Ax - b\| \le t.

Stochastic Optimization

Minimize cxc^\top x such that Pr(axb)η\text{Pr}(a^\top x \le b) \ge \eta (in other words, we’re picking aa out of a hat, and we want that the probability of the condition being met is decently high). A version of markowitz forcasting or something. Finance people like this.

Suppose aN(a,Σ)a \in \mathcal{N}(\overline{a}, \Sigma). axN(ax,xΣx)a^\top x \sim \mathcal{N}(\overline{a}x, x^\top \Sigma x).

Pr(axb)=Φ(baxΣxi)\text{Pr}(a^\top x \le b) = \Phi(\frac{b-\overline{a^\top}x}{\|\Sigma x\|_i}) or something. Gaussican CDF. We can the minimize cxc^\top x such that ax+Φ1(η)Σ1/2xib\overline{a}^\top x + \Phi^{-1}(\eta)\|\Sigma^{1/2}x\|_i \le b. This is a very frequnetist problem because we’re setting a sort of confidence level.

Robust optimization

Minimize cxc^\top x subject to axba^\top x \le b for all aEa \in \mathcal{E}. E={a+Pu:u1}\mathcal{E} = \{\overline{a} + Pu : \|u \| \le 1\}. P0P \ge 0. What is the relationship between robust and stochastic problems? It is actually a very contentious point in the optimization community because the robust people say they can solve all the stochastic people’s problems. Recht low key sides with the robusters (get it, robust, plust they bust people’s problems so they’re busters).

The tricks on the robust side are still going to work even if we change up E\mathcal{E}, and we get different formulations that still solve the stochastic ppls problems or something? You just have to worry about what sets you can describe well using convex sets rather than deal with stochasticity?

Duality: Concave Maximization and Convex Optimization

Fact 1: Solutions to the primal are always at least the solutions to the dual.

Branch and bound is a solution method that makes use of duality.

He thinks algebra is underappreciated. 🤨 Idk about that one. Maybe in CS.

Dualizing is somehow a projection into the space convex problems? Maybe? because dual gives you a convecx problem, and dual of the dual gives you a convexification of your original problem.

Let’s look at this: PP is symmetric, maximize xPxx^\top P x subject to x2=1\|x\|_2 = 1.

Or maybe, looking at a dual, minimize λ\lambda subject to λIP0\lambda I - P \ge 0. We know how to do this. That’s just λmax\lambda_{\max} of PP. But then if we put xx the corresponding eigenvector into the original problem, we immediately find λmax\lambda_{\max}, so we have strong duality and we’ve solved that nonconvex problem!



As a fun fact, it might seem like this website is flat because you're viewing it on a flat screen, but the curvature of this website actually isn't zero. ;-)

Copyright © 2024, Aathreya Kadambi

Made with Astrojs, React, and Tailwind.