Nam Le

“There are some things which cannot be learned quickly, and time, which is all we have, must be paid heavily for their acquiring. They are the very simplest things, and because it takes a man’s life to know them the little new that each man gets from life is very costly and the only heritage he has to leave.” - Ernest Hemingway (More…)

News #

I will be updating both good news, bad news and all kinds of news.

Optimization Research Papers in JMLR Volume 25

Optimization Research Papers in JMLR Volume 25 (2024) # This document lists papers from JMLR Volume 25 (2024) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications. Convex Optimization # Papers addressing convex optimization problems, including sparse NMF, differential privacy, and sparse regression. Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction Authors: Yuze Han, Guangzeng Xie, Zhihua Zhang Description: Investigates lower complexity bounds for finite-sum optimization problems in convex settings.

Ebooks & related papers on Convex Optimizations

Ebooks # Boris Mordukhovich , Nguyen Mau Nam. An Easy Path to Convex Analysis and Applications. 2023 Yurii Nesterov. Lectures on Convex Optimization. 2018 Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. 2015 Dimitri Bertsekas. Nonlinear Programming. 2016 Boris Teodorovich Polyak. Introduction to Optimization. 1987 R. T. Rockafellar. Convex Analysis. 1970 H. H. Bauschke & P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. 2011 Lieven Vandenberghe and Stephen P. Boyd. Convex Optimization. 2004 Papers # Yu. E. Nesterov. A method of solving a convex programming problem with convergence rate. 1983

Pre-print articles on Adagrad-variant methods

1. Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models # Authors: Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti Abstract: Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease on the average loss as most samples come from infrequent words. On the other hand, Adam and sign-based methods are less sensitive to this problem. To establish that this behavior is caused by class imbalance, we show empirically that it can be reproduced across architectures and data types, on language transformers, vision CNNs, and linear models. On a linear model with cross-entropy loss, we show that class imbalance leads to imbalanced, correlated gradients and Hessians that have been hypothesized to benefit Adam. We also prove that, in continuous time, gradient descent converges slowly on low-frequency classes while sign descent does not.

Pre-print articles on Adaptive Optimization

1. A simple uniformly optimal method without line search for convex optimization # Authors: Tianjiao Li, Guanghui Lan Abstract: Line search (or backtracking) procedures have been widely employed into first-order methods for solving convex optimization problems, especially those with unknown problem parameters (e.g., Lipschitz constant). In this paper, we show that line search is superfluous in attaining the optimal rate of convergence for solving a convex optimization problem whose parameters are not given a priori. In particular, we present a novel accelerated gradient descent type algorithm called auto-conditioned fast gradient method (AC-FGM) that can achieve an optimal $\mathcal{O}(1/k^2)$ rate of convergence for smooth convex optimization without requiring the estimate of a global Lipschitz constant or the employment of line search procedures. We then extend AC-FGM to solve convex optimization problems with Hölder continuous gradients and show that it automatically achieves the optimal rates of convergence uniformly for all problem classes with the desired accuracy of the solution as the only input. Finally, we report some encouraging numerical results that demonstrate the advantages of AC-FGM over the previously developed parameter-free methods for convex optimization.

Pre-print articles on gradient-clipping methods

1. Why gradient clipping accelerates training: A theoretical justification for adaptivity # Authors: Jingzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie Abstract: We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, \emph{gradient clipping} and \emph{normalized gradient}, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings.