“There are some things which cannot be learned quickly, and time, which is all we have, must be paid heavily for their acquiring. They are the very simplest things, and because it takes a man’s life to know them the little new that each man gets from life is very costly and the only heritage he has to leave.” - Ernest Hemingway (More…)
News #
I will be updating both good news, bad news and all kinds of news.
Posts #
Pre-print articles on Adagrad-variant methods
1. Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models # Authors: Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti Abstract: Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease on the average loss as most samples come from infrequent words. On the other hand, Adam and sign-based methods are less sensitive to this problem. To establish that this behavior is caused by class imbalance, we show empirically that it can be reproduced across architectures and data types, on language transformers, vision CNNs, and linear models. On a linear model with cross-entropy loss, we show that class imbalance leads to imbalanced, correlated gradients and Hessians that have been hypothesized to benefit Adam. We also prove that, in continuous time, gradient descent converges slowly on low-frequency classes while sign descent does not.
Pre-print articles on Adaptive Optimization
1. A simple uniformly optimal method without line search for convex optimization # Authors: Tianjiao Li, Guanghui Lan Abstract: Line search (or backtracking) procedures have been widely employed into first-order methods for solving convex optimization problems, especially those with unknown problem parameters (e.g., Lipschitz constant). In this paper, we show that line search is superfluous in attaining the optimal rate of convergence for solving a convex optimization problem whose parameters are not given a priori. In particular, we present a novel accelerated gradient descent type algorithm called auto-conditioned fast gradient method (AC-FGM) that can achieve an optimal $\mathcal{O}(1/k^2)$ rate of convergence for smooth convex optimization without requiring the estimate of a global Lipschitz constant or the employment of line search procedures. We then extend AC-FGM to solve convex optimization problems with Hölder continuous gradients and show that it automatically achieves the optimal rates of convergence uniformly for all problem classes with the desired accuracy of the solution as the only input. Finally, we report some encouraging numerical results that demonstrate the advantages of AC-FGM over the previously developed parameter-free methods for convex optimization.
Pre-print articles on gradient-clipping methods
1. Why gradient clipping accelerates training: A theoretical justification for adaptivity # Authors: Jingzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie Abstract: We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, \emph{gradient clipping} and \emph{normalized gradient}, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings.
Free Books on Dynamical Systems
Arxiv/ Free Books # 1. Lectures on Neural Dynamics - Francesco Bullo # Chapter 1: Neural circuit models based on firing rates and Hopfield networks: their dynamics, interconnections, and local Hebbian adaptation rules Chapter 2: Stability in dynamic neural networks using Lyapunov methods, multistability, and energy functions Chapter 3: Optimization in neural networks through biologically inspired gradient dynamics and sparse representations. Chapter 4: Unsupervised learning via neural dynamics, linking Hebbian rules to tasks like PCA, clustering, and similarity-based representation learning. 2. Linear Geometry and Algebra - Taras Banakh # Abstract: Linear Geometry studies geometric properties which can be expressed via the notion of a line. All information about lines is encoded in a ternary relation called a line relation. A set endowed with a line relation is called a liner. So, Linear Geometry studies liners. Imposing some additional axioms on a liner, we obtain some special classes of liners: regular, projective, affine, proaffine, etc. Linear Geometry includes Affine and Projective Geometries and is a part of Incidence Geometry. The aim of this book is to present a self-contained logical development of Linear Geometry, starting with some intuitive acceptable geometric axioms and ending with algebraic structures that necessarily arise from studying the structure of geometric objects that satisfy those simple and intuitive geometric axioms. We shall meet many quite exotic algebraic structures that arise this way: magmas, loops, ternary-ring, quasi-fields, alternative rings, procorps, profields, etc. We strongly prefer (synthetic) geometric proofs and use tools of analytic geometry only when no purely geometric proof is available. Liner Geometry has been developed by many great mathematicians since times of Antiquity (Thales, Euclides, Proclus, Pappus), through Renaissance (Descartes, Desargues), Early Modernity (Playfair, Gauss, Lobachevski, Bolyai, Poncelet, Steiner, Möbius), Late Modernity Times (Steinitz, Klein, Hilbert, Moufang, Hessenberg, Jordan, Beltrami, Fano, Gallucci, Veblen, Wedderburn, Lenz, Barlotti) till our contempories (Hartshorne, Hall, Buekenhout, Gleason, Kantor, Doyen, Hubault, Dembowski, Klingenberg, Grundhöfer).