Foundation of Data Science – Virtual Talk Series

Monday, Apr 1st, 2024 — Alex Dimakis from UT Austin

The next Foundations of Data Science virtual talk series on the “Theory of Large ML Models” will take place on Monday, Apr 1st at 2:00 PM Pacific Time (17:00 Eastern Time, 23:00 Central European Time, 21:00 UTC). Alex Dimakis from UT Austin will talk about “Deep Generative models and Inverse problems for Signal Reconstruction.”.

Details of the talk (Zoom link) are available here.

Abstract: Sparsity-based methods like compressed sensing have led to significant technological breakthroughs in signal processing, compression and medical imaging. Deep generative models like GANs, VAEs and Diffusions are data-driven signal models that are showing impressive performance. We will survey our framework of how pre-trained generative models can be used as priors to solve inverse problems like denoising, filling missing data, and recovery from linear projections in an unsupervised way. We generalize compressed sensing theory beyond sparsity, extending Restricted Isometries to sets created by deep generative models. We will also discuss applications to accelerating MRI, fairness in imaging and numerous open problems.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Friday, Feb 9th, 2024 — Kangwook Lee from UW Madison

The next Foundations of Data Science virtual talk series on the “Theory of Large ML Models” will take place on Friday, Feb 9th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 21:00 UTC). Kangwook Lee from UW Madison will talk about “Theoretical Exploration of Foundation Model Adaptation Methods”.

Details of the talk (Zoom link) are available here.

Abstract: Due to the enormous size of foundation models, various new methods for efficient model adaptation have been developed. Parameter-efficient fine-tuning (PEFT) is an adaptation method that updates only a tiny fraction of the model parameters, leaving the remainder unchanged. In-context Learning (ICL) is a test-time adaptation method, which repurposes foundation models by providing them with labeled samples as part of the input context. Given the growing importance of this emerging paradigm, developing theoretical foundations for the new paradigm is of utmost importance.

In this talk, I will introduce two preliminary results toward this goal. In the first part, I will present a theoretical analysis of Low-Rank Adaptation (also known as LoRA), one of the most popular PEFT methods today. Our analysis of the expressive power of LoRA not only helps us better understand the high adaptivity of LoRA observed in practice but also provides insights to practitioners. In the second part, I will introduce our probabilistic framework for a better understanding of ICL. With our framework, one can analyze the transition between two distinct modes of ICL: task retrieval and learning. We also discuss how our framework can help explain and predict various phenomena, which can be observed with large language models in practice yet not fully explained.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Monday, Feb 5th, 2024 — Sasha Rush from Cornell

The next Foundations of Data Science virtual talk series on the “Theory of Large ML Models” will take place on Monday, Feb 5th at 2:00 PM Pacific Time (17:00 Eastern Time, 23:00 Central European Time, 22:00 UTC). Sasha Rush from Cornell University will talk about “Scaling Data-Constrained Language Model”.

Details of the talk (Zoom link) are available here.

Extrapolating scaling trends suggest that training dataset size for LLMs may soon be limited by the amount of text data available on the internet. In this talk we investigate scaling language models in data-constrained regimes. Specifically, we run a set of empirical experiments varying the extent of data repetition and compute budget. From these experiments we propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we discuss and experiment with approaches for mitigating data scarcity.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Monday, Oct 30th, 2023 — Boaz Barak from Harvard

The next Foundations of Data Science virtual talk series on the “Theory of Large ML Models” will take place on Monday, Oct 30th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Boaz Barak from Harvard University will talk about “The Uneasy Relation Between Deep Learning and Statistics”.

Details of the talk (Zoom link) are available here.

Deep learning uses the language and tools of statistics and classical machine learning, including empirical and population losses and optimizing a hypothesis on a training set. But it uses these tools in regimes where they should not be applicable: the optimization task is non-convex, models are often large enough to overfit, and the training and deployment tasks can radically differ.

This talk will survey the relation between deep learning and statistics. In particular we will discuss recent works supporting the emerging intuition that deep learning is closer in some aspects to human learning than to classical statistics. Rather than estimating quantities from samples, deep neural nets develop broadly applicable representations and skills through their training.

The talk will not assume background knowledge in artificial intelligence or deep learning.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Thursday, Mar 16th, 2023 — Vladimir Braverman from Rice University

The next Foundations of Data Science virtual talk series on recent advances in adversarially robust streaming will take place on Thursday, March 16th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 21:00 UTC). Vladimir Braverman from Rice University will talk about “Adversarial Robustness of Streaming Algorithms through Importance Sampling”.

Details of the talk (Zoom link) are available here.

Abstract: Robustness against adversarial attacks has recently been at the forefront of algorithmic design for machine learning tasks. In the adversarial streaming model, an adversary gives an algorithm a sequence of adaptively chosen updates u1, … ,un as a data stream. The goal of the algorithm is to compute or approximate some predetermined function for every prefix of the adversarial stream, but the adversary may generate future updates based on previous outputs of the algorithm. In particular, the adversary may gradually learn the random bits internally used by an algorithm to manipulate dependencies in the input. This is especially problematic as many important problems in the streaming model require randomized algorithms, as they are known to not admit any deterministic algorithms that use sublinear space. In this paper, we introduce adversarially robust streaming algorithms for central machine learning and algorithmic tasks, such as regression and clustering, as well as their more general counterparts, subspace embedding, low-rank approximation, and coreset construction. For regression and other numerical linear algebra related tasks, we consider the row arrival streaming model. Our results are based on a simple, but powerful, observation that many importance sampling-based algorithms give rise to adversarial robustness which is in contrast to sketching based algorithms, which are very prevalent in the streaming literature but suffer from adversarial attacks. In addition, we show that the well-known merge and reduce paradigm in streaming is adversarially robust. Since the merge and reduce paradigm allows coreset constructions in the streaming setting, we thus obtain robust algorithms for k-means, k-median, k-center, Bregman clustering, projective clustering, principal component analysis (PCA) and non-negative matrix factorization. To the best of our knowledge, these are the first adversarially robust results for these problems yet require no new algorithmic implementations. Finally, we empirically confirm the robustness of our algorithms on various adversarial attacks and demonstrate that by contrast, some common existing algorithms are not robust.

This is a joint work with Avinatan Hassidim, Yossi Matias, Mariano Schain, Sandeep Silwal, Samson Zhou. This result has appeared in NeurIPS 2021.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Thursday, Feb 9th, 2023 — Amit Chakrabarti from Dartmouth College

The next Foundations of Data Science virtual talk series on recent advances in adversarially robust streaming will take place on Thursday, February 9th at 2:00 PM Pacific Time (17:00 Eastern Time, 23:59 Central European Time, 22:00 UTC). Amit Chakrabarti from Dartmouth College will talk about “How to color your adversary’s graph”

Details of the talk (Zoom link) are available here.

An n-vertex graph with maximum degree D is (D+1)-colorable: an almost trivial combinatorial result, with an equally simple greedy algorithm to produce a (D+1)-coloring. However, given a stream of edges of such a graph, can we maintain a valid (D+1)-coloring as the edges arrive, while using not much more than O(n) space? What if the edges are chosen by an adversary who can look at our current coloring and add additional edges to try to confound us? This is the newly-popular setting of adversarially robust streaming algorithms and this talk is about the coloring problem in this setting.

We obtain upper and lower bound results for this problem. In O(n polylog n) space, we can maintain an O(D^(5/2))-coloring of such an adversarial graph. On the other hand, every adversarially robust coloring algorithm under the same space limitation must spend Omega(D^2) colors. We in fact prove more general. results that trade off the space usage against the color budget. One interesting by-product of our work is that in combination with the celebrated Assadi-Chen-Khanna algorithm (SODA 2019), it provides the first separation between randomized and deterministic algorithms for the (ordinary, non-robust) streaming graph coloring problem.

Based on joint works [C.-Ghosh-Stoeckl] and [Assadi-C.-Ghosh-Stoeckl].

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday, Dec 14th, 2022 — Omri Ben-Eliezer from MIT

The next Foundations of Data Science virtual talk series on recent advances in adversarially robust streaming will take place on Wednesday, December 14th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Omri Ben-Eliezer from MIT will talk about “Robust sampling and online learning”

Details of the talk (Zoom link) are available here.

Abstract: Random sampling is a fundamental primitive in modern algorithms, statistics and machine learning, used as a generic method to obtain a small yet “representative” subset of the data. In this talk we shall explore to what extent random sampling is robust to adaptive inputs in a streaming setting, proposing and studying a model for sampling against a white-box adaptive adversary (where future inputs generated by the adversary are allowed to depend on the current internal state of the sampler). I will demonstrate scenarios where the adaptive sample complexity can be much larger than in the static world, but argue that these scenarios are not entirely realistic. I will then reveal a deep connection between our setting and online learning. As it turns out, the sample complexity in our sampling setting is captured by a sequential version of Rademacher complexity, a notion that is also known to capture the regret in online classification problems. Leveraging this connection, one can bound the sample complexity using the Littlestone dimension of the relevant concept class. The obtained bounds are tight in many regimes, and lead us to resolve a classical open problem on optimal regret bounds in online learning.

Based on joint works with Noga Alon, Yuval Dagan, Shay Moran, Moni Naor, and Eylon Yogev.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Tuesday, Nov 8th, 2022 — Edith Cohen from Google Research (& Tel-Aviv University)

The next Foundations of Data Science virtual talk series on recent advances in adversarially robust streaming will take place on Tuesday, November 8th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Edith Cohen from Google Research will talk about “On Robustness to Adaptive Inputs: A Case Study of CountSketch.”

Details of the talk (Zoom link) are available here.

Abstract: The performance of randomized algorithms can be considered both in adaptive settings, where inputs may depend on prior outputs of the algorithm, and in non-adaptive settings. Adaptive settings arise when the algorithm is a component of a system with feedback or when an adversary attempts to construct an input on which the algorithm fails. The last decade saw impressive progress that included the development of black-box methods of obtaining algorithms that are robust to adaptive inputs from non-robust counterparts, most notably using differential privacy, and established lower bounds through “attacks.” But intriguing questions remain on the practical implications of adaptivity and the potential to develop tailored algorithms for specific problems and “nicer” inputs. In my talk, I will review key ideas from the literature and then present our recent work that explored these questions for CountSketch, a popular dimensionality reduction method (joint with Xin Lyu, Jelani Nelson, Tamas Sarlos, Moshe Shechner, and Uri Stemmer).

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday Oct 5th 2022 — David Woodruff from CMU

The first Foundations of Data Science virtual talk of the season, and first of a series on recent advances in adversarially robust streaming, will take place on Wednesday, October 5th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). David Woodruff from CMU will give us a survey about “Adversarially Robust Streaming Algorithms.”

Details of the talk (Zoom link) available here.

Abstract: A streaming algorithm is given a sequence of items and seeks to compute or approximate some function of this sequence using a small amount of memory. A body of work has been developed over the last two decades, resulting in optimal streaming algorithms for a number of problems. I will start by surveying some of these problems. I will then investigate the adversarial robustness of streaming algorithms. An algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems do not admit sublinear-space deterministic algorithms; on the other hand, space-efficient randomized algorithms for these problems are generally not adversarially robust. This raises the question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems. I will survey work showing for a number of streaming problems, one can obtain algorithms that are adversarially robust with a small overhead in their memory requirements.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday May 11th 2022 — Sébastien Bubeck from Microsoft Research

The next Foundations of Data Science virtual talk of this year will take place on Wednesday, March 11th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Sébastien Bubeck from Microsoft Research will speak about “Set Chasing, with an application to online shortest path.”

Please register here to join the virtual talk.

Abstract: Since the late 19th century, mathematicians have realized the importance and generality of selection problems: given a collection of sets, select an element in each set, possibly in a “nice” way. Of particular importance in computer science is the scenario where the ground set is a metric space, in which case it is natural to ask for Lipschitz selection. In this talk I will describe a far-reaching extension of this classical Lipschitz selection problem to an *online* setting, where sets are streaming to the selector. I will show how Riemannian gradient descent (aka mirror descent) can be used to approach this type of problems. I will illustrate the power of the framework by solving a long-standing problem in online shortest path known as layered graph traversal (introduced by Papadimitriou and Yannakakis in 1989).

The series is supported by the NSF HDR TRIPODS Grant 1934846.