## Thursday Jan 13th 2022 — Piotr Indyk from MIT

The first Foundations of Data Science virtual talk of this year will take place on Thursday, Jan 13th at 10:00 AM Pacific Time (13:00 Eastern Time, 19:00 Central European Time, 18:00 UTC). Piotr Indyk from MIT will speak about “Learning-Based Sampling and Streaming”.

Please register here to join the virtual talk.

Abstract: Classical algorithms typically provide “one size fits all” performance, and do not leverage properties or patterns in their inputs. A recent line of work aims to address this issue by developing algorithms that use machine learning predictions to improve their performance. In this talk I will present two examples of this type, in the context of streaming and sampling algorithms. In particular, I will show how to use machine learning predictions to improve the performance of (a) low-memory streaming algorithms for frequency estimation (ICLR’19), and (b) sampling algorithms for estimating the support size of a distribution (ICLR’21).   Both algorithms use an ML-based predictor that, given a data item, estimates the number of times the item occurs in the input data set.

The talk will cover material from papers co-authored with  T Eden, CY Hsu, D Katabi, S Narayanan, R Rubinfeld, S Silwal, T Wagner and A Vakilian.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Thursday Nov 18th — Nicole Immorlica from Microsoft Research

The next Foundations of Data Science virtual talk will take place on Thursday, Nov 18th at 10:00 AM Pacific Time (13:00 Eastern Time, 19:00 Central European Time, 18:00 UTC). Nicole Immorlica from Microsoft Research will speak about “Communicating with Anecdotes”.

Please register here to join the virtual talk.

Abstract: Classic models of communication in economics typically assume agents can communicate any message.  However, many important communications, such as those in newspapers or politicians’ speeches, use data to convey information.  In this talk, we explore how the reliance on data impacts communication.  In our model, there are two Bayesian agents (a sender and a receiver) who wish to communicate. The receiver must take an action whose payoff depends on their personal preferences and an unknown state of the world. The sender has access to a collection of data points correlated with the state of the world and can send exactly one of these to the receiver in order to influence her choice of action. Importantly, the sender’s personal preferences may differ from the receiver’s, which affects the sender’s strategic choice of what to send. We show that in a Nash equilibrium even a small difference in preferences can lead to a significant bias in the communicated datum. This can significantly reduce informativeness of the communication, leading to substantial utility loss for both sides. One implication is informational homophily: a receiver can rationally prefer to obtain data from a poorly-informed sender with aligned preferences, rather than a knowledgeable expert whose preferences may differ from her own.

Joint work with Nika Haghtalab, Brendan Lucier, Markus Mobius and Divya Mohan.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Thursday Oct 21st — Maxim Raginsky from UIUC

The next Foundations of Data Science virtual talk will take place on Thursday, Oct 21st at 10:00 AM Pacific Time (13:00 Eastern Time, 18:00 Central European Time, 17:00 UTC).  Maxim Raginsky from University of Illinois, Urbana-Champaign will speak about “Neural SDEs: Deep Generative Models in the Diffusion Limit”

Please register here to join the virtual talk.

Abstract: In deep generative models, the latent variable is generated by a time-inhomogeneous Markov chain, where at each time step we pass the current state through a parametric nonlinear map, such as a feedforward neural net, and add a small independent Gaussian perturbation. In this talk, based on joint work with Belinda Tzen, I will discuss the diffusion limit of such models, where we increase the number of layers while sending the step size and the noise variance to zero. I will first provide a unified viewpoint on both sampling and variational inference in such generative models through the lens of stochastic control. Then I will show how we can quantify the expressiveness of diffusion-based generative models. Specifically, I will prove that one can efficiently sample from a wide class of terminal target distributions by choosing the drift of the latent diffusion from the class of multilayer feedforward neural nets, with the accuracy of sampling measured by the Kullback-Leibler divergence to the target distribution. Finally, I will briefly discuss a scheme for unbiased, finite-variance simulation in such models. This scheme can be implemented as a deep generative model with a random number of layers.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Thursday Sept 23rd — Christos Papadimitriou from Columbia University

The next Foundations of Data Science virtual talk will take place on Thursday, Sept 23rd at 10:00 AM Pacific Time (13:00 Eastern Time, 18:00 Central European Time, 17:00 UTC).  Christos Papadimitriou from Columbia Univeristy will speak about “How does the brain beget the mind?”

Please register here to join the virtual talk.

Abstract: How do molecules, cells and synapses effect reasoning, intelligence, planning, language? Despite dazzling progress in experimental neuroscience, as well as in cognitive science at the other extreme of scale, we do not seem to be making progress in the overarching question: the gap is huge and a completely new approach seems to be required. As Richard Axel recently put it: “We don’t have a logic for the transformation of neural activity into thought […].”

What kind of formal system would qualify as this “logic”?

I will introduce the Assembly Calculus (AC), a computational system which appears to be a promising bridge between neurons and cognition. Through this programming framework, a Parser was recently implemented which (a) can handle reasonably complex sentences of English and other languages, and (b) works exclusively through the spiking of neurons.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Thursday May 6th — Hamed Hassani from University of Pennsylvania

The next Foundations of Data Science virtual talk will take place on Thursday, May 6th at 11:00 AM Pacific Time (14:00 Eastern Time, 19:00 Central European Time, 18:00 UTC).  Hamed Hassani from Univeristy of Pennsylvania will speak about “Learning Robust Models: How does the Geometry of Perturbations Play a Role?”

Please register here to join the virtual talk.

Abstract: In this talk, we will focus on the emerging field of (adversarially) robust machine learning. The talk will be self-contained and no particular background on robust learning will be needed. Recent progress in this field has been accelerated by the observation that despite unprecedented performance on clean data, modern learning models remain fragile to seemingly innocuous changes such as small, norm-bounded additive perturbations.  Moreover, recent work in this field has looked beyond norm-bounded perturbations and has revealed that various other types of distributional shifts in the data can significantly degrade performance.  However, in general our understanding of such shifts is in its infancy and several key questions remain unaddressed.

The goal of this talk is to explain why robust learning paradigms have to be designed — and sometimes rethought — based on the geometry of the input perturbations.  We will cover a wide range of perturbation geometries from simple norm-bounded perturbations, to sparse, natural, and more general distribution shifts.  As we will show, the geometry of the perturbations necessitates fundamental modifications to the learning procedure as well as the architecture in order to ensure robustness. In the first part of the talk, we will discuss our recent theoretical results on robust learning with respect to various geometries, along with fundamental tradeoffs between robustness and accuracy, phase transitions, etc.  The remaining portion of the talk will be about developing practical robust training algorithms and evaluating the resulting (robust) deep networks against state-of-the-art methods on naturally-varying, real-world datasets.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Thursday April 1st — Ingrid Daubechies from Duke University

The next Foundations of Data Science virtual talk will take place on Thursday, April 1st at 10:00 AM Pacific Time (13:00 Eastern Time, 19:00 Central European Time, 17:00 UTC).  Ingrid Daubechies from Duke Univeristy will speak about “Discovering low-dimensional manifolds in high-dimensional data sets.”

Please register here to join the virtual talk.

Abstract: This talk reviews diffusion methods to identify low-dimensional manifolds underlying high-dimensional datasets, and illustrates that by pinpointing additional mathematical structure, improved results can be obtained. Much of the talk draws on a case study from a collaboration with biological morphologists, who compare different phenotypical structures to study relationships of living or extinct animals with their surroundings and each other. This is typically done from carefully defined anatomical correspondence points (landmarks) on e.g. bones; such landmarking draws on highly specialized knowledge. To make possible more extensive use of large (and growing) databases, algorithms are required for automatic morphological correspondence maps, without any preliminary marking of special features or landmarks by the user.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Thursday March 18 — Tim Roughgarden from Columbia University

The next Foundations of Data Science virtual talk will take place on Thursday, March 18th at 11:00 AM Pacific Time (14:00 Eastern Time, 20:00 Central European Time, 19:00 UTC).  Tim Roughgarden from Columbia Univeristy will speak about “Data-Driven Algorithm Design.”

Please register here to join the virtual talk.

Abstract: The best algorithm for a computational problem generally depends on the “relevant inputs”, a concept that depends on the application domain and often defies formal articulation. While there is a large literature on empirical approaches to selecting the best algorithm for a given application domain, there has been surprisingly little theoretical analysis of the problem.

We adapt concepts from statistical and online learning theory to reason about application-specific algorithm selection. Our models are straightforward to understand, but also expressive enough to capture several existing approaches in the theoretical computer science and AI communities, ranging from self-improving algorithms to empirical performance models. We present one framework that models algorithm selection as a statistical learning problem, and our work here shows that dimension notions from statistical learning theory, historically used to measure the complexity of classes of binary- and real-valued functions, are relevant in a much broader algorithmic context. We also study the online version of the algorithm selection problem, and give possibility and impossibility results for the existence of no-regret learning algorithms.

Joint work with Rishi Gupta.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Thursday Feb 18 — Costis Daskalakis from MIT

Welcome to the Spring 2021 edition of Foundations of Data Science Virtual Talks.

Our first talk for the season will take place on Thursday, Feb 18th at 11:00 AM Pacific Time (14:00 Eastern Time, 20:00 Central European Time, 19:00 UTC).  Costis Daskalakis from MIT will speak about “Equilibrium Computation and the Foundations of Deep Learning.”

Please register here to join the virtual talk.

Abstract: Deep Learning has recently yielded important advances in single-agent learning challenges, much of that progress being fueled by the empirical success of gradient descent and its variants in computing local optima of non-convex optimization problems. In multi-agent learning applications, the role of single-objective optimization is played by equilibrium computation, yet our understanding of its complexity in settings that are relevant for Deep Learning remains sparse. In this talk we focus on min-max optimization of nonconvex-nonconcave objectives, which has found applications in GANs, and other adversarial learning problems. Here, not only are there no known gradient-descent based methods converging to even local and approximate min-max equilibria, but the computational complexity of identifying them remains poorly understood. We show that finding approximate local min-max equilibria of Lipschitz and smooth objectives requires a number of queries to the function and its gradient that is exponential in the relevant parameters, in sharp contrast to the polynomial number of queries required to find approximate local minima of non convex objectives. Our oracle lower bound is a byproduct of a complexity-theoretic result showing that finding approximate local min-max equilibria is computationally equivalent to finding Brouwer fixed points, and Nash equilibria in non zero-sum games, and thus PPAD-complete.

Minimal complexity theory knowledge will be assumed in the talk. Joint work with Stratis Skoulakis and Manolis Zampetakis.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Friday Dec 04 — Adam Smith from Boston University

The next Foundations of Data Science virtual talk will take place on Friday, Dec 04th at 10:00 AM Pacific Time (13:00 Eastern Time, 18:00 Central European Time, 17:00 UTC, 23:30 Indian Time).  Adam Smith from Boston University will speak about “When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?”.

Abstract: Modern machine learning models are complex, and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning.

Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of image- and text-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds.

Joint work with Gavin Brown, Mark Bun, Vitaly Feldman, and Kunal Talwar.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Friday Nov 20 — Himanshu Tyagi from the Indian Institute of Science (IISc)

The next Foundations of Data Science virtual talk will take place on Friday, Nov 20th at 10:00 AM Pacific Time (13:00 Eastern Time, 18:00 Central European Time, 17:00 UTC, 23:30 Indian Time).  Himanshu Tyagi from IISc will speak about “General lower bounds for estimation under information constraints”.

Abstract:  We present very general lower bounds for parametric estimation when only limited information per sample is allowed. These limitations can arise, for example, in form of communication constraints, privacy constraints, or linear measurements. Our lower bounds hold for discrete distributions with large alphabet as well as continuous distributions with high-dimensional parameters, apply for any information constraint, and are valid for any $\ell_p$ loss function. Our bounds recover both strong data processing inequality based bounds and Cramér-Rao based bound as special cases.

This talk is based on joint work with Jayadev Acharya and Clément Canonne.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.