## Thursday April 1st — Ingrid Daubechies from Duke University

The next Foundations of Data Science virtual talk will take place on Thursday, April 1st at 10:00 AM Pacific Time (13:00 Eastern Time, 19:00 Central European Time, 17:00 UTC).  Ingrid Daubechies from Duke Univeristy will speak about “Discovering low-dimensional manifolds in high-dimensional data sets.”

Please register here to join the virtual talk.

Abstract: This talk reviews diffusion methods to identify low-dimensional manifolds underlying high-dimensional datasets, and illustrates that by pinpointing additional mathematical structure, improved results can be obtained. Much of the talk draws on a case study from a collaboration with biological morphologists, who compare different phenotypical structures to study relationships of living or extinct animals with their surroundings and each other. This is typically done from carefully defined anatomical correspondence points (landmarks) on e.g. bones; such landmarking draws on highly specialized knowledge. To make possible more extensive use of large (and growing) databases, algorithms are required for automatic morphological correspondence maps, without any preliminary marking of special features or landmarks by the user.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Thursday March 18 — Tim Roughgarden from Columbia University

The next Foundations of Data Science virtual talk will take place on Thursday, March 18th at 11:00 AM Pacific Time (14:00 Eastern Time, 20:00 Central European Time, 19:00 UTC).  Tim Roughgarden from Columbia Univeristy will speak about “Data-Driven Algorithm Design.”

Please register here to join the virtual talk.

Abstract: The best algorithm for a computational problem generally depends on the “relevant inputs”, a concept that depends on the application domain and often defies formal articulation. While there is a large literature on empirical approaches to selecting the best algorithm for a given application domain, there has been surprisingly little theoretical analysis of the problem.

We adapt concepts from statistical and online learning theory to reason about application-specific algorithm selection. Our models are straightforward to understand, but also expressive enough to capture several existing approaches in the theoretical computer science and AI communities, ranging from self-improving algorithms to empirical performance models. We present one framework that models algorithm selection as a statistical learning problem, and our work here shows that dimension notions from statistical learning theory, historically used to measure the complexity of classes of binary- and real-valued functions, are relevant in a much broader algorithmic context. We also study the online version of the algorithm selection problem, and give possibility and impossibility results for the existence of no-regret learning algorithms.

Joint work with Rishi Gupta.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Thursday Feb 18 — Costis Daskalakis from MIT

Welcome to the Spring 2021 edition of Foundations of Data Science Virtual Talks.

Our first talk for the season will take place on Thursday, Feb 18th at 11:00 AM Pacific Time (14:00 Eastern Time, 20:00 Central European Time, 19:00 UTC).  Costis Daskalakis from MIT will speak about “Equilibrium Computation and the Foundations of Deep Learning.”

Please register here to join the virtual talk.

Abstract: Deep Learning has recently yielded important advances in single-agent learning challenges, much of that progress being fueled by the empirical success of gradient descent and its variants in computing local optima of non-convex optimization problems. In multi-agent learning applications, the role of single-objective optimization is played by equilibrium computation, yet our understanding of its complexity in settings that are relevant for Deep Learning remains sparse. In this talk we focus on min-max optimization of nonconvex-nonconcave objectives, which has found applications in GANs, and other adversarial learning problems. Here, not only are there no known gradient-descent based methods converging to even local and approximate min-max equilibria, but the computational complexity of identifying them remains poorly understood. We show that finding approximate local min-max equilibria of Lipschitz and smooth objectives requires a number of queries to the function and its gradient that is exponential in the relevant parameters, in sharp contrast to the polynomial number of queries required to find approximate local minima of non convex objectives. Our oracle lower bound is a byproduct of a complexity-theoretic result showing that finding approximate local min-max equilibria is computationally equivalent to finding Brouwer fixed points, and Nash equilibria in non zero-sum games, and thus PPAD-complete.

Minimal complexity theory knowledge will be assumed in the talk. Joint work with Stratis Skoulakis and Manolis Zampetakis.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Friday Dec 04 — Adam Smith from Boston University

The next Foundations of Data Science virtual talk will take place on Friday, Dec 04th at 10:00 AM Pacific Time (13:00 Eastern Time, 18:00 Central European Time, 17:00 UTC, 23:30 Indian Time).  Adam Smith from Boston University will speak about “When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?”.

Abstract: Modern machine learning models are complex, and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning.

Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of image- and text-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds.

Joint work with Gavin Brown, Mark Bun, Vitaly Feldman, and Kunal Talwar.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Friday Nov 20 — Himanshu Tyagi from the Indian Institute of Science (IISc)

The next Foundations of Data Science virtual talk will take place on Friday, Nov 20th at 10:00 AM Pacific Time (13:00 Eastern Time, 18:00 Central European Time, 17:00 UTC, 23:30 Indian Time).  Himanshu Tyagi from IISc will speak about “General lower bounds for estimation under information constraints”.

Abstract:  We present very general lower bounds for parametric estimation when only limited information per sample is allowed. These limitations can arise, for example, in form of communication constraints, privacy constraints, or linear measurements. Our lower bounds hold for discrete distributions with large alphabet as well as continuous distributions with high-dimensional parameters, apply for any information constraint, and are valid for any $\ell_p$ loss function. Our bounds recover both strong data processing inequality based bounds and Cramér-Rao based bound as special cases.

This talk is based on joint work with Jayadev Acharya and Clément Canonne.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Monday, Nov 09 — Tal Rabin from University of Pennsylvania

The next Foundations of Data Science virtual talk will take place on Monday, Nov 09th at 10:00 AM Pacific Time (1:00 pm Eastern Time, 18:00 Central European Time, 17:00 UTC).  Tal Rabin from UPenn will speak about “You Only Speak Once — Secure MPC with Stateless Ephemeral Roles”.

Abstract: The inherent difficulty of maintaining stateful environments over long periods of time gave rise to the paradigm of serverless computing, where mostly-stateless components are deployed on demand to handle computation tasks, and are teared down once their task is complete. Serverless architecture could offer the added benefit of improved resistance to targeted denial-of-service attacks. Realizing such protection,
requires that the protocol only uses stateless parties. Perhaps the most famous example of this style of protocols is the Nakamoto consensus protocol used in Bitcoin. We refer to this stateless property as the You-Only-Speak-Once (YOSO) property, and initiate the formal study of it within a new YOSO model. Our model is centered around the notion of roles, which are stateless parties that can only send a single message. Furthermore, we describe several techniques for achieving YOSO MPC; both computational and information theoretic.

The talk will be self contained.

Based on joint works with: Fabrice Benhamouda, Craig Gentry, Sergey Gorbunov, Shai Halevi, Hugo Krawczyk, Chengyu Lin, Bernardo Magri, Jesper Nielsen, Leo Reyzin, Sophia Yakoubov.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Friday, Oct 09 — Alexandr Andoni from Columbia University

The next Foundations of Data Science virtual talk will take place on Friday, Oct 09th at 10:00 AM Pacific Time (1:00 pm Eastern Time, 18:00 Central European Time, 17:00 UTC).  Alexandr Andoni from Columbia University will speak about “Approximating Edit Distance in Near-Linear Time”.

Abstract: Edit distance is a classic measure of similarity between strings, with applications ranging from computational biology to coding. Computing edit distance is also a classic dynamic programming problem, with a quadratic run-time solution, often taught in the “Intro to Algorithms” classes. Improving this runtime has been a decades-old challenge, now ruled likely-impossible using tools from the modern area of fine-grained complexity. We show how to approximate the edit distance between two strings in near-linear time, up to a constant factor. Our result completes a research direction set forth in the breakthrough paper of [Chakraborty, Das, Goldenberg, Koucky, Saks; FOCS’18], which showed the first constant-factor approximation algorithm with a (strongly) sub-quadratic running time.

Joint work with Negev Shekel Nosatzki, available at https://arxiv.org/abs/2005.07678.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Friday, Sept 11 — Bin Yu from UC Berkeley

The next Foundations of Data Science virtual talk will take place on Friday, Sept 11th at 10:00 AM Pacific Time (1:00 pm Eastern Time, 18:00 Central European Time, 17:00 UTC).  Bin Yu from UC Berkeley will speak about “Veridical Data Science”.

Abstract: Building and expanding on principles of statistics, machine learning, and the sciences, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework is comprised of both a workflow and documentation and aims to provide responsible, reliable, reproducible, and transparent results across the entire data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle for the data science life cycle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. We develop inference procedures that build on PCS, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations.

Moreover, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis.

The PCS framework will be illustrated through our DeepTune approach to model and characterize neurons in the difficult visual cortex area V4.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Friday, May 15 — Amin Karbasi from Yale University

The fourth Foundations of Data Science virtual talk will take place on Friday, May 15th at 11:00 AM Pacific Time (2:00 pm Eastern Time, 20:00 Central European Time, 19:00 UTC).  Amin Karbasi from Yale University will speak about “User-Friendly Submodular Maximization”.

Abstract: Submodular functions model the intuitive notion of diminishing returns. Due to their far-reaching applications, they have been rediscovered in many fields such as information theory, operations research, statistical physics, economics, and machine learning. They also enjoy computational tractability as they can be minimized exactly or maximized approximately.

The goal of this talk is simple. We see how a little bit of randomness, a little bit of greediness, and the right combination can lead to pretty good methods for offline, streaming, and distributed solutions. I do not assume any background on submodularity and try to explain all the required details during the talk.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

## Friday, April 17 — Shachar Lovett from UC San Diego

The third Foundations of Data Science virtual talk will take place next Friday, April 17th at 11:00 AM Pacific Time (2:00 pm Eastern Time, 20:00 Central European Time, 19:00 UTC).  Shachar Lovett from UC San Diego will speak about “The power of asking more informative questions about the data”.

Abstract: Many supervised learning algorithms (such as deep learning) need a large collection of labelled data points in order to perform well. However, what is easy to get are large amounts of unlabelled data. Labeling data is an expensive procedure, as it usually needs to be done manually, often by a domain expert. Active learning provides a mechanism to bridge this gap. Active learning algorithms are given a large collection of unlabelled data points. They need to smartly choose a few data points to query their label. The goal is then to automatically infer the labels of many other data points.

In this talk, we will explore the option of giving active learning algorithms additional power, by allowing them to have richer interaction with the data. We will see how allowing for even simple types of queries, such as comparing two data points, can exponentially improve the number of queries needed in various settings. Along the way, we will see interesting connections to both geometry and combinatorics, and a surprising application to fine grained complexity.

Based on joint works with Daniel Kane, Shay Moran and Jiapeng Zhang.

Link to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.