Design a site like this with WordPress.com
Get started

Tuesday, Nov 8th, 2022 — Edith Cohen from Google Research (& Tel-Aviv University)

The next Foundations of Data Science virtual talk series on recent advances in adversarially robust streaming will take place on Tuesday, November 8th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Edith Cohen from Google Research will talk about “On Robustness to Adaptive Inputs: A Case Study of CountSketch.

Details of the talk (Zoom link) are available here.

Abstract: The performance of randomized algorithms can be considered both in adaptive settings, where inputs may depend on prior outputs of the algorithm, and in non-adaptive settings. Adaptive settings arise when the algorithm is a component of a system with feedback or when an adversary attempts to construct an input on which the algorithm fails. The last decade saw impressive progress that included the development of black-box methods of obtaining algorithms that are robust to adaptive inputs from non-robust counterparts, most notably using differential privacy, and established lower bounds through “attacks.” But intriguing questions remain on the practical implications of adaptivity and the potential to develop tailored algorithms for specific problems and “nicer” inputs. In my talk, I will review key ideas from the literature and then present our recent work that explored these questions for CountSketch, a popular dimensionality reduction method (joint with Xin Lyu, Jelani Nelson, Tamas Sarlos, Moshe Shechner, and Uri Stemmer).

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Advertisement

Wednesday Oct 5th 2022 — David Woodruff from CMU

The first Foundations of Data Science virtual talk of the season, and first of a series on recent advances in adversarially robust streaming, will take place on Wednesday, October 5th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). David Woodruff from CMU will give us a survey about “Adversarially Robust Streaming Algorithms.

Details of the talk (Zoom link) available here.

Abstract: A streaming algorithm is given a sequence of items and seeks to compute or approximate some function of this sequence using a small amount of memory. A body of work has been developed over the last two decades, resulting in optimal streaming algorithms for a number of problems. I will start by surveying some of these problems. I will then investigate the adversarial robustness of streaming algorithms. An algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems do not admit sublinear-space deterministic algorithms; on the other hand, space-efficient randomized algorithms for these problems are generally not adversarially robust. This raises the question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems. I will survey work showing for a number of streaming problems, one can obtain algorithms that are adversarially robust with a small overhead in their memory requirements.

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday May 11th 2022 — Sébastien Bubeck from Microsoft Research

The next Foundations of Data Science virtual talk of this year will take place on Wednesday, March 11th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Sébastien Bubeck from Microsoft Research will speak about “Set Chasing, with an application to online shortest path.

Please register here to join the virtual talk.

Abstract: Since the late 19th century, mathematicians have realized the importance and generality of selection problems: given a collection of sets, select an element in each set, possibly in a “nice” way. Of particular importance in computer science is the scenario where the ground set is a metric space, in which case it is natural to ask for Lipschitz selection. In this talk I will describe a far-reaching extension of this classical Lipschitz selection problem to an *online* setting, where sets are streaming to the selector. I will show how Riemannian gradient descent (aka mirror descent) can be used to approach this type of problems. I will illustrate the power of the framework by solving a long-standing problem in online shortest path known as layered graph traversal (introduced by Papadimitriou and Yannakakis in 1989).

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday April 20th 2022 — Gabriel Peyré from CNRS and ENS

The next Foundations of Data Science virtual talk of this year will take place on Wednesday, March 20th at 12:00 PM Pacific Time (15:00 Eastern Time, 21:00 Central European Time, 19:00 UTC). Gabriel Peyré from CNRS and Ecole Normale Supérieure will speak about “Scaling Optimal Transport for High dimensional Learning.

Please register here to join the virtual talk.

Abstract: Optimal transport (OT) has recently gained lot of interest in machine learning. It is a natural tool to compare in a geometrically faithful way probability distributions. It finds applications in both supervised learning (using geometric loss functions) and unsupervised learning (to perform generative model fitting). OT is however plagued by the curse of dimensionality, since it might require a number of samples which grows exponentially with the dimension. In this talk, I will explain how to leverage entropic regularization methods to define computationally efficient loss functions, approximating OT with a better sample complexity. More information and references can be found on the website of our book “Computational Optimal Transport.”

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday April 6th 2022 — Shuchi Chawla from UT Austin

The third Foundations of Data Science virtual talk of this year will take place on Wednesday, March 6th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Shuchi Chawla from UT Austin will speak about “Pandora’s Box with Correlations: Learning and Approximation.

Please register here to join the virtual talk.

Abstract: In the Pandora’s Box problem, the algorithm is provided with a number of boxes with unknown (stochastic) rewards contained inside them. The algorithm can open any box at some cost, discover the reward inside, and based on these observations can choose one box and keep the reward contained in it. Given the distributions from which the rewards are drawn, the algorithm must determine an order in which to open the boxes as well as when to stop and accept the best reward found so far. In general, an optimal algorithm may make both decisions adaptively based on instantiations observed previously. The Pandora’s Box problem and its extensions capture many kinds of optimization problems with stochastic input where the algorithm can obtain instantiations of input random variables at some cost. Previous work on these problems assumes that different random variables in the input are distributed independently. As such it does not capture many real-world settings. In this work, we provide the first algorithms for Pandora’s Box-type problems with correlations. In the independent setting, optimal algorithms are non-adaptive and based on the notion of the Gittins index. These techniques fail to extend to the correlated case. We assume that the algorithm has access to samples drawn from the joint distribution on input and provide solutions that require few samples; are computationally efficient; and guarantee approximate optimality.
This is joint work with Evangelia Gergatsouli, Yifeng Teng, Christos Tzamos, and Ruimin Zhang and appeared in FOCS’20.

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday March 16th 2022 — Eric Tchetgen Tchetgen from Penn

The second Foundations of Data Science virtual talk of this year will take place on Wednesday, March 16th at 11:00 PM Pacific Time (16:00 Eastern Time, 21:00 Central European Time, 20:00 UTC). Eric Tchetgen Tchetgen from The Wharton School, University of Pennsylvania will speak about “An Introduction to Proximal Causal Learning.

Please register here to join the virtual talk.

Abstract: A standard assumption for causal inference from observational data is that one has measured a sufficiently rich set of covariates to ensure that within covariates strata, subjects are exchangeable across observed treatment values. Skepticism about the exchangeability assumption in observational studies is often warranted because it hinges on one’s ability to accurately measure covariates capturing all potential sources of confounding. Realistically, confounding mechanisms can rarely if ever, be learned with certainty from measured covariates. One can therefore only ever hope that covariate measurements are at best proxies of true underlying confounding mechanisms operating in an observational study, thus invalidating causal claims made on basis of standard exchangeability conditions.

Causal learning from proxies is a challenging inverse problem which has to date remained unresolved. In this paper, we introduce a formal potential outcome framework for proximal causal learning, which while explicitly acknowledging covariate measurements as imperfect proxies of confounding mechanisms, offers an opportunity to learn about causal effects in settings where exchangeability based on measured covariates fails. Sufficient conditions for nonparametric identification are given, leading to the proximal g-formula and corresponding proximal g-computation algorithm for estimation, both generalizations of Robins’ foundational g-formula and g-computation algorithm, which account explicitly for bias due to unmeasured confounding. Both point treatment and time-varying treatment settings are considered, and an application of proximal g-computation of causal effects is given for illustration.

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Thursday Jan 13th 2022 — Piotr Indyk from MIT

The first Foundations of Data Science virtual talk of this year will take place on Thursday, Jan 13th at 10:00 AM Pacific Time (13:00 Eastern Time, 19:00 Central European Time, 18:00 UTC). Piotr Indyk from MIT will speak about “Learning-Based Sampling and Streaming”.

Please register here to join the virtual talk.

Abstract: Classical algorithms typically provide “one size fits all” performance, and do not leverage properties or patterns in their inputs. A recent line of work aims to address this issue by developing algorithms that use machine learning predictions to improve their performance. In this talk I will present two examples of this type, in the context of streaming and sampling algorithms. In particular, I will show how to use machine learning predictions to improve the performance of (a) low-memory streaming algorithms for frequency estimation (ICLR’19), and (b) sampling algorithms for estimating the support size of a distribution (ICLR’21).   Both algorithms use an ML-based predictor that, given a data item, estimates the number of times the item occurs in the input data set.  

The talk will cover material from papers co-authored with  T Eden, CY Hsu, D Katabi, S Narayanan, R Rubinfeld, S Silwal, T Wagner and A Vakilian.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Thursday Nov 18th — Nicole Immorlica from Microsoft Research

The next Foundations of Data Science virtual talk will take place on Thursday, Nov 18th at 10:00 AM Pacific Time (13:00 Eastern Time, 19:00 Central European Time, 18:00 UTC). Nicole Immorlica from Microsoft Research will speak about “Communicating with Anecdotes”.

Please register here to join the virtual talk.

Abstract: Classic models of communication in economics typically assume agents can communicate any message.  However, many important communications, such as those in newspapers or politicians’ speeches, use data to convey information.  In this talk, we explore how the reliance on data impacts communication.  In our model, there are two Bayesian agents (a sender and a receiver) who wish to communicate. The receiver must take an action whose payoff depends on their personal preferences and an unknown state of the world. The sender has access to a collection of data points correlated with the state of the world and can send exactly one of these to the receiver in order to influence her choice of action. Importantly, the sender’s personal preferences may differ from the receiver’s, which affects the sender’s strategic choice of what to send. We show that in a Nash equilibrium even a small difference in preferences can lead to a significant bias in the communicated datum. This can significantly reduce informativeness of the communication, leading to substantial utility loss for both sides. One implication is informational homophily: a receiver can rationally prefer to obtain data from a poorly-informed sender with aligned preferences, rather than a knowledgeable expert whose preferences may differ from her own.  

Joint work with Nika Haghtalab, Brendan Lucier, Markus Mobius and Divya Mohan.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Thursday Oct 21st — Maxim Raginsky from UIUC

The next Foundations of Data Science virtual talk will take place on Thursday, Oct 21st at 10:00 AM Pacific Time (13:00 Eastern Time, 18:00 Central European Time, 17:00 UTC).  Maxim Raginsky from University of Illinois, Urbana-Champaign will speak about “Neural SDEs: Deep Generative Models in the Diffusion Limit”

Please register here to join the virtual talk.

Abstract: In deep generative models, the latent variable is generated by a time-inhomogeneous Markov chain, where at each time step we pass the current state through a parametric nonlinear map, such as a feedforward neural net, and add a small independent Gaussian perturbation. In this talk, based on joint work with Belinda Tzen, I will discuss the diffusion limit of such models, where we increase the number of layers while sending the step size and the noise variance to zero. I will first provide a unified viewpoint on both sampling and variational inference in such generative models through the lens of stochastic control. Then I will show how we can quantify the expressiveness of diffusion-based generative models. Specifically, I will prove that one can efficiently sample from a wide class of terminal target distributions by choosing the drift of the latent diffusion from the class of multilayer feedforward neural nets, with the accuracy of sampling measured by the Kullback-Leibler divergence to the target distribution. Finally, I will briefly discuss a scheme for unbiased, finite-variance simulation in such models. This scheme can be implemented as a deep generative model with a random number of layers.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Thursday Sept 23rd — Christos Papadimitriou from Columbia University

The next Foundations of Data Science virtual talk will take place on Thursday, Sept 23rd at 10:00 AM Pacific Time (13:00 Eastern Time, 18:00 Central European Time, 17:00 UTC).  Christos Papadimitriou from Columbia Univeristy will speak about “How does the brain beget the mind?”

Please register here to join the virtual talk.

Abstract: How do molecules, cells and synapses effect reasoning, intelligence, planning, language? Despite dazzling progress in experimental neuroscience, as well as in cognitive science at the other extreme of scale, we do not seem to be making progress in the overarching question: the gap is huge and a completely new approach seems to be required. As Richard Axel recently put it: “We don’t have a logic for the transformation of neural activity into thought […].”

What kind of formal system would qualify as this “logic”?

I will introduce the Assembly Calculus (AC), a computational system which appears to be a promising bridge between neurons and cognition. Through this programming framework, a Parser was recently implemented which (a) can handle reasonably complex sentences of English and other languages, and (b) works exclusively through the spiking of neurons.

The series is supported by the NSF HDR TRIPODS Grant 1934846.