Design a site like this with WordPress.com
Get started

Thursday, Mar 16th, 2023 — Vladimir Braverman from Rice University

The next Foundations of Data Science virtual talk series on recent advances in adversarially robust streaming will take place on Thursday, March 16th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 21:00 UTC). Vladimir Braverman from Rice University will talk about “Adversarial Robustness of Streaming Algorithms through Importance Sampling”.

Details of the talk (Zoom link) are available here.

Abstract: Robustness against adversarial attacks has recently been at the forefront of algorithmic design for machine learning tasks. In the adversarial streaming model, an adversary gives an algorithm a sequence of adaptively chosen updates u1, … ,un as a data stream. The goal of the algorithm is to compute or approximate some predetermined function for every prefix of the adversarial stream, but the adversary may generate future updates based on previous outputs of the algorithm. In particular, the adversary may gradually learn the random bits internally used by an algorithm to manipulate dependencies in the input. This is especially problematic as many important problems in the streaming model require randomized algorithms, as they are known to not admit any deterministic algorithms that use sublinear space. In this paper, we introduce adversarially robust streaming algorithms for central machine learning and algorithmic tasks, such as regression and clustering, as well as their more general counterparts, subspace embedding, low-rank approximation, and coreset construction. For regression and other numerical linear algebra related tasks, we consider the row arrival streaming model. Our results are based on a simple, but powerful, observation that many importance sampling-based algorithms give rise to adversarial robustness which is in contrast to sketching based algorithms, which are very prevalent in the streaming literature but suffer from adversarial attacks. In addition, we show that the well-known merge and reduce paradigm in streaming is adversarially robust. Since the merge and reduce paradigm allows coreset constructions in the streaming setting, we thus obtain robust algorithms for k-means, k-median, k-center, Bregman clustering, projective clustering, principal component analysis (PCA) and non-negative matrix factorization. To the best of our knowledge, these are the first adversarially robust results for these problems yet require no new algorithmic implementations. Finally, we empirically confirm the robustness of our algorithms on various adversarial attacks and demonstrate that by contrast, some common existing algorithms are not robust.

This is a joint work with Avinatan Hassidim, Yossi Matias, Mariano Schain, Sandeep Silwal, Samson Zhou. This result has appeared in NeurIPS 2021.

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Advertisement

Thursday, Feb 9th, 2023 — Amit Chakrabarti from Dartmouth College

The next Foundations of Data Science virtual talk series on recent advances in adversarially robust streaming will take place on Thursday, February 9th at 2:00 PM Pacific Time (17:00 Eastern Time, 23:59 Central European Time, 22:00 UTC). Amit Chakrabarti from Dartmouth College will talk about “How to color your adversary’s graph”

Details of the talk (Zoom link) are available here.

An n-vertex graph with maximum degree D is (D+1)-colorable: an almost trivial combinatorial result, with an equally simple greedy algorithm to produce a (D+1)-coloring. However, given a stream of edges of such a graph, can we maintain a valid (D+1)-coloring as the edges arrive, while using not much more than O(n) space? What if the edges are chosen by an adversary who can look at our current coloring and add additional edges to try to confound us? This is the newly-popular setting of adversarially robust streaming algorithms and this talk is about the coloring problem in this setting.

We obtain upper and lower bound results for this problem. In O(n polylog n) space, we can maintain an O(D^(5/2))-coloring of such an adversarial graph. On the other hand, every adversarially robust coloring algorithm under the same space limitation must spend Omega(D^2) colors. We in fact prove more general. results that trade off the space usage against the color budget.  One interesting by-product of our work is that in combination with the celebrated Assadi-Chen-Khanna algorithm (SODA 2019), it provides the first separation between randomized and deterministic algorithms for the (ordinary, non-robust) streaming graph coloring problem.

Based on joint works [C.-Ghosh-Stoeckl] and [Assadi-C.-Ghosh-Stoeckl].

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday, Dec 14th, 2022 — Omri Ben-Eliezer from MIT

The next Foundations of Data Science virtual talk series on recent advances in adversarially robust streaming will take place on Wednesday, December 14th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Omri Ben-Eliezer from MIT will talk about “Robust sampling and online learning”

Details of the talk (Zoom link) are available here.

Abstract: Random sampling is a fundamental primitive in modern algorithms, statistics and machine learning, used as a generic method to obtain a small yet “representative” subset of the data. In this talk we shall explore to what extent random sampling is robust to adaptive inputs in a streaming setting, proposing and studying a model for sampling against a white-box adaptive adversary (where future inputs generated by the adversary are allowed to depend on the current internal state of the sampler). I will demonstrate scenarios where the adaptive sample complexity can be much larger than in the static world, but argue that these scenarios are not entirely realistic. I will then reveal a deep connection between our setting and online learning. As it turns out, the sample complexity in our sampling setting is captured by a sequential version of Rademacher complexity, a notion that is also known to capture the regret in online classification problems. Leveraging this connection, one can bound the sample complexity using the Littlestone dimension of the relevant concept class. The obtained bounds are tight in many regimes, and lead us to resolve a classical open problem on optimal regret bounds in online learning.

Based on joint works with Noga Alon, Yuval Dagan, Shay Moran, Moni Naor, and Eylon Yogev.

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Tuesday, Nov 8th, 2022 — Edith Cohen from Google Research (& Tel-Aviv University)

The next Foundations of Data Science virtual talk series on recent advances in adversarially robust streaming will take place on Tuesday, November 8th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Edith Cohen from Google Research will talk about “On Robustness to Adaptive Inputs: A Case Study of CountSketch.

Details of the talk (Zoom link) are available here.

Abstract: The performance of randomized algorithms can be considered both in adaptive settings, where inputs may depend on prior outputs of the algorithm, and in non-adaptive settings. Adaptive settings arise when the algorithm is a component of a system with feedback or when an adversary attempts to construct an input on which the algorithm fails. The last decade saw impressive progress that included the development of black-box methods of obtaining algorithms that are robust to adaptive inputs from non-robust counterparts, most notably using differential privacy, and established lower bounds through “attacks.” But intriguing questions remain on the practical implications of adaptivity and the potential to develop tailored algorithms for specific problems and “nicer” inputs. In my talk, I will review key ideas from the literature and then present our recent work that explored these questions for CountSketch, a popular dimensionality reduction method (joint with Xin Lyu, Jelani Nelson, Tamas Sarlos, Moshe Shechner, and Uri Stemmer).

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday Oct 5th 2022 — David Woodruff from CMU

The first Foundations of Data Science virtual talk of the season, and first of a series on recent advances in adversarially robust streaming, will take place on Wednesday, October 5th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). David Woodruff from CMU will give us a survey about “Adversarially Robust Streaming Algorithms.

Details of the talk (Zoom link) available here.

Abstract: A streaming algorithm is given a sequence of items and seeks to compute or approximate some function of this sequence using a small amount of memory. A body of work has been developed over the last two decades, resulting in optimal streaming algorithms for a number of problems. I will start by surveying some of these problems. I will then investigate the adversarial robustness of streaming algorithms. An algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems do not admit sublinear-space deterministic algorithms; on the other hand, space-efficient randomized algorithms for these problems are generally not adversarially robust. This raises the question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems. I will survey work showing for a number of streaming problems, one can obtain algorithms that are adversarially robust with a small overhead in their memory requirements.

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday May 11th 2022 — Sébastien Bubeck from Microsoft Research

The next Foundations of Data Science virtual talk of this year will take place on Wednesday, March 11th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Sébastien Bubeck from Microsoft Research will speak about “Set Chasing, with an application to online shortest path.

Please register here to join the virtual talk.

Abstract: Since the late 19th century, mathematicians have realized the importance and generality of selection problems: given a collection of sets, select an element in each set, possibly in a “nice” way. Of particular importance in computer science is the scenario where the ground set is a metric space, in which case it is natural to ask for Lipschitz selection. In this talk I will describe a far-reaching extension of this classical Lipschitz selection problem to an *online* setting, where sets are streaming to the selector. I will show how Riemannian gradient descent (aka mirror descent) can be used to approach this type of problems. I will illustrate the power of the framework by solving a long-standing problem in online shortest path known as layered graph traversal (introduced by Papadimitriou and Yannakakis in 1989).

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday April 20th 2022 — Gabriel Peyré from CNRS and ENS

The next Foundations of Data Science virtual talk of this year will take place on Wednesday, March 20th at 12:00 PM Pacific Time (15:00 Eastern Time, 21:00 Central European Time, 19:00 UTC). Gabriel Peyré from CNRS and Ecole Normale Supérieure will speak about “Scaling Optimal Transport for High dimensional Learning.

Please register here to join the virtual talk.

Abstract: Optimal transport (OT) has recently gained lot of interest in machine learning. It is a natural tool to compare in a geometrically faithful way probability distributions. It finds applications in both supervised learning (using geometric loss functions) and unsupervised learning (to perform generative model fitting). OT is however plagued by the curse of dimensionality, since it might require a number of samples which grows exponentially with the dimension. In this talk, I will explain how to leverage entropic regularization methods to define computationally efficient loss functions, approximating OT with a better sample complexity. More information and references can be found on the website of our book “Computational Optimal Transport.”

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday April 6th 2022 — Shuchi Chawla from UT Austin

The third Foundations of Data Science virtual talk of this year will take place on Wednesday, March 6th at 1:00 PM Pacific Time (16:00 Eastern Time, 22:00 Central European Time, 20:00 UTC). Shuchi Chawla from UT Austin will speak about “Pandora’s Box with Correlations: Learning and Approximation.

Please register here to join the virtual talk.

Abstract: In the Pandora’s Box problem, the algorithm is provided with a number of boxes with unknown (stochastic) rewards contained inside them. The algorithm can open any box at some cost, discover the reward inside, and based on these observations can choose one box and keep the reward contained in it. Given the distributions from which the rewards are drawn, the algorithm must determine an order in which to open the boxes as well as when to stop and accept the best reward found so far. In general, an optimal algorithm may make both decisions adaptively based on instantiations observed previously. The Pandora’s Box problem and its extensions capture many kinds of optimization problems with stochastic input where the algorithm can obtain instantiations of input random variables at some cost. Previous work on these problems assumes that different random variables in the input are distributed independently. As such it does not capture many real-world settings. In this work, we provide the first algorithms for Pandora’s Box-type problems with correlations. In the independent setting, optimal algorithms are non-adaptive and based on the notion of the Gittins index. These techniques fail to extend to the correlated case. We assume that the algorithm has access to samples drawn from the joint distribution on input and provide solutions that require few samples; are computationally efficient; and guarantee approximate optimality.
This is joint work with Evangelia Gergatsouli, Yifeng Teng, Christos Tzamos, and Ruimin Zhang and appeared in FOCS’20.

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Wednesday March 16th 2022 — Eric Tchetgen Tchetgen from Penn

The second Foundations of Data Science virtual talk of this year will take place on Wednesday, March 16th at 11:00 PM Pacific Time (16:00 Eastern Time, 21:00 Central European Time, 20:00 UTC). Eric Tchetgen Tchetgen from The Wharton School, University of Pennsylvania will speak about “An Introduction to Proximal Causal Learning.

Please register here to join the virtual talk.

Abstract: A standard assumption for causal inference from observational data is that one has measured a sufficiently rich set of covariates to ensure that within covariates strata, subjects are exchangeable across observed treatment values. Skepticism about the exchangeability assumption in observational studies is often warranted because it hinges on one’s ability to accurately measure covariates capturing all potential sources of confounding. Realistically, confounding mechanisms can rarely if ever, be learned with certainty from measured covariates. One can therefore only ever hope that covariate measurements are at best proxies of true underlying confounding mechanisms operating in an observational study, thus invalidating causal claims made on basis of standard exchangeability conditions.

Causal learning from proxies is a challenging inverse problem which has to date remained unresolved. In this paper, we introduce a formal potential outcome framework for proximal causal learning, which while explicitly acknowledging covariate measurements as imperfect proxies of confounding mechanisms, offers an opportunity to learn about causal effects in settings where exchangeability based on measured covariates fails. Sufficient conditions for nonparametric identification are given, leading to the proximal g-formula and corresponding proximal g-computation algorithm for estimation, both generalizations of Robins’ foundational g-formula and g-computation algorithm, which account explicitly for bias due to unmeasured confounding. Both point treatment and time-varying treatment settings are considered, and an application of proximal g-computation of causal effects is given for illustration.

 The series is supported by the NSF HDR TRIPODS Grant 1934846.

Thursday Jan 13th 2022 — Piotr Indyk from MIT

The first Foundations of Data Science virtual talk of this year will take place on Thursday, Jan 13th at 10:00 AM Pacific Time (13:00 Eastern Time, 19:00 Central European Time, 18:00 UTC). Piotr Indyk from MIT will speak about “Learning-Based Sampling and Streaming”.

Please register here to join the virtual talk.

Abstract: Classical algorithms typically provide “one size fits all” performance, and do not leverage properties or patterns in their inputs. A recent line of work aims to address this issue by developing algorithms that use machine learning predictions to improve their performance. In this talk I will present two examples of this type, in the context of streaming and sampling algorithms. In particular, I will show how to use machine learning predictions to improve the performance of (a) low-memory streaming algorithms for frequency estimation (ICLR’19), and (b) sampling algorithms for estimating the support size of a distribution (ICLR’21).   Both algorithms use an ML-based predictor that, given a data item, estimates the number of times the item occurs in the input data set.  

The talk will cover material from papers co-authored with  T Eden, CY Hsu, D Katabi, S Narayanan, R Rubinfeld, S Silwal, T Wagner and A Vakilian.

The series is supported by the NSF HDR TRIPODS Grant 1934846.