Friday, Oct 09 — Alexandr Andoni from Columbia University

The next Foundations of Data Science virtual talk will take place on Friday, Oct 09th at 10:00 AM Pacific Time (1:00 pm Eastern Time, 18:00 Central European Time, 17:00 UTC).  Alexandr Andoni from Columbia University will speak about “Approximating Edit Distance in Near-Linear Time”.

Abstract: Edit distance is a classic measure of similarity between strings, with applications ranging from computational biology to coding. Computing edit distance is also a classic dynamic programming problem, with a quadratic run-time solution, often taught in the “Intro to Algorithms” classes. Improving this runtime has been a decades-old challenge, now ruled likely-impossible using tools from the modern area of fine-grained complexity. We show how to approximate the edit distance between two strings in near-linear time, up to a constant factor. Our result completes a research direction set forth in the breakthrough paper of [Chakraborty, Das, Goldenberg, Koucky, Saks; FOCS’18], which showed the first constant-factor approximation algorithm with a (strongly) sub-quadratic running time.

Joint work with Negev Shekel Nosatzki, available at https://arxiv.org/abs/2005.07678.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Friday, Sept 11 — Bin Yu from UC Berkeley

The next Foundations of Data Science virtual talk will take place on Friday, Sept 11th at 10:00 AM Pacific Time (1:00 pm Eastern Time, 18:00 Central European Time, 17:00 UTC).  Bin Yu from UC Berkeley will speak about “Veridical Data Science”.

Abstract: Building and expanding on principles of statistics, machine learning, and the sciences, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework is comprised of both a workflow and documentation and aims to provide responsible, reliable, reproducible, and transparent results across the entire data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle for the data science life cycle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. We develop inference procedures that build on PCS, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations.

Moreover, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis.

The PCS framework will be illustrated through our DeepTune approach to model and characterize neurons in the difficult visual cortex area V4.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Friday, May 15 — Amin Karbasi from Yale University

The fourth Foundations of Data Science virtual talk will take place on Friday, May 15th at 11:00 AM Pacific Time (2:00 pm Eastern Time, 20:00 Central European Time, 19:00 UTC).  Amin Karbasi from Yale University will speak about “User-Friendly Submodular Maximization”.

Abstract: Submodular functions model the intuitive notion of diminishing returns. Due to their far-reaching applications, they have been rediscovered in many fields such as information theory, operations research, statistical physics, economics, and machine learning. They also enjoy computational tractability as they can be minimized exactly or maximized approximately.

The goal of this talk is simple. We see how a little bit of randomness, a little bit of greediness, and the right combination can lead to pretty good methods for offline, streaming, and distributed solutions. I do not assume any background on submodularity and try to explain all the required details during the talk.

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Friday, April 17 — Shachar Lovett from UC San Diego

The third Foundations of Data Science virtual talk will take place next Friday, April 17th at 11:00 AM Pacific Time (2:00 pm Eastern Time, 20:00 Central European Time, 19:00 UTC).  Shachar Lovett from UC San Diego will speak about “The power of asking more informative questions about the data”.

Abstract: Many supervised learning algorithms (such as deep learning) need a large collection of labelled data points in order to perform well. However, what is easy to get are large amounts of unlabelled data. Labeling data is an expensive procedure, as it usually needs to be done manually, often by a domain expert. Active learning provides a mechanism to bridge this gap. Active learning algorithms are given a large collection of unlabelled data points. They need to smartly choose a few data points to query their label. The goal is then to automatically infer the labels of many other data points.

In this talk, we will explore the option of giving active learning algorithms additional power, by allowing them to have richer interaction with the data. We will see how allowing for even simple types of queries, such as comparing two data points, can exponentially improve the number of queries needed in various settings. Along the way, we will see interesting connections to both geometry and combinatorics, and a surprising application to fine grained complexity.

Based on joint works with Daniel Kane, Shay Moran and Jiapeng Zhang.

Link to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Friday, March 27 — Sujay Sanghavi from UT Austin

The second Foundations of Data Science virtual talk will take place next Friday, March 27th at 11:00 AM Pacific Time (2:00 pm Eastern Time, 20:00 Central European Time, 19:00 UTC).  Sujay Sanghavi from University of Texas at Austin will speak about “Towards Model Agnostic Robustness”.

Abstract: It is now common practice to try and solve machine learning problems by starting with a complex existing model or architecture, and fine-tuning/adapting it to the task at hand. However, outliers, errors or even just sloppiness in training data often lead to drastic drops in performance.

We investigate a simple generic approach to correct for this, motivated by a classic statistical idea: trimmed loss. This advocates jointly (a) selecting which training samples to ignore, and (b) fitting a model on the remaining samples. As such this is computationally infeasible even for linear regression. We propose and study the natural iterative variant that alternates between these two steps (a) and (b) – each of which individually can be easily accomplished in pretty much any statistical setting. We also study the batch-SGD variant of this idea. We demonstrate both theoretically (for generalized linear models) and empirically (for vision and NLP neural network models) that this effectively recovers accuracy in the presence of bad training data.

This work is joint with Yanyao Shen and Vatsal Shah and appears in NeurIPS 2019, ICML 2019 and AISTATS 2020.

Link to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Friday, February 28 — Jon Kleinberg from Cornell University

The first Foundations of Data Science virtual talk will take place this coming Friday, February 28th at 11:00 AM Pacific Time (2:00 pm Eastern Time, 20:00 Central European Time, 19:00 UTC). Jon Kleinberg from Cornell University will speak about “Fairness and Bias in Algorithmic Decision-Making”.

Abstract: As data science has broadened its scope in recent years, a number of domains have applied computational methods for classification and prediction to evaluate individuals in high-stakes settings. These developments have led to an active line of recent discussion in the public sphere about the consequences of algorithmic prediction for notions of fairness and equity. In part, this discussion has involved a basic tension between competing notions of what it means for such classifications to be fair to different groups. We consider several of the key fairness conditions that lie at the heart of these debates, and in particular how these properties operate when the goal is to rank-order a set of applicants by some criterion of interest, and then to select the top-ranking applicants. The talk will be based on joint work with Sendhil Mullainathan and Manish Raghavan.

Link to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

Create your website at WordPress.com
Get started