Teaching Statistics at Google Scale

Alex, Daniel, Kaitlyn

Outline

Big Picture
2/3 examples
- MapReduce & Poisson Bootstrap
- Empirical Bayes In Sparse Settings
- Injecting Noise to Avoid Feedback Effects
What we learned

Big Picture

Data science = big data \( \neq \) throwing algorithms at things

Instead, it calls for an educational focus on well-applied statistical methods in the right situations

“We wish to emphasize the value of a solid understanding of classical statistical ideas as they apply to modern problems in preparing tomorrow’s students for large-scale data science challenges.” (p. 283)

1. MapReduce & Poisson Bootstrap

The situation:

we have a sharded architecture, data is stored in multiple places
so instead of aggregating it all, we map each source of data, reducing them to tuples (like summary statistics) and then reduce them into the final output; this is MapReduce

The problem:

while calculating simple point estimators is easy, calculating unbiased variance is computationally prohibitive

Example

Say we want to calculate the mean country-wise click, that's easy. Let \( Y_{iuc} \) denote \( i \) th click of \( u \) th user in country \( c \), and \( N_c \) be total clicks in a country: \[ \hat{\theta_c}=\sum_{i,u}Y_{iuc}/N_c \]

But to calculate the \( Var \) of this estimate we need to account for successive clicks from the same user, which would require a lot more computation, i.e. if you let \( Z_{uc} \) be the number of clicks from the same user in the same country, we'd need multiple queries to get each. (\( M_c \) = unique users in country \( c \) ) \[ V_c^2=\frac{1}{M_c}\left(\frac{M_c}{N_c}\right)^2\frac{1}{M_c}\left(\sum_u\sum_i Y^2_{iuc} \right) - \hat{\theta_c^2} \]

1. Example

Poisson Bootstrap:

Instead of querying multiple times, query once, and construct \( B \) replicates from the output… i.e. \( B \) independent Poisson(1) rv's for each record in the data

If we compare the result of this to a naive variance estimate that assumes that the clicks are iid…

1. Example

3. Injecting Noise to Avoid Feedback Effects

Consider a classifier that determines commercial intent for a given search query. Another team uses these predictions to increase auction/average click costs, but the classifier hasn't been updated so it increasingly predicts commercial intent

3. Example

Let \( \hat{Y_i}^t[\emptyset] \) denote predictions made for the \( i \) th item, give no feedback, and

\( \hat{Y_i}^{t+1}[\hat{Y_i}^t] \) denote the time \( t+1 \) prediction in a system with feedback given the prediction for item \( i \) produced at time \( t \).

Then the difference is feedback at time \( t \):

\[ \text{feedback}_i^t=\hat{Y_i}^{t+1}[\hat{Y_i}^t]-\hat{Y_i}^t[\emptyset] \]

3. Example

We assume two things, (1) that \( \text{feedback}_i^t \) is only dependent on the prediction at time \( t \), i.e. Markov property, and (2) that it also does not depend on \( \hat{Y_i}^{t+1}[\emptyset] \).

Thus a feedback function is defined: \[ f(y)=\mathbb{E}[\hat{Y_i}^{t+1}[\hat{Y_i}^t]-\hat{Y_i^{t+1}}[\emptyset]|\hat{Y_i^t}=y,\hat{Y_i}^t[\emptyset],\hat{Y}_i^1,...,\hat{Y}_i^{t-1}] \]

Challenge: It's not possible to simultaneously observe both \( \hat{Y_i}^{t+1}[\hat{Y_i}^t] \) and \( \hat{Y_i}^{t+1} \)

3. Example

So instead what we do is inject noise \( v_i^t \) at time t randomly into the predictions, which creates a sort of randomized experiment.

If we consider feedback that enters the model linearly:

\[ \hat{Y_i}^{t+1}[\hat{Y_i}^t]=\hat{Y_i}^{t+1}[\emptyset]+\theta\hat{Y}_i^t \text{, and so } f(y)=\theta y \]

Then only give noisy predictions \( \hat{Y_i}^{t+1}[\hat{Y_i}^t+v_i^t] \) to the other teams, then our new relationship is :

\[ \begin{aligned} \hat{Y_i}^{t+1}[\hat{Y_i}^t+v_i^t]&=\hat{Y_i}^{t+1}[\emptyset]+f(\hat{Y}_i^t+v_i^t) \\ &=\hat{Y_i}^{t+1}[\emptyset]+\theta\hat{Y}_i^t+\theta v_i^t \end{aligned} \]

And we can regress \( \hat{Y_i}^{t+1}[\hat{Y_i}^t+v_i^t] \) on \( v_i^t \)

3. Example

What We Learned

The paper described 3 subproblems within the broader context of click cost estimation for paid web search, a real world problem
Each problem was solved NOT by heavyweight engineering machinery, but instead through careful statistical reasoning

“It remains as critical as ever that we continue to equip students with classical techniques, and that we teach each and every one of them to think like a statistician.” p. 290