Alex, Daniel, Kaitlyn
Data science = big data \( \neq \) throwing algorithms at things
Instead, it calls for an educational focus on well-applied statistical methods in the right situations
“We wish to emphasize the value of a solid understanding of classical statistical ideas as they apply to modern problems in preparing tomorrow’s students for large-scale data science challenges.” (p. 283)
The situation:
The problem:
Say we want to calculate the mean country-wise click, that's easy. Let \( Y_{iuc} \) denote \( i \) th click of \( u \) th user in country \( c \), and \( N_c \) be total clicks in a country: \[ \hat{\theta_c}=\sum_{i,u}Y_{iuc}/N_c \]
But to calculate the \( Var \) of this estimate we need to account for successive clicks from the same user, which would require a lot more computation, i.e. if you let \( Z_{uc} \) be the number of clicks from the same user in the same country, we'd need multiple queries to get each. (\( M_c \) = unique users in country \( c \) ) \[ V_c^2=\frac{1}{M_c}\left(\frac{M_c}{N_c}\right)^2\frac{1}{M_c}\left(\sum_u\sum_i Y^2_{iuc} \right) - \hat{\theta_c^2} \]
Poisson Bootstrap:
Instead of querying multiple times, query once, and construct \( B \) replicates from the output… i.e. \( B \) independent Poisson(1) rv's for each record in the data
If we compare the result of this to a naive variance estimate that assumes that the clicks are iid…
Consider a classifier that determines commercial intent for a given search query. Another team uses these predictions to increase auction/average click costs, but the classifier hasn't been updated so it increasingly predicts commercial intent
Let \( \hat{Y_i}^t[\emptyset] \) denote predictions made for the \( i \) th item, give no feedback, and
\( \hat{Y_i}^{t+1}[\hat{Y_i}^t] \) denote the time \( t+1 \) prediction in a system with feedback given the prediction for item \( i \) produced at time \( t \).
Then the difference is feedback at time \( t \):
\[ \text{feedback}_i^t=\hat{Y_i}^{t+1}[\hat{Y_i}^t]-\hat{Y_i}^t[\emptyset] \]
We assume two things, (1) that \( \text{feedback}_i^t \) is only dependent on the prediction at time \( t \), i.e. Markov property, and (2) that it also does not depend on \( \hat{Y_i}^{t+1}[\emptyset] \).
Thus a feedback function is defined: \[ f(y)=\mathbb{E}[\hat{Y_i}^{t+1}[\hat{Y_i}^t]-\hat{Y_i^{t+1}}[\emptyset]|\hat{Y_i^t}=y,\hat{Y_i}^t[\emptyset],\hat{Y}_i^1,...,\hat{Y}_i^{t-1}] \]
Challenge: It's not possible to simultaneously observe both \( \hat{Y_i}^{t+1}[\hat{Y_i}^t] \) and \( \hat{Y_i}^{t+1} \)
So instead what we do is inject noise \( v_i^t \) at time t randomly into the predictions, which creates a sort of randomized experiment.
If we consider feedback that enters the model linearly:
\[ \hat{Y_i}^{t+1}[\hat{Y_i}^t]=\hat{Y_i}^{t+1}[\emptyset]+\theta\hat{Y}_i^t \text{, and so } f(y)=\theta y \]
Then only give noisy predictions \( \hat{Y_i}^{t+1}[\hat{Y_i}^t+v_i^t] \) to the other teams, then our new relationship is :
\[ \begin{aligned} \hat{Y_i}^{t+1}[\hat{Y_i}^t+v_i^t]&=\hat{Y_i}^{t+1}[\emptyset]+f(\hat{Y}_i^t+v_i^t) \\ &=\hat{Y_i}^{t+1}[\emptyset]+\theta\hat{Y}_i^t+\theta v_i^t \end{aligned} \]
And we can regress \( \hat{Y_i}^{t+1}[\hat{Y_i}^t+v_i^t] \) on \( v_i^t \)
“It remains as critical as ever that we continue to equip students with classical techniques, and that we teach each and every one of them to think like a statistician.” p. 290