2015-01-07

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)



For the last two weeks of our class on statistical communication, I gave my students the following assignment:

Every day, you will write an entry in your statistics diary. Just set up a text or Word file and add to it each day. The diary entries can be anything. They can be short slice-of-life observations (“Today Jakey asked me what he would be the best at in ten years: piano, soccer, judo, chess, or fencing. This made me wonder: how predictable are such things? How would we gather data to measure this?”), quick research notes (“Someone emailed me a graph of birthday data from Brazil! I sent it to Aki along with a link, maybe he or a student can fit our model. Apparently Brazil has different holidays than we do in the U.S.”), or things you’re working on, difficult statistics problems that you might be stuck on, or have an insight about. You can write as little or as much as you want each day. The only requirement is that you write something new in it, every day. You’re not allowed to go back a week later and fill in 7 entries at once. That would be cheating. One entry each day. Just type it in to the file.

Here are some of their diaries, which I’m posting here for three reasons: First, I had a lot of fun reading these, and I think you will too; second, for you teachers out there, maybe you’ll now be inspired to try this “statistics diary” thing out in your own classes; third, I’m teaching statistical communication this spring, and now that I’ve posted this, I can point students to this link so they can get a sense of what their diaries can look like.

I was thinking of posting just my favorite entries, but I decided it would be best to present each students’ diary entire, in order to give a clearer sense of what they were doing.

Student #1:

**November 25, 2014**

Why are paradoxes of the statistical sort so “paradoxical”? How much of it is linguistic? As in, how much of our confusion is a fault in the ambiguities contained in how the paradox is stated? I’m thinking in particular of, for instance, things like the paradox of the heap, or of Monty Hall, where it’s unclear what the host’s actual intention is. I think I’ll write a post about Monty Hall later, although not so much from a linguistic point of view.

**November 27, 2014**

Today I stumbled across this “journal” (http://www.jasnh.com/) which purports to champion the null hypothesis. Even if I like the spirit behind it, I’m not sure it’s a great idea.

I just think you’d have to back up each study as being sufficiently plausible that the null is false; otherwise, it’d be too easy to publish whatever you want. Also, it doesn’t seem to be the most well-organized. I always think better when there are categories involved, and the website is essentially an uncategorized listing of papers.

**November 28, 2014**

John List gave a talk at the Society of Judgment and Decision Making this past week about “making life like a lab experiment”. I liked a lot about what he said, although a lot of it seemed like it pertained to the economics field more than anything else. The “field experiments” movement resonates me on several levels (spiritually, ecumenically, grammatically…).

On the other hand, I do wonder whether List’s position (that we should stop trying to analyze “mounds and mounds of big data” and instead conduct field experiments) might be almost too broad. It seems like field experiments are great under certain contexts where interventions are clearly feasible. But, there are so many problems that are worth studying that are very difficult to study via the field experimental approach, be it due to ethical barriers, legal constraints, historical issues, rare events, etc. It seems like he may be overlooking the usefulness of sophisticated causal inference techniques in these scenarios.

**November 29, 2014**

Another cool, if strange, effect on the “power” of uncertainty I was reading about today. It turns that under certain scenarios, people might prefer uncertainty to certainty (Shen et al., working) which is interesting considering that we generally are pretty averse to things we can’t predict. What these researchers find is that people are willing to work harder and exert more effort for rewards which are probabilistic (e.g., “50% chance of winning $1, 50% chance of winning $2) compared rewards which are certain (e.g., “100% chance of winning $2″). Note that here, the probabilistic reward is dominated by the certain reward!

**December 1, 2014**

It seems weird to me that people aren’t as persuaded by “hard statistics” as they should be. It’s not that surprising, I suppose, given how much pull things like status quo bias have on people, even when statistics might seem to refute habit. The thought I had related to this concerned the proposed change in NBA basketballs from a while ago. In the end, even though the numbers were in favor of adopting the new standard, the proposal was rejected due to a few vocal nay-sayers who had already gotten used to the old basketballs. Seems like numbers have a hard time winning over plain inertia.

**December 2, 2014**

Since I started being friends with R Shiny, I’ve been thinking some about when dynamic graphics are better. I feel like there aren’t many systematic principles out there to use a rules of thumb. It’s tempting to almost overdo it, especially in my current state of mind, where I keep wanting to turn every graph I make into a Shiny app. But are there times where dynamic graphs are overkill? When do they help, and when do they just distract? I think it would be useful to as a class, maybe, via the blog, draft a guideline addressing this issue.

**December 3, 2014**

Not my story, but a story I heard second-hand from a friend. This friend’s professor of political science was recounting a time when a student he thought highly of received a negative review from the department. As a result, the professor said to my friend that he revised his opinions of the department downward, his opinions of the candidate downward, AND his own opinion downward. Something about this isn’t right!

Student #2:

November 23

First diary entry – I suppose I’ll make a few comments about my final project and goals for expanding it. We’re building a Shiny app to plot data from a linear regression – the app will allow you to plot different explanatory variables against your dependent variable and to view the impact on regression output of incorporating different variables in the model. We’re planning to present the interactive app in class, but ultimately we’d like to turn it into an R package so that people can run it on their own – I think this will give them a deeper understanding of how Shiny works.

November 24

My local post office is terrible – they routinely fail to deliver packages but pretend that they tried to, marking things “Business closed – notice left” when no one put any notification on our door. The USPS has definitely gone through tough times in recent years, but I wonder how my post office compares to others. You could use a performance measure such as the average length of time between arrival of a package to the facility and actual delivery.

November 25

A lot of people have been sharing a 538 article relevant to the Darren Wilson lack of indictment without seeming to have read it. The headline (“It’s Incredibly Rare For A Grand Jury To Do What Ferguson’s Just Did”) is grabby but misleading, given the article content. The article goes on to explain that while prosecutors convince grand juries to indict in nearly all cases, this is not true when the prospective defendant is a police officer. So it is incredibly rare to fail to get an indictment, but it’s pretty ordinary when the case is related to a police shooting.

November 26

Our Shiny app for the final project is going well – today I linked the user’s CSV input to the rest of the functionality, so it’s fully reactive to the dataset input. The next step is for Tara to add in her leverage function and added variable plots.

November 27

Happy Thanksgiving! How consistent are sales of traditional Thanksgiving food items from year to year, controlling for population alone (such as number of turkeys or pounds of cranberries sold per capita)? Are there any overarching trends over time? I’d guess we eat more of everything…

November 28

A Shiny app from Hadley Wickham that allows you to scale an eggnog recipe! This inspired me to ponder the many possible non-statistical uses of Shiny…you can imagine a whole cookbook published this way. The distribution platforms have room for improvement, though – it would be nice to be able to browse shinyapps.io rather than navigating directly to the app’s URL.

November 29

With all of the holiday travel going on, I’ve been wondering about some of the paradoxical human perceptions of statistical risk. Why are people more afraid of flying than driving, when fatal traffic accidents are so much more common than plane crashes? Some argue that it’s related to the lack of control, since you aren’t the one flying the plane – but you’re also often a passenger in a car, you never drive a bus or train, etc. It’s curious how humans tend to fixate on extreme scenarios in interpreting risk.

November 30

With all of the Black Friday sales going on, I’ve been wondering about how companies use all of the additional information available to them via online sales. Prices on Amazon change with significant frequency, and the retailer can see the goods we place in our “shopping carts” – do they change the price of the good based on the indicated interest? This seems like it could significantly alter the negotiating power in the existing relationship between retailer and consumer.

December 1

On R-Bloggers today, there was an article listing gift ideas for statisticians. I quite like the ’distribution plushies’ – stuffed toys in the shapes of statistical distributions.

December 2

Inspired by this article from 538, I was thinking about how people lie about their height. There’s a plot in the article that indicates the distribution of men’s height on OKCupid is centered roughly 2 inches above the actual distribution of men’s height in the US. But is everyone reporting 2 inches higher? Or is there a tendency to report more added height when you’re shorter (if I’m 6’0, I’ll say I’m 6’1″, but if I’m 5’7″, I’ll say I’m 5’10″)? And do people prefer to report round numbers, like 6’0″?

December 3

I wondered today about medical statistics with respect to disease prognoses. I’m sure there are conventions in reporting terminology, but for the uninformed layperson, the phrasing makes it hard to tell quite what things mean. For example, for a disease with a 5-year survival rate of 70& and a relapse rate of 35%, does the survival rate incorporate the probability of relapse?

December 4

Apparently 2014 is on track to be the hottest year on record, with global average air temperature 0.57 C above average. Which raises the question – what is the best way to make people care about something that isn’t individually observable? This past winter was very cold and snowy in the US, and my grandmother refuses to believe global warming exists.

Student #3:

Mon 24 Nov

I’m wondering about the accuracy of weather forecasts. When snowfall is predicted and an estimated number of inches is proposed, how many times do the actual measurements match up? Has anyone reported on weather forecast accuracy recently with actual data?

Tues 25 Nov

Apparently I just crossed paths with the Ferguson protests in midtown this evening. I wonder what the ratio of police to protesters is here in the city? When the news coverage recalls how many people were at such events, how do they arrive at their estimate? Is it the “density” of people in one block multiplied by the number of block occupied? I guess I wonder how these things are estimated for a dynamic group (people marching through the various avenues/streets in Manhattan) vs a static group (assembled in a park for example). The latter seems easier…

Wed 26 Nov

I bought an actual paper version of the New York Times for my train ride today and saw an infographic on the “most googled recipes in each state”. What is frog eye salad and why do people in Colorado seek this out so much? I didn’t save the print version, but found the online version here: http://www.nytimes.com/interactive/2014/11/25/upshot/thanksgiving-recipes-googled-in-every- state.html?_r=0&abt=0002&abg=0

It’s interesting because the online version seemed to have missed a good opportunity to make the map an interactive one (click on a state and that takes you to the more detailed state section, vs the pulldown menu option). I almost like the print version better for the clarity and ease of seeing all the states with their top 3 recipes in an array.

Thurs 27 Nov

I wish there was a better way to peruse PubMed. I don’t love the way the search and archive system is designed. It seems like more filters or ways to organize a list of results could improve the system. For example if one searches “microfluidic whole blood assays for point of care settings”, they should be able to re-organize the results by “most cited”, “newest to oldest” etc. It just seems like the sheer volume of articles contained in the system is regurgitated rather than refined with these exploratory lit searches. I wish the database could be reorganized or re-tagged for better searches!

Fri 28 Nov

It seems like there’s more and more reports of how Black Friday “sales” are not actually much of a bargain.

http://money.cnn.com/2014/11/25/news/economy/black-friday-deals/

I wonder how this year will be analyzed by the news outlets. Sat 29 Nov

More Thanksgiving analysis. A friend sent me this link: http://jezebel.com/this-is-what-america-was-most-thankful-for-this-year-s-1664583267

It’s a visualization of some Facebook “data” of what people were most thankful for in each state. I think it’s relative to people in other states, so I doubt Californians’ first thought is Netflix, but perhaps that shows up more in their posts than people from Ohio. It’s probably also skewed by the kind of people who post a status talking about something like this.

The original Facebook data analysis post is here: https://m.facebook.com/notes/facebook-data-science/what-are- we-most-thankful-for/10152679841318859/

Sun 30 Nov

I emailed our research collaborator after seeing some discrepancies in our data analyses. I explained our methodology, which I assumed was a pretty standard one in the diagnostics space. Apparently, the collaborator disagrees and finds this “too stringent” of an analysis. I think this requires more followup.

Mon 1 Dec

We had a final project deadline at 5pm today. I think it would be really interesting (and entertaining for the instructors) to plot the distribution of submission times for this. Maybe even compare this to all the previous homework assignment submissions.

Tues 2 Dec

We had a call with the research collaborator this afternoon about data we had collected last summer and her methodology in analyzing it compared to ours. To summarize the gist of our “issue”, we were discussing diagnostic performance specs, such as sensitivity, specificity, and signal to noise cutoffs. I think the root cause of the confusion was that the “gold standard” reference test data she had provided (to which we had to compare our prototype test to), was not actually the right one, and therefore we were getting mismatched counts of “false positives” or “false negatives” (leading to discrepancies in overall diagnostic performance specs that were calculated).

Wed 3 Dec

http://www.nytimes.com/2014/12/04/technology/personaltech/smart-nurseries-track-a-babys-sleep-or-lack- thereof.html?hp&action=click&pgtype=Homepage&module=mini-moth&region=top-stories-below&WT.nav=top- stories-below

“it’s hard to figure out what parents can do with this data or how it could actually help babies sleep

better.”

Seems like another “Big Data” type issue: finding ways to deconstruct data or provide actionable steps?

Student #4:

Nov. 25

Andrew is sending out free book to students again. Consider the number of free book a student got from the class, should that be underdispersed (because when one got a book, he will tend to give the change to others) or overdispersed (because there are guys who are really active and are more likely to get the book)

Nov. 26

Attended a talk about contextual bandit problems. Striking a balance between exploration and exploitation really is one of the main themes of human life.

Nov. 27

Had a great Thanksgiving meal in Wei’s home. The time series for consumption of turkey must be really spiky around Thanksgiving. How do turkey sellers manage to accommodate the bursty need during Thanksgiving?

Nov. 28

Went out shopping in department stores in Midtown. Analyzing the discounts in Thanksgiving and maybe providing optimal strategy for boasting sales seems like a cool stat project

Nov. 31

Spent too much time playing candy crush, which is really full of uncertainty. Not sure if the developers took statistics into account. How do they control the difficulty of the game?

Dec. 1

Spent most of my time coding a dynamical system model for neuron spike train data (a fancy way of doing PCA on count data). Just amazed by how people developed all these delicate and complicated models to summarize and understand the data.

Dec. 2

Still in the middle of coding and felt that it can be true that “50 percent of the result from statistics papers are just coding error”. I mean, when things got complicated, coding and debugging is really really hard.

Dec. 3

Heard a talk about a variant of PCA that involves both the covariates and the latent variables.

Student #5:

Monday, November 24

It’s my dad’s birthday, hooray! He actually got the birthday present I sent on time but my mum, whose birthday was 2 weeks ago did not. Instead, she got a letter from customs, asking her to come in person and present all the original receipt for the items in the package that I sent. I am pretty sure this is the result from some random sampling they do because pretty clearly, if you look at the stuff in the package, you can see that the value is less than the legal limit. I still wonder what the procedural rules are: how many packages are selected at random? And then if they open it and see it’s of no commercial value, at what point do they just sent it on? Also curious what kind of data is collected but right now just a little angry my mum’s present is 2 weeks late

Tuesday, November 25

I was going to run a pre-test of a survey on MTurk. The big concern with MTurk is always that it’s not representative of your target population – too many men, too young, too educated to represent an average citizen. But since the pre-test was supposed to be just about making sure the questions were understood, produced variance and somewhat sensible answers, this approach seemed alright. Then it turned out, that MTurk just pretty much did not have any workers at all in our target population – so we needed to come up with Plan B really quickly and ended up sending the survey out to friends and family…. Which is an even much less representative sample than MTurk – looks like almost everyone is in their late 20s or early 30s and highly educated. At what point does it become just too unrepresentative?

Wednesday, November 26

Big travel day before Thanksgiving. The weather report anticipated a snow storm and according to what I am hearing from friends, there is a lot of anxiety at the airport. But what has come down by the afternoon was mostly rain, no big deal. I wonder if the weather people weight downside and upside risk differently – e.g. if there is a 40% chance of a snowstorm, would you rather tell people there will be a snowstorm and be wrong, then at least everyone was prepared, then tell them the likelihood is greater that everything will be just fine, even though there is a substantial chance of snow? Is the reasoning the same for different kind of weather events? Snow, wind, rain?

Thursday, November 27

Took a long run in Central Park in anticipation of massive amounts of food later. There were a few people walking their dogs, but overall, it was pretty empty but I noticed a lot of cops in the Park. Wondering how they get assigned – is there extra pay for holidays, so they all volunteer? Is it the same number as usual, just that they are more noticeable because there are fewer people and it seems more disproportionate? Are there extra cops because a relatively empty park might attract crime? I guess I am really interested in forecasting in general

Friday, November 28

Another day, another Central Park run. Even more cops today! Maybe they’re planning their Christmas party?

Saturday, November 29

I just discovered this stats blog: http://iquantny.tumblr.com/ – totally awesome! The author takes public New York City data and turns it into really fun analyses!

Sunday, November 30

So, I am working on a paper that deals with the spatial links of racial attitudes and attitudes towards welfare and I am struggling to find good data – it seems that most surveys collect spatial information on the state level, at best, when really I would be interested in country-level, zip-code or even census-tract level data. I wonder why – I also recently discovered by accident that when you do a survey using qualtrics, it will automatically record latitude and longitude of the respondents (based on their IP address, I presume) so that even gives you point data – awesome! Why is this data (or some aggregated level if there are privacy concerns) published for all online surveys? And in-person surveys? Good spatial data should also be really easy to collect, then, it’s hardest for phone interviews

Monday, December 1

Final prep meeting before we’re presenting our project “Plotting for dummies” tomorrow. I really learned a lot about Shiny and I think also coding in general, great experience.

Monday, December 2

Still struggling to find appropriate data for my spatial project. I spent countless hours trying to find out what an obscure variable called “geolink” was, according to the codebook a variable to “link the data to census data”. Turns out it’s just made up numbers because the PIs wanted to restrict access to the spatial link…. Gaaaaa!

Tuesday, December 3

Asked the ICPSR what it would take to get access to the data…. They are telling me that they don’t have the data either. I reached out to the Principal Investigators, let’s see.

Wednesday, November 4

Heard back from the Principal Investigator – they write “Dear Ms. Baltes, as I am sure you realize, this data was collected 20 years ago” – and continue to say that they don’t have the data either. Unfortunately, in terms of spatial data and including the kind of questions I am interested in, this survey is still the best there is… so I am not sure what to do. I might try and find papers that use the data and see if they authors kept the data.

Student #6:

Nov 24

This afternoon, one friend who works in pharmacy asked me the Negative binomial regression. He told me why bio-researchers often ignore statistical assumptions in those statistical methods. It confused me too. I thoughts we should be more concerned with those assumptions, or the results are not reliable.

Nov 25

My friend Rijia and I presented in Dr. Gelman’s class. I was wondering If I should use a statistical way to describe how nervous I was. I might count my heart beats times in one minute with the time passed, or in which the slide I told, or at which moment that the comment classmates said, or what kind of face that Professor Gelman showed.

Nov 26

it was snow today. Honestly, Americans like discussing the weather.

Because of the statistical diary, I ask myself how fast and how large snowing would cause the road have snow cover. The factors should include the humidity of the time and the snow speed per hour… etc.

Indeed, I never thought those questions before. Thanks for the statistical diary assignment. Should I google the answer and write it on the statistical diary? lol

Nov 27

Today is Thanks giving, I visited my friend Jeff’s apartment, having a thanks giving dinner with him and his friends. Joyce is Jeff’s friend I never meet her before. She is a business data manager in MTV. She introduced her job position. She told us that she must designed many models for predicting the internet users’ video and movie tastes. In her department, she and her colleagues should capture surfing webs information from each internet user. After collecting bunch of data, Joyce and her colleagues could realize what kind of movie or video that users may prefer, and the MTV company would send the advertisement for you.

I thought the statistical methods she used were applied in many fields such as shopping webs, elections campaigns… and so on. It is really interesting.

After Joyce explained what she did, everyone all called her ‘evil’, because they couldn’t believe there is no privacy in the internet world. If you want to keep your privacy, clean the Webs Caches and manage your computer routinely.

Nov 28

It is Black Friday. I spend whole morning looking for valuable back bag. I love the Amazon system. When I looked for these products I considered to buy, the Amazon system always gave me other related products simultaneously. This kind of algorithm save me many time. I observed the criteria the system uses, including the popularity (review number of consumers), the producs release date, best discount of products, and the best sell.

In statistical thinking, I thought when seller categorize each product they must be very careful. because the key words are very important. When people typed the keywords, how precise the shopping filter would cause whether the product become popular or broaden the information to the potential costumers.

Nov 29

There is a mid-term election for local governors today in Taiwan. As a Taiwanese, I woke up at 5:00 am for following the election results. It is really excited because the candidate of Taipei City who I support won the campaign. The KMT party, which is like American Republican lost many counties where they governed before.

Many political scientists felt surprised that the KMT got a serious lost.

They all thought the main reason is that the election poll is different with the election result. Many statistician and political scientists discussed the possible factors, which include the missing sample of teenagers who study out of their home town. The phone interview lost plenty of samples who only use cell phone. Those discussions made me think about the course of sample survey I learned from Dr. Michael Sobel in 2012 fall semester.

With the technology progressive, if scientists want to capture precise information, they should consider tech using and think a new approach for design the sample survey methods.

Nov 30

I have an idea of Entrepreneurship. I want to be a data story teller, which is sharing more Asian social issue with data analysis to the world. I am wondering if it works or not. Therefore, I shared the idea with my friend who is a sociology PhD student in NYU and a politician in Taiwan. I told them that there are many data journal teams such as fivethityeight, NYTimes, Washington Post in the United States. Data Journalists analyze bunch of data and write many meaningful articles.I am wondering if it works in Asia. We discuss a lot difficulties we may suffer later. For example, there are less datasets in Asia. In addition, If we would like to write any discourse, should we write the article in Chinese or English or both.(I know my English writing is suck…) Anyway, no matter how difficult it is, we had a conclusion that we should start it as soon as possible.

Dec 1

It is bored today. I am struggling in assignment of Statistical Machine learning. I deeply believe that the statistical science is really broad and deep. As an amateur in quantitative research, the learning process is like a boat sailing against the current, we must forge ahead or be swept downstream

Dec 2

I am wondering what kind of student eager to write statistical diary everyday. I thought the main reason for me is statistical thinking and english writing.

Since began the statistical diary wiring, I write it every morning. I remember the interesting event before the day. It stimulate my statistical ideas.

In addition, as a Non-Native English speaker, I love using this chance for improving my English writing….. oh… I need to increase the vocabulary in my diary.

Dec 3

More statistical methods I learned, more interesting research questions I could ask. I am not sure if this argument would be always true or not.

Student #7:

11/24 Someone should make a dictionary that shows how the meanings of statistical terms vary between the various disciplines that use statistics. For example, I’m sick of trying to figure out what authors mean by fixed effects and random effects. These terms get tossed around casually as if everyone with statistical knowledge knows what they mean, except they seem to mean many different things to different people. Very frustrating.

11/25 How many undiscovered computer coding errors/typos are there that, if fixed, would change the results in published work? Probably a lot. There are a few notable examples (e.g. Harvard economists and their Excel documents), but my hunch is that there must be many many many undiscovered cases (mostly innocuous but some are probably consequential). Is there any way to estimate this?

11/26 I saw presentations by two different Columbia graduate students today and in both cases the student made the same really basic statistical mistake: to choose the independent variables to include in a regression model they first did a bunch of bivariate regressions and then used all of the “significant” predictors in the bivariate regressions as the predictors in the multiple regression. No no no no no!!! Are students being taught to do this? Are they taught not to do it but then forget or get lazy? Are they taught not to do this but then aren’t shown any alternatives?

11/27 I’ve had a number of conversations recently with people are convinced that random means that all possibilities are equally likely, even when confronted with simple examples to the contrary. For example, I tried using the example of weighted dice and got a response like “oh, well if they’re not equally likely then it’s not really random.” I wonder how common this perception actually is and why it exists at all. Maybe the most common colloquial use of “random” is as a substitute for “uniformly at random”, but then what’s the colloquial term all other kinds of randomness?

11/28 I sent an email to the author of a political science paper asking if he had data and code that I could look at. He responded promptly and sent me what he called his “replication files.” But then I discovered that he only had processed data (after recoding, cleaning, etc.) and the code he sent was just for running the analysis on this post-processed data. I asked if he had the raw, dirty data and the code he used to prepare it for the analyses and he said he didn’t think so but would get back to me. I haven’t heard anything. I’m wondering if this is typical or atypical? I don’t have enough experience to know how prevalent this is. If “replication” is just running code on clean data then we’ve got problems (see my entry for 11/25).

11/29 I wonder if the ‘gifts’ that public radio stations offer to listeners who make dona- tions (e.g. tote bags, t-shirts, DVDs of Ken Burns documentaries, etc.) actually lead to an increase in the number of contributors, the amount donated per contributor, etc.?

11/30 One of my pet peeves is when people say that something was caused by chance. Usually when we say something was caused by chance we’re really just expressing our ignorance about why that thing happened. But over Thanksgiving I caught myself saying it, which was rather disappointing. It’s just such a natural thing to say, despite not making much sense. Am I making something out of nothing by being annoyed by this common saying? I found a blog post on this topic – written by a statistician – in which the author recommended replacing “caused by chance” with “I don’t know what the cause is, but our uncertainty in X is . . . ”, but that just seems like a ridiculous thing to say in informal contexts.

12/1 In response to the recent events in the news, a friend of mine asked me if there’s any- thing that could be said statistically about what an “acceptable” number of unarmed black teenagers getting shot by police officers should be? Ideally zero (obviously) but accounting for human error, the complexity of the circumstances, etc, it gets pretty messy. So, is there a way to roughly determine what an abnormally high number of these cases would be so we can then compare what we observe to some idea of what we should reasonably expect?

12/2 What are good choices of prior distributions for time-varying autoregressive pa- rameters? For example in

yt = ρtyt−1 + other stuff

what are some reasonable priors for the ρt’s? There’s a ton written about cases where

there is just a single ρ, but what about when we want to let ρ vary with time?

12/3 Introductory probability textbooks often (always?) use the number of typos on a printed page as an example when discussing the Poisson distribution. Has anyone done a large-scale empirical study to actually see if this is indeed a good approximation (and if it’s better with certain types of texts)? In other words, are the assumptions we have to make so that the typos follow a Poisson distribution actually reasonable? They’re obviously wrong, but how wrong? For example, we need to assume that the typos are independent, but this is violated all the time (e.g. if instead of “weird” I type “wierd” then the “i” and “e” are both wrong but the two errors are not independent. Or does this count as just one typo?). I would assume someone has looked into this, but a google search is pretty useless (you just get page after page of the toy practice problems). Also, how does the distribution change when automatic spell-checkers are used (which catch some types of errors but not others)?

Student #8:

Mon 24 Nov

I was working on my project on the electricity sector reform. Specifically, I want to test the hypothesis: if a country is introduced an independent regulatory agency, the country is more likely to adopt unbundling rather than privatization due to its relatively less political cost (which is opposite expectation to the ‘textbook model’ of reform). I have a binary indicator of regulatory agency, unbundling and privatization, respectively. What would be the best model to test this? Probit would be sufficient using t-1 for regulatory agency, which is main IV?

Tues 25 Nov

As I got comments saying that endogeneity issue is highly suspected in my model, I’m considering 2sls. But the concern, or the bigger concern is left: can I use 2sls when an instrumental variable as well as dependent variable is binary? Isn’t there any way to deal with this situation where IV, instrument variable and DV are all binary and endogeneity problem has to be addressed? I found that some scholars suggest Biprobit, but not sure if this is the best one.

Wed 26 Nov

To celebrate Thanksgiving, I came to my aunt’s house in NJ. I don’t even remember how long I have been stuck to the Columbia neighborhood. Anyway, it was a great feeling to enjoy ‘outside’ city atmosphere. But one question was come up, especially compared to the toll fee in South Korea (also even taking account for the difference in living cost), I found the toll that drivers has to pay to go across George Washington Bridge is extremely expensive. What makes it that expensive? How much money they collect per hour? And how much they spend for maintenance of the bridge?

Thur 27 Nov

I will spend whole Thanksgiving holidays in my aunt’s house. My aunt asked me to help his son’s study, who goes to a high school. I found that he have been basically suffered from the courses requiring some ‘quantitative’ skills, such as mathematics and statistics. Especially, he told me that he had no idea on probability. Upon his vigorous request, I tried to help him to have a grasp of basic concept in probability. What I recognized is that he doesn’t understand ‘counting numbers’ such as ‘Combination’, ‘permutation.’ What would be the best way to understand the difference between combination and permutation in its usage and logic? I often use the term ‘distinctiveness’ to differentiate it; for example whether you want to choose 2 people to clean up the classroom or choose a president and a vice president, which is distinctive position. Isn’t there any other example to let a student to get to know the difference?

Friday 28 Nov

Finally, Black Friday! My relatives and I went to the Garden State Mall, which I heard, one of the biggest malls in NJ. As expected, we had to spend half an hour to find an available parking spot. The situation inside the mall was like a hell. It was so crowded and even required us to make a line to spend another half an hour to get into the shop. I was just curious: what the graph of the amount of daily sales in November? Strongly suspect that the trend would keep low before the Black Friday, and jump up on the Black Friday. The, when comparing the trend in December and October, this trend is significantly different, or the average sales in November are also quite different from that in October or December?

Saturday 29 Nov

The second my private lecture on the high school level statistics for my cousin. Today, I taught the logic of the ordinary squared regression. I started with drawing a scatter plot of height and weight, then explained the reason why we want to draw a best-fit line. It went very smoothly before I wrote an equation… In order to help him to understand basic mathematics behind it (obviously I didn’t even attempt to teach how to calculate the ‘beta’ using maximization and so on) by reminding him of simple linear equation, and explaining what a and b implies respectively. It was OK. But as I put y-hat instead of y, the situation went different. He was confused on why we have to use y-hat and so on. I patiently explained ‘hat’ here represents ‘predicted value’ and explained the error term and so on. He asked me “if there are errors, why do we need to calculate this line and want to use it?” Then, I was just lost… I didn’t know from where I have to start.

Sunday 30 Nov

Now coming back home. I got very good news from my family in South Korea: my older brother is about to marry! I could not help calling my brother. He gave me a very detailed description on how they met and what kind of person his wife is. But he suddenly asked me about my plan for marriage. He added “You know, there are only few Korean girls who want to live in the U.S. while abandoning her job in South Korea and losing a close and frequent relationship with her family and friends.” I just wonder if his statement is statistically right, especially considering the growing number of Korean girls who are dreaming of ‘studying’ in the U.S. these days.

Monday 1 Dec

Today, I planned to come up with an idea for identification strategy to test the hypothesis: whether signing the TRIPS (Trade related to Intellectual Property Rights) agreement has positive impact on innovation. Since I’m considering number of patents as a proxy for innovation, following previous literature on innovation, I studied some models that is designed to deal with such a count variable as a dependent variable. I found that there are various methods: Poisson, Negative Binomial, and Zero-inflated model and so on. As a pseudo approach, I tried to estimate negative binomial models but the STATA iterated and iterated and finally gave me an error message. What made this ‘no results’? I think I have to delve into data structure and re-examine the assumptions required to be held for negative binomial model.

Tuesday 2 Dec

We presented our project of “Plotting data for dummies (https://joonyang.shinyapps.io/Final/)’ for the statistical communication class. The presentation went well, I guess. Very critical questions are asked from the floor, but I believe that many students understood the reason why we have delved into this project – to let dummies, who do not know any statistical package, plot graphs easily. I personally will work on this project to reflect the comments and to improve further. As a complete novice to R and Shiny, I had some difficulties in pursuing such a project requires some backgrounds on those statistical software. However, fortunately, with my teammates’ sincere helps, I could learn a lot of stuff, and eventually could contribute to this final project outcome. At the same time, this experience allowed me to realize how useful technical skill would be, and further how this kind of knowledge would be efficient for me to pursue my research project. Now I’m motivated enough to delve into R and other useful packages in R this Winter.

Wednesday 3 Dec

I returned the results from the Second midterm exam to the students who is taking a class of ‘International Politics”, which I’m TAing for. As the professor asked me to give ‘curve’ – making mean equal to 86, which is low B+ in our grading schemes, I changed little bit and little bit to meet such a request from the professor. Basically, my strategy was to bump all the grade up by one point but it turned out that this strategy would violate another rule: 30% – A / 50%-B /20%-C. What would be the best or the least discriminate way to do this?

Student #9:

Mon 24 Nov

For how many laps do F1 tyres last?

What was the temperature variation on the ground of the Yas Marina at Abu Dhabi Grand Prix and how did it affect the tyre performance? What is the variation in these numbers?

What percentage of handicaps think they’re entitled to acting like jerks and not apologise to people whose foot they have stepped on?

Tue 25 Nov

On average how many drops of water continue to be dispensed from a faucet after switching off?

Are the protesters at the Ferguson Riot really protesting for the cause? Or are they just fuelled by their inner needs to be destructive? What percentage of the people is protesting for a constructive cause and what percentage for a destructive cause?

Wed 26 Nov

How long do I spend waiting for an elevator at different times of a day? What is the distribution of waiting time?

How accurately can one predict climate change, and the associated changes in sea levels? Can climate change prediction accurately predict arctic animal population (polar bears, seals, penguins) changes?

Thu 27 Nov

How many turkeys were killed in the name of Thanksgiving? Why percentage of the total number of turkeys killed was in excess?

Ideas are just ideas. What’s with the need to patent to claim ownership over an idea? What percentage of the patents filed actually provide exclusive right to commercialise, and what percentage is just filed without a useful purpose?

Fri 28 Nov

What percentage of high school friends does one keep in touch with in grad school?

What percentage of college friends does one keep in touch with in grad school?

Sat 29 Nov

How many percent of engineering/science/mathematics students are introverts?

How many percent of business school/MBA students are extroverts?

Is there a general correlation between academic disciplines and introversion/extroversion?

Sun 30 Nov

How many fire alarms go off between 2-4 am in the morning in a month (and in the process wake up everyone else who has to bear the sound of the firetrucks blaring through the city)? Are they more frequent on weekday or weekend nights?

Mon 1 Dec

How often does one look at their cell phone in a day? Is there a correlation with different activities, e.g. does the frequency increase in classroom, or when one is dining alone?

Tue 2 Dec

What’s the chance of a subway train not being on time?

How many delays occur in a day? How many minutes does an average employed person with working hours from 9am-5pm waste due to commute delays?

Life is fragile, everyday people die unexpectedly. But suicide is not easy either; people have tried to kill themselves by over injection of insulin, or hanging themselves, or wrist slashing. Attempts are not always successful. These attempts, which are cowardice, seem to have higher failure rate than bold ones such. What’s the success rate of suicides, based on boldness of the attempts? How many of those who failed ended up disabled and unable to attempt again?

Wed 3 Dec

What’s the chance of a non-smoker developing lung cancer in New York City vs. in a less crowded city with cleaner air?

How many stars can one observe on a night of clear sky in Manhattan in different months of the year?

Student #10:

11/25/2014

I was waiting for somebody’s call at planned time today. I think the real time he called will be like normal distribution with mean equal to the planned time.

11/26/2014

When I check the weather on the phone today, it says the chance of snow currently is 73%. I guess this probability means nothing since it is snowing right now.

11/27/2014

Happy Thanksgiving! I’m wondering what is the distribution of dates of Thanksgiving.

11/28/2014

I see fewer people on the street today. I guess this is not the situation at commercial area. The distribution of people changed because of Black Friday.

11/29/2014

I read a set of interesting graphs today comparing living expenses in China and U.S. First it compares eating, clothes, apartment renting and traveling expense. And in general you spend less in China. But salary in U.S is much higher. So it then compares how much food, clothes you can buy with the salary and how many months of housing rent you can pay with the salary. According to this, life in China is much more expensive.

So it’s true statistics and statistical graph can really be misleading sometimes.

11/30/2014

When picking a seat at the theater today, I suddenly realize the decision is depend on two variables, one is the personal preference and the other is the selection of other people and so the decision for everyone isn’t independent.

12/1/2014

I don’t think of any interesting statistical story today. What is the odds for this?

12/2/2014

The light bulb is broken today. It reminds me of the classic example of exponential distribution: Life of a light bulb follows a exponential distribution.

12/3/2014

Being asked by someone about different directions of statistics, I suddenly find about there are a lot parts that I don’t even know, like spatial statistics, statistical science. Need to learn more sometime.

Student #11:

Sunday, November 23, 2014

I was thinking about our project “Opening Week Box Office Performance Prediction”. The problem is there are too many categorical variable, too little numeric ones. How about adding average previous box office incomes of opening week for each leading actor and actress. I brought it up but no one respond. Sign. I think for a movie I haven’t watched, it’s hard to know exactly who is the leading actor or actress. And as for average previous box office incomes of opening week for each leading actor and actress, we need to find out how many movies he or she was in and the incomes. Also whether she or he is in leading role or not in previous movies. Real project and real life is really complicated.

Monday, November 24, 2014

I go to work today. In the crowed subway, I am seeing people rushing from NYC to NJ and NJ people rushing to NJ. What’s the average commuting time for people who live in NYC? For me, it’s between 4 to 5 hours per day on Monday Wednesday and Friday when I go to work. I wonder what the life is like for others.

Tuesday, November 25, 2014

Sometimes I want to call friends back in China, but it’s midnight or some bad time for calling. So I don’t call. There are always time lags between different parts of earth. Will there be significant more calls or times of communication if there is no time lag at all in worldwide?

Wednesday, November 26, 2014

It rains and snows today. The library has less people. Some people at work will work from home today. What is the percentage that people’s plan to go somewhere will be influenced by weather?

Thursday, November 27, 2014

I had a wonderful thanksgiving meal with 5 friends today. We six people didn’t finish a nearly 20 pounds turkey. How many bites does it take to each a 20 pounds turkey?

Friday, November 28, 2014

Yesterday I went to the movie, huger games: mockingjay. It’s bad. I went to see it because my friend wants to see it. My friend wants to see it because she saw it from first movie. Since it’s a series movie, she definite wants to follow. So I am thinking about our projects, actually, there is no very rational reason behind why people go to a movie so that the box office performance will be good. As for me, I go to a movie when I have time and friend to do this. Or I like this actor or actress very much.

Saturday, November 29, 2014

There is a theory that people will become expert in one area if he or she has spent 10000 hours on it. In this holiday, while I slept most of it, I wonder why there is no certificate for us to be expert in sleep, for we definitely slept for more than 10000 hours since we’ve born.

Sunday, November 30, 2014

I am collecting data for project “Opening Week Box Office Performance Prediction” until 5 o’clock in the morning. For some variable we have look it up manually on the website. For example, we are counting how many nominated and won times for each actor, actress, writer, director etc. I have never done that before. It feels like I am a chef for such a long time that I buy material from markets to cook, but today I have to go to the farm and kill some animal, and drag it to kitchen. Solute to all the people collecting data for us all along!

Monday, December 1, 2014

I met several strangers today, who helped me. I had meals with friends, who I have known for more than a year. Is there any association between the feeling of happiness and the number of people we meet everyday?

Tuesday, December 2, 2014

I have a Latin quiz today. I have been told long time ago that people can really remember a new word by seeing it in different context 72 times. I hope I can find it out someday.

Wednesday, December 3, 2014

Wherever I go, waiting in a line or medical room, or subway, break during the Broadway show, people are looking at their own cellphones. I kind of want to know what is the longest gap between each check on phone for different people. How many of them will go crazy if you take the phone from them for more than one day? (I cannot help laughing when I am writing). And how many time do people spend on phone, doing what, per day.

Thursday, December 4, 2014

Tree lighting ceremony is 6pm today. I saw so many beautiful lights on the college walk after class, while I am walking. I cannot help wondering how many bulbs are there in total. I once saw workers putting those lights on, and they are actually twine wires full of bulbs onto the trees. I think the problem can be solved this way: we investigate how many bulbs are there on the wire per meter on average, and estimate how many meters needed for one tree on average, then count how many trees are there for the lighting. At last, we do the multiply these three numbers.

Student #12:

1 Monday, November 24

Tonight my comps study group met to go over some of the literature on ethnicity in preparation for comps in January. It is interesting at the very least to see how the use of quantitative (statistical) methods has evolved in political science. There is still so much unevenness in the quality of empirical analysis. I wonder how the training process (i.e. grad school as a “critical period” for methods training) and the publication review process (i.e. review of papers by people whose understanding of methods is less than it could be.)

2 Tuesday, November 25

I’m currently sitting in the airport awaiting my flight. Everything seems to be delayed and passengers are unhappy. Rebooking after delayed/canceled flights seems to be a tedious and very costly endeavor for airlines. I wonder how airlines make their flight schedules in order to minimize the amount of missed connecting flights etc. If the online flight tracking sites are correct, certain routes are much more likely to be canceled/delayed than others. How do airlines account for this? How much variation is there by season? Do flight schedules change seasonally to account for these differences?

3 Wednesday, November 26

Today we drove from my parents’ home to my grandmother’s home in preparation for Thanksgiving. We passed the spot where I received my one and only speeding ticket. I still look around to figure out where the cop was lying “in wait” when I was pulled over. How accurate are their speed radars? To what extent is the uncertainty surrounding the estimate grounds for challenging the ticket in court? A quick Google search turns up a bunch of websites giving advice on how to challenge tickets in court which are not particularly interesting to me at this point. (In the interest of full disclosure, I was speeding when I got the ticket…)

4 Thursday, November 27

Today, my mom was talking about her geneology research (a new hobby). She sometimes struggles to make sure the “G.E. Miller” that she is finding records for is the right “G.E. Miller.” This piqued my curiosity about how researchers that use historical microdata (like the 19th century US Census microdata now hosted by the National Bureau of Economic Research) match large samples of individuals across censuses. What characteristics are the easiest to use to ensure that you are tracking the same individual beyond names. Birth place? Birth year? Has anyone investigated the best method for doing this?

5 Friday, November 28

Today Gram was looking at the Black Friday advertisements in the local newspaper. Retailers seem to prey upon people that don’t really understand percentages. It seems that retailers use multiple markdowns (i.e. “an additional 20% off prices already marked down by 20%”) to make their offers sound better than they actually are. Perhaps there is a better explanation to the complicated sales schemes offered in these coupons but my cynical side is going towards the former explanation.

6 Saturday, November 29

I arrived back in New York (well technically Newark) early this morning. In Penn Station, I helped some lost tourists from somewhere in Latin America (maybe Colombia based on the accent?). How many people get lost in the New York subway system every year? How many people are lost at any given time?

7 Sunday, November 30

We finally have our Shiny app working, but today I tried to import data with missingness to Shiny which “broke” the leverage and AV plot functions. This was a pain to fix. Visualizing the scatter plot with and without missing data is revealing and an important exercise in a field (political science) with so much missing data.

8 Monday, December 1

Today I read some pieces on culture in preparation for comps. One particularly problematic piece from a theoretical and empirical standpoint actually had a ton of graphs. I don’t think I’ve ever seen APSR print so many graphs in one article before. The article was published in 1988–I wonder if they would still publish as many graphs today. Do people just not submit articles with lots of graphs or are there limits on space in publication? Either way, it’s a shame.

9 Tuesday, December 2

Today a group of students met with the external review committee that is doing the department’s review this year. I helped pick the group of 10 grad students with another grad student. While we picked on the usual diversity of subfields, year (in the program, and demographic profiles, the people that came up on our list are generally those who are pretty active (comparatively) in the department. How accurately does such a group represent the views of the “median” grad student in the department? Does it matter?

10 Wednesday, December 3

Today was the Comparative Politics Seminar which I co-coordinate. Coordinating basically entails lots of logistical work and ordering food. It is never clear how many people will attend the seminars and consequently how much food to order. I wish I had a better sense of how to predict the number of attendees based on the speaker/topic, time of the semester, and attendees’ appetites to order the correct amount of food. (Of course, if they gave us a bigger budget, this would be less of a concern.)

11 Thursday, December 4

The students in the Math Methods class that I TA are worried about their final on Monday. They seem preoccupied whether the class will be “curved.” Maybe the answer is to study, not worry as much about the “curve.” Furthermore, it’s grad school… In any case, understanding of a “curve” seems to be a fundamentally misunderstood or at least misconstrued pretty regularly.

Student #13:

Nov.24

Leaves are falling everyday, given the number of trees in a street and the time you need to go through all of them, what is the probability that non of the leaf is falling while you are passing the street?

Nov.25

I am making my afternoon tea today and I am wondering if I should put my tealeaves first or after I filled my cup with hot water, which way will enhance the “connection” between hot water and tea leaves and gives me a stronger tea?

Nov.26

There’s always traffic on Broadway, every night I walk home there are tons of cars passing by me, thus, given a specific time, let’s say 10pm, how many cars will pass by me on my way home? And how may people will me encounter?

Nov.27

I am snacking again, cookies and chocolate!

What makes me want to eat at night? Does weather have anything to do with it? How can people predict the consumption of snacks in a given season and what other things will the consumption of snacks affect?

No.28

My roommate and I got the same letter from “Capital One” to encourage us open our credit cards there. How many of those letters are they sending out each week and how can we estimate how many people on our block receive the same letter this week?

Nov.29

This is the black Friday weekend, because of the sale, every shop in mid Manhattan has more customers than usual. How much should the promotion be in order to maximize their profit in this day? For example, shops that only have 20% off have much less customers than those have 40% off, however, less discount means more profit.

Nov.30

I always woke up late on Sunday and wondering if I should have breakfast or brunch, all those restaurants serving brunch are filled with people. Can I predict the number of people there on weekdays based on the number of people there on Sundays?

Dec.1

My roommate is waiting for those packages she ordered on black Friday. Will there be a package deliver delay because of the high volume of packages? And does this delay vary in areas?

Dec.2

New Yorkers don’t like using umbrellas when it is raining but not heavily raining. On my way home, only around 20% of the people I pass by are using umbrella. What affect people’s decision of using an umbrella?

Dec.3

A lot of my friends are leaving after this semester, it is pretty sad. How many friends can I make each month and how long will it take to have the same amount of friends as I used to ☹

Student #14:

Nov. 23 Sunday

——————–

I encountered eletric outage at my apartment. I wonder how often it happens. I know EdisonCon has ad up in the train station advertising… the data must be very interesting to look at.

Nov. 24 Monday

———————

Again on the train I saw the Fizergald’s Law Firm AD. It is a busy and rather disorganized table listing numbers in diferent formats, such as “1.6 billion — XXX court case”, “1,000,000,000 — yyy court case”, and “1000 thousand — zzz court case”. I am really curious about the conspiracy behind this formating. I did realize that for some people a certain format stands out to them more. My friend first thought that 1.6 billion was largest. And for me, I first jump in to convert that 1,000,000,000 to comparable scale. Perhaps, in statcomm class, we will log10 then and make it a graphical display, but the graph will be unreadable on a train AD.

Nov. 25 Tuesday

———————-

While walking to class, I saw hotdog stand. Actually two. One was popular with the tourists another with the construction workers. One might think that the taste and pricing of the hotdog probabably is different by reasons like tourist choosing it for convenience, workers for price and/or taste. One can try slapping some anova on the price and taste by customer groups.

Nov. 26 Wednesday

————————–

We tend to think that when we are extremely busy, more things happen. Psychologically, we might just tend to take a note of choatic things. But is it real? Can we design a experiement testing the hypothesis?

Nov. 27 Thursday

———————–

Yesterday’s weather forecast on my phone was so accaurate to hourly precision. That it almost gained back my trust from many failures over the year. Though, I was

wondering that what is the possibility of it was right merely by chance. Having that thought in mind, I decided to bring my umbrella along with me just in case…

Nov. 28 Friday

——————–

As I was filling out extremely demanding application for grad school, I started to wonder how many possible mistakes an applicant can made. And to what extend those mistake would affect an application. Also, some tasks were really redundant such as “List all your work experiences” — “Submit your CV/Resume” and “Send in of

Show more