2014-05-21

I've just registered for this forum, and can answer various questions about my model. A few things in recent pages:

1. The 20-year estimate for a 1M-to-1 false positive (using 4.75 sigma as criterion) is based on about 200,000 tested games per year, not 50,000, because the unit of testing is player-performance-in-one-event, not individual games. One can figure 8 games per event on average, but as there are 2 sides to a game it's a multiplier of 4. The way I've described it before is based on 1,000 player-performances per week of TWIC, which amounts to a million per 20 years. Since many events are split across 2 weeks of TWIC, this number leaves some growing room for getting all games and from more events. Another commenter is right to note the biased-selection factor of getting only the top-board games from many events, but I allow for it and getting all the games from events goes into that allowance. In any event, the section is written to allow principals to prefer the 5.00, "3M-to-one", 60-year standard as safer; one reason for proposing 4.75 is that a couple 4.8's have arisen in practice.

2. Regarding
Quote:

"Check your assumptions, the paper recommends the statistical analysis (with the given risk of false positive) to be used as sole evidence for a conviction. In other words, in Little Britain style, you are banned for 5 years because "computer says so"

and
Quote:

It does say that, but I don't believe that is really Ken Regan's intention, and it's his baby. The wording needs tightening up to confirm that the test will only be used to convict when there is independent supporting evidence, but in reality I don't think this is a frightening part of the paper.

---and regarding "turnaround" and "changed his mind"---

That is correct. At the time of a March 2012 NY Times story, I wrote a page "Parable of the Golfers" with the statement, "There just aren't enough moves and games and players to get even a sniff of five-sigma, but 3.5 sigma, that happens..." Prior to then I'd had no z-score above 4.00, not even in the Feller case, though one from August 2011 would qualify now that I have a better handle on ratings below 2200. Part of what I meant by "not enough games and players" is: how would you empirically test my 1-in-a-million projection? For a true field test, you need 1M player-performances, but per above that needs about 4M games---a large fraction of all the games ever recorded. And at 4-6 CPU-hours per game for my full test, it would take a long time. What I've done is run simulations on multiple 10K-size sets drawn randomly from my training data, which falls under accepted "bootstrapping" methods and warrants the projections out to the 3.50--4.00 range. Above 4.00 it's extrapolation, and a good defense lawyer could try to knock that down. Once I finish converting my model to the the Houdini-Komodo-Stockfish troika, getting it up to multiple 100K bootstrap trials will be possible.

I was shocked when numbers over 5.00 (even over 6.00 upon excluding round 8 and moves past 70) tumbled out of Zadar. I'd thought I'd only get them in cases of "consecutive events", either by combining the games or using Stouffer's Rule to combine the z-scores. The primary question I see is: what near-term actions can be justified from so high a readout alone? Banning is far-term, but IMHO the potential to ban is necessary in principle to open a process near-term, and for me a fair central process would have been much preferable to what we saw extend through last December.

Besides publications on my website I have also written articles for the Maths/CS blog I co-manage, whose standing in my field I analogize to Sergey Shipov's Crestbook, and article titles suffice for Google: "Thirteen Sigma" (about the 2010 Azov Don Cup) and "Littlewood's Law" explain perceptions about statistical results, while "The Crown Game Affair" laid out the argument on Ivanov (with campiness intended to mitigate the legal risk).

3. Among points earlier in this thread, let me say the "recipe" is in materials on my website and FIDE has no intent to keep it secret. I do regard full-test data as private, while screening tests (which have no z-score judgment value) are already essentially being done publicly by chess-db.com and some other sites. The most accessible version of the recipe may be the part of my Tallinn talk http://www.cse.buffalo.edu/~regan/Talks/FIDE84CongressTalk.pdf from slide-overlay 47 onward. What happened is that I realized from the preliminary ACC meeting in Paris and from talking with people the night of my arrival in Tallinn and at breakfast that what people needed most to hear were the fundamentals of evidentiary statistics stated in the chess context. So I hurriedly wrote overlays 1--46 that day, finishing just before the 5pm meeting, and wound up stopping my talk at slide 48 where I'd intended to begin . The attitude of saying "Analytics" in the title is that it's not just statistics; two others of my private reports last year went into game-move detail. In Buffalo my professional colleagues duly tempered expectations about statistics, as did I in revising the document---but one must allow that in a dozen positive cases I count the statistics were or would-have-been effective, plus there are several dozen cases (4 so far this year) where the model turns aside accusations/loud-whispers I think are unfounded. I don't really disagree with points raised here about "smarter cheating"; what I hope is that "smarter science" together with smarter prevention will combine to make the risk/reward and complexity/detectability curves more adverse for potential cheaters.

The math in the "recipe" is not deep, and unlike online systems it has no sub-surface information such as move-timing and player-profiling. Once the main principle and a few "big-data-style" regularities are digested, there are only a couple choices beyond doing the simplest thing. I analogize the latter to realizing in the Marshall Ruy that 11...c6 is better than the original 11...Nf6 idea, and realizing that Black need not fear certain endgames. The current ramifications of updating the model are like all the "book" that's developed since, including taking d2-d3 and/or Re5-e2 seriously, but the basic ideas and effectiveness points of the Marshall have been pretty consistent.

Statistics: Posted by KWRegan — Wed May 21, 2014 6:20 am

Show more