How to read a poll

Fivethirtyeight.com gives you a daily reminder of what a terrible presidency looks like.
Not to pile on about "the media," but generally speaking, they've done a lousy job of explaining the significance of polling results, either during elections, or regarding issues in the news. And (though this should never have to be said) don't read the comments sections. They not only are unlikely to help, but they may make you dumber.  Some evergreens include:
  • "Nobody polled me."
  • "Yeah, those polls were way off in 2016 – dummies."
  • "Fake polls."
So as a public service, I'm here to talk about how to read polls – both in general, and in our political moment in particular.

Rule #1: It matters how a poll is worded. Read the questions.

The idea of a poll is pretty simple: you want to know something and then you ask a bunch of people what they think and record what fraction of them give each answer.  These days, most of the polling is
  1. Approval. 
    More or less self-explanatory – and in a real sense, almost meaningless, since political approval isn't really measurable.
  2. Horse race polls.
    This includes the famous "Generic Congressional Ballot,"which is testable by a much more concrete measure: who will win and by how much.
  3. Issue polls.
    Here's where pollsters can really screw with the results (with the exception of places like Pew – who revisit key issues periodically and using the same language each time). Think of it this way. There's a big difference between asking:
    • Who do you trust more, Trump or the media? versus
    • Who do you trust more, Trump or the NY Times 
If you find a surprising poll result (or, if you're being intellectually honest, even if it's not surprising), follow the clicks through and check out the original wording of the questions.  The reporting can often distort the situation, as when, during the January shutdown, polls asked who was to blame, and the split was something like 30% apiece for Trump and the Democrats, and 20% for Republicans.  The press largely reported this as suggesting most blamed Trump and the Democrats.  But that's a bad comparison.  The overwhelming plurality blamed the Republicans, and that would have been the better way to report it.

We're seeing the same thing now when people are asked whether teachers should be armed.  There's been a lot of press about a CBS poll which purports to show(according to CBS's own headline) that 44% of people support arming teachers, while 50% are opposed.  But the wording of the poll reads, "Do you favor or oppose allowing more teachers and school officials to carry guns in schools?" In other words, "allow" means that many may have interpreted this not as a mandate, but simply a possibility, and the "school officials" may have been interpreted to include security personnel.  Hard to say.


Rule #2: It matters who is doing the asking.

Not all of the bad wording in polls is simply journalistic malpractice.  Sometimes, it's malice.

538 has computed a list of pollster ratings (which, while not perfect, is a good first pass), and there are a bunch which are consistently right-leaning.  Many you've likely never heard of, but some, including Emerson, Harris, Gravis, and Rasmussen (which will get a lot more disdain below) get a lot of press.

One of the ways that right-wing hack pollsters become right-wing hack pollsters is by intentionally stacking the questions, even for non-controversial questions. For example, Rasmussen's current polling intentionally deflects the results of Mueller's investigation by also asking respondents about whether the US also interferes in other countries' elections.  You can even imagine the wording in other polls.  "Given the intransigence of the Democrats in Congress, how do you feel about...?" or polls that are based on fundamentally flawed premises.  This misleading wording is the cousin of the notorious "push poll" which is simply propaganda pretending to be a poll.

Pollsters have a track record and 538 takes those track records in to account, but the result can be skewed because they tend to look only at the final poll.  For instance,  in August 2008, Raz had McCain beating Obama by 12 points and then in October, suddenly showed them neck and neck (Spoiler: Obama won). In October 2012, it had Ronmey up by 4 points up until election day where Obama was reelected by nearly 4 points.

But the other big thumb on the scale is that it's actually really tough to do a poll.  You have to call a lot of people, and fewer than 1 in 10 people answer phone polls at all. In order to correctly estimate what would have happened if you'd talked to a representative sample, you have to adjust your sample to get the right fractions of men and women, weighted by race, location (urban vs. rural, by region, etc.) by age. But every pollster has to make choices about to re-weight their samples.

The Times did a piece during the 2016 campaign wherein they gave the same data (the raw survey results) to 4 different high-quality pollsters and found a 5 point range in results.  That's pretty bad.  A big part of that is that some pollsters insist on trying to pick "Likely Voters," rather than "Adults." The problem is that we know how many men and women, people under 40, African Americans and so forth there are in a population, but we don't know how many of them are actually going to vote, and past history is not a great indication, especially now. For actual head-to-head polls, LV models are a necessary evil, but for everything else, it's just another way to put your thumb on the scale.

Finally, it really matters how you poll people.  The best polls use landlines and cell phones, which are expensive and time consuming.  Landline-only tends to over-poll older (and thus more conservative) voters. Interactive Voice Response (IVR) callers tend to get abyssal response rates.  Internet-only polls are even worse, for obvious reasons.




Rule #3: All polls have errors.

Do not believe any pollster who says something like, "Trump's approval went up by 1% since last month," and not just because any rise is unbelievable.  All polls have noise, even if you can correctly weight your sample.  I've got some more details in a set of probability notes that I wrote for students, but for now, a couple of examples will suffice.

Here's a simple example, in which I ask (or have a computer simulate asking) 1000 people a yes-no question like "Do you approve of Trump's performance?" where each one has a 47% chance of saying yes.  Then I run the experiment again and again, and each time, I get a slightly different outcome:
 
A bunch of polls measuring the approval of a candidate who really has 47% approval (the kind of support that Trump could never dream of). Note that occasionally by random chance (~7% of the time) the politician polls with 50% or more, despite only having a minority of support.

This is a "bell curve" or, fancier, a "normal distribution," or even fancier than that, a "Gaussian."  Most of the time, I very nearly get 47% of my respondents saying yes, But there's a big scatter.  Here's a formula:

$$\sigma=\sqrt{\frac{p(1-p)}{N}}\simeq \frac{0.5}{\sqrt{N}}$$

$p$ is the probability of getting a "yes" (if you were able to ask literally everyone), and $N$ is the number of people you actually talk to.  The $\sigma$ tells you about the range of possibilities.
  • 68% of the time, you'll poll within 1$\sigma$ of $p$.
  • 95% of the time, the result will be within 2$\sigma$
For 1000 people, the $\sigma=1.6%$, so you'd expect to poll between 44-50%, roughly 95% of the time.

A worked Example: Today's CNN Poll
Hot off the presses, there's a new poll out from CNN saying that Trump's approval went from 40% last month to 35% now. Does it mean anything?

Well, it's a high quality (cell+landline) pollster, talking to 1,016 people, so:
$$\sigma=\frac{0.5}{\sqrt{1016}}= 1.6\%$$

But in order to tell whether two results are different, ask:  
Are they more than $\sqrt{2}\sigma$ different?  If yes, then probably. If no, then they're basically the same.
In this case, $\sqrt{2}\sigma=2.2\%$.

The new result differs by the old by about 5 points, so probably!

There's another test:
Are they more than $2\sqrt{2}\sigma$ different?  If yes, then almost certainly.
In this case, $2\sqrt{2}\sigma=4.4\%$.

This 5 point shift in the CNN poll almost certainly means a real shift in opinion. 


Why not be more accurate and just poll even more people?  The thing is, it's expensive to conduct a poll.  The best ones tend to contact people directly on the phone, and use a combination of landlines and cell phones.  But to halve the uncertainty, you need to quadruple the number of people you speak with (meaning that your expenses go up by a factor of 4 as well).  This is why polls of around 1000 tend to be pretty popular.  The uncertainty is small enough to give a good estimate, but not too expensive as to be impractical.

And big $N$ doesn't solve all of your problems.

Different pollsters have what are called different "systematic errors," which no amount of averaging is going to get rid of.  It's possible, albeit unlikely, that Raz's LV model is the most accurate way to weight the data.

Here's your standard "random (Upper Left) vs Systematic (Upper Right) error" figure.
Because of the noise, trying to look for a trend in a daily tracking poll (like, you guessed it, Raz) is a fool's errand.  First, note that if a poll has a 1.6% uncertainty in "approval" then the "net approval" which is Approval-Disapproval, has twice as large (3.2%) uncertainty – since approval and disapproval are almost perfectly uncorrelated.

But tracking polls only redo a small fraction of their data every day.  For Raz, they poll 500 people a day and then average every 3 days.  This produces a big mess.  Here's a simulation of a poll like that, but for which $p=0.5$.  The result certainly doesn't look 50-50 and it doesn't (to the untrained eye) look random.  There's also a 10 point swing (!) in net approval over the course of a month.  Remember, this is based on a model where the underlying popularity of the president is a constant.
A simulated tracking poll a true approval of 50%, and a sample of 500 per day, averaged over 3 day intervals (same as Raz).  It looks like there are day to day trends, even though nothing is really changing.

Don't believe the hype when someone claims to see a signal after 1 day.



Rule #4: Averages can tell you a lot


The larger the total $N$, the smaller the uncertainty. This is why places like 538 and RealClearPolitics average lots of pollsters, though I don't agree entirely with their methodology.  For instance, high quality pollsters (who do more live, cell phone calling) cost more, so they poll less frequently and when they do so, their results tend to produce sharp steps in the averages.  Lower quality pollsters use IVR and internet, and can afford to do so weekly, including at least 6 polling organizations.

So now for some real data! I've collected the last 6 months of the net-approval polling (which are mostly, but not exclusively, the cheaper variety.  Gallup is the exception.), and the whole thing looks like a big mess.
6 months of weekly polling from 6 different pollsters.  There's a lot of noise in there, right?

You can see the different systematics by eye. Raz and morning consistently produces better net approval than anyone else.  Raz is actually worse, because they force a "approve or disapprove" with only 1-2% not answering. Thus, a 0 net approval means 50% approve.  For Morning consult, a 45-45 poll would give the same thing. Most of the polls talk to around 1500 people, though internet-only pollsters, like Survey Monkey, typically poll more than 10,000, but as we'll see, this doesn't actually help their errorbars.

We have a priori reasons to suppose that Raz has bad systematics, but we can ignore that (for now), if we're only interested in the trendlines. We can look at the variations of all 6 polls from their averages – to put them on an equal basis.   The pollsters average net approvals over the last 6 months:
  • Gallup: -20
  • Raz: -11
  • YouGov: -13
  • Morning Consult: -8
  • Survey Monkey: -16
  • Ipsos: -19
There is a 12 point systematic different between the most Trump-friendly pollster (Morning Consult), and the least (Gallup – the highest quality in terms of methodology).  Subtracting out the average, the results look marginally more coherent (but not much):
As above, but with all polls averaged to zero.  In all 6, Trump is in better shape now than he was 6 months ago.  Sorry about that.
Or, to reduce the (random) noise, we can average all of them together.  This is, in essence what 538 and RealClearPolitics do. The difference here is that we're not getting skewed by a monthly interloper in the form of a high quality pollster.  Every week is averaged on an equal basis, as it were:

Here's the bad news (at least if you're a fan of sanity).  In the last 6 months, Trump's net approval seems to have improved by about 8 points (or about +4 to his "approval").  Moreover, if you look at the timing of the recent uptick, it's right around the State of the Union and the beginning of Februrary, when people got their first paychecks after the new tax bill.  That said, the last two weeks have seen a downtick of about a point, so maybe his odious response to the Florida shootings is the first step in reversion to the mean.

I have one more test.  Let's look at each individual (mean subtracted) poll and compare it to the averages of the 5 others.  Here's what Raz looks like:
It's a little complicated.

For the last 3 weeks, the net Raz polls have been 2-6 points high compared to the other 5 even after you subtract out their overall bias.

The pollsters generally correlate with one another. Each poll correlates with the other 5 averaged with a coefficient of about 0.7.  (1 means they trace each other exactly, 0 means totally random, and -1 means when one goes up the others go down and vice versa).

But we can also see how much the trendlines of each pollster misses the general consensus even after you subtract out the systematics. The standard deviation (the average miss) is:


  • Gallup: 2.6
  • Raz: 3.0
  • YouGov: 2.7
  • Morning Consult: 2.9
  • Survey Monkey: 2.5
  • Ipsos: 3.0
Survey Monkey produces the nominally most consistent results, but nowhere in line with what you'd expect by asking 10,000 people (wish should give a random error of only 1%).  Raz produces the worst, tied with Ipsos.


Short version: when people criticize Raz and others for being terrible pollsters, there are good reasons.

Comments