It’s 2015 — You’d Think We’d Have Figured Out How To Measure Web Traffic By Now

In May, a Vanity Fair article about Bill Simmons’s departure from ESPN said that Grantland had 6 million unique visitors in March but that “ESPN’s internal numbers … had the site reaching 10 million uniques in April.”

Late last year, The Wall Street Journal noted that Buzzfeed had 74.6 million monthly uniques, but that its “internal traffic numbers are far higher than the comScore figures … in striking distance of passing 200 million unique viewers per month.”

Last fall, Arianna Huffington wrote “100 Million Thank-Yous” to celebrate Huffington Post’s 115 million unique visitors in August but noted that their “internal numbers, at 368 million UVs, are much higher, of course.”

Not even as numerate an institution as FiveThirtyEight is immune.

Uniques are what most people mean when they talk about a website’s traffic. Show up once and you count as one unique visitor — show up again in the same month, or even visit the site every day in that month, and you still count as one unique visitor (or at least that’s the idea). Uniques are the big-picture number — the Nielsen rating, the Blue Book value, the GDP — that’s supposed to show how well a website is doing. People used to talk about pageviews, a simple count of how many pages were loaded over a certain amount of time. But uniques have taken over, because uniques measure people, not pages. Advertisers care about the former when they’re planning an ad buy.

If uniques are people, how do 4 million, or 125 million, or 253 million people go missing? In an age when we assume our phones and laptops are tracking our every move, taking an actual head count of how many people go to a website is still almost impossible. There’s a blind spot at the center of the panopticon, and it’s roughly the size and shape of a cookie.

Lou Montulli invented “Web cookies” to give the Web a memory. On his blog, The Irregular Musings of Lou Montulli, he described surfing the pre-cookie Internet as “a bit like talking to someone with Alzheimer[’s] disease,” where “each interaction would result in having to introduce yourself again, and again, and again.”

Practically, this meant that every time you wanted to check your email, you had to re-enter your username and password. Shopping online was even harder: Getting all the way through the checkout process depended on clicking directly from page to page — if you happened to hit “back” or just closed your Outpost.com¹ window by mistake, you’d have to start over from the beginning.

In 1994 Montulli noted all this while he was a programmer at Netscape, and he decided to fix it — he decided to make cookies to serve as little memory files for our online lives.² After that, when you went to Outpost.com, your browser would download a cookie file to a folder on your hard drive. The next time you visited, the site would ask your browser to check whether you had an old Outpost.com cookie sitting around. If so, it would remember who you were, or that you had a left-handed Apple mouse in your virtual shopping cart, and you wouldn’t have to start from scratch.

The simplest solution to the problem of a Web with no memory would have been to give every Web browser, or even every Web user, a unique ID code, a driver’s license for the information superhighway. But Montulli made sure that didn’t happen.

“I was very much against this concept,” Montulli writes, “because the unique identifier could be used to track a user at every website.” Cookies, in other words, were designed to thwart surveillance and the kind of broad-spectrum tracking that advertisers crave. Far from a driver’s license, cookies were just online loyalty cards, stamped by a website every time you stopped by.

Marketers soon realized that cookie technology, with a slight twist, could work for them in some ways. In addition to a website’s own “first-party” cookies, marketers started asking websites to serve up the marketer’s own “third-party” cookies, too. Then, when you visited two websites that had agreed to serve up the same marketer’s third-party cookie, the marketer’s server would register a match and know that you’d been on both sites — spread those matches far enough, and the marketer now has a good picture of your overall behavior. No need for a driver’s license if marketers can just slap a sign on your back when you aren’t looking.

This allowed marketers to build up usage profiles and then, more importantly, start serving up ads across a person’s Web experience — if you went to pools.com, they could see that you had visited mesotheliomalawyers.com earlier that week and serve up an ad asking if you need legal help with an asbestos-related disease. But thanks to the way cookies work, third-party cookies still couldn’t tell marketers how many real people went to a website. Because cookies, whether first-party loyalty cards or third-party secret trackers, aren’t attached to people at all, but individual browsers on particular computers.

If you use both Chrome and Safari in a day, week or month, then you, the person, are now represented by two separate cookies. If you use Chrome and Safari on both your work and home computers, then two cookies becomes four. If you also use a phone and a tablet, and use multiple browsers on those, four becomes eight. And if, at some point during the month in which these cookies are being tracked, you or your antivirus programs delete your cookie cache, then fresh cookies get served, and the numbers climb even higher.

Those huge, parenthetical, internal traffic numbers are the raw cookie counts — the number of humans who visited a site, multiplied by all the browsers, machines and accidental deletions.

The lower numbers are just the cookies, crunched. ComScore, Quantcast, Nielsen and other measurement companies use proprietary models to estimate how many actual people went to a website over a given amount of time. There really is no way to directly measure uniques, but the companies’ estimates are much more accurate reflections of traffic reality.

ComScore was one of the first companies to get into the measurement game for the Web. I asked their chief research officer, Josh Chasin, how they come up with their numbers every month. Some background was required before he could answer.

“When comScore started out, we said we measured the Internet, but what we really measured was computer access to the Internet,” Chasin said. “At the time, those two were synonymous. But now measuring the Internet means measuring across multiple devices, notably smartphones and tablets, but also gaming consoles, Roku, Apple TV, and it’s probably also going to mean measuring watches.”

ComScore was one of the first businesses to take the approach Nielsen uses for TV and apply it to the Web. Nielsen comes up with TV ratings by tracking the viewing habits of its panel — those Nielsen families — and taking them as stand-ins for the population at large. Sometimes they track people with boxes that report what people watch; sometimes they mail them TV-watching diaries to fill out.³ ComScore gets people to install the comScore tracker onto their computers and then does the same thing.

Nielsen gets by with a panel of about 50,000 people as stand-ins for the entire American TV market. ComScore uses a panel of about 225,000 people⁴ to create their monthly Media Metrix numbers, Chasin said — the numbers have to be much higher because Internet usage is so much more particular to each user. The results are just estimates, but at least comScore knows basic demographic data about the people on its panel, and, crucial in the cookie economy, knows that they are actually people.⁵

As Chasin noted, though, the game has changed. Mobile users are more difficult to wrangle into statistically significant panels for a basic technical reason: Mobile apps don’t continue running at full capacity in the background when not in use, so comScore can’t collect the constant usage data that it relies on for its PC panel. So when more and more users started going mobile, comScore decided to mix things up.

“Before 2009, we were pretty staunchly in the panel camp, but then we realized they weren’t enough,” Chasin said. “We’re pretty clear on this now: good measurement requires the integration of panel measurement and site-centric measurement from tagging.”

Tagging works basically like third-party cookies. Websites that hire comScore or Quantcast or Nielsen to measure their sites embed little one-pixel “beacons” in each of their pages, which ping back to the measurement company’s servers each time they’re loaded, recording data such as users’ IP addresses, what time they loaded the page and what cookies they already have saved. The companies then combine the panel and tagging data, compare that to the raw internal cookies, and out pop the uniques.

ComScore produces the most widely referenced online audience measurement numbers in the business, but that doesn’t mean its numbers are the most accurate. “It’s probably fair to say right now that our mobile panel could be larger,” Chasin said. Using the server-side tagging system helps close that gap to some degree, but as the majority of Web traffic migrates to mobile, that leaves a huge potential hole in comScore’s numbers. On a less technical note, too, there’s the fundamental problem that all this modeling takes time. ComScore and its competitors come out with their top-level traffic rankings weeks or months after the period they’re measuring, leaving publishers and ad buyers to work with old data in an industry built on the premise of instantaneous communication.

Each measurement company comes up with different numbers each month, because they all have different proprietary models, and the data gets more tenuous when they start to break it out into age brackets or household income or spending habits, almost all of which is user-reported. (And I can’t be the only person who intentionally lies, extravagantly, on every online survey that I come across.)

In the end, though, just having a number that everyone can point to as an acceptable proxy of reality is more important than how accurate that number may be. The Nielsen TV rating is notoriously fuzzy, but companies bought $78 billion of TV ads in 2013 based on their faith that those ratings were good enough. ComScore could theoretically measure mobile better, and come out with real-time reporting, but money is as much a limiting factor as technology. Metrics are only ever as good as it is financially viable for them to be, and advertisers, publishers and agencies will pay for only as much accuracy as their own business will support. Right now, comScore leads the industry when it comes to online audience measurement, and comScore has to be only accurate enough to keep that lead.

So, unless you have a serious paywall, and therefore have users who are logged in 100 percent of the time (like the Financial Times), there is just no way to know for sure how many individual real-live people visit your site in a month, week or day.

And that’s assuming that real people are even visiting your site in the first place. A study published this year by a Web security company found that bots make up 56 percent of all traffic for larger websites, and up to 80 percent of all traffic for the mom-and-pop blogs out there. More than half of those bots are “good” bots, like the crawlers that Google uses to generate its search rankings, and are discounted from traffic number reports. But the rest are “bad” bots, many of which are designed to register as human users — that same report found that 22 percent of Web traffic was made up of these “impersonator” bots.

Given the size of this bot horde, an industry-funded regulatory agency called the Media Ratings Council is moving to require all measurement services to include bot-detection and exclusion methods in their products in order to get their official stamp of approval. But even if all the bot traffic can be weeded out, that’s one more estimation that has to be folded into the estimates, all using another layer of proprietary methods, further widening the divide between what can be directly measured and what can be considered reality.

I asked David Coletti, ESPN’s VP of digital media research and analytics, how big the difference between the internal and external numbers tends to be across the sites (like this one) that he oversees for the company.

“We always see a delta of at least a couple million,” Coletti said, for the smaller sites under his aegis (again, like this one). But in his experience, “the more the site is visited, the bigger the discrepancy gets.”

At ESPN.com, the mothership of ESPN Web properties, Coletti says he’ll often see the internal numbers for monthly unique visitors running at three times the comScore numbers.

“If I were to go out and make the argument that the internal number is correct,” Coletti said, “I would be suggesting that every American visited ESPN in the past month, which would be wonderful, but unlikely.”

Traffic, as represented by unique visitors, will always be estimated under the current technological regime, and those parenthetical “internal numbers” that reporters drop in media stories bear little relation to how many actual people go to a given website. Or as Coletti puts it: “Neither numbers are right or wrong — they’re just counting in different ways, and it’s unsatisfying.”

Facebook is trying to change that.

The social media giant announced in May that it would begin hosting articles directly on its own servers, with no link out to the websites that created them. The content-creating websites (in the pilot program, that means outlets including The New York Times and Buzzfeed, but more are sure to come) justified this move as necessary to bring in high Web traffic. Hosting the articles on Facebook allows for flashier “read this” buttons and shorter loading times, which in turn, theoretically, makes more people read the articles, boosting traffic.

But for Facebook, and advertisers and the media companies themselves, this move also solves the cookie problem. Facebook doesn’t need cookies — it has faces, faces of real people, or at least accounts that correspond to real people, which means that it knows how many real people look at an article hosted on Facebook. And more than that, even, it knows their names, and their ages, and what they “like,” and probably where they live.

Apple and Google are in a position to break the cookie regime, too, with the possibility of persistent logins across browsers, devices, days and years, but Facebook is out front. In the current version of the future, knowing how many real people went to a given site will likely also mean knowing which real people went to a given site. No proxy, no guessing, just you.

The Internet has become the first fully paranoid mass medium. If we read, if we click, if we watch, we do so with the knowledge that we are being watched in turn. When ads adjust to what we type and feeds adjust to what we like, we have visual proof that the network is looking at us. When the watchers seem to get it wrong, and show us an ad for orthopedic surgery after we search for elbow macaroni, we get to experience the grim glee, once reserved for prisoners and test subjects, of hearing loud snores through the one-way mirror.

This wasn’t the purpose of the Internet when it first got going, but it quickly became its selling point. Advertisers dreamed of reaching “one to one,” a state of omniscience in which they could precisely target not only specific demographics but individual consumers with a particular ad. The Internet promised to make that dream come true.

Twenty years later, we take it as a given that we’re living in that dream. We are tracked, through our phones and our laptops, by a long list of companies, and assume that they probably know everything we do.

But the assumption has preceded the reality.

The cookie conundrum, the direct uncountability of how many people actually go to a given website, isn’t even considered a major issue in the online ad world — they have much bigger problems. Studies over the past couple of years have suggested that more than half of the ads on the Internet never even make it to the visible rectangle of someone’s screen. For 20 years, people have been paying for ads that, far from being shown to the one person most susceptible to their charms, have been shown to literally no one. Video ads, until very recently, qualified as “seen” even if they played in a hidden tab, with the sound off, or below the fold. The industry that we assume is watching us all the time has only just come up with a working definition for when an ad is “viewed.”

Right now, though, the industry is finally starting to catch up to its omniscient image. An association of online advertisers has declared 2015 the official “Year of Transition” as publishers and marketers try to figure this all out, but they will figure it out soon. The technology is in place to watch users as we assumed we’d been being watched all along. Chartbeat, for instance, runs lightweight JavaScript programs on its clients’ websites to record, every 15 seconds, where our cursor is on the screen, how often we scroll down the page and a host of other “engagement” metrics. The industry as a whole — publishers, marketers, advertisers and measurement companies — will presumably agree on the best way to use that tracking technology in the next couple of years, and start buying and selling ads based on the metrics it can measure. If that happens, the dream will get a lot closer to coming true.

The days of the cookie and its intentional privacy features (or tracking flaws) may be numbered, too. Right now, its fallibility makes us harder to count and harder to track, but it might become obsolete as browsers stop accepting third-party cookies, more and more users switch to the mobile Web, and persistent logins (like Facebook’s) become more widespread. Mobile devices have persistent identities — the generic MAC Address, Android_ID for Android devices, and the unsubtly named Identifier for Advertisers on Apple devices — which let marketers tie a single device to a single user, and Verizon Wireless has even been quietly inserting a “Unique Identifier Header” (essentially that online driver’s license) into the Web traffic of its subscribers for at least two years.

But for now, at least, Lou Montulli’s cookie is still doing its job, serving as a kind of passive privacy shield. Its virtue is its impermanence, giving us a small escape hatch out of the economy that it helped create. Third-party cookies have the air of the nefarious, but they’re grainy black-and-white security cameras stuck in a corner. We’re on the cusp of the HD era, about to enter the sci-fi surveillance world we thought we’d been living in all along.

Or at least, to ratchet down the paranoia, a world where we can say, for sure, how many people visited this page.

Footnotes

(NASDAQ: COOL). This is true.
Montulli also came up with the idea for the <blink> HTML tag, is one of only six inductees of the World-Wide Web Hall of Fame, and created The Amazing Fishcam, a webcam feed of a fish tank in the Netscape offices that, according to its website, “was the second live camera on the Web and is the oldest camera site still in existence.”
Sweeps season gets its name from this old-school arrangement — to keep the logistics load down, Nielsen used to mail these viewing diaries out to smaller markets just a few weeks out of the year, starting with the East Coast and then “sweeping” west across the country.
That’s out of the million or so people in the U.S. (and 2 million worldwide) that comScore has recruited for its panel (though “recruit” might be a strong word — they either ask people to opt in for some small reward using a banner ad or ask people to download the meter in exchange for free software through partner sites). Each month, Chasin said, about 225,000 of those million empaneled households have provided data good enough to go into the Media Metrix calculations.
And they claim to be able to tell multiple users on one home computer apart based on accessing particular private accounts (Jim only logs on to Jim’s email, etc.) or, barring that, viewing habits — in the past, right after someone logged on to Jim’s email, that person often went to Etsy, so the Etsy-goer is probably Jim. It’s not perfect.

Footnotes

Comments