What is sampling?
Sampling is when Google Analytics estimates your traffic in a specific way in order to get you larger or more complex reports in a shorter amount of time. For example, let’s say you ask Google Analytics for a report that looks at keywords by city for the past three months. You might be asking Google Analytics to pull data on quite a lot of visits (“sessions”), and in a way that isn’t already preconfigured by Google. This means that the report doesn’t already exist by default, you would have to use a secondary dimension dropdown or create a custom report in order to see it.
Normally this report could take a very long time to process in order for you to see it because of all the work that Google’s processors have to do in order to pull and organize the data correctly. Some web analytics tools like Omniture SiteCatalyst will have this delay – you might be staring at your screen for ten minutes waiting for a report to finish loading. Google doesn’t want users to have this kind of customer experience, so Google samples it instead.
When I think of sampling, I think of biological studies. No, really! Think of an ornithologist studying a flock of geese. He wouldn’t try to capture the entire flock to study them. Instead, he captures a sample of the flock, tags them, and then next season when they return he makes inferences about the entire flock based on what his sample has done. Sampling in Google Analytics is much the same – Google is taking a percentage of your visitors and extrapolating on their data to make patterns that represent your total visitors.
So, is sampling bad?
Not necessarily. Keep in mind that Google Analytics is already an imperfect measuring solution – it doesn’t track users who have cookies disabled via ad-block plugins, for example. So expecting to get absolutely 100% accurate data from Google Analytics (or any web analytics tool, in truth, although some come closer than others) is an unrealistic standard to have.
That being said, sampling can be harmful if Google Analytics is sampling too much. Think of the geese example – if you are guessing about what the entire flock does based on what 80% of the geese do, then it’s probably a pretty safe assumption. However if you are guessing what the entire flock does based on 30% of the geese, you might not have a good guess.
With this knowledge, be very careful when you are pulling automated reports from API tools like Excellent Analytics across many months at a time. This is a very easy way to pull highly sampled data and Excellent Analytics will not tell you when your data is sampled. Excellent Analytics Pro, the paid version, does actually have a great feature (in addition to a host of other nice features) that tells you if your data is sampled or not.
How will I know when Google Analytics is sampling, and by how much?
When Google Analytics samples, you’ll see a yellow “sticky note” at the top of your reports that looks like this:
In the example above, you can see that the sampling is based on 22.27% of my visits. I wouldn’t consider this to be very reliable data. This data should be OK for looking at really big increases or decreases, but I wouldn’t make any life or death decisions based on it. I prefer sampling to be based on 80% of visits or higher, but even then keep an eye out for anything that looks suspect. 95% or higher is ideal.
The weird icon with the black and white squares is an option that also appears, enabling you to control the sampling to an extent. If you click on this icon, you’ll be presented with a slider – if you slide the button all the way to the right, you will get more precise reports with less sampling! In some cases this slider can make the difference in allowing you to feel confident in your data.
Useful links & Info:
- The best article from Google that I have found on sampling here.
- Google Analytics Premium offers sample-free data, for downloaded reports only (not currently available in the Google Analytics User Interface, by logging in at google.com/analytics). Clients with a lot of visitors (over 500,000 a month) are more likely to run into sampling more often, so this might be something for them to consider.