Exposure Assessment: Using Stats Make Sense of What a ‘High’ Concentration Is
The most common question I’ve encountered so far from working is, “Is the concentration in this sample normal, or is it above what is expected?” What I usually reply with is that this depends on the study you’re looking at and whether a precise scientific methodology was applied so that this question can be accurately answered. What you’ll often get as a reply is, “Why can’t you just take the average of a population and use that?” Worse, someone may bring you a printout of a regulatory guideline and insist on using that as the baseline for comparison. No one wants to know Mike’s opinion on the latter situation… Whether it be epidemiology or environmental contaminants, to properly begin these assessments, we need appropriate controls and a meaningful comparison. This requires a control population or location and splitting up the “normal” data into equal parts, which we’ll dive into below.
Biomonitoring and NHANES Case Study: Making Sense of Levels
Let’s start with biomonitoring as an example to answer these questions. This process is about testing people’s blood, urine, or other biological samples for chemicals, like PCBs—those infamous polychlorinated biphenyls that used to be everywhere but still linger around in the environment. The goal here? Biomonitoring helps us figure out what “normal” chemical levels are in the general population and, more critically, when levels become high enough to be a red flag. That’s where NHANES, or the National Health and Nutrition Examination Survey, comes in. NHANES, run by the CDC, is an ongoing, massive survey gathering health data and samples from thousands of Americans. By giving us this detailed look into what’s in people’s bodies, NHANES lets us compare an individual’s results with averages or percentiles across the country. So, if we find high PCB levels in a sample, NHANES data helps us gauge where that sits compared to the broader U.S. population.
What are quantiles/quartiles/percentiles?
The objective of quantiles is to split up the data into equal bins where each bin contains the same number of observations. In terms of PCBs obtained from the NHANES measurements, quantiles would bin equal amounts of individuals into each bin which are segregated by concentrations. The term quartile is used when the data is binned into four equally distributed categories generally referred to as 1st, 2nd, 3rd, and 4th quartiles. Percentiles bin the data into equally distributed bins based on how much data lies in percentage below a certain value. Figure 1 illustrates the percentile binning of a sample set of 100 PCB measurements. At the 25th percentile in Figure 1, there are 25 sample measurements below the line indicative that 25 percent of all the data lies within this bin. At the 50th percentile in Figure 1, there are 50 sample measurements below the line indicative that half of all the data lies within this bin. This is also referred to as the median. At the 95th percentile in Figure 1, there are 95 sample measurements below the line and only 5 samples (5%) that exceed this line.
Which percentile you use depends on what data you’re looking at. In epidemiological studies, the normal levels for the general population fall below the higher percentiles which provide the upper distribution and range of levels in the unexposed population. The 95th percentile is helpful when determining whether levels observed in separate public health investigations or other studies are above what is generally observed for a normal population. In environmental chemical studies such as monitoring a contaminated site, what is often used as a reference are background samples. Similar to epi studies, if a concentration of a sample is above the higher quantile of the background sample it can be considered impacted.
Figure 1 – PCB measurements conducted on a data set with 100 samples. At the 25th percentile in Figure X, there are 25 sample measurements below the line indicative that 25 percent of all the data lies within this bin. At the 50th percentile in Figure X, there are 50 sample measurements below the line indicative that half of all the data lies within this bin. This is also referred to as the median. At the 95th percentile in Figure X, there are 95 sample measurements below the line and only 5 samples that exceed this line. I always care so I attached the R code of this figure as a hover label within this post.
Important Note!
This example above is looking at individuals in the general population and when you pose your question it’s all about context. If we’re trying to see if a sample is above ‘normal’ we must have a reference group of normal. In the context of the above, if our sample we are comparing is above the 95th percentile of the normal data it can be deemed as ‘exposed’. Remember when comparing a sample to a study the context is key!
Bringing It All Together: Your Data, Your Questions
So, when it comes to interpreting chemical data, there’s a lot more to it than picking a single number and calling it a day. Each dataset tells its own story, and the percentiles and quantiles we choose help frame that story accurately. Whether we’re monitoring background levels in the environment or assessing exposure risks in people, the way we slice the data can reveal critical insights—or lead us down the wrong path if we’re not careful.
Curious about how to apply these insights to your own data? Or maybe you have a data conundrum you just can’t crack? I’m all ears. Feel free to reach out—I'm always up for a deep dive into the data and would be happy to help you make sense of it all!