5 Million Records to Analyze?
Machine Learning to the Rescue!

Making sense of 50 years of BBS Data

First posted by Mike Fuller on October 5, 2017
Updated January 22 2018

Machine Learning in Long-Term Research

Long term research in ecology has deepened our understanding of the processes that govern natural systems (and arguably changed the lives of researchers -- see my book review in Bioscience). It's no surprise that long term projects can generate large amounts of data. This is both boon and curse -- while it can reveal trends missed by short term studies, piles of data also present challenges for analysis. Here, I provide an example of how machine learning methods can supplement traditional approaches to analysis of long term data.

The ultimate goal of ecology is to understand the underlying causes of species abundance and distribution. But to do that, we must first separate random noise from process-driven patterns. Therein lies the potential of Machine Learning: a powerful tool for uncovering predictive patterns in data.

Ecological systems are complex, constantly shifting entities, where patterns arise from the interplay of thousands of variables. To make sense of this complexity, ecologists have adopted the divide and conquer approach of hypothesis testing. The world is divided into experimental groups, defined by the question at hand. Want to know if climate change induced drought is changing vegetation patterns? Set up plots in a natural landscape, and add water to some plots but not others. Are species ranges shifting in response to climate change? Track abundance over time, effectively creating a moving window of "before" and "after" subgroups. To extract meaningful information, we impose artificial order on othewise intractable chaos.

For this blog post, I consider the North American Breeding Bird Survey (BBS) program, which annually tracks the distribution and abundance of over 400 species of birds in the US and Canada. Inaugurated in 1966, the BBS has amassed over 5 million records on avian populations-- a trove of data which has been used to inform conservation practices and environmental policy. It is a famous example of citizen science, where members of the public contribute to large scientific projects. Anyone can join in to help identify and count birds for the BBS.

The Moving Frontier of Data Analysis

As is the tradition in ecology, analysis of BBS data has relied heavily on statistical models. Approaches have mirrored larger trends in ecology, beginning with relatively simple frequentist approaches in the 80s and 90s, and moving to more sophisticated Bayesian approaches in the 2000s and beyond. These methods have proven effective for identifying species declines, and landscape scale trends in avian community structure.

Machine Learning is an alternative approach to analysis that has become the de facto method in the world of Big Data. How does Machine Learning differ from standard statistical approaches? Is it suitable for ecological analysis? What can it tell us that traditional methods can't?

A Tool for Long-Term Research

The divide and conquer approach works when there are clearly defined subgroups, which is why most research builds on the results of past studies; having established predictable patterns, we can construct narrowly defined questions that are amenable to hypothesis testing.

By contrast, the questions posed by long-term research are often broadly defined. This is intentional, as we expect surprise when following a complex system through time. Long-term research is by nature exploratory, not confirmatory. What's more, compared to short-term studies, we expect long-term data to be messy (more variable); given enough time, species abundances will fluctuate wildly. It's precisely this greater complexity, and the desire to uncover the drivers of both short and long term patterns, that motivates long-term research.

Where Machine Learning can contribute to this endeavor is in its ability to:

  • Discover previously unknown patterns of association
  • Reveal or verify temporal or spatial trends
The key strength of Machine Learning is its ability to identify hidden structure in large, messy data sets. That can mean finding previously unknown relationships among entities, and assigning entities to subgroups based on a large number of variables, where the sheer number of variables poses a barrier to standard analytical methods.

Citizen Science = More Data!

Quoting the Oxford Dictionary, Citizen Science Association", defines CitSci as:

“scientific work undertaken by members of the general public."

Work in this context most often involves helping with data collection. By inviting the public to assist researchers in the field, CitSci accelerates the rate that data can be gathered: we get more data in less time. It also transforms the culture of science by diversifying what it means to be a scientist. After all, data collection is fundamental to scientific research, and anyone who contributes to the process becomes an integral member of the team, no?

CitSci transforms research in other ways, too, not all of which are well understood. For one thing, we can no longer rely on the expectation that field workers have been trained by years of formal preparation. Now, it's true that you don't need a degree to be an expert -- years of experience can be a valid form of education. But what separates a degree from self-gained knowledge is that with the former, skills and knowledge are formally tested and validated. A degree is an objective measure of experience that is arguably a more reliable metric than the statement "I've been birding all my life".

Citizen Science = Good Data?

Does the quality of data from CitSci projects differ from conventional research? This can be difficult to assess. For the BBS, having decades-worth of data yielded the statistical power required to distinguish annual variation from true population declines. But how reliable is that data? Recent studies indicate their may be problems. For example, studies have revealed declines in hearing ability over time, that affects the accuracy and completeness of species lists[1,2]. Understanding how differences among observers in birding skill influence estimates of bird abundance is crucial for drawing accurate conclusions about observed trends in bird populations.

Detecting Observer Bias

Before we can quantify differences in skill among observers, we need to separate observer effects from route effects. BBS counts are recorded for a combination of route + observer + year. We want to disentangle the route and year components of bias from the observer component. Which means we need data from multiple observers for a particular route.

It would be helpful if count data from two or more observers were recorded at the same time. Unfortunately, although simultaneous observers are permitted, the BBS database doesn't track their individual contributions -- only the sum total is recorded. As an alternative, to quantify observer differences, we can use single-observer records for a given route, and look for a bias between different observers that is consistent over time (i.e. using a repeated measures approach). Which means we need to find routes surveyed by different people over time. Where might that be?

Lonsome Cowbirdboys?
Continental Trends in Birder Density

We are more likely to find multi-observer routes in regions that support many BBS observers. From 1966 to 2015, over 7,000 people participated in at least one BBS survey in the US. The map below shows the distribution of these observers. As might be expected, participation in the BBS program tends to follow human population density, with a distinct descreasing trend in BBS participation away from the coasts, and from east to west.

map of US observer density

But human density is not the whole story. Evidently, birding is most popular in the northeast, and least popular in Nevada! [Update: actually, now I think pop density explains the pattern pretty well! Ha! ] The next step is to examine multi-observer routes for evidence of observer bias.

Can We Quantify Birding Skills?

Having found where to look for routes with multiple observers, we next require a measure of birding ability. Now, it's one thing to compare species counts of a given person over time[1,2]. But how do you quantify differences between people?

One approach is to estimate observer error rates for bird identification. For example, one study[3] asked birders to listen to recordings of bird calls and songs, recorded from BBS survey routes, and report all the species they heard. The study revealed an average error rate of 14 percent, with some participants missing nearly 40 percent of recorded calls and songs.

These results suggests that, given suitable data (recorded bird calls, in this case), we can indeed measure differences between birders. Unfortunately, audio recordings are not available for more than a handful of BBS routes. To understand the influence of observer bias more generally, we must find a different approach -- one that can exploit the vast base of records established by the BBS.

Shannon Entropy as a Metric
for Birding Ability

Here, I argue that Shannon Entropy provides an objective measure of birder ability. Most ecologists are familiar with Shannon Entropy (H) as an index of species diversity. H is calculated from the proportions of different species in a sample. It is widely used to compare different communities, or assess changes in community structure over time. But Shannon's original purpose for H was as a relative measure of the information content contained in a signal, and that makes it a great metric for comparing birder skills.

Think of the calls, songs, and sightings of the birds at a site as a signal, which encodes information about which birds are present. Now consider an observer as an imperfect signal recorder, whose error rate varies randomly from person to person (here, age could be consider a non-random covariate. We will get to that later). Just as with the audio-recordings study, two observers may generate different species lists based on the same signal.

Differences in error rates could be due to differences in hearing ability, knowledge of local species, or general identification skill. Weather conditions on a given day, such as noise from high winds or low light due to dark clouds, could also influence an observer's error rate on that day. But on average, if error rates differ among observers, their species lists will show consistent differences, too. Assuming each person's error rate (and how it changes with age or experience) is relatively constant, then on average, differences in H between observers who are working the same sites provide a good indication of relative birding skill.

Refining our Comparisons with
Dynamic Time Warping

poster from Back to the Future

Dynamic Time Warping sounds so cool, right? So what the heck is it? DTW is a machine learning method used to compare two or more time series. In a nutshell, it calculates the degree to which one time series must be altered ("warped") to make it align with a second time series. It provides a measure (the "distance") of how different two time series are from each other. For example, if you used DTW to compare a time series to itself, the distance would be zero; it's identical to itself. The greater the distance between two time series, the more "different" they are. And DTW is not limited to time series; you can apply to any type of numerical series. For example, biologists are using it to compare DNA sequences. Pretty cool, eh?

Earlier, we decided to use Shannon entropy to compare observers. But here's the thing: most BBS observers have surveyed a particular route over a period of years. In other words, for many observers, we have multiple values of H to work with. This is good. More data is always good! A time series of H arguably permits a more accurrate measure of performance than can a single value.

Entangled Data:
Observers vs Routes

As mentioned above, our biggest challenge in comparing BBS observers is to separate the observer from the route. That is, each observer's time series on a route reflects not only the observer's skill, but also the route's characteristics. How can we separate the two? If only we had an independent measure of the expected entropy of each route ...

Actually, we do! We have data from nearby routes. The combined species list of all routes in a local area represents an unbiased sample from the bird metacommunity. The metacommunity is the source of species and individuals found on any given route. As such, it can serve as the expected bird community for each route in the area. Here, we assume that individual birds are free to move around within the metacommunity. If true, the species relative proportions found on a route at any given moment reflect their proportions in the metacommunity. As you might guess, this is not a new idea[4].

Admittedly, the metacommunity is not a perfect prediction of a given route's diversity. Each local community can be expected to vary to some degree from the metacommunity. Still, the metacommunity is a logical choice, which is why its use as a predictor of local diversity is broadly accepted by community ecologists.

Measuring Relative Performance

We now have a reasonable method for assessing an observer's skill, relative to other observers. If an observer's repeated survey results are consistently higher or lower than the metacommunity average (representing the survey average of her peers), this would suggest she is biased: consistently over-counting or under-counting local species, relative to observers on neighboring routes.

To recap: our approach is to compare the time series for a given observer to that of the metacommunity, using DTW distances as a relative metric of how well the observer's survey results agree with the expected results.

Choosing a Machine Learning Approach

With our comparison metric in hand, it's time to choose a specific method for analysis. Machine Learning is not a single approach -- as a mature discipline, it encompasses many different methods. And as with conventional statistics, one often has a choice of several possible methods for a given data type and problem.

Supervised vs Unsupervised Learning

Our goal is to uncover any differences in skill among BBS observers that might influence conclusions about the status of bird populations. In Machine Learning terms, this could be viewed as either a Classification problem or a Clustering problem. Classification generally uses a supervised learning approach, while clustering often takes an unsupervised learning approach.

In either case, we generally have a set of variables (features) associated with each record, which we use to assign the observations to discrete groups. For example, data on individual birds could include a set of features relevant to their activity, such as habitat type, time of day, nest height, etc. We could use machine learning to identify groups of species based on their activity patterns, which might tell us something about their ecological niches.

So what's the difference? Briefly, in supervised learning, observations or records (i.e. our data) have been previously labeled with an identity; a type or value. We use the labels to "supervise" how our data are grouped into classes (e.g. yes/no, positive/negative, state of residence, etc). For example, data on individual birds could have species names as labels. With unsupervised learning, our data are unlabeled; we're not yet sure how to classify them. In that case, we could apply a machine learning method to identify natural groups, based on similarities in their features. We can then use this information to label each record, and then apply a supervised learning technique for further analysis.

K-Means Clustering

There are many possible approaches to follow here, but for this example, we'll use something called k-means clustering or k-means for short. K-means is an example of unsupervised learning. We'll use a set of features which a potentially relevant to observer skill, and see if the algorithm can find natural groups, or clusters.

Another similar sounding techique, k-nearest neigbors (k-NN), is a supervised learning approach often used with dynamic time warping. Here, we could use either; k-means is a tool for exploring the "pattern space" of data (which is our first goal in this case), whereas k-NN is more often used to classify a new data point, based on its feature space. Also, k-means has certain heuristic properties that make it better suited to extracting a process model. Later, we'll use those properties to construct a linear model, based on the outcome of k-means.

Features: Something to hang your hat on

We are using k-means to group observers into discrete categories, based on their birding skill, using the time series of their bird diversity estimates. But we don't just want to know if observers form natural groups; ultimately we want insight into the factors that differentiate observers. The more relevant variables we include as input data, the more informative the output will be. In Machine Learning parlance, variables are called features.

We may already have a pretty good idea of what influences an observers skill level. For example, we might suspect that older, more experienced observers would be better birders. Of course, we have to draw our potential features from the data we have. In the case of the BBS dataset, there are a number of possible features which might influence birding skill. Based on an earlier exploration of the data, here are four features that are potentially important:

  1. dte.mean = Averate DTW distance
  2. RteProc = Number of unique BBS routes processed
  3. TSrange = Number of years experience doing BBS surveys
  4. meanDuration = Average time spent processing a BBS route

This then is the feature space for our model.

Setting Up the Model

There is only one parameter to set for the k-means model: k, the number of clusters in the output. Choosing a value for k is more art than science, but there are some rules of thumb. For example, if you were believed bird behavior was different for males and females, you might start with just two categories of a multi-feature data set. Generally, modelers choose a value for k that is proportional to the feature space.

Since we chose four features for our model, we'll start with four clusters (k = 4), and then alter k depending on the patterns revealed by the k-means procedure

Results of K-means with 4 Clusters

Here are the results of the four-cluster model. The centroids of the clusters provide a measure of the influence of each feature on the four clusters of observers

             dtw.mean RteProc TSrange meanDuration
          1  -0.3120 -0.0744 -0.3437      -0.8119
          2  -0.1998 -0.3051 -0.3827       1.9478
          3  -0.2733 -0.2796 -0.3744       0.3577
          4   1.0437  0.7201  1.3509      -0.2650

We see that dte.mean, our main measure of preformance, is much higher in the fourth cluster, relative to the others. This suggests that observers do differ by that metric. By the same metric, clusters 1, 2, and 3 are relatively similar to each other. Other features, such as TSrange, are also similar across 1, 2, and 3. This shows we have not had much success differentiating the observers by anything other than dtw.mean, which implies we have too many clusters.

After exploring cluster sizes of 3 (not shown), it eventually became clear that a k of 2 yielded the most separation in feature space:

              dtw.mean RteProc  TSrange  meanDuration
          1  -0.3134   -0.2217  -0.4364   0.1185
          2   0.8602    0.6086   1.1978  -0.3253

Here we can see a strong separation across the feature space, especially for dtw.mean and TSrange. Based on these results, these two features yielded the most informative measures of observer differences. Let's plot all four to get a visual assessment

box plots of four patterns

The figure comfirms our assessment: dtw.mean and TSrange show the greatest separation between clusters (Years in Service refers to length of time performing surveys). The pattern shows that the time series of older observers (those with more years doing BBS surveys) deviate more from the regional (metacommunity) average. In other words, birding performance seems to decline with age.

By contrast, the average amount of time spent conducting a survey (Mean Duration) does not differ much between the clusters. There is greater variation among older observers in the number of routes processed; not surprising as the longer one participates in the BBS program, the more routes one can potentially process.

We can verify the age effect by partitioning TSrange into discrete time periods, and plotting dtw.mean by time period:

box plots of dtw distance over time

Here, the time periods on the x-axis were chosen to yield 10-year intervals; these are the empirical periods that fall within those intervals. We see a lot of residual variation in the difference from mean time series (y-axis). Nevertheless, the data show a definite trend of declining performance (i.e. increasing dtw distance) with observer age.

More to follow ...


  1. Farmer et al. 2014. Observer aging and long-term avian survey data quality. Ecology and Evolution 4: 2563-2576.
  2. Kelling et al. 2015. Can Observation Skills of Citizen Scientists Be Estimated Using Species Accumulation Curves? PLoS One 10: e0139600.
  3. Campbell, M. and C.M. Francis. 2011. Using Stereo-Microphones To Evaluate Observer Variation In North American Breeding Bird Survey Point Counts. The Auk 128: 303-312.
  4. Hubbell, S. 2001. The Unified Neutral Theory of Biodiversity and Biogeography. Princeton Univ. Press.

image of grassland #2 Trips to fairly unknown regions should be made twice:
once to make mistakes and once to correct them.
- John Steinbeck.