Sunday, January 28, 2018

Ramblings on tree data and when a lot is too much

I've been thinking a lot about tree canopy spread lately.

My tree growth model, Vida, does and ok job at predicting tree canopy spread, but it could be better. The main impediment is that, when I initially made the model, the Abies alba data I was using for validation had trunk diameter(DBH), above ground woody mass, tree height, and photosynthetic tissue mass. I had to make some simplifying assumptions in order to derive canopy width.

Yesterday I read a paper by Pretzsch et al called "Crown size and growing space requirement of common tree species inurban centres, parks, and forests." It used several thousand measurements of DBH to predict canopy spread for 22 different species, one of which was Abies alba. Unfortunately, the data set wasn't a part of the supplemental data, but an email and a few hours later resulted in having access to 37,882 DBH and canopy spread paired measurements. In case you are wondering what that looks like in log-log space:



The equation for that best fit line, without any corrections is

Canopy radius in meters= 0.7120*(DBH^0.5447)

That exponential value, 0.54467, is the slope of the best fit line in log-log space. In general, what people looking at allometric relationship have found is that the slopes of the lines for a given relations (canopy radius vs DBH, for example) tend to remain constant across species. I could give you the 95% confidence intervals and the r2 (which was not great), but this isn't a scientific paper.

A lot of what I do deals with looking at large data sets and looking for simple (usually allometric) patterns. In one way, I am a "macroecologist" in that I looking at large scales and looking at top-down relationships of whole systems. At the same time, I try and figure out what the underlying rules are that result in those patterns in individual-based, bottom-up ways.

That blob of data is, I think, illustrative of a potential pitfall in analysis of large data sets. To illustrate my point, I can plot just the Abies alba data.


Sort of a similar blob, with a slope of 0.4200. Now I could just go with this relationship, but I know there is a problem here. Specifically, the volume of data is hiding important information.

Imagine this situation: You have two trees of the same species. One is grown on the rain shadow side of a mountain, and the other is grown on a well watered plain. The environment each tree experiences is very different. If you take measurements of both trees for, say, 100 years, and combine them, you lose details about relationship in the noise generated.

 

On the left, we see every entry from the Cannell dataset ("World forest biomass and primary production data", 1982). On the right we see data from a specific plantation of Abies alba monitored for 95 years from within that dataset. With the large set of data, it's difficult to tell whether there are changes in growth, or whether we're just seeing noise. On the right, it's clear that a simple line fit in log-log space doesn't fit the data. There are actually two relationships.


There is a log-log relationship between height and DBH (slope close to unity) for young trees, but after they transition to sexual maturity, the relationship between height and DBH becomes lin-log. This is something that would be hard to catch in a large mass of data, especially since different species start producing propagule at different times, and there are variations in the same species depending on the environmental conditions.

Back to the canopy spread dataset. Looking at the Abies alba canopy spread data, I think I can see something hidden. What if I break the data up by collection site? There are 18 different collection sites, but I'll just show two.