Thursday, May 19, 2016

Thoughts on allometry and datasets

Imagine you know nothing about humans, but you want to predict the adult height of human males living in the United States of America in 2014, using their mass as the independent variable.

Luckily for you, you happen to have a really great dataset of primate height and mass. You plot height vs mass, get a best fit line equation, and you can finish your dissertation, right?

Oh, but wait. The dataset is for primates. It includes data from Homo erectus, Homo neanderthalensis, chimpanzees, gorillas, anatomically modern humans, etc. No problem. You remove everything but anatomically modern humans from your analysis, fit your line, plenty of time to finish and grab a beer before—

Oops. That dataset includes females. You remember reading somewhere that anatomically  modern humans are sexually dimorphic. Remove all the female data from the analysis, fit a line and—

Hold on. This data is world wide and not just for the United States of America. You're pretty sure you read that there are differences in height between different countries. You strip out all the data not related to the United States of America, fit a line and....

------------

You get the idea. I'm not going to belabour the point I'm trying to make. In this example you could question whether the data you have is for human males in the US in 2014, or maybe it's data from 1820 (which would matter since people have gotten taller as nutrition has improved over time), or maybe there are racial difference you don't know about and are not indicated in the data.

Each of those best fit lines was accurate, but totally dependant on the scale you are interested in.

This is something I think about a lot. I engage in things that could be called "macroecology", which tends to to look at large-scale, whole systems and look for top-down relationships. At the same time, I always try and understand what the underlying bottom-up causes are of the large scale patterns. How does the individual tree result in the structure of a forest? How does human metabolism result in trade networks? It can be hard to parse datasets to get at what is needed, particularly if the data was gathered in such a way that it ignores attributes that might be very important to your question, but were not important to the collector. To use the above example, what year does the human data come from.

That is all.