Sunday, January 28, 2018

Ramblings on tree data and when a lot is too much

I've been thinking a lot about tree canopy spread lately.

My tree growth model, Vida, does and ok job at predicting tree canopy spread, but it could be better. The main impediment is that, when I initially made the model, the Abies alba data I was using for validation had trunk diameter(DBH), above ground woody mass, tree height, and photosynthetic tissue mass. I had to make some simplifying assumptions in order to derive canopy width.

Yesterday I read a paper by Pretzsch et al called "Crown size and growing space requirement of common tree species inurban centres, parks, and forests." It used several thousand measurements of DBH to predict canopy spread for 22 different species, one of which was Abies alba. Unfortunately, the data set wasn't a part of the supplemental data, but an email and a few hours later resulted in having access to 37,882 DBH and canopy spread paired measurements. In case you are wondering what that looks like in log-log space:



The equation for that best fit line, without any corrections is

Canopy radius in meters= 0.7120*(DBH^0.5447)

That exponential value, 0.54467, is the slope of the best fit line in log-log space. In general, what people looking at allometric relationship have found is that the slopes of the lines for a given relations (canopy radius vs DBH, for example) tend to remain constant across species. I could give you the 95% confidence intervals and the r2 (which was not great), but this isn't a scientific paper.

A lot of what I do deals with looking at large data sets and looking for simple (usually allometric) patterns. In one way, I am a "macroecologist" in that I looking at large scales and looking at top-down relationships of whole systems. At the same time, I try and figure out what the underlying rules are that result in those patterns in individual-based, bottom-up ways.

That blob of data is, I think, illustrative of a potential pitfall in analysis of large data sets. To illustrate my point, I can plot just the Abies alba data.


Sort of a similar blob, with a slope of 0.4200. Now I could just go with this relationship, but I know there is a problem here. Specifically, the volume of data is hiding important information.

Imagine this situation: You have two trees of the same species. One is grown on the rain shadow side of a mountain, and the other is grown on a well watered plain. The environment each tree experiences is very different. If you take measurements of both trees for, say, 100 years, and combine them, you lose details about relationship in the noise generated.

 

On the left, we see every entry from the Cannell dataset ("World forest biomass and primary production data", 1982). On the right we see data from a specific plantation of Abies alba monitored for 95 years from within that dataset. With the large set of data, it's difficult to tell whether there are changes in growth, or whether we're just seeing noise. On the right, it's clear that a simple line fit in log-log space doesn't fit the data. There are actually two relationships.


There is a log-log relationship between height and DBH (slope close to unity) for young trees, but after they transition to sexual maturity, the relationship between height and DBH becomes lin-log. This is something that would be hard to catch in a large mass of data, especially since different species start producing propagule at different times, and there are variations in the same species depending on the environmental conditions.

Back to the canopy spread dataset. Looking at the Abies alba canopy spread data, I think I can see something hidden. What if I break the data up by collection site? There are 18 different collection sites, but I'll just show two.

Thursday, May 19, 2016

Thoughts on allometry and datasets

Imagine you know nothing about humans, but you want to predict the adult height of human males living in the United States of America in 2014, using their mass as the independent variable.

Luckily for you, you happen to have a really great dataset of primate height and mass. You plot height vs mass, get a best fit line equation, and you can finish your dissertation, right?

Oh, but wait. The dataset is for primates. It includes data from Homo erectus, Homo neanderthalensis, chimpanzees, gorillas, anatomically modern humans, etc. No problem. You remove everything but anatomically modern humans from your analysis, fit your line, plenty of time to finish and grab a beer before—

Oops. That dataset includes females. You remember reading somewhere that anatomically  modern humans are sexually dimorphic. Remove all the female data from the analysis, fit a line and—

Hold on. This data is world wide and not just for the United States of America. You're pretty sure you read that there are differences in height between different countries. You strip out all the data not related to the United States of America, fit a line and....

------------

You get the idea. I'm not going to belabour the point I'm trying to make. In this example you could question whether the data you have is for human males in the US in 2014, or maybe it's data from 1820 (which would matter since people have gotten taller as nutrition has improved over time), or maybe there are racial difference you don't know about and are not indicated in the data.

Each of those best fit lines was accurate, but totally dependant on the scale you are interested in.

This is something I think about a lot. I engage in things that could be called "macroecology", which tends to to look at large-scale, whole systems and look for top-down relationships. At the same time, I always try and understand what the underlying bottom-up causes are of the large scale patterns. How does the individual tree result in the structure of a forest? How does human metabolism result in trade networks? It can be hard to parse datasets to get at what is needed, particularly if the data was gathered in such a way that it ignores attributes that might be very important to your question, but were not important to the collector. To use the above example, what year does the human data come from.

That is all.

Wednesday, March 16, 2016

Decimal marks

I have been terribly sick recently. When not dealing with a fever, having a hard time breathing, or just passing out from tiredness, I've been working on a review paper looking at the history of allometry. One of the figures I want to include is one from Louis Lapicque’s 1907 paper Tableau general des poids somatiques et encéphaliques dans les espéces animales ("General picture of body and brain weight in animal species"). The figure is interesting in a number of ways. First, it's a figure. The creation and publication of figures in 1907 was a pretty unusual move. Most of the time, data was presented in tables because it was simply too time consuming to create and then reproduce figures in publications.. Second, the figure might be the first one to graphically show how (what we now call) allometric relationships appear linear when plotted in log-log space.

The copyright has expired on the original 1907 paper and figure, so there wouldn't be an impediment to reprinting the original figure, but I was curious whether I could recreate it from the data provided in the appendix of paper. In the process, I discovered a small little error related to commas and decimal points. In the original 1907 figure, there is a datapoint for European conger, listed in the figure using the French name "Congre" (lower right *). But when the data from the appendix was replotted, the conger datapoint was missing. A new point appeared to the left, however (Figure 1b, red *).

Figure 1a: Showing the 1907 plotted data for Conger vulgaris
Figure 1b: Showing the re-plotted data for Conger vulgaris

After confirming that I had transcribed the numbers correctly, I started looking closer at the figure and the raw data. So what's going on? What might be going on is confusion about whether a "," is a thousand's separator or a decimal point.

The original data table that has the data in question looks like



A large part of the world uses a "," as a decimal mark. In the United States, if we want to numerically write out ten-thousand, we would say "10,000". To write the same thing in France, it would appear as "10.000". To be clear, for the rest of this entry, I will use "." as a decimal mark, so when I say "99.99" I mean 99 and 99/100.

Here, we see that the data for the conger is written as [10.000 1.05]. The paper is in French, and other examples in the table show the decimal mark is a comma, which means the values are 10,000 and 1,050.

But wait. "1.05" doesn't make sense as 1,050. It would have been written as "1.050" if that was what was meant. I think what we are seeing here is a mixed decimal mark error. The point illustrated with the red arrow in Figure 1b matches [10, 1.05]. That would mean the eels have a body mass of 10g and a brain mass of 1.05g. A pretty remarkable eel, and also clearly wrong. Similarly, if we decide that the actual data should be [10,000 1,050], we have an eel with a body mass of 10,000g and a bring mass of 1kg...putting the brain mass close to that of humans. Again, extraordinary.

The actual plotted data corresponds to [10,000 1.05], which passes the sanity test of an eel weighing 10,000g and having a brain weighing 1.05g. Thus, a 99 109 year-old typo is resolved.

So kids, double check your decimal marks, lest you crash things into Mars.

The reworking of the old data has been illuminating, and has given me a better appreciation of the tools I use on a regular basis. In redoing the figure, I have identified minor errors in the locations of points in the original figure, and can see that some of Lapicque’s lines are off. For example, the line fit to the Blue Whale in the upper right corner just peters out near the datapoint for "antelope", but the recreated figure shows it to fit a line he had drawn for Lions, Pumas, and house cats. Weird grouping, but whatever. A type of robin (10) also fits on the same line as swans(6), mallards(7), and the garganey(a duck, 8)...which makes much more sense.

Time to have a coughing fit and lay down.

An afterword

Completely coincidentally, there is an odd point on the original plot that is vaguely in the same area as the incorrectly plotted [10, 1.05] data. I had initially thought Lapicque plotted something in exactly the same way my recreation had with a point at [10, 1.05]
Figure 3: What is this?
Now I'm left wondering: is this an artifact of how the figure was created or copied, or a data point that was removed from the analysis? If it's a datapoint, it looks like it would fall near lines for either monkeys and tamarin, or gibbons and orangoutangs. Maybe a pygmy tarsier?


Saturday, February 27, 2016

Civilization V and bottom-up vs top-down modeling

I've recently begun looking at Civilization V again. Specifically, I have been looking at Civ V: Brave New World because (for reasons I won't go into here) I feel vanilla Civ V and Civ V: Gods and Kings are simply unplayable. I haven't been playing Civ V, really. I've been watching AI-only games and thinking about the game and simulations in general.

In my everyday job, I think about simulations a lot. Specifically, I think about how complex patterns arise from simple rules and the interactions of individual actors. There are general rules in how seemingly complex systems reach stability, and many of these rules are due to bottom-up interactions. Previous iterations of the Civilization game franchise have felt like bottom-up simulations to me: the size of your empire was partly a function of how rapidly you could move your troops (which could be altered by building roads or railroads, upgrading units, etc). If you couldn't defend your boarders, some AI would take cities from you until you reached a point that you could maintain your boarders.[1] In more recent versions of the Civilization games, concepts of "city happiness" played a role in that an unhappy city could choose to "flip" and join a different civilization.

As I have been watching Civ V AI-only "games"[2], I've been wondering why I just wasn't able to feel as immersed in the games. I still play Civ IV: Beyond the Sword, so I haven't lost my love of the genre. Maybe a part of why I am not taken with Civ V is that some of the game elements are top-down limitations rather than bottom-up emergent properties.

Civ V introduced the concept of civilization-wide happiness. If you have negative happiness in your civilization, that slows (or can stop) population growth in cities, slows production, and results in combat penalties for your troops.


The red text and the little angry face show sources of unhappiness in this civilization. Note that there is a total of nine unhappiness for cities, and 36 unhappiness for population. Just to be clear, the entire civilization is unhappy that there are too many cities. Quoting from a guide:

Each City you found will produce 3 Unhappiness, and +1 per unit of Population. So, founding a City will immediately produce 4 Unhappiness. As it grows, it will produce more and more Unhappiness, +1 per new Citizen.
This is a great example of top-down regulation of a simulation's behaviour. This game mechanic effectively means that the maximum size of a civilization is limited unless the player or AI works very hard at building things that promote happiness, have good relations with city states (a special type of AI in the game), get certain luxuries, etc.

To be clear, I understand this is just a game. As a game mechanic, this sort of thing works well and fixes a problem that older versions of Civilization had. Specifically, the emergence of a Superpower in the game. In older versions of the game, the more cities a player/AI had, the more troops it could produce, and the more cities it could conquer, so it could produce more troops....[3]

The top-down control on civilization size has a drawback: it removes the organic feel of a spreading group of humans. Humans have spread to every corner of the globe that can support life. We didn't stop because we had too many tribes. If you set up an AI-only game with a single civilization and no winning conditions, that AI will expand to a certain size and then just stop, leaving the rest of the game-world empty.

What would make Civ V more of a bottom-up simulation? In Civ 2, there was a game mechanic where, if you managed to capture a more powerful civilization's capital and it was unable to instantly build a capital in another city, civil war would happen and the AI you were fighting would become two different AI controlled civilizations. As far as I know there was nothing similar in Civ III or Civ IV, but there was a mod made for Civ IV that allowed unhappy cities to revolt and form a brand new civilization.

Thanks to the modding community, there are mods for Civ V that accomplish similar things, though the rebellious cities form City States, which are special AI civilizations called Minor Civilizations that do not make settlers, so they do not expand their territory. From my perspective, it would be more interesting if the cities become new Major Civilizations, but at least there is a mod that allows Minor Civilizations to build and use settlers so they can expand.

It has me wondering whether a combination of

  1. Removing happiness penalties for founding/having cities (mod)
  2. Using the Revolutions mod to allow individual cities that are unhappy or influenced by other civilizations to rebel and either start their own civilization or join another civilization
  3. The City State Settlers mod to allow those newly generated Minor Civilizations to make settlers and expand
would result in an AI-only game where a single AI would result in the game-world being completely settled by a diverse number of civilizations. Many from one. Ex uno multa? My very own Tower of Babel simulation.


[1] This is not entirely accurate, since what actually happens is a sort of armed boarder stand-off or crushing defeat where a massive stack of units overwhelms the enemy positions.
[2] Let's be honest and just call them "simulations".
[3] In Civ I, this was slightly offset by the need to feed your troops. Food for your troops came from the city it was produced in. This was a fun mechanic that I miss. You could get rid of troops by "interrupting their supply line" by attacking the city that was feeding them, destroying their ability to harvest food, or capturing the city. Later versions required that you pay money to support troops, but a successful economy let you fiend pretty big armies...sort of like in real-life.

Tuesday, January 12, 2016

Playing with colour conversions

I'm posting this in case other people experience problems with using python's colorsys and can't figure out how to get values they expect.

In my tree simulation software, the graphical options allow for 2d and 3d graphical formats. When I wrote the code, I had been tinkering with CFDG a lot, so I tied into it and used it to generate the graphics. At some point I need to revisit that and use something else to generate the graphics, but here's an example of what the 2d stuff looks like.


In this case the top panel is a top down view of the simulated forest, and the bottom panel is a side view to give you an idea of what the heights look like. The relative darkness of the green colour of each canopy represents how shaded that particular canopy is. Bright green means that canopy is receiving full sunlight, and a completely black canopy would be receiving no light at all. The latter never happens since the trees die before that point.

An example 3d view looks like this.
The spherical "warts" are visually oversized propagules. This particular scene is set in Ithaca, NY at around 06:30 and the summer solstice.

CFDG uses HSV for defining colours, but the 3d output is as a dxf autocad file, meaning the colour code is use an autocad colour index (ACI). Right now the colours of propagules, canopies, and trunks are hard coded in the code that generates the dxf file, but I was working on the code so I could generate prolate spheroids for canopies, and thought I could change that so the dxf would use the come colour as the species was defined to have.

The plan was to convert HSV into RGB, and then do some colour distance maths to pick the closest ACI value. Python has a colorsys module that lets one convert RGB to/from HSV, but as I used it, it did not behave in ways I expected.

I define canopies as having a HSV colour of 117°, 100%, 100%. An online colour picker I looked at said the RGB would be 13, 255,0. Here is where the wat with colorsys began.

The colorsys module wants values input values to be between 0 and 1, and will give you corresponding values. Since "H" is HSV is in degrees, you need to take the H value and divide it by 360. Your % values get divided by 100, so you end up using 0.325, 1.0, 1.0 like:

coloursys.hsv_to_rgb(0.325,1.0,1.0)

which gives you 0.04999, 1.0, 0.0. Now you need to multiple each value of that tuple by 255 to get 12.79, 255, 0...which is close enough to 13, 255, 0.

Now I need to make a RGB to ACI lookup table that measures colour distance and picks a ACI close to the RGB given.

Monday, November 9, 2015

What the what? The Unity sculpture and GDT

I'll start with the punchline: HOLY SHIT THE "UNITY" SCULPTURE LOOKS LIKE THE GDT LOGO!






Way back in 1994, two close friends of mine discovered that, when we stood side-by-side, we made an interesting symbol:


Later on, we would use this symbol as the logo for a weekly publication called Gracies Dinnertime Theatre(GDT).[1][2] Over the years, one thing we tried to do within the publication was intentionally change the layout and presentation of material, including aspects of the logo. For the founders, the logo meant both "co-operation" and "change", the idea being that the simplest stable object is a tripod, but the lack of a level surface at the top of the symbol meant a lack of permanence. From 1994 until it stopped publishing in 2005, the GDT logo and how it was used had various incarnations, but always involved the three upright columns and an angled top piece.

April 1995 to November 1995
December 1995 to January 1996
February 1996 to March 1996,
September 1999 to December 1999 
December 1996 to May 1997
April 1996 to November 1996,
September 1997 to May 1999

December 1999 to March 2000  
March 2000 to May 2005

Despite having children and a respectable job, I look back on my time with GDT (1994 to 1998 as founder and editor, and as contributor into the 2000s) with fondness. For an upstart publication, we ended up doing things that still surprise and please me--like making up our own calendar that started on 16 July 1945 (the date of the first nuclear detonation). It seemed silly at the time but has gained support from fellow scientists.[3]

So when I first saw the "Unity" sculpture on the Rochester Institute of Technology campus, I was very surprised. The relationship between GDT and the administration of RIT was always poor: GDT was never officially a sanctioned publication and would often publish unflattering material about the college president, board members, and the school in general, GDT was an embarrassment.

On the ground, however, GDT received a great deal of support from professors, particularly those in the College of Imaging Arts and Sciences (CIA's). GDT's first "advisor" when it applied for and received a Creative Arts grant was Bruce Sodervick, a sculpture professor, and one of GDT's founding members was in the ceramics department. GDT would also regularly have Pizza and Coffee sales in the CIAS building to promote the publication and get a chance to talk with readers. To say that GDT had close ties to CIAS would be a bit of an understatement.

It should go without saying that the GDT logo was closely associated with the publication. We made shirts, stickers, even books that had the logo.

What am I to think when a GIANT GDT LOGO sculpture is erected on the RIT campus that is created by two professors in the sculpture department? I think it's pretty great! Thanks, guys!

Friday, January 23, 2015

Is there something to Briggs-Meyers personality types?

I've been thinking a lot about personality recently. I'm not exactly sure what got me thinking about it, but it's been going on for a few months now. Since October, I think.

Regardless, part of my meditations have revolved around whether there is some truth behind Briggs-Meyers (BM) (or any other) categorization. Put another way, to paraphrase a much more rigorously minded friend: are personality categorizations convincing because they are vague enough that someone can read a description and say "ZOMG that is totally me"?

I'm not going to answer that. I _will_ say that it was sort of creepy how dead on the description of me at 16 Personalities was. I even had my wife read things as at an attempt at an objective outside observer, and she laughed while reading it saying again and again, "This is totally you."

So I can't answer whether there is just the magic of vague categories going on, but that's ok. What I was interested in was whether there are other ways to converge on a BM personality type.

One thing that occurred to me was to use software like what is at I Write Like (the github repository is at https://github.com/coding-robots/iwl). I threw a few writing samples at it--a long email, a previous blog entry, and some text from chatting online with a friend--to see what it would spit out. I specifically left out a sample of my scientific writing because that does not feel spontaneous or authentically me. The results looked like this:


  • Blog entry: I write like H. P. Lovecraft
  • Email: I write like David Foster Wallace
  • Online chat: I write like Stephen King


From there I did a quick search to see what people online thought these author's BM types might be:


  • H. P. Lovecraft: INFP
  • David Foster Wallace: INTP or INFP
  • Stephen King: INTP


And here is the interesting bit for me: when I take the BM tests, I score as a IN(T/F)P. It is T/F because the T and the F have exactly the same score. It's neat to see that the writing assessment and the (admittedly) non-rigorous BM assignment of the resulting writers matched my BM results so well.

The implication is that, depending on the medium I choose to communicate in, I could seem to be different. Email? INTP or INFP. Online chats? INTP. On a blog where I just spill my ideas? INFP.

Neat. Suggestive. Not at all something I am ready to say is Meaningful.