Thursday, May 19, 2016

Thoughts on allometry and datasets

Imagine you know nothing about humans, but you want to predict the adult height of human males living in the United States of America in 2014, using their mass as the independent variable.

Luckily for you, you happen to have a really great dataset of primate height and mass. You plot height vs mass, get a best fit line equation, and you can finish your dissertation, right?

Oh, but wait. The dataset is for primates. It includes data from Homo erectus, Homo neanderthalensis, chimpanzees, gorillas, anatomically modern humans, etc. No problem. You remove everything but anatomically modern humans from your analysis, fit your line, plenty of time to finish and grab a beer before—

Oops. That dataset includes females. You remember reading somewhere that anatomically  modern humans are sexually dimorphic. Remove all the female data from the analysis, fit a line and—

Hold on. This data is world wide and not just for the United States of America. You're pretty sure you read that there are differences in height between different countries. You strip out all the data not related to the United States of America, fit a line and....

------------

You get the idea. I'm not going to belabour the point I'm trying to make. In this example you could question whether the data you have is for human males in the US in 2014, or maybe it's data from 1820 (which would matter since people have gotten taller as nutrition has improved over time), or maybe there are racial difference you don't know about and are not indicated in the data.

Each of those best fit lines was accurate, but totally dependant on the scale you are interested in.

This is something I think about a lot. I engage in things that could be called "macroecology", which tends to to look at large-scale, whole systems and look for top-down relationships. At the same time, I always try and understand what the underlying bottom-up causes are of the large scale patterns. How does the individual tree result in the structure of a forest? How does human metabolism result in trade networks? It can be hard to parse datasets to get at what is needed, particularly if the data was gathered in such a way that it ignores attributes that might be very important to your question, but were not important to the collector. To use the above example, what year does the human data come from.

That is all.

Wednesday, March 16, 2016

Decimal marks

I have been terribly sick recently. When not dealing with a fever, having a hard time breathing, or just passing out from tiredness, I've been working on a review paper looking at the history of allometry. One of the figures I want to include is one from Louis Lapicque’s 1907 paper Tableau general des poids somatiques et encéphaliques dans les espéces animales ("General picture of body and brain weight in animal species"). The figure is interesting in a number of ways. First, it's a figure. The creation and publication of figures in 1907 was a pretty unusual move. Most of the time, data was presented in tables because it was simply too time consuming to create and then reproduce figures in publications.. Second, the figure might be the first one to graphically show how (what we now call) allometric relationships appear linear when plotted in log-log space.

The copyright has expired on the original 1907 paper and figure, so there wouldn't be an impediment to reprinting the original figure, but I was curious whether I could recreate it from the data provided in the appendix of paper. In the process, I discovered a small little error related to commas and decimal points. In the original 1907 figure, there is a datapoint for European conger, listed in the figure using the French name "Congre" (lower right *). But when the data from the appendix was replotted, the conger datapoint was missing. A new point appeared to the left, however (Figure 1b, red *).

Figure 1a: Showing the 1907 plotted data for Conger vulgaris
Figure 1b: Showing the re-plotted data for Conger vulgaris

After confirming that I had transcribed the numbers correctly, I started looking closer at the figure and the raw data. So what's going on? What might be going on is confusion about whether a "," is a thousand's separator or a decimal point.

The original data table that has the data in question looks like



A large part of the world uses a "," as a decimal mark. In the United States, if we want to numerically write out ten-thousand, we would say "10,000". To write the same thing in France, it would appear as "10.000". To be clear, for the rest of this entry, I will use "." as a decimal mark, so when I say "99.99" I mean 99 and 99/100.

Here, we see that the data for the conger is written as [10.000 1.05]. The paper is in French, and other examples in the table show the decimal mark is a comma, which means the values are 10,000 and 1,050.

But wait. "1.05" doesn't make sense as 1,050. It would have been written as "1.050" if that was what was meant. I think what we are seeing here is a mixed decimal mark error. The point illustrated with the red arrow in Figure 1b matches [10, 1.05]. That would mean the eels have a body mass of 10g and a brain mass of 1.05g. A pretty remarkable eel, and also clearly wrong. Similarly, if we decide that the actual data should be [10,000 1,050], we have an eel with a body mass of 10,000g and a bring mass of 1kg...putting the brain mass close to that of humans. Again, extraordinary.

The actual plotted data corresponds to [10,000 1.05], which passes the sanity test of an eel weighing 10,000g and having a brain weighing 1.05g. Thus, a 99 109 year-old typo is resolved.

So kids, double check your decimal marks, lest you crash things into Mars.

The reworking of the old data has been illuminating, and has given me a better appreciation of the tools I use on a regular basis. In redoing the figure, I have identified minor errors in the locations of points in the original figure, and can see that some of Lapicque’s lines are off. For example, the line fit to the Blue Whale in the upper right corner just peters out near the datapoint for "antelope", but the recreated figure shows it to fit a line he had drawn for Lions, Pumas, and house cats. Weird grouping, but whatever. A type of robin (10) also fits on the same line as swans(6), mallards(7), and the garganey(a duck, 8)...which makes much more sense.

Time to have a coughing fit and lay down.

An afterword

Completely coincidentally, there is an odd point on the original plot that is vaguely in the same area as the incorrectly plotted [10, 1.05] data. I had initially thought Lapicque plotted something in exactly the same way my recreation had with a point at [10, 1.05]
Figure 3: What is this?
Now I'm left wondering: is this an artifact of how the figure was created or copied, or a data point that was removed from the analysis? If it's a datapoint, it looks like it would fall near lines for either monkeys and tamarin, or gibbons and orangoutangs. Maybe a pygmy tarsier?


Saturday, February 27, 2016

Civilization V and bottom-up vs top-down modeling

I've recently begun looking at Civilization V again. Specifically, I have been looking at Civ V: Brave New World because (for reasons I won't go into here) I feel vanilla Civ V and Civ V: Gods and Kings are simply unplayable. I haven't been playing Civ V, really. I've been watching AI-only games and thinking about the game and simulations in general.

In my everyday job, I think about simulations a lot. Specifically, I think about how complex patterns arise from simple rules and the interactions of individual actors. There are general rules in how seemingly complex systems reach stability, and many of these rules are due to bottom-up interactions. Previous iterations of the Civilization game franchise have felt like bottom-up simulations to me: the size of your empire was partly a function of how rapidly you could move your troops (which could be altered by building roads or railroads, upgrading units, etc). If you couldn't defend your boarders, some AI would take cities from you until you reached a point that you could maintain your boarders.[1] In more recent versions of the Civilization games, concepts of "city happiness" played a role in that an unhappy city could choose to "flip" and join a different civilization.

As I have been watching Civ V AI-only "games"[2], I've been wondering why I just wasn't able to feel as immersed in the games. I still play Civ IV: Beyond the Sword, so I haven't lost my love of the genre. Maybe a part of why I am not taken with Civ V is that some of the game elements are top-down limitations rather than bottom-up emergent properties.

Civ V introduced the concept of civilization-wide happiness. If you have negative happiness in your civilization, that slows (or can stop) population growth in cities, slows production, and results in combat penalties for your troops.


The red text and the little angry face show sources of unhappiness in this civilization. Note that there is a total of nine unhappiness for cities, and 36 unhappiness for population. Just to be clear, the entire civilization is unhappy that there are too many cities. Quoting from a guide:

Each City you found will produce 3 Unhappiness, and +1 per unit of Population. So, founding a City will immediately produce 4 Unhappiness. As it grows, it will produce more and more Unhappiness, +1 per new Citizen.
This is a great example of top-down regulation of a simulation's behaviour. This game mechanic effectively means that the maximum size of a civilization is limited unless the player or AI works very hard at building things that promote happiness, have good relations with city states (a special type of AI in the game), get certain luxuries, etc.

To be clear, I understand this is just a game. As a game mechanic, this sort of thing works well and fixes a problem that older versions of Civilization had. Specifically, the emergence of a Superpower in the game. In older versions of the game, the more cities a player/AI had, the more troops it could produce, and the more cities it could conquer, so it could produce more troops....[3]

The top-down control on civilization size has a drawback: it removes the organic feel of a spreading group of humans. Humans have spread to every corner of the globe that can support life. We didn't stop because we had too many tribes. If you set up an AI-only game with a single civilization and no winning conditions, that AI will expand to a certain size and then just stop, leaving the rest of the game-world empty.

What would make Civ V more of a bottom-up simulation? In Civ 2, there was a game mechanic where, if you managed to capture a more powerful civilization's capital and it was unable to instantly build a capital in another city, civil war would happen and the AI you were fighting would become two different AI controlled civilizations. As far as I know there was nothing similar in Civ III or Civ IV, but there was a mod made for Civ IV that allowed unhappy cities to revolt and form a brand new civilization.

Thanks to the modding community, there are mods for Civ V that accomplish similar things, though the rebellious cities form City States, which are special AI civilizations called Minor Civilizations that do not make settlers, so they do not expand their territory. From my perspective, it would be more interesting if the cities become new Major Civilizations, but at least there is a mod that allows Minor Civilizations to build and use settlers so they can expand.

It has me wondering whether a combination of

  1. Removing happiness penalties for founding/having cities (mod)
  2. Using the Revolutions mod to allow individual cities that are unhappy or influenced by other civilizations to rebel and either start their own civilization or join another civilization
  3. The City State Settlers mod to allow those newly generated Minor Civilizations to make settlers and expand
would result in an AI-only game where a single AI would result in the game-world being completely settled by a diverse number of civilizations. Many from one. Ex uno multa? My very own Tower of Babel simulation.


[1] This is not entirely accurate, since what actually happens is a sort of armed boarder stand-off or crushing defeat where a massive stack of units overwhelms the enemy positions.
[2] Let's be honest and just call them "simulations".
[3] In Civ I, this was slightly offset by the need to feed your troops. Food for your troops came from the city it was produced in. This was a fun mechanic that I miss. You could get rid of troops by "interrupting their supply line" by attacking the city that was feeding them, destroying their ability to harvest food, or capturing the city. Later versions required that you pay money to support troops, but a successful economy let you fiend pretty big armies...sort of like in real-life.

Tuesday, January 12, 2016

Playing with colour conversions

I'm posting this in case other people experience problems with using python's colorsys and can't figure out how to get values they expect.

In my tree simulation software, the graphical options allow for 2d and 3d graphical formats. When I wrote the code, I had been tinkering with CFDG a lot, so I tied into it and used it to generate the graphics. At some point I need to revisit that and use something else to generate the graphics, but here's an example of what the 2d stuff looks like.


In this case the top panel is a top down view of the simulated forest, and the bottom panel is a side view to give you an idea of what the heights look like. The relative darkness of the green colour of each canopy represents how shaded that particular canopy is. Bright green means that canopy is receiving full sunlight, and a completely black canopy would be receiving no light at all. The latter never happens since the trees die before that point.

An example 3d view looks like this.
The spherical "warts" are visually oversized propagules. This particular scene is set in Ithaca, NY at around 06:30 and the summer solstice.

CFDG uses HSV for defining colours, but the 3d output is as a dxf autocad file, meaning the colour code is use an autocad colour index (ACI). Right now the colours of propagules, canopies, and trunks are hard coded in the code that generates the dxf file, but I was working on the code so I could generate prolate spheroids for canopies, and thought I could change that so the dxf would use the come colour as the species was defined to have.

The plan was to convert HSV into RGB, and then do some colour distance maths to pick the closest ACI value. Python has a colorsys module that lets one convert RGB to/from HSV, but as I used it, it did not behave in ways I expected.

I define canopies as having a HSV colour of 117°, 100%, 100%. An online colour picker I looked at said the RGB would be 13, 255,0. Here is where the wat with colorsys began.

The colorsys module wants values input values to be between 0 and 1, and will give you corresponding values. Since "H" is HSV is in degrees, you need to take the H value and divide it by 360. Your % values get divided by 100, so you end up using 0.325, 1.0, 1.0 like:

coloursys.hsv_to_rgb(0.325,1.0,1.0)

which gives you 0.04999, 1.0, 0.0. Now you need to multiple each value of that tuple by 255 to get 12.79, 255, 0...which is close enough to 13, 255, 0.

Now I need to make a RGB to ACI lookup table that measures colour distance and picks a ACI close to the RGB given.