Fanout versus Confusion: two complexity metrics

Posted on July 12, 2019 by Rick Jelliffe

In electrical engineering, fan-out means the number of logic gate inputs some logic gate (technology) can drive: early computers used diode logic that allowed 0 fanout: you could only have fan-in (the number of input gates some logic has), while modern technologies allow potentially a fanout in the thousands.

This idea was taken up as a software complexity metric: in functional and procedural systems it is the number of other functions or procedures that some function or procedure calls, in class-based object systems it is the number of classes referenced in some class. At one stage, the attempt was made to count the number of decision points as a kind of fan-out, which came a cropper because sometimes high fan-out is a sign of high organization (such as a large case statement that matches the problem well).

Can we use fanout as a metric for a structured document, or a document corpus? Yes, next is a description of some possibilities. But the more interesting question is is this useful? and I think that requires more thought.

Fanout Distribution Graph

Lets imagine that instead of a single number, our metric is a graph, with fan-out on the x axis (say from 0 to 100) and in the y axis a count of all the elements in the document or corpus that have that fanout, where fan-out is the count of unique child elements. (There are other possible counts too.) So this gives us a snapshot of something: for example we can imagine that many collections HTML pages would look like this: not many different kinds of subelements at any stage. (Libre Office’s axis labels are offset, so what looks like 1 actually is 0: grrr.)

At another extreme, we could imagine that a data-oriented fixed format, where there are a lot of records that are just required sequences of the same elements, would look like this:

We might see that a complex document set that has highly specific markup for a particular domain and very varied documents might look like this (with several thousand at 0):

And from that we can tell that this is richly marked up, and there are many elements with quite a few different kinds of markup.

Regularity Graph

What this cries out for is to know whether these are all the same bunch of elements or some variety: to do this we could add a count to each bar, that of total unique elements for that count. If the counts of unique elements are the same as the fan-out count, that means we have a highly regular document (such as the second chart) while the more the count is greater than the bar label (it cannot be less, by definition, since we are counting unique elements), the more that there is variation.

So what do we make of these numbers. We can see that the elements with a low fan-out did not vary in the elements they contained. We can see that elements which had a fan-out of 4 (looks like 5 due to charter problem) had a count of 50 different elements: there was a very high degree of variability there. And for elements with 7,8 or 9 fanout, they all had 15 or sixteen: maybe this indicates that half had one, half had another set. But there is a stability there.

Potential Fanout Distribution Chart

While we are thinking of graphs, we can make one that counts each element found in our corpus but, instead of adding the actual unique count of elements found, it counts the unique elements possible in the Schema or DTD. This will (in a closed schema) always be a greater or equal number than the fanout count.

In this our same data as last time, but the yellow shows that, for the data in this document or corpus, what we might have had to cope with according to the schema. Now this chart does not show anything on a per element basis, but it avoids the problem of showing/graphing/explaining combinatorial explosion, which people just don’t get: this is a more linear number that perhaps reflects work a bit better: what it shows us is how big a dialect of the schema our actual document or corpus uses. It has the virtue of being able to be graphed on the same graph as an actual fan-out graph, for comparisons. In most cases for schemas, we can expect some large count to have a large occurrence: think of rich text with semantic elements added: they might easily have (say) 17 unique elements which can potentially appear in any situation. I know of kitchen-sink schema that have literally scores of elements.

From a work perspective, the more a divergence, the more that the schema may be unnecessarily complex, ill-suited or confusing: prone to allowing the wrong markup. A chart where there is such a strong divergence would be useful information for when customizing a structured editor: you know you need to trim (or, to decide if the elements that are never used in context would actually be useful: maybe the schema gets it right and the conversions or markup team got it wrong!)

Is Fanout Useful?

So now we have these three graphs, we come to the question Are they useful? Of course, you can say any metric is only useful if it is not what you expect. So lets rephrase our question as Will these charts ever show us something we are not expecting?

It can allow documents sets to be characterized: simple like the HTML, regular like the records, highly marked up with a large number of elements like a document; or with a lot of internal consistency between elements; or using the schema maximally or minimally.

I think they can, and they can give a general confirmation about the state of the system. It joins things like counts of XPaths, maximum path lengths, optionality counts and so on as information that provides boundaries, validates assumptions (or the wild claims of data owners!) and may raise alarm bells. You could conceivable add in fan-out as a factor in development effort, costings, tasking or risk for a project.

Is that good enough? I mean to say, if we need some tool to tell us we are dealing with a giant schema that is not actually used much in the documents set, or that we are translating fixed records, we haven’t done our cursory homework on understanding the dataset. Fanout metrics can give an idea of scale of an issue, but perhaps not so much that the issue exists. So I am not convinced that fan-out is compelling or reliable, even though it is objective, can be presented nicely and used as a tool in document analysis or whatever. Someone can change my mind, but it does not seem to be a good magic bullet.

An Alternative Complexity Metric to Fanout: Confusion!

One of the key problems I see in the use of fan-out is that often the work in a project (especially with XPath-based technologies, and indeed with the old OmniMark language) is determined by how many element rules can be defaulted (so that every time an element occurs it can be treated by the same template) or how much context-specific processing there is that is not merely the same general structure as some other element.

Your schemas and your documents may have all sorts of crazy fan-outs, but that does not correspond to workload if they are handled the same: at an extreme if there is a one-to-one mapping from an element in one schema to another, such that you can just make a single to-from table and process the document with that.

So finally this brings us to an alternative metric: what if the development work of a project is this:

work to do = boilerplate + simply processed elements + complex processed elements

The simple processed elements is just coding such as where one element produces another element, with just material in the immediate context such as transferring attribute values in some well-known ways.

In this case, we can estimate the complexity of a project if we can obtain input and output samples, remove the elements we can easily write code to detect a simple relationship between the input and output documents (at simplest, element X is always deleted, or element Y is always simply turned into element Z). Then the complexity is the count of elements we cannot find a simple relationship for. This is a nice metric, because it does not require enormous smarts to calculate it: you only have two buckets. I call this metric Confusion because it is count of the elements where your system has no idea what is going on: some special processing is required. Do we need to count the dunnos, and make them a positive thing rather than a roadblock for project estimation?

Now, you could validate it, by extracting the Confusion/Simplicity metric for some input/outputs of past projects, then examining the XSLTs to categorize templates less boilerplate as simple or complex: your Confusion metric tool is useable when it generates numbers in the same ballpark as the XSLTs categorized templates. You can then use the metric to estimate work.

(And now you can apply this metric to the cost and times of developing that XSLT (if it is available), you can figure out the velocity of your team for simple and complex transforms: if these estimates can be fairly simply correlated with the Confusion/Simplicity metric (the assumption that unusual things are where project effort is needed seems plausible), then it becomes something plausible for project estimation too. The distinction between work and velocity is essential to estimation: this is the fundamental intuition of the SCRUM methodology: knowing the number of story points, function points, entities, etc is useless for estimation unless you also know the velocity of the team to fully implement one.