The scenario: you have a large corpus of traditional SGML/XML type documents, semi-structured text not database dumps, with schema or DTD structures coping with the great variety of structures found in real life:  with the position independence, repetition and recursion that makes enumerating all possible documents i,practical due to combinatorial explosion.

You need to check that your transformations are working with some sample of documents. How many should you pick?  In this article, I suggest you need to be thinking in terms of thousands of documents, the more the merrier.

Your first fight will be to get past the naive boss or team who say “Oh, you just find some typical documents“. That is suitable for specific unit testing, it gives no confidence that the transformation works properly.

Your next fight will be the desire to limit the number of test cases, because if you say you want too many, then your naive boss or team will say (probably for apparently good business reasons!) “Oh, don’t go overboard, the chances of each incremental test finding something new becomes vanishingly small doesn’t it?”   What that person may actually be concerned about is time: will it take weeks to run the tests over the corpus of the millions of documents? Will it add substantially to the compilation time if part of unit tests? Indeed, I have worked on projects where testing documents took longer than the sprint itself!  There is a great answer to this: run these independently of the build (not as Unit tests) on some cloud platform.  Using the cloud is something that bosses consider good, and if you want you can scale up to run tests faster (though the IO issue of getting all your precious documents out into the Cloud becomes a major problem, both for bandwidth and security).

So that leaves the question,

How many documents of a large varying corpus do I need to test to make sure everything gets converted correctly?  

We need to look at statistics for a cue.  And I say a cue, because the thing is that we have to start with some assumption or hypothesis about the nature and distribution of elements and structures in our corpus.  We don’t have one, undoubtedly.  (If you do, let me know, and why it would apply to other people’s corpuses too.)

So the best we can do is say “What kinds of numbers do we get using typical kinds of assumptions?”, and then make the imaginative leap that, from prudence, the onus needs to be on the detractors to prove that our document corpus is not akin to these assumptions.  A rational feel is better than an irrational certainty, if you know what I mean:

Sometimes it is hard to understand the calculators and statistics. There are two different questions the statistics tend to answer:

  • How many documents do I need to sample to probably find at least one example of a particular structure?  This is the analysis problem.
  • How many documents to I need to sample so that my sample set will have the same kind of proportion of this feature as the corpus?  This is probably not an interesting number.

So there are two statistical methods I think are applicable.

The first is using Binomial Probability Theorem.  There is an online calculator here. And some relevant background here.  In this case, we estimate how likely some thing is likely to be used in our documents, and we have an acceptance percentage which is what the likelihood our sample set will contain those on average.

  • So lets say that we think half our documents have appendixes, and we want to make sure we have an average 99% chance of finding them.  According to the BPT, we need to test 6 documents.
  • Lets say we think our documents have 1/1000 chance of having a table within a table.   According to the BPT, we need to test 4603 documents.

Note that this is only an average number, not a guaranteed number!  The more certainty you want, the greater the sample size.

What is salutory is that I have worked on numerous projects where small sets of documents (like the six) were considered a solid basis for testing: they were picked by a human for their supposed completeness rather than randomly selected. That has its strengths and weaknesses, but I think the first bullet point which gave 6 above suggests that a small number is barely adequate even for common structures, without special knowledge.

When we look at 1/1000 probability, that may seem rare, but if we have a corpus with a million documents it is still a thousand occurrences: 4,603 is the number of documents you need to test, on average, before having found one of those structures in your test.  But 99% probability is not good enough, in our post-six-sigma age.

The second approach we can take for getting a feel is Random Probability Sampling theory. I have posted on various forums about this before. There is a good calculator here. We provide our corpus size, the probability that we have found the occurrences, and a degree of confidence. Lets run our examples again, assuming a corpus of 100,000 documents.  We don’t worry about our expected likelihood the thing may occur (do we know it?),

  • For both the appendixes and the tables in tables at 99% confidence, the sample sizes are the same: 14, 228.   (The incidence is 0.5)
  • (If you are willing to reduce your confidence to only 50%, you get 7 documents, which is the same as the BPT example above. With only 6 or 7 documents, it is still only a coin toss whether your random selection contains a document with that feature.)

So again you need to be looking at testing thousands of document.  If you increase the confidence range, you need more documents in your sample set.

You might take the approach that the BPT and the Random Sample Size numbers give a lower and upper bound: if you know the likelihood of an element or structure, you tend towards the BPT, but otherwise you need to use the Random Sample Size number.

My experience is that testing everything is useful. Schematron tests can be written upfront to implement SEV1 and SEV2 error checking (for example) on transformations.  A sanity check of a small number of documents is prudent as an initial step. You can then report the number of documents that fail with a problem, and get the most severe (the errors affecting the most documents) fixed first, down to whatever threshold of error tolerance is required. (i.e. if only one document in a million has a particular SEV2 error, for business reasons it may be appropriate to downgrade the error rating to SEV3.)


  • Adjust your expectations to expect to test thousands of documents.
  • For SCRUM, get a project agreement as part of the definition of done about what document sample size needs to be, using Binary Probability or the Random Probability as a proxy.  Build a test system for this.
  • When your tests find an error, use the same document selection to check that  error has been fixed (and take an offending document as a unit test, as well, perhaps.)
  • For each new test run (not being for checking fixes), make another random selection, if possible.
  • Have some small (< 10) collection of documents that are supposed to be correct, and test them first before doing any large-scale tests, each time. These are a sanity check against gross errors.
  • Test about 15,000 documents selected randomly as part of normal testing for each sprint, assuming SCRUM for argument.  Provision yourself to allow these tests to run overnight so that the results are ready for the Sprint Review or Retrospective, or over a weekend so that the results are ready for the Sprint Planning meeting.
  • At some testing milestone or at some different cadence to the sprint (the “epic”?), test all documents in the corpus.