Patterns in time: validating dynamic behaviour of Java with Schematron and log4j

Posted on March 16, 2018 by Rick Jelliffe

I was thinking more about my recent post on Alan Turing’s method for checking routines in Alan Turing and Tony Hoare in Assertions. How would we do that today, and could Schematron be useful? (This blog post is just a snapshot of a design to air the technique, not debugged code.)

I have previously also posted on using JAXB to serialize out a Java program as a snapshot, and validate that, in Can we assert patterns in Java Objects with Schematron. I expect the same method is available in some other languages, too.

Turing’s approach requires two things, first that we can capture the value of interesting fields at each statement, and second that we can compare them against an expected sequence. The first can be tested by static invariant tests: for example that i is always greater than 10. But normal typing and subtyping can do this if pursued. More interestingly, it allows static co-contraint invariants such as that if the month is November the day cannot be 31. The second is in the realm of what you might be tempted to call dynamic invariant tests if your tests were fuzzy (such as that if the input document has 100 paragraphs then the output document should have at least one and less than 200) or just value tracing otherwise (such as that if the function argument x is value 5, at line 3 the variable y should be 6, and at line 10 the variable y should be 9.)

So can we do with Schematron and off-the-shelf tools? It is tempting to use Java annotations, but it is possible just to use any Java logger that can produce XML. (If you don’t produce XML, you still can use Schematron, but you need to put some step to convert your log format to XML, which is what XSLT3 is supposed to be good for.)

The Log4j: So lets use Apache log4j 2.

The DTDs and schemas for log4j are a complete mess, and inconstent, and it is perhaps prudent of Apache that they have made them so difficult to find. No wonder people find XML complicated, if the libraries are so intent on complexifying things. The XML layout produced by log4j has changed substantially between log4j 1 and log4j 2. I located the Log4j 2 Events DTD on the GitHub and here it is with the namespace prefixes removed and declarations matching what MapMessage.asXML() generates. But you should expect to customise this, depending on whether the log4j XML Layout generates prefixes or not.

<!--element and attribute declarations-->
 <!--Log4J 2.0 XML Schema-->
 <!ELEMENT Events ((Event)*)>
 <!ATTLIST Events
         xmlns CDATA 'http://logging.apache.org/log4j/2.0/events'
 >
 <!ELEMENT Event (Message, (NDC)?, (Marker)?, (Throwable)?, 
           (LocationInfo)?, (Properties)?)>
 <!ATTLIST Event
         logger CDATA #REQUIRED
         timestamp NMTOKEN #REQUIRED
         level (OFF | FATAL | ERROR | WARN | INFO | DEBUG | TRACE | ALL)
              #REQUIRED
     threadId CDATA #REQUIRED
     thread CDATA #REQUIRED
     threadPriority CDATA #REQUIRED
 >
 <!ELEMENT Message ANY>
 <!ELEMENT NDC ANY>
 <!ATTLIST Marker
     parent CDATA #IMPLIED
 > 
 <!ELEMENT Throwable ANY>
 <!ELEMENT LocationInfo EMPTY>
 <!ATTLIST LocationInfo
         class CDATA #REQUIRED
         method CDATA #REQUIRED
         file CDATA #REQUIRED
         line NMTOKEN #REQUIRED
 >
 <!ELEMENT Properties ((Data)+)>
 <!ELEMENT Data EMPTY>
 <!ATTLIST Data
         name CDATA #REQUIRED
         value CDATA #REQUIRED
 >
<!ELEMENT Map (Entry*)>
<!ELEMENT Entry CDATA>
<!ATTLIST Entry 
         key CDATA #REQUIRED>

You can see examples of how to create XML logfiles in this post. You can see how to set up multiple loggers so the tracing does not interfere with normal logging in this post.

Cutting to the chase, here is the idea for the Java instrumentation. (This is just a sketch: not tested. You get the idea.) It is screaming out to be a function or annotation to tidy it up, I know. (The names i1, i2 etc are pretty much what Turing called inductive variables.)

public int doStuff(int n) {
  int i = n;   
  {MapMessage m = new MapMessage(); m.put("i1", Integer.toString(i)); logger.trace( m);}
  i = i++;
  {MapMessage m = new MapMessage(); m.put("i2", Integer.toString(i)); logger.trace( m);}

i = i++;
  {MapMessage m = new MapMessage(); m.put("i3", Integer.toString(i)); logger.trace( m);}

... 
   return i;
}

and our XML output should look something like this if we call doStuff(0):

<?xml version="1.0" encoding="UTF-8"?>
<Events xmlns="http://logging.apache.org/log4j/2.0/events">
  <Event logger="com.schematron.com.eg" timestamp="1373436580419" level="TRACE" threadId="1" thread="main" threadPriority="1">
    <Message><Map><Entry key="i1">0</entry></Message> 
    ... 
  </Event>
  <Event logger="com.schematron.com.eg" timestamp="1373436580420" level="TRACE" threadId="1" thread="main" threadPriority="1">   
    <Message><Map><Entry key="i2">1</entry></Message> 
    ...
  </Event>

<Event logger="com.schematron.com.eg" timestamp="1373436580421" level="TRACE" threadId="1" thread="main" threadPriority="1">
    <Message><Map><Entry key="i3">2</entry></Message> 
    ...
  </Event>

</Events>

The Schematron: And this allows us to write static invariant schematron code, such as the following weak test:

<sch:schema ...>
   <sch:ns prefix="log4j" uri="http://logging.apache.org/log4j/2.0/events"/>
   <sch:pattern ...>
     <sch:rule context="log4j:Entry[starts-with(@key,'i')]"   ...>
       <sch:let name="before" value="preceding::log4j:Entry[starts-with(@key,'i')]" />
        <sch:assert test="not($before) or (number(.) > number($before))">
         The values of i should increment. </sch:assert>
     </sch:rule>
   </sch:pattern>
</sch:schema>

To get the Turing-style test cases, we can write a schema that specifies a Schematron phases for each test case, like this:

<sch:schema ...>
   <sch:ns prefix="log4j" uri="http://logging.apache.org/log4j/2.0/events"/>
   <sch:phase id="doStuff-0">
     <sch:active pattern="p1"/>
   </sch:phase>
   <sch:phase id="doStuff-1">
     <sch:active pattern="p2"/>
   </sch:phase>

 <sch:pattern id="p1">
 <!-- This pattern is used when doStuff(0) was called -->
 <sch:rule context="log4j:Entry/[@key='i1']" ...> 
     <sch:assert test="number(.) = 0">
      For doStuff(0): At i1 the value of i should be zero</sch:assert>
  </sch:rule>
 <sch:rule context="log4j:Entry/[@key='i2']" ...> 
     <sch:assert test="number(.) = 1">
      For doStuff(0): At i2 the value of i should be one</sch:assert>
  </sch:rule>
 <sch:rule context="log4j:Entry/[@key='i3']" ...> 
     <sch:assert test="number(.) = 2">
       For doStuff(0): At i3 the value of i should be two</sch:assert>
  </sch:rule>
 </sch:pattern>
<sch:pattern id="p2">
 <!-- This pattern is used when doStuff(1) was called -->
 <sch:rule context="log4j:Entry/[@key='i1']" ...> 
     <sch:assert test="number(.) = 1">
       For doStuff(1): At i1 the value of i should be one</sch:assert>
  </sch:rule>
 <sch:rule context="log4j:Entry/[@key='i2']" ...> 
     <sch:assert test="number(.) = 2">
       For doStuff(1): At i2 the value of i should be two</sch:assert>
  </sch:rule>
 <sch:rule context="log4j:Entry/[@key='i3']" ...> 
     <sch:assert test="number(.) = 3">
       For doStuff(1): At i3 the value of i should be three</sch:assert>
  </sch:rule>
 </sch:pattern>  
</sch:schema>

Summary: We can use bog-standard COTS logging to generate a dynamic trace, as an XML document, of the values we are interested in during program execution; this then allows us to run Schematron schemas on that XML. With this, we can then make use of the power of XPaths to tie together information from multiple times and locations in the program’s execution. Of the ways to organize this is to use Turing’s approach of inductive variables.

Improvements: Of course, we can generate much better messages, because the “XML Layout” of log4j also will give the package/class/line in the program where the message was generated. We can pass back multiple messages in the MapMessage object, or even pass back exceptions and (with extension classes) XML fragments. As mentioned, the syntax above is verbose, and annotations/injections/aspects would allow something minimal.

Applicability: As with all testing, this kind of technique is only of value if you are willing to spend as much (money or time or personnel) for testing as for coding. The example is a little misleading, because I wouldn’t expect instrumenting a single function would reveal much, and we have debuggers for that perhaps. So the technique is more useful in situations where the dynamic operation of the code was complex, perhaps where code was poorly structured as far as cohesion and coupling, or if co-routines or some asynch processes were being used.

Disability: In a post last week, Analysis versus Synthesis: Are we atuned to each way of thinking? I suggested that the debugging approaches suitable for people with large working memory (who can skip back and forward in code easily) and good short-term-memory-consolidation (who can come back tomorrow or next week and still remember what is going on easily) might be unwise or ineffective for people with smaller working memories and poorer short-term-memory-consolidation: frequent characteristics of ADHD, dyslexia and autism in particular.

That is what set me on this train of thought: is Turing’s kind of stepwise, systematic, forward-going instrumentation a more viable approach for people with that kind of mind, rather than post-failure analysis debugging (using binary chop, ad hoc unit tests, or whatever)?

Indeed, can the same thing be said about TDD in general: if you have workers with some of these syndromes (or you have it yourself) is that another reason to adopt TDD? (And, negatively, if, say, a project has a low-level development regime that fails to allow for TDD approaches, forward instrumentation, or even static analysis like FindBugs, and instead relies on analysis capacity, isn’t that exactly a form of instutional discriminisation: it creates a disability by rejecting approaches that bypass the lack of capability?)

Anyway, that is the idea, I hope you found it interesting.