Weka 3.6 Outputs Models as Java Code

The Weka software for those of you that might not know is an excellent exploratory and prototyping tool for Machine Learning algorithms and Data Mining. If you at some point find yourself working with data that you're asked to find a pattern to or figure out if can be used to make better decisions, Weka should be among your first stops.

The Joys of Data Mining

I was fortunate enough that the software had come into existence prior to my graduate studies. If it hadn't I am convinced that my research and thesis would have progressed a lot more slowly (i.e. at a complete stand-still).

My current day to day job doesn't really give me the opportunity to play around with this software at all but I religiously keep it installed on my home machines (along with R) out of a twisted sense of duty :)

After updating Weka to the latest stable version 3.6 recently I noticed a long coveted option finally available under options for the classifier algorithm:

Classify tab > More Options (in Test Options)

Classify tab > More Options (in Test Options)

My geeky nerd heart jumped a bit as I remembered my hack solution to convert C4.5 decision trees into C++ code by running a custom built python script over the generated tree output from Weka back in the day. It was a huge pain and had a large possibility of a human error. After digging up some demo .arff files (the Weka data file format) to test this new magic and readying my old-timer "Things weren't always this easy" speech in my head, you can only imagine my crushing disappointment of seeing the hack job done with the resulting Java code output.

The Nerd Rage

From a relatively simple test set of whether or not to go outside to play based on weather conditions that produced the decision tree:

       J48 pruned tree
       ------------------
       outlook = sunny
       |   humidity = high: no (3.0)
       |   humidity = normal: yes (2.0)
       outlook = overcast: yes (4.0)
       outlook = rainy
       |   windy = TRUE: no (2.0)
       |   windy = FALSE: yes (3.0)
       Number of Leaves  :  5
       Size of the tree :  8

The code was, to say the least, crushingly disappointing. Below is the code generated for the tree above. I am at a loss of where to start...

  1. The function names
    The application already has access to all the attribute names. Why aren't they included in the function names rather than relying solely on some cryptic auto-generated names that have no meaning. Just this fact ensures that this code can never be used as-is in any project as it is completely unmanageable. Just imagine if this was a tree of non-trivial size.
  2. Variable/Parameter types
    or should I rather say the complete lack there of. Again all this information is already present in the supplied .arff file and is used by the classifier algorithm. Can't see how it is better to pass around an flattened array of all the values. And an Object array to boot!
  3. What on gods green earth is happening in the run function
    This is the biggest WTF of them all. The classifyInstance function that I am guessing will be called by code invoked from the weka.classifiers.Classifier namespace seems to deliberately deconstruct an organised data class into this aforementioned god-forsaken Object array. I have no words.

The generated code

class WekaClassifier {

  public static double classify(Object[] i)
    throws Exception {

    double p = Double.NaN;
    p = WekaClassifier.N21c17f5a2(i);
    return p;
  }
  static double N21c17f5a2(Object []i) {
    double p = Double.NaN;
    if (i[0] == null) {
      p = 1;
    } else if (i[0].equals("sunny")) {
    p = WekaClassifier.N268fff063(i);
    } else if (i[0].equals("overcast")) {
      p = 0;
    } else if (i[0].equals("rainy")) {
    p = WekaClassifier.N37aff6b14(i);
    } 
    return p;
  }
  static double N268fff063(Object []i) {
    double p = Double.NaN;
    if (i[2] == null) {
      p = 1;
    } else if (i[2].equals("high")) {
      p = 1;
    } else if (i[2].equals("normal")) {
      p = 0;
    } 
    return p;
  }
  static double N37aff6b14(Object []i) {
    double p = Double.NaN;
    if (i[3] == null) {
      p = 1;
    } else if (i[3].equals("TRUE")) {
      p = 1;
    } else if (i[3].equals("FALSE")) {
      p = 0;
    } 
    return p;
  }
}

... snip ...

public double classifyInstance(Instance i) throws Exception {
    Object[] s = new Object[i.numAttributes()];

    for (int j = 0; j < s.length; j++) {
      if (!i.isMissing(j)) {
        if (i.attribute(j).isNominal())
          s[j] = new String(i.stringValue(j));
        else if (i.attribute(j).isNumeric())
          s[j] = new Double(i.value(j));
      }
    }

    // set class value to missing
    s[i.classIndex()] = null;

    return WekaClassifier.classify(s);
  }

I feel an uncontrollable nerd-rage fuelled urge to download the source and fix this cluster-fuck of a feature and I will!

Stay tuned...



Software Developer
For hire


Developer & Programmer with +15 years professional experience building software.


Seeking WFH, remoting or freelance opportunities.