In this post I want to share some experiences in the field of “Machine Learning” my current project pointed me to lately. I will focus myself on “Data Classification” with the tool RapidMiner and give an overview of the topic. Especially I would like to share how you can use this “stuff” from your Java application.
If you have a background in architecting and developing enterprise software like I have, chances are high, that you spend most of your time thinking about the structure of your software system: How can I arrange the code for the different features of my system so that all the different architectural *abilities (Scalability, Maintainability, …) are met? To be honest with us, most often the features themselves are relatively simple: get some data from the GUI, validate the data by mostly simple rules, store the data in a database and retrieve it later to present it on yet another GUI. Quiet often the sheer mass of the requested features is the challenge, not one of the features by itself.
Lately I was pointed to some different kind of beast. Without going into the full details here, my team got the request to somehow “calculate” the “next-best-action ” for a user of a customer-care-system having a customer on the phone and the customer’s data on screen. What to do next with the customer? No clear set of rules were available at first, perhaps some data what works with different customers and what doesn’t could be gathered.
That constellation lead me to the thrilling area of “Machine Learning ” and some interesting experience with a tool called “RapidMiner ” I would like to share.
If your system can “learn” from data and – after learning – use the new “knowledge” to act “better”, then you have some kind of “Machine Learning” component in your system. There are many different dimensions in which the Machine Learning field can be split. Often you find a split in three different areas:
- Classification
- Collaborative Filtering / Recommendation Engines
- Clustering
For this post I will concentrate myself on the first area: Classification. I will highlight the difference to the other areas at the end of the post (and perhaps there will be some time to dive deeper in these areas in later postings).
This post is rather long, so let me provide a table of contents to you:
1. Introduction to Classification
2. First example – Getting your feet wet which RapidMiner Classification
3. Second Example – Text Classification
4. Using in Java
5. A note on scalability
6. Some other areas of machine learning
7. Conclusion
So, let’s start with “1.”:
1. Introduction to Classification
So “Classification” – what’s that? Let me give you an example for an application first. Imaging you have your data in a database. Maybe you have a table with all your customers, for each customer a single row (sounds common?). You may have many fields in that table with very diverse information about your customer, e.g. address, job, age, last year’s spending for different product groups, marital status and many more. Now you would like to start the new year with a marketing campaign for which you want to address you customers specifically to their living conditions and buying habits. So, you have to classify your customers in some groups (e.g. technical geek, luxury-addicted, and budget-oriented). That’s classification – you give your customers a “label” to act accordingly. If you can formulate a sound set of rules to do that, it’s simple. But if you have complex datasets and only some examples of successful classification, Machine Learning comes into play.
To get an impression how classification works please take a look at figure 1. There is a division between the phases “model building”, “model testing” and “production”.
Figure 1: Schematics of Classification
First during “model building” you give data-rows (or “examples” in the machine learning lingua) with the known labels into the machine learning algorithm. The algorithm tries to “learn” which data constellations in the fields lead to which labels. The learned information builds a “model” in the terms of the algorithm.
During model building you didn`t give the algorithm all of your labeled data. You hold back some smaller part of the rows. Now, in the model testing phase, you use those rows to test the model that the algorithm had built. Like later in the production phase you apply the model to row to let the algorithm predict labels. But unlike in production you now have a predicted label and a label known to be correct for each row. You can compare these two labels and gain some insights about the quality of your model. If it doesn’t satisfy you, you can tweak some parameters of the learning algorithm and go back to the model building phase.
Later on in the production phase you use the build model to predict labels for new rows and let your system react accordingly. From a software technology view you have to let your application interact with the Machine Learning component. We will take a look at this interaction later.
Side note: I simplified a little bit here. E.g. it is often the case, that you can’t simply use your existing data tables. If you have a complex data model with different 1:n relationships, you have to flatten that to a view with one big fat row for each “thing” you want to label. Additionally you have to take care about rows with missing data and improper data types. In the end you get a pipeline or a process through which you let your data flow to the machine learning engine.
2. First example – Getting your feet wet which RapidMiner Classification
You don’t have to implement the Machine Learing algorithms for yourself. There are a lot of tools which you can use. One which I find very useful is RapidMiner. It’s an Open Source tool (AGPL3) originated at the Technical University Dortmund now put forward by Rapid-I GmbH, which sells commercial licenses as well. You can use it for very different data related tasks, not only for classification. You can build your learning processes in a strong, Eclipse-based RCP GUI and use the derived models via a Java API in your own application.
Figure 2 shows a learning process in RapidMiner as an example.
Figure 2: RapidMiner Iris-Classification-Process
The figure shows a classic example in Machine Learning: Classification of Iris Flowers in three different subtypes (Iris Setosa, Iris Versicolour and Iris Virginica) by different leaf measurements. It is based on a dataset published by R.A. Fisher back in 1936.
Figure 3 shows some data rows from the dataset:
Figure 3: Some Example rows from the Iris-Dataset
To get this rolling for yourself you can clone my work from github: https://github.com/frank-engelen/machine_learning.git . To keep things easy, I would suggest cloning this git repository to the root directory of your computer. In other cases you need to adjust some paths.
C:\>git clone https://github.com/frank-engelen/machine_learning.git machine_learning
RapidMiner works with the term “Repository”, too. After you cloned from github you will find a subdirectory called “rapidminer_repo” in “/machine_learning”. Install and fire up RapidMiner (see the Readme.md in the Github-repository for additional remarks on installing and starting RapidMiner) and import that repo into your RapidMiner-Workspace. To do that press the “Add Repository”-Icon in the Repositories-View (see figure 4) and enter the data shown in the figure.
Figure 4: Import a RapidMiner Repository
To rebuild the process of the initial example (figure 2) you have to open the process “01-iris-process” in the repository view via double click.
On the leftmost side you see a node “Read CSV” which reads the Iris-Dataset into the system (if you need to adjust paths, here is one place). It additionally selects the attribute no. 5 of the dataset as the “label” for the classification. The second node splits the dataset in two partitions: 90% for training, 10% for testing. The training partition goes to the “Naïve Bayes” node which performs the building of the model (“Naïve Bayes ” is one possible algorithm for Machine Learning, there are many more available for your Machine Learning needs). The “Apply Model” node applies that learned model to the test data. The test data enriched with the predicted labels is then forwarded to a performance evaluation.
You can start the process with the big blue “Play”-button in the toolbar. With that you switch to the “Results”-Perspective (see figure 4). In one tab (“ExampleSet”) you see the test dataset with all the attributes and the calculated prediction. In the other tab (“PerformanceVector”) you see some statistics about the prediction. In our simple case the accuracy of the prediction was 100%. The so-called “confusion matrix” showing the cases your model failed is therefore relatively boring. Don’t expect such good results in real world cases – 80%-95% is more realistic. We will see an example of that now.
Figure 5: Perfect Iris Classification by the Process
3. Second Example – Text Classification
Another common application for Classification is the classification of text. If you have a big mass of documents and want to split them into different groups, Text Classification can help. The second example process in my github-repositiory takes a dataset with approx. 20.000 postings to 20 selected topic newsgroups of the Usenet . The dataset was provided by Tom Mitchell from Carnegie Mellon University. Details can be found here . Figure 6 shows one of the postings as an example and a list of the 20 different topic groups.
Figure 6: Example Posting and list of topic groups
If you open “02-text-learning” in RapidMiner from the repository view you see a learning and testing process for the twenty newsgroups problem (see figure 7). Like in the first process we have seen, there is a split of the example data between learning and testing (90%/10% again), a kind of a “Naïve Bayes”-Learning-algorithm and some nodes for model-applying and performance evaluation. Additionally we see two “Store”-nodes which form the basis to use the learned model from Java (see next section). The other nodes “ProcDocs”, “Select Attributes” and “Set Role” are new. We will discuss them later.
Figure 7: Text-Learning-Process
If you start the process you will need some patience. On my notebook the learning and testing phases together take approximately 6 minutes. After that a confusion matrix show up (see figure 8).
Figure 8: Text-Learning-Confusion Matrix
Over 86 % of the test postings are put in the right newsgroups! Impressing! Additionally if you dive deeper into the confusion matrix you see that there is some confusion in splitting postings between “talk.religion.misc”, “alt.atheism” and “soc.religion.christian”. I bet that even for a human it would be difficult to split these topics.
So how does it work? It may disappoint you but there is no text understanding and very little semantic analysis in place. It’s all about statistics. The basic trick in text classification: The number of occurrences of different kind of words differs for different topics. Simple put: in the group “talk.religion.misc” will be more occurrences of the word “church” then in “comp.sys.ibm.pc.hardware”. So, if you find the word “church” in a posting the likelihood that the posting belongs to “talk.religion.misc” increases and the likelihood for “comp.sys.ibm.pc.hardware” decreases. The Naïve Bayes operator does sophisticated calculations based on that initial thought.
Because in both examples a form of Naïve Bayes operator is used the “how” of the classification nearly stays the same. What differs from the first example I gave you is, that in the first example the dataset was in a tabular form right from the start. In text classification we have documents and we are responsible to bring that data in tabular form, too – each document gets its own row.
The “ProcDocs” node is responsible for building the data table. It does that by calculating the number of occurrences of words in the different documents. ProcDocs looks on a file-system directory structure, reads the document files in that structure and produces one example row for each document found (so “ProcDocs” is the second place to adapt if you used a different file-path for the git repository). The fields of that rows consist of some metadata (filepath, filename, filedate, document length, label for learning/testing) and one field for nearly each word that was found during processing in one of the documents (you can take a look at the rows on the “ExampleSet”-Tab in the “Result”-Perspective). Why “nearly each word”? Well, that’s what makes the “ProcDocs” node complex. It even has an inner sub process to deal with that complexity. Double click on the “ProcNode” to get a view on the sub process (see figure 9)
Figure 9: Sub process and Properties of “ProcDocs”
This sub process is executed for each of the approx. 20.000 postings. Let me summarize the tasks for each inner node:
Tokenize: Takes the text of the document and splits it into a stream of tokens (aka words). On each non-letter-character another word begins.
Stem: Does some “Stemming ” on each word. That normalizes groups of semantically similar words to a common word. An example from Wikipedia: “fishing”, “fished”, “fish”, and “fisher” will all become “fish”.
Filter Stopwords: Words from a list of “stopwords” will be filtered out here. Stopwords are words which are so common, that they don’t help in classifying and only would bloat the example rows. Examples are “and” or “the”. The operator uses a predefined list of English stopwords.
Extract Length: Will add the length of each document as a new field to each example row.
Based on the token build by the subprocess the “ProcDocs” node itself calculates the number of occurrences of each token in each document and builds the example rows with the fields for the tokens (well, another special lingua here: a “row” is called “vector”, too). In simple cases the number of occurrences is stored directly in the row fields. But to come to a good classification performance some more maths are necessary. Instead of the “number of occurrences” the “term frequency – inverse document frequency” (TD-IDF ) is stored for each token. This number correlates the frequency of each token in the current document with the frequency of the token in all the documents: If a token is present in only a few of the documents but in this it is very frequent then that’s more interesting than a token that is very common in all of the documents.
To further prevent the example rows from being bloated some “pruning” is applied: tokens which are very rarely or very, very often used are filtered out. Especially the former prevents creative wordings like “arghoohi” from bloating the rows.
Well, that’s nearly all for the complex “ProcDoc” node. Only one further thing to mention: The classification label for learning and testing is derived by the file directory of the document. Click on “text directories” / “Edit list (20)…” to see that. Luckily that matches the structure of the 20-newsgroups-dataset.
Compared to that, the remaining nodes “Select Attribute” and “Set Role” aren’t so complex: “Select Attributes” filters out some unused or disturbing meta-data fields in each row. The “Set role” node indicates that the field “metadata_path” should be treated as the primary id of each document and should therefore not be considered in learning.
4. Using in Java
Phew! Heavy stuff, but at the end a very impressing result as I would think: 86+% classification hits without any domain specific programming! (BTW: For a look on advanced document classification in a demanding and complex environment you should have a look at Jürgens post ).
But how can we use all that for our Java applications? Thankfully it is quite simple – I’ve put an example in the git repo. Here is the “main”-method of MainClassifier:
1public static void main(String[] args) throws Exception {
2 // Path to process-definition
3 final String processPath =
4 "/machine_learning/rapidminer_repo/03-text-classification-in-Java.rmp";
5
6 // Init RapidMiner
7 RapidMiner.setExecutionMode(ExecutionMode.COMMAND_LINE);
8 RapidMiner.init();
9
10 // Load process
11 final com.rapidminer.Process process =
12 new com.rapidminer.Process(new File(processPath));
13
14 // Load learned model
15 final RepositoryLocation locWordList = new RepositoryLocation(
16 "//My Machine Learning Repo/02-text-processdata/20-newsgroups.model");
17 final IOObject wordlist = ((IOObjectEntry)
18 locWordList.locateEntry()).retrieveData(null);
19
20 // Load Wordlist
21 final RepositoryLocation locModel = new RepositoryLocation(
22 "//My Machine Learning Repo/02-text-processdata/20-newsgroups.wordlist");
23 final IOObject model = ((IOObjectEntry)
24 locModel.locateEntry()).retrieveData(null);
25
26 // Execute Classification process with learned model and wordlist as
27 // input. Additionally expects files in
28 // /machine_learning/data/03-20_newsgroup_java_in
29 final IOContainer ioInput = new IOContainer(new IOObject[] { wordlist, model });
30 process.run(ioInput);
31 process.run(ioInput);
32 final long start = System.currentTimeMillis();
33 final IOContainer ioResult = process.run(ioInput);
34 final long end = System.currentTimeMillis();
35 System.out.println("T:" + (end - start));
36
37 // Print some results
38 final SimpleExampleSet ses = ioResult.get(SimpleExampleSet.class);
39 for (int i = 0; i < Math.min(5, ses.size()); i++) {
40 final Example example = ses.getExample(i);
41 final Attributes attributes = example.getAttributes();
42
43 final String id = example.getValueAsString(attributes.getId());
44 final String prediction = example.getValueAsString(
45 attributes.getPredictedLabel());
46
47 System.out.println("Path: " + id + ":\tPrediction:" + prediction);
48 }
49}
The method initializes RapidMiner and loads a classification process which was defined via the RapidMiner GUI (you can find ‘03-text-classification-in-Java’ in the imported RapidMiner-Repo).
The process takes the list of all words/tokens and the model as an input. List and model were created during the learning phase. It’s also possible to read these two things in the process via “Retrieve”-Nodes. But you get a better performance, especially if you execute the process several times, if you read them separately and put them into the process as an input.
A “ProcDocs”-Node in the processes equivalent to the learning “ProcDocs”-Node looks for all files in “/machine_learning/data/03-20_newsgroup_java_in” and processes them.
At the end of the Java program you can see how the process result is retrieved and printed (see figure 10):
Figure 10: Classification in Java/Eclipse
5. A note on scalability
The runtime of the classification process in the Java-Program is around 700ms (timed around the process.run(…)-call). This time is influenced by the initialization and class-loading time. Further runs can be faster, around 200ms. This stands in sharp contrast to the six minutes runtime of the learning process. This is a pattern which is typical for ‘classification’. The learning time is much, much longer than the actual classification time. That means you can build online systems which use classification even if your learning time goes into time scales of hours and more. This is especially true because you can use multiple RapidMiner instances to do classification simultaneously.
But with very, very, very large dataset you won’t be able to handle the learning on a single machine (Advice: don’t give up too early, look e.g. at Amazon AWS “High-Memory Quadruple Extra Large Instance”). So, at some time you may need to use a cluster. Unfortunately RapidMiner has no direct support for distributed learning. There is a commercial extension “Radoop ” which let RapidMiner work with Apache Hadoop Clustering . Alternatively after some GUI-guided first steps in Machine Learning you may like to switch to Apache Mahout . But you will see that Mahout is a combination of some diverse Open Source Projects which make it heterogeneous and somehow harder to use. Additionally some important classification algorithms (like “Support Vector Machines ” are not implemented in Mahout. My advice would be: “start small but start”. Don’t let the fear that you can’t handle Facebook-like requests loads directly stop you from getting some experience with classification.
6. Some other areas of machine learning
So, that’s nearly it for now. I hope my posting gave you some first insights into the “magic” of “classification”. I would like to briefly address the differences to the other areas mentioned above:
Collaborative Filtering / Recommondation Engines
The best example of a Collaborative Filtering in action is for sure Amazon.com with its “Customers Who Bought This Item Also Bought”. I don’t know exactly how Amazon implemented that, but in the traditional flavor you do not work with one table of example rows as in classification. Instead you work with two tables (e.g. items and customer) and the n:m-relation between them (e.g. “bought” or “rates”). Traditionally you don’t look into the rows but only on the relationships. For more information you can check http://en.wikipedia.org/wiki/Collaborative_filtering .
Clustering
Clustering tries to find groups of data in a given dataset so that rows in the same group are more “similar” to each other than rows of different groups. Traditionally you provide a form of a “similarity measure” to the algorithm. For more information you can check http://en.wikipedia.org/wiki/Cluster_analysis .
7. Conclusion
In my opinion, it becomes more and more important to get some sense of all that billions, trillions and quadrillions of bits and bytes stored in modern systems. Even in “Big data” systems not the data by itself is important but rather the information inherent to that data which can be used to optimize business decisions. Machine Learning can extend your toolset to move from “data” to “information”. As I’ve mentioned above, classification can be used for a diverse set of problems from splitting your customer base to pre-splitting the data entering your system. Its application can reach from very local (e.g. providing some “intelligent” pre-selection for a drop-down list on a GUI based on the current data situation) to global where it may be the determining factor for the architecture of the system (e.g. a social media sentiment analysis system).
So, perhaps you get your feet wet now – and make some interesting experience in that area. And the next time if you will be asked about the parts of your software system perhaps you answer: “Well, the usual parts: Views, Controllers, Domain-Objects, Services … and some AI/Machine Learning-Stuff”. Some interesting talks may start…
BTW: If you want to dive deeper into Machine Learning and RapidMiner I strongly suggest giving “Data Mining for the masses” by Dr. Matt North a try.
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Frank Engelen
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.