Exploring KNIME - L1-DS

KNIME Analytics Platform for Data Scientists: Basics

Mar 11, 2023

KNIME is a graphical Data Preparation and Machine Learning tool enabling users to build transparent and easy to communicate workflows.

L1-DS is the first course (Level 1) for the Data Science (DS) training. KNIME offers exercises with data and also solutions to get a better understanding of best practice in workflow modeling. Training slides can be found here. There are the following 5 exercises:

In this Blog I want to explore workflow 4 - Machine Learning.

Decision Tree

The first workflow includes Decision Tree algorithm. Decision trees are supervised ML algorithms. This means you need example data with a label to learn from.

You can find a documentation and also a video on YouTube explaining how the Decision Tree nodes work.

The first node of the workflow is the metanode “Fully Joined Data” which wraps the data preparation process:

Different datasets are read, joined and transformed to provide the right data for the next steps of preparing the machine learning. The join is supported by a graphical representation and different parameters can be used to define output, key handling, performance:

For the machine learning data is split in training and test data with a 50:50 ratio in standard and is using a stratified sampling of the target variable.

Stratified sampling
i.e. the distribution of values in the selected column is (approximately) retained in the output tables. You may optionally specify a fixed seed.

To train the decision tree with the training data there are several parameters possible but most defining is the quality measure "Gini Index" and the "Gain Ratio". A explaination can be found here.

Pruning on the other side is very important to prevent overfitting. Results from Decision Tree Predictor shows that there is typically no 100% fit for “Target” Prediction:

Our whole dataset is 11551 records and 15 columns to learn from data and test prediction of a value to new records.

One way to understand how well new data would be predicted is the confusion matrix:

Another graphical approach is the ROC (receiver operating characteristic):

Linear Regression

This algorithm is very popular for numeric prediction and even if not always data shows a linear relation, for these kind of problems it is often the reference where you can prove a to find a better solution or algorithm.

K-Means (Clustering)

The third algorithm is k-means for clustering of data. Compared to decision tree and linear regression k-means is a unsuperviced method of machine learning. This means that you don’t need labeled data for a learning process but directly try to find patterns (clusters) from available data.

Data Strategy in a Nutshell

Discussion about this post

Ready for more?