Working with text data in Magento 2: a machine learning tutorial

Working with text data in Magento 2: a machine learning tutorial

The story

One day I was checking out Magento2 and I thought to myself: “Hey, you know what would be nice? Integrating machine learning with Magento… Ok, but how and what should I use?” At that time I was learning how to work with text data, so I wanted to test the accuracy of a search using machine learning.

At first I created a shell script and executed it, but that wasn’t a good approach since every time I wanted to add more stuff, I had to go to the script file and modify it, so I ended up with 500+ lines of code in a script file. After that, I decided to split the script into classes and create a PHP class that calls the machine learning methods separately, but it felt very messy, and every time I would use a new function, I had to create a new PHP method. Finally, I said to myself: “Why not an API?”, it’s flexible and it allows you to use data more quickly and efficiently. That was the answer, but there was still a problem, I wanted to get the product data from somewhere, but I didn't want to get it from Magento, so the solution was another app that would hold product information, an app that Magento could use to import products, and Python to create training data, so it was something like this:

In the end, I tested the search using 100 products, and the results were pretty accurate.

It was fun studying Magento 2 and machine learning and I thought that maybe some of you guys would like to know how it works.

About the article

In this article we will focus on the basics of machine learning by creating a custom search module in Magento 2 that connects to a Python API to get the best product using an approach that is called classification (we will find out what this is later in the article) based on our search query, and a short algorithm that finds related products based on the query words.

I attached a zip file containing the Magento2 module and the machine learning API. But I must tell you from the start that the zip contains only the machine learning parts of the module, some parts (for example deserialization/serialization logic, products information app etc.) have been removed since this will be a machine learning tutorial.

Technologies that I used:

  • Django framework (Python)
  • Scikit-learn (Python machine learning library)
  • Magento 2 (PHP7 example)
  • Miniconda (package management for multiple programming languages)
  • Coffee (a programmer’s best friend)


Why is machine learning important?

Machine learning can automatically produce models that analyze complex data. As data gets bigger, we can get more accurate results. This can help businesses know what customers want, analyze what is the best solutions to follow and find new opportunities to sell products.

For example let’s say you have a Magento store that sells music. By using machine learning you can analyze what types of songs a customer is buying, and give him recommended products based on the data youget. This will increase the chance of the customer buying those products.

What is Machine Learning?

Wikipedia describes it best: “Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed".

How can we do this?

The most common problems and tasks that involve machine learning are Supervised and Unsupervised learning.

Supervised Learning

Example: Classification

Image source.

In this type of learning you present the computer with some inputs and their desired outputs, so you can ‘teach’ it what is what.

Human explanation:

You are a little kid walking with your dad at the zoo. At one point, your dad points to an animal and says “Look son, that is a turtle”. So what do you do? You analyze the turtle and get the features that describes it [“has a shell”, “is green”, “has a flat shape”, etc], and you put a label on it ( [“has a shell”, “is green”, “has a flat shape”] => turtle) . The next time you see another type of turtle (for ex Trionychidae), you still know what it is because of the common features the two have.

The above example is known as Classification.

Unsupervised Learning

Example: Clustering

Image source.

In this type of learning, you don’t assign a label to the input, you let the computer find features on its own.

Human explanation:

You are at home and you find a box of old newspapers. It does not say what type of newspapers they are so you start reading them. After a while you start to find similar features between newspapers, some talk about politics, some about sports, some don’t make any sense at all. Now, if someone came in and said “Hey, can you please sort those newspapers?”, it would be easy cause you already found similarities.

The above example is known as Clustering.

There are other approaches of Supervised and Unsupervised learning that you can check here.

Before we get started, I suggest you download the zip that I attached, which contains the module and the API, so it would be easier to follow the steps below.


1) Machine learning algorithm

a) Training data

Let’s say we have three products (books to be more precise: A game of thrones, Winter, Speaker for the dead) and, when somebody searches for something, we want to return the best product that is defined by the search query.

One way to achieve this in Machine Learning is to have some training data and labels. A training data is a piece of information (for example the string: “winter is coming”) to which a label is attached (ex: game_of_thrones), which we pass to our algorithm to show what category it’s supposed to be.

b) Tokenization

Tokenization is the process of transforming sentences into words.

Example: The string “winter is coming” will be transformed into the array [“winter“, “is“, “coming“].

To achieve this, the method get_tokenized_counts() was created, that returns the token counts of our training data.

def get_tokenized_counts(self):
        return self.vectorizer.fit_transform(self.train_data.get_training_data())

c) Term frequency inverse document frequency

Term frequency is a method of finding out how many times a term appears in a text.


1. "“John knows nothing”"

2. "“John likes fighting”"

The above string will be transformed into the array: [“John“, “knows“, “nothing“, “likes“, “fighting“]

The frequency of words in the above two string will be represented by the arrays:

1. [1, 1, 1, 0, 0]

2. [1, 0, 0, 1, 1]

The biggest problem is that some words have a higher frequency than others (“a”, “an”, “the”). To fix this problem inverse document frequency is used, that checks if the word is common or rare.

In the API, the get_Tf_Idf() method is used to return a token representation

 def get_Tf_Idf(self):
        return self.tfidf_transformer.fit_transform(self.get_tokenized_counts())

d) Naive Bayes

Naive Bayes classifiers is a set of probabilistic classifiers that use Bayes theorem with naive independence (two events are independent if the occurrence of one does not affect the probability of the other) assumptions between features.

Example of Bayes theorem:

1. Let’s say we have the pills (dangerous pill and safe pill), and an unidentified pill that we need to check if it’s dangerous or safe.

2. Let’s also assume that all three pills have these features (Long and Short). For each pill we have the following data:

Next we calculate the probability of the unidentified pill. To do this we need to first calculate the base rate, evidence and likelihood probabilities of our pills.

Base rate:

P (Dangerous) = 600 / 1500 = 0.4

P (Safe) = 400 / 1500 = 0.26

P (Unidentified) = 500 / 1500 = 0.33


P (Long) = 900 / 1500 = 0.6


P(Long|Dangerous) = 500 / 900 = 0.55

P(Long|Safe) = 100 / 900 = 0.11

P(Long|Unidentified) = 300 / 900 = 0.33

Pill probability

Dangerous = P (Long | Dangerous) * P (Dangerous) / P (Long) = 0.55 * 0.4 / 0.6 = 0.36

Safe = P (Long | Safe) * P (Safe) / P (Long) = 0.11 * 0.26 / 0.6 = 0.047

Unidentified = P (Long | Unidentified) * P (Unidentified) / P (Long) = 0.33 * 0.33 / 0.6 = 0.18

The example above is why Naive Bayes is so popular, because at the end of the day, it’s just simple math.

In the method get_prediction() I used the MultinominalNB class to predict the outcome:

 def get_prediction(self, query_string):
        clf = MultinomialNB().fit(self.get_Tf_Idf(), self.train_data.get_labels_for_training_data())

        tokenized_data = self.vectorizer.transform([query_string])
        tfidf = self.tfidf_transformer.transform(tokenized_data)

        prediction = clf.predict(tfidf)[0]

        return prediction

As I mentioned at the beginning of the article, I also added a method that gets related products based on the query words SimilarProducts().get(), so if you have the “hello world” query string, every product that contains that word would be returned.

2) Magento module

a) Create the module structure

The first step I did was to create a new module (Evozon_Search) with the following structure:

│   ├── Block/
│   │   ├── Search.php
│   ├── Controller/
│   │   ├── Index/
│   │   │   ├── Index.php
│   ├── etc/
│   │   ├── frontend/
│   │   │   ├── routes.xml
│   │   ├── module.xml
│   ├── Helper/
│   │   ├── Data.php
│   ├── view/
│   │   ├── frontend/
│   │   │   ├── layout/
│   │   │    │   ├── default.xml
│   │   │    │   ├── search_index_index.xml
│   │   │   ├── templates/
│   │   │    │   ├──
│   │   │    │   ├── search.phtml
│   ├── registration.php

b) Replace the search template

n short, what I did was to overwrite the basic search template in module.xml with the new one that points to the custom search controller:


c) Contact the Machine learning API

This was all done in the controller which is a simple one, it has a constant that points to the API:

const API_PRODUCTS_URL =  "";

And a method that gets the json response and transforms it into an object:

public function getSearchProducts() : stdClass
  $getQueryString = url_encode($this->query);
  $urlGetPath = sprintf("%s%s", self::API_PRODUCTS_URL, $getQueryString);

  $response = $this->curl->getBody();

  return json_decode($response);

(The above code is where the serialization/deserialization was removed).

Now if we enter the string “winter” in the search box, for example:

we get the result:



Using categorization, you can get pretty fast and accurate results, and if you want the price to be a factor in the search, you can always just add another feature.

I would use machine learning in combination with a custom search algorithm in applications where I have a lot of text data to process, as for the rest of the applications, I think I’m going to stick to the regular search.


Tell us what you think

Fields marked with " * " are mandatory.

We use cookies to offer you the best experience on our website. Learn more

Got it