Revisiting an old friend. Hi CloudSearch, how have you been?

Revisiting an old friend. Hi CloudSearch, how have you been?

In our previous post we introduced you to the freetext search service offered by AWS, the good old CloudSearch. We promised back then we would follow up with a more hands-on post about the search service towards which we’ve developed a more-love-than-hate relation during the last 2+years we spent together. And here it is, the second post in the CloudSearch dedicated series, aiming at helping you setup a search domain on your own, feed it data and finally perform some basic searches. Get ready to roll up your sleeves, hack some configuration and code, progressively advance through steps further described and at the end enjoy that sweet feeling to have mastered yet another beast.

Creating the search domain

In order to start using CloudSearch you obviously need an AWS account and then you have to create a CloudSearch domain.

In AWS parlance, a CloudSearch domain is a Lucene index, which is accessible through two endpoints, one for documents upload and one to performs searches. The domain is hosted on an AWS instance, which comes in various hardware configurations and thus various running costs. The domain itself must be given an unique name among all domains inside an AWS account, which is then used as base for the full domain unique HTTP endpoints.

The search domain we've created as a sandbox for this article will store movies data that we want to make searchable. Therefore we’ll name our domain “movies” and, after AWS sets it up and we’ll have the two needed endpoints, we’ll import data into it and then search for movies by various criteria.

When it comes to creating and setting up a new search domain, you have to choose from using one of the following options:

  • a web interface;
  • the AWS CLI tool to automate the process;
  • your preferred programming language and build a setup tool yourself with the AWS SDK provided in a variety of languages, including PHP.

Setting up a CloudSearch domain using the web console

While the most beginner-friendly way to setup a CloudSearch domain, the web interface will quickly show its shortcomings. One of them is that it becomes rather unstable as the number of domains in your AWS domain rises. In our experience we felt quite frustrated at times by the random error messages telling us we had no rights to perform some actions, but then, after a page refresh, we were magically regaining our privileges.

Another aspect is simplicity. Simplicity is always good, except when you have to quickly check configuration options and change fine details. While the previous claims can be considered subjective or perhaps an isolated personal experience, I think we all agree, as developers, that setting up several domains with mouse clicks is never something we enjoy doing. Therefore, we are further going to guide you through the setup process using the other options.

Setting up a CloudSearch domain using the AWS CLI console

The Amazon web console helps you create and fully configure your CloudSearch domains in a few easy steps. But you’ll find yourself lost in a sea of clicks and repetitive tasks if you need to configure tens or even hundreds of similar domains. To sustain the statement above, we could just mention that for one of our projects we had to maintain over one hundred domains, doing bulk configuration changes and suggesters builds with each release that needed changes at the domain level. There were several teams depending on our job, as well as the preprod environment and the live site. Without a tool to automate these tasks, we would still be clicking buttons around even now. AWS CLI console came to the rescue.

Installing AWS CLI console

Amazon provides an unified CLI tool to manage all their services under the name of AWS CLI. More details on what it is and how you can install it on various operating systems here[http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html].

We are going to use the AWS CLI tool to set up our search domain from the command line. Although there is the choice of using the more appealing web console, the CLI allows you to automate a lot of maintenance tasks by using setup scripts.

First thing you need to know about the AWS CLI is that it is written in Python and therefore it needs the interpreter to run, and more than that it is installed as a Python package.

The way we installed it was by using pip (the Python package management system). First of all you need to have pip installed.

In order to check if pip is installed:

> pip --version

If you get a command not found message, then you have to install it. This is something pretty easy to do:

> curl -O https://bootstrap.pypa.io/get-pip.py
> python get-pip.py --user

Last thing to do is verify that the path where the script installed pip, which is printed on screen after install, is added to your PATH environment variable. And that’s it, now we can install the AWS CLI.

In order to install it with pip, run the following:

> pip install awscli --upgrade --user

After this command is run you will have the aws command line installed in the user directory for Python packages. You can run the same command above to update an older version of the package.

The command above installs the aws script in the ~/.local/bin directory (note the --user option). This means each user can have his/her own version of aws script. One last thing is to make sure that path is added to the PATH environment variable.

Check the successful install with:

> aws --version
aws-cli/1.11.160 Python/2.7.12 Linux/4.10.0-35-generic botocore/1.7.18

There is one more thing we have to do before using the AWS CLI and that is to configure it with the access key, secret key and the default region. Documentation for that can be read here.

Let’s do a quick configuration by running:

> aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: eu-central-1
Default output format [None]: json

The command asks you for the keys, the default region for the domains and the command output format. Details about obtaining the keys can be found here. Now you are good to go.

Running CLI commands to setup the movie search domain

The documentation for the part of AWS CLI we are interested in can be found here. Here’s the list of things we have to do in order to have a fully functional CloudSearch domain:

  • 1. create a domain;
  • 2. configure access policies;
  • 3. create an analysis schema;
  • 4. configure index fields;
  • 5. trigger domain reindex (required after any changes to the schema).

1. Create a domain

First thing is to create the new domain. In order to do that, we’ll execute the create-domain command.

> aws cloudsearch create-domain --domain-name movies

The domain was created successfully and it takes a few minutes to become active.

2. Configure access policies

Next, configure access policies. Documentation for doing that can be found here.

We want to allow anonymous users access the search and suggestions endpoint, but restrict them from indexing documents. The IAM access policy that we use for that is:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": ["*"]
      },
      "Action": [
        "cloudsearch:*"
      ]
    }
  ]
}

We put this JSON in a file named access-policies.json then we run the AWS CLI command to set them for the domain we’ve just created:

> aws cloudsearch update-service-access-policies --domain-name movies --access-policies "`cat access-policies.json`"

If you are not very experienced with access policies in AWS don’t panic, neither are we. The documentation here will give you a good starting point.

3. Create an analysis schema

Next step is to create an analysis schema for the text fields in our domain. The analysis schema defines the expected language for the text, type of stemming applied, the stopwords and synonyms to use. It controls how the contents of text and text-array fields are handled during indexing and searching. We could use here a predefined analysis schema which is provided for all supported languages, but for the sake of example we will create our own scheme with custom configuration.

We plan to have two types of analysis available for each language on the movies domain: one with stemming and one without. Here is the AWS CLI command to create an analysis scheme for Romanian with the name “fuzzy_romanian” (the name must follow the regexp [a-z][a-z0-9_]*), with full algorithmic stemming, some synonyms and some stopwords defined:

> aws cloudsearch define-analysis-scheme \
    --domain movies-1 \
    --analysis-scheme 'AnalysisSchemeName="fuzzy_romanian", AnalysisSchemeLanguage="ro",AnalysisOptions={Synonyms="{\"groups\":[[\"laptop\",\"notebook\",\"calculator portabil\"]],\"aliases\":{\"telefon\":[\"smartphone\",\"phablet\"]}}",Stopwords="[\"un\",\"o\",\"pe\",\"la\"]",AlgorithmicStemming="full"}'

While in the previous example we’ve used the shorthand syntax, we’ll showcase the JSON syntax in the next one. It creates an analysis scheme for Romanian but this time without algorithmic stemming, stopwords or synonyms, therefore named “romanian”:

{
  "AnalysisSchemeName":"romanian",
  "AnalysisSchemeLanguage":"ro",
  "AnalysisOptions":{
    "Stopwords": "[]",
    "Synonyms": "[]",
    "AlgorithmicStemming":"none"
  }
}

File: romanian.json

And the command to deploy the definition above:

> aws cloudsearch define-analysis-scheme 
      --domain movies \ 
      --analysis-scheme "`cat romanian.json`"

Both commands output a JSON result after successful run, with details about the analysis scheme that was created. More details on the options for when defining analysis schemes can be found here.

4. Configure index fields

We are going to define a bunch of index fields now for our domain, so that each time we index documents, CloudSearch knows how to handle each of the fields in our documents. We are not going to show here the commands needed for the entire movies schema, but just enough examples so that it’s clear how it’s done:

  • Adding a single valued text field that will handle text in Romanian with full stemming and will allow highlighting and returning of the held content. Also, we are using a dynamic field to catch all document fields ending in _fuzzy_ro at index time:
> aws cloudsearch define-index-field \
    --domain-name movies \
    --name *_fuzzy_ro \
    --type text \ 
    --return-enabled true \
    --highlight-enabled true \ 
    --analysis-scheme fuzzy_romanian
  • Adding a multiple valued text field to handle multiple text contents in Romanian, but without stemming applied:
> aws cloudsearch define-index-field \
   --domain-name movies \
   --name *_multi_ro \
   --type text-array \
   --return-enabled true \ 
   --highlight-enabled true \ 
   --analysis-scheme romanian
  • Adding two numeric fields for votes: count and average:
> aws cloudsearch define-index-field \
   --domain-name movies \
   --name vote_count \
   --type int \
   --return-enabled true
> aws cloudsearch define-index-field \
   --domain-name movies \
   --name vote_average \
   --type double \
   --return-enabled true

5. Trigger domain reindex

To apply the schema changes, we have to tell the search domain to start indexing its documents using the latest indexing options. You can do that either from dashboard or through AWS CLI:

aws cloudsearch index-documents --domain-name movies

A complete domain setup using the AWS CLI only can be found here.

Setting up a CloudSearch domain using the AWS SDK

The SDKs provided by Amazon simplifies the usage of AWS services in your applications by having an implementation in each of the major programming languages. As one of these languages is PHP, in this part of the article we'll focus on presenting the basic configuration required to use the SDK for PHP, as well as performing some simple operations. The version of the SDK used is 3.82.1 and it can be installed by following the steps presented here. The recommended way to create clients for one of the AWS services is using the class \Aws\Sdk. The class that we are interested in is CloudSearchClient, which is used for domain creation and configuration.

Create and configure a domain using SDK

We start by setting the AWS credentials, and then we are ready to work with SDK. The actions that we have to do for having a functional domain are:

  • 1. create SDK client;
  • 2. create domain;
  • 3. configure access policies;
  • 4. configure analysis scheme;
  • 5. configure fields;
  • 6. start indexing domain.

So let’s start creating our new movies domain.

1. Create SDK client

Creating the client mentioned before is pretty straightforward:

$sdk = new \Aws\Sdk([
    'region'      => 'eu-central-1',
    'version'     => '2013-01-01',
    'credentials' => [
        'key'    => $credentials['accessKey'],
        'secret' => $credentials['secretKey'],
    ]
]);

$cloudSearchClient = $sdk->createCloudSearch();

After the SDK client is created, we are ready to proceed to the creation of the domain.

2. Create a domain

To create a new domain object you only have to provide the domain name. After the domain was successfully created, we can begin to configure it.

$cloudSearchClient->createDomain([
    'DomainName' => $domainName,
]);

3. Configure access policies

This configuration step is important because here we set the rules for controlling the access to our domain. These rules, which have to be in JSON format, determine what actions different users or processes can do. If the desired access rights for anonymous users are for search, suggestions and indexing documents, take a look at the example below:

$cloudSearchClient->updateServiceAccessPolicies([
    'DomainName' => $domainName,
    'AccessPolicies' => '{
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "*"
            ]
          },
          "Action": [
            "cloudsearch:*"
          ]
        }
      ]
    }',
]);

If you want to find out more about access policies, just click here.

4. Configure analysis scheme

An analysis scheme helps you define language-specific processing options for each text and text-array field. When you create a new scheme, you have several options like:

  • stemming dictionary:
    • maps words to related root word;
  • synonyms:
    • words that have the same meaning or refer to the same group of interest;
    • types of synonyms: groups and aliases;
  • stopwords:
    • words that should be ignored during search or indexing;
    • for example: “the”, “and”, “or”, “a”, “an”.

In the following examples we created a scheme for two languages: Romanian and English. The English one is a scheme with stemming, stopwords, synonyms and the other one is simpler.

$cloudSearchClient->defineAnalysisScheme([
    'DomainName' => $domainName,
    'AnalysisScheme' => [
        'AnalysisSchemeName' => 'romanian',
        'AnalysisSchemeLanguage' => 'ro',
        'AnalysisOptions' => [
            "AlgorithmicStemming" => "none",
        ]
    ]
]);
$cloudSearchClient->defineAnalysisScheme([
    'DomainName' => $domainName,
    'AnalysisScheme' => [
        'AnalysisSchemeName' => 'fuzzy_english',
        'AnalysisSchemeLanguage' => 'en',
        'AnalysisOptions' => [
            'AlgorithmicStemming' => 'full',
            'Synonyms' => "{\"groups\": [[\"world war ii\",\"wwii\",\"second world war\"]]
, \"aliases\": {\"star wars\": [\"jedi\"],\"captain america\": [\"team thor\",\"avengers\"]}}",
            'Stopwords' => "[\"a\", \"an\", \"and\", \"are\", \"as\", \"at\", \"be\", \"but\"]"
        ]
    ]
]);

5. Configure fields

To index data for our application, we need to create and configure fields to our AWS CloudSearch domain. Each document will have corresponding fields from movies data. Let’s find out how we can configure different types of fields:

  • Numeric fields: vote_count and vote_average
$cloudSearchClient->defineIndexField([
    'DomainName' => $domainName,
    'IndexField'      => [
        'IndexFieldName' => 'vote_count',
        'IndexFieldType'   => 'int',
    ],
]);
$cloudSearchClient->defineIndexField([
    'DomainName' => $domainName,
    'IndexField'      => [
        'IndexFieldName' => 'vote_average',
        'IndexFieldType'   => 'double',
    ],
]);
  • Date field: release_date
$cloudSearchClient->defineIndexField([
    'DomainName' => $domainName,
    'IndexField'      => [
        'IndexFieldName' => 'release_date',
        'IndexFieldType'   => 'date',
    ],
]);
  • Multiple text value (dynamic field)
$cloudSearchClient->defineIndexField([
    'DomainName' => $domainName,
    'IndexField'      => [
        'IndexFieldName' => '*_multi_en',
        'IndexFieldType'   => 'text-array',
        'TextArrayOptions' => [
            'AnalysisScheme' => 'english',
        ],
    ],
]);
  • Single text value (dynamic field)
$cloudSearchClient->defineIndexField([
    'DomainName' => $domainName,
    'IndexField'      => [
        'IndexFieldName' => '*_fuzzy_en',
        'IndexFieldType'   => 'text',
        'TextOptions' => [
            'AnalysisScheme' => 'fuzzy_english',
        ],
    ],
]);

6. Reindex domain

After all the configurations are done, we need to trigger reindexing in order to apply all changes to our new domain.

$cloudSearchClient->indexDocuments([
    'DomainName' => $domainName,
]);

To see a full setup for a new domain using SDK, we’ve created a simple Symfony CLI application, where we’ve added all required SDK functionality. After the commands are executed, we have a ready-to-use domain with data about movies. First we need to set the AWS credentials by creating a configuration file from the aws-credentials.txt.dist template.

All existing operations to setup a domain using SDK can be consulted here.

As you can see, the SDK is a very convenient way to interact with CloudSearch, because you don’t have to worry about low level API calls. Another out of the box benefit is that all clients provided by the SDK will return promises when you invoke any of the methods suffixed with Async. A detailed documentation about this feature can be found here.

Indexing data into our domain

When it comes to indexing data into the search domain, as a PHP web developer you will most likely want to be able to write a PHP script that takes data from some source, be it a relational database, plain text files, a web service or others, and then index a flattened version of your data into the search domain.

The key takeaways here are the flat aspect of the documents you feed to CloudSearch and the details of the fields you choose to index: the type, naming and relevance. One piece of advice about selecting what fields you index into the search engine: select only what you really need, nothing more. Remember it's a paid service and the cost increases with content size and instances you have to use. Therefore, in order to keep your search cost effective, make sure you are not using the search domain as a primary flat database for your application.

We are going to continue our search adventure with creating a PHP script to take movies data from a very nice source of free information about movies, The Movie Database (TMDB), and then upload them as documents to the domain we’ve previously prepared.

Of course there are other ways to get data into a domain such as using the cURL command or the more user friendly Postman application to prepare and send requests to the documents endpoint of the domain. Another way of sending documents to a domain is by using the AWS CLI which provides a command that takes the contents of a JSON file and uploads them to the specified domain.

Although you can easily use cURL directly in a PHP script to send document prepared JSON to CloudSearch, the more elegant approach is to make use of the AWS PHP SDK. And this is exactly what we’ll do next. We have prepared a small Symfony console application that allows import from TMDB and then upload to the previously created domain. Let’s discover the key parts of this small application next.

The CLI application we are going to discover is publicly available on github. The application makes use of the TMDB SDK library php-tmdb/api to access the TMDB API and of the AWS PHP SDK aws/aws-sdk-php, and is built on latest versions of Symfony components.

We will further focus on some key parts of the code to explain the command that imports data from TMDB into Cloudsearch in one run, the command we have named “full:moviesImportAndUpload”. Here is a usage example of the command:

php console.php full:moviesImportAndUpload en,ro 5 2017,2018 
movies-some_hash_here.eu-central-1.cloudsearch.amazonaws.com

The arguments to the command are as follows:

  • A list of two letters language codes to specify the languages in which to import movie data. In our examples of domain creation we made sure to provide support for English and Romanian to text fields, therefore in the usage above we are requesting data for both languages.
  • The number of pages to request from the TMDB API (20 movies per page is default). The number of pages will be requested per year and language, so make sure you put a reasonable number there.
  • A list of years to get movies for, comma separated.
  • The full domain name without the “doc-” prefix.

We’ve structured the import tool so that command classes are decoupled from the logic actually doing the import and upload, and we’ve made use of Dependency Injection in a very simple form. The wiring of services is made directly in the entry point script, console.php:

$container = new ContainerBuilder();
// The API key for TMDB and the keys for AWS are stored in a config 
// YAML file as parameters
$fileLocator = new FileLocator([__DIR__.'/config']);
$loader = new YamlFileLoader($container, $fileLocator);
$loader->load('config.yaml');

// We use a document batch factory to create batches of imported documents
$container->setParameter('data_dir', __DIR__.'/data');
$container->register('document_batch_factory', MovieDocumentBatchFactory::class)
    ->setArguments([$container->getParameter('data_dir')]);

// All the logic for importing from TMDB is encapsulated in an importer service
$apiKey = $container->getParameter('tmdb')['api_key'];
$container->register('tmdb_movie_importer', TmdbMovieImporter::class)
    ->setArguments([$apiKey, new Reference('document_batch_factory')]);

// The Cloudsearch API version used is latest available
const AWS_VERSION = '2013-01-01';
// The region we want to create the search domain in is in Central Europe
const AWS_REGION = 'eu-central-1';
// We configure the DI container to offer an instance of the AWS SDK fully 
//configured
$accessKey = $container->getParameter('cloudsearch')['accessKey'];
$secretKey = $container->getParameter('cloudsearch')['secretKey'];
$container->register('aws_sdk', Sdk::class)
    ->addArgument(
        [
            'credentials' => array(
                'key'    => $accessKey,
                'secret' => $secretKey,
            ),
            'version'     => AWS_VERSION,
            'region'      => AWS_REGION,
        ]
    );
// All the logic for uploading a document batch to CS is encapsulated in an 
// updater service which gets the SDK injected through constructor
$container->register('cloudsearch_updater', CloudsearchUploader::class)
    ->addArgument(new Reference('aws_sdk'));

Let’s go now to the command class and see how the two main steps of importing and uploading data implemented are:

protected function execute(InputInterface $input, OutputInterface $output)
    {
        // Obtaining the values passed for the 4 required arguments
        $languages = explode(',', $input->getArgument('languages'));
        $years = explode(',', $input->getArgument('years'));
        $pages = (int)$input->getArgument('pages');
        $domain = $input->getArgument('domain');

        $output->writeln('Importing movies data from TMDB...');
        $tmdbImporter = $this->getApplication()
            ->getContainer()
            ->get('tmdb_movie_importer');
        $movieBatch = $tmdbImporter
            ->importMovieBatch($years, $languages, $pages);

        $output->writeln('Uploading movies data to Cloudsearch...');
        $cloudSearchUploader = $this->getApplication()
            ->getContainer()
            ->get('cloudsearch_updater');
        $cloudSearchUploader->uploadMovieDocumentBatch($movieBatch, $domain);
    }

The code snippet above speaks for itself, therefore we won’t go into details about it, but instead we’ll continue the natural flow of code and present the snippet of code in the movie importer that does all the magic:

public function importMovieBatch(array $years, array $languages, int $pages): MovieDocumentBatch
    {
        $token = new \Tmdb\ApiToken($this->apiKey);
        $this->client = new \Tmdb\Client($token);

        $movieBatch = new MovieDocumentBatch();
        foreach ($years as $year)
        {
            $totalPages = 1;
            for ($page = 1; $page <= $pages && $page <= $totalPages; $page++)
            {
                foreach ($languages as $language)
                {
                    $selected = $this->client->getDiscoverApi()->discoverMovies(
                        [
                            'language'             => $language,
                            'page'                 => $page,
                            'primary_release_year' => $year,
                        ]
                    );

                    $totalPages = $selected['total_pages'];
                    $movies = $selected['results'];
                    $this->preProcessMovies($movies, $language);
                    $movieBatch->addMovies($movies, $language);
                }
            }
        }

        return $movieBatch;
    }

In the snippet above we start using the API that the php-tmdb/api provides. We first create an ApiToken instance based on the API key we received from TMDB, then we create a new Client to use that token. The Client class public interface offers access to all the APIs in a very nice and clean way, but we’ll only make use of the “discover” API for now. The call to discoverMovies() allows a series of filters, out of which we’ll use the language, page and the primary release year so that we apply the preferences provided at command run.

Every page of movies imported undergoes a preprocessing stage in which the importer translates genre IDS into genre names, by making an extra request to the genres API on TMDB. Please check the full class code on github for all the details around that.

Finally, the movies are added to the document batch, which itself processes them by selecting only the fields we want in our CloudSearch domain. Please see the MovieDocumentBatch implementation on github for the full logic of mapping the data to the language agnostic and language dependent fields we have prepared in the CloudSearch domain.

The second step of our command is the upload to CloudSearch. The main logic for that happens in the CloudsearchUploader::uploadMovieDocumentBatch() method:

public function uploadMovieDocumentBatch(MovieDocumentBatch $batch, string $domain): int
    {
        $client = $this->sdk->createCloudSearchDomain(
            ['endpoint' => $this->getEndpoint($domain)]
        );
        $documentsChunks = array_chunk($batch->getDocuments(), self::BATCH_SIZE);
        $uploaded = 0;
        foreach ($documentsChunks as $chunk)
        {
            $sdfChunk = $this->prepareSdfChunk($chunk);
            try
            {
                $client->uploadDocuments(
                    [
                        'documents'   => json_encode($sdfChunk),
                        'contentType' => 'application/json',
                    ]
                );
                $uploaded += count($sdfChunk);
            } catch (\Exception $ex)
            {
                // best effort upload, we don't stop because of chunk failed
                print $ex->getMessage();
            }
        }
        return $uploaded;
    }

Uploading documents to a CloudSearch domain requires a client class instance. That is easily obtainable with the createCloudSearchDomain() method of the AWS SDK object. The method takes an array of options as argument, but only the “endpoint” is required to have a working client. In order to avoid the batch limitations imposed by CloudSearch, we split the documents into fixed count chunks.

A note here is that CloudSearch requires a special JSON format for the payload of upload, and that is called SDF (Search Document Format) in the documentation. Our documents are in a flat format, so we need to ensure the SDF format, hence the call to prepareSdfChunk(). The function code is offered below so that you have the full picture:

private function prepareSdfChunk($chunk)
    {
        return array_map(
            function ($document) {
                $id = $document['id'];
                unset($document['id']);
                return [
                    'type'   => 'add',
                    'id'     => $id,
                    'fields' => $document,
                ];
            },
            $chunk
        );
    }

And we’re done! We now have a rather simple implementation of an extract-transform-load process that will allow us to put some data into our movies domain.

The fun part: searching for movies

The structure of the movie documents

Before we start throwing queries at the search domain, it is a good thing to look at one of the documents we have uploaded and see what fields and data type we have at our disposal.

In order to fill up the domain with movies, we will make use of the command line tool previously built and run something similar with the following:

php console.php full:moviesImportAndUpload en,ro 20 2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018 movies-somehashhere.eu-central-1.cloudsearch.amazonaws.com

So here’s a random movie document from the domain we are going to search on:

A simple warm-up search

Requirement

We want to get all horror movies about Second World War ordered descending by their average vote.

Solution

We’ll use the simple parser for this purpose (see q.parser parameter). The simple parser offers only a few configuration options regarding the query, therefore we can negate terms, such as -civil in this case, we can perform phrase searches or searches with a slop. We specify the fields we want to search in with the q.options parameter and we set a boost per each field.

In order to filter only horror movies, we make use of the fq parameter and pass it a structured query on the genres_multi_en field (the text array field that does English analysis to the terms).

In short, the query parameters are the following:

q.parser:simple
q.options:{fields: ['title_en^2','overview_fuzzy_en^1']}
return:original_title,title_en,overview_fuzzy_en,vote_average
fq:genres_multi_en:'Horror'
sort:vote_average desc
q:world -civil war II

The cURL command for our needs might look like the following:

curl -X GET \
'http://search-movies-somehashhere.eu-central-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=simple&q.options={fields:%20[%27title_en^2%27,%27overview_fuzzy_en^1%27]}&return=original_title,title_en,overview_fuzzy_en,vote_average&fq=genres_multi_en:%27Horror%27&sort=vote_average%20desc&q=world%20-civil%20war%20II'

Here is an example of a full output:

{
    "status": {
        "rid": "9JHNjoMt8wesHwJ2",
        "time-ms": 73
    },
    "hits": {
        "found": 2,
        "start": 0,
        "hit": [
            {
                "id": "438799",
                "fields": {
                    "overview_fuzzy_en": "On the eve of D-Day during World War II, American paratroopers are caught behind enemy lines after their plane crashes on a mission to destroy a German Radio Tower in a small town outside of Normandy. After reaching their target, the paratroopers come to realize that besides fighting off Nazi soldiers, they also must fight against horrifying, bloody, and violent creatures that are a result of a secret Nazi experiment.",
                    "original_title": "Overlord",
                    "vote_average": "6.9",
                    "title_en": "Overlord"
                }
            },
            {
                "id": "153738",
                "fields": {
                    "overview_fuzzy_en": "Toward the end of World War II, Russian soldiers pushing into eastern Germany stumble across a secret Nazi lab, one that has unearthed and begun experimenting with the journal of one Dr. Victor Frankenstein. The scientists have used the legendary Frankenstein's work to assemble an army of super-soldiers stitched together from the body parts of their fallen comrades -- a desperate Hitler's last ghastly ploy to escape defeat",
                    "original_title": "Frankenstein's Army",
                    "vote_average": "5.7",
                    "title_en": "Frankenstein's Army"
                }
            }
        ]
    }
}

Some more searches

Requirement

Given we have configured the following alias synonyms:

{
  "star wars": [
    "jedi"
  ],
  "captain america": [
    "team thor",
    "avengers"
  ]
}

and we have the following group synonyms:

[
  [
    "world war ii",
    "wwii",
    "second world war"
  ]
]

we want to obtain all the movies that mention “avengers” in their overview, and by avengers we target mainly Captain America movies. We also want to highlight the words that determined the matches.

Solution

We’ll use the dismax parser, which is capable of taking as input a text that comes from direct user input, no validation needed, and generates match queries depending on the list of fields and boosts provided to ensure best results. We are going to use some specific dismax parameters that you are invited to lookup in Lucene documentation.

In short, the query parameters are the following:

q.parser:dismax
q.options:{fields: ['overview_fuzzy_en^1'], phraseFields:['overview_fuzzy_en'], phraseSlop:2, tieBreaker: 0.1}
return:title_en,overview_fuzzy_en
sort:_score desc
q:avengers
highlight.overview_fuzzy_en:{format:'text'}

The cURL command for our needs might look like the following:

curl -X GET \
'http://search-movies-somehashhere.eu-central-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=dismax&q.options={fields:%20[%27overview_fuzzy_en^1%27],phraseFields:[%27overview_fuzzy_en%27],phraseSlop:2,%20tieBreaker:%200.1}&return=title_en,overview_fuzzy_en&sort=_score%20desc&q=avengers&highlight.overview_fuzzy_en={format:%27text%27}'

An example result from CloudSearch:

…
{
                "id": "211387",
                "fields": {
                    "overview_fuzzy_en": "The film takes place one year after the events of Captain America: The First Avenger, in which Agent Carter, a member of the Strategic Scientific Reserve, is in search of the mysterious Zodiac.",
                    "title_en": "Marvel One-Shot: Agent Carter"
                },
                "highlights": {
                    "overview_fuzzy_en": "The film takes place one year after the events of *Captain America*: The First *Avenger*, in which Agent Carter, a member of the Strategic Scientific Reserve, is in search of the mysterious Zodiac."
                }
            },
            {
                "id": "413279",
                "fields": {
                    "overview_fuzzy_en": "Discover what Thor was up to during the events of Captain America: Civil War.",
                    "title_en": "Team Thor"
                },
                "highlights": {
                    "overview_fuzzy_en": "Discover what Thor was up to during the events of *Captain America*: Civil War."
                }
            }
...
Requirement

We want to obtain all action movies that refer to spies and we can imagine that the current language of the site is Romanian, therefore the search is performed in text fields with Romanian content. We also want to see only movies from 2010 to 2018, therefore we put a range filter on the release date.

Solution

We’ll use a structured query and we will sort the results based on an expression that uses all info we have about user ratings.

In short, the query parameters are the following:

q.parser:structured
return:title_ro,overview_fuzzy_ro,review_expr
sort:review_expr desc,_score desc
q:(and+overview_fuzzy_ro:'spion' genres_multi_ro:'Acțiune')
fq:(range+field%3D'release_date' ['2010-01-01T00:00:00Z','2018-12-31T23:59:59Z'])
expr.review_expr:vote_average*vote_count

The cURL command for our needs might look like the following:

curl -X GET \
'http://search-movies-somehashhere.eu-central-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&return=title_ro,overview_fuzzy_ro,review_expr&sort=review_expr%20desc,_score%20desc&q=%28and+overview_fuzzy_ro:%27spion%27%20genres_multi_ro:%27Ac%C8%9Biune%27%29&fq=%28range+field%3D%27release_date%27%20[%272010-01-01T00:00:00Z%27,%272018-12-31T23:59:59Z%27]%29&expr.review_expr=vote_average%2Avote_count'

An example result from CloudSearch:

…
{
                "id": "192102",
                "fields": {
                    "title_ro": "Condamnat să ucidă",
                    "overview_fuzzy_ro": "Kevin Costner este Ethan Runner, un spion internațional implicat în dejucarea planurilor celor mai periculoși răufăcători..."
                },
                "exprs": {
                    "review_expr": "7246.799999999999"
                }
            },
            {
                "id": "23172",
                "fields": {
                    "title_ro": "Spionul din vecini",
                    "overview_fuzzy_ro": "Fostul spion al CIA, Bob Ho, își asumă cea mai dificilă atribuție până în prezent..."
                },
                "exprs": {
                    "review_expr": "3266.1"
                }
            }
...

Oh CloudSearch, what wonderful memories we have together!

We’ve managed to get through all steps needed to start using CloudSearch, meaning we set up a search domain, then we imported data into it and finally we performed some basic searches to see it at work. Now, the time has come to say goodbye to CloudSearch with promises of future visits and a forever friendship...

Thank you for your patience and time invested into following this journey of a beautiful friendship.


NO COMMENTS

Tell us what you think

Fields marked with " * " are mandatory.

We use cookies to offer you the best experience on our website. Learn more

Got it