Meet CloudSearch… not the coolest kid in class, but definitely someone you want as a friend

Meet CloudSearch… not the coolest kid in class, but definitely someone you want as a friend

What is CloudSearch?

Amazon CloudSearch is a cloud-based search service provided by Amazon Web Services.It incorporates the Apache Lucene Core search engine library and provides search features similar to Apache Solr.

From the cloud provided service point of view, Amazon CloudSearch offers simple domain configurations through AWS Console, AWS CLI and AWS SDKs; auto-scaling for data; automatic monitoring and recovery for search domains; high availability and cost-effective options.

When we think of CloudSearch, all the following features are worth mentioning: full text search, faceting, geospatial search, search suggestions, match highlighting, customizable relevance ranking, field weighting, synonyms and stopwords handling.

In terms of communicating with Cloudsearch, we can choose between XML and JSON formats, therefore it is pretty versatile.

Amazon CloudSearch can be regarded as an easy to integrate customizable search feature for any kind of application needing search. It can be added to a website without having to deal with provisioning, managing indexes, monitoring and data partitioning.

Should I consider it as a search solution for my new project?

CloudSearch is surely a service worth taking into consideration. One thing to keep in mind is that there is no such thing as a “one size fits all” search solution. Any service you choose will eventually require a bit of fine-tuning to match your use case.

Since it’s a managed service in the AWS Cloud, it is very easy to setup a domain and to add searchable data to it. There is even an option to analyze fields from a given document, so all you have to do is to provide a JSON, XML or text file which contains your documents and the configuration wizard will do all the grunt work.

Depending on your use case, managing relations between documents can become cumbersome at some point, because CloudSearch does not have built-in support for this feature. You can find a workaround for this shortcoming, like we did (more details on this topic later in the article).

To sum it up, AWS CloudSearch is a good solution if you don’t want the headache that comes with managing servers infrastructure and provisioning and just want a quick and reliable search for a simple data model.

Should I consider it as an asset to add to my toolbelt?

From a developer’s point of view, CloudSearch is an endpoint that provides the application with the needed fulltext search features. It is not very different from other search engines out there, such as Elasticsearch and Apache Solr. In fact it shares the same core with the aforementioned two, namely the Apache Lucene Core. Therefore, the terminology is pretty much the same when it comes to comparing the features of these search engines. This means that, as a developer, knowing CloudSearch constitutes a solid ground for understanding any of the others’ way of working.

One might say that other search engines available both as self-manageable services or as cloud hosted services are more sought for, especially in the PHP world. I am thinking here of Elasticsearch, which has gained a lot of popularity lately, among PHP developers at least. Apache Solr is still in the books with powerful features, despite a somewhat steeper learning curve and demanding maintenance. There are others such as Algolia, which is a very nice competitor for the previous ones.

But let’s not forget the enterprise applications world. PHP has matured enough to be considered as the language of choice at least for building the customer facing web platforms of some big players. Enterprise level players are already using cloud providers for some of the needed services, Amazon Web Services being the choice for a lot of them. See where I’m going with this? It’s a common case that, when they look for a search engine for their customer facing websites, they look in the AWS yard and discover it provides CloudSearch, which solves their problem. To conclude, if you are developing a web application to be integrated in such an enterprise environment, then knowing CloudSearch will prove very useful.

Knowledge about Amazon CloudSearch is also a plus when you want to apply for a job that requires previous work with AWS, as it often is among the examples of AWS services in the job offers.

How does it compare to other search engines out there?

Comparing search engines in the smallest detail is a tough task. And one of the main reasons for that is because each and every one of us, as everyday users of the search functionality, expect different things from the search we are using. Moreover, there are no two similar business domains in terms of search needs. And to add to that, there are even different search needs within the same website depending on the viewport of the page: for a desktop version, a facets sidebar might be more important than for a mobile version, while for a mobile version a well configured fuzzy search feature that is typos tolerant might be the number one requested feature. We are most familiar with e-commerce sites because a vast amount of our work orbits around e-commerce, but there are several other use cases for search that require a slightly different feature set from the search engine currently used. For example, an e-commerce site makes heavy use of faceting and allowing the user to filter by a large number of attributes, while a blog system or a scientific papers search site will require good highlighting capabilities and less faceting, just enough to power up the tag cloud in the sidebar.

Therefore, without trying to present an exhaustive list of features, we’ll look at CloudSearch and the other mentioned search solutions in the light of the following features:

  • Query syntaxes / parsers
  • Faceting and filtering
  • Sorting and pagination
  • Suggestions and text autocomplete
  • Fuzzy matching
  • Did you mean…
  • Languages support and stemming
  • Synonyms and stopwords
  • Geo distance search and geo distance sorting

Query syntaxes / parsers

When writing the queries, CloudSearch allows you to choose from four query parsers (simple, structured, dismax and lucene). Similar to Apache Solr, which served as an inspiration, CloudSearch expects the query for documents selection in a query string parameter named simply “q”. In fact, the four query strategies are more than just parsers, they also control how the query will be performed once parsed.

Other similarities with Apache Solr are the “lucene” and the “dismax” parsers. The “lucene” parser allows you to write low level queries in the Lucene syntax. The dismax parser is an implementation of the DisjunctionMaxQuery mode that was first provided by Apache Solr, then CloudSearch creators added it to CloudSearch for easy migration of existing code to their new solution.

The main advantage of using dismax is that it takes a user-entered phrase, parses it into tokens and then searches individual words across several fields using different boosting, and it never fails if the user input is bad.

From a developer’s point of view, this provides the easiest best results, which is all it takes to power up a site search most of the times.

Due to the same Lucene core, CloudSearch also shares these two query types with Elasticsearch, the later providing them in its JSON based syntax under the names “query string query” that allows the Lucene syntax and the “multi match query” for something similar to the dismax behavior in CloudSearch, but also a “dis max query” as a compound query that takes a somewhat different approach.

CloudSearch also provides the “simple” parser and the “structured” parser. The simple parser is, as its name says, a fairly simple parser that you will quickly find inadequate for most of the non-trivial search use cases. The real power of querying with CloudSearch resides in the structured parser. The structured query syntax reminded us of functional programming, as each query part is enclosed in parenthesis and the part identifier is the first element in the enclosed sequence of elements. It takes some time to get comfortable with, but then you literally feel you can do anything… in terms of search, of course.

Faceting and filtering

A facet is the value of a given field that a set of documents have in common. Most of the times you want to display a count in line with the facet so that the user knows how many documents share that common value and what to expect if he or she clicks on the particular value to use it as a filter.

CloudSearch provides all you need in order to equip your search results page with faceted navigation. For each field where you need faceting data in relation to the results of a query, you have to add just a query parameter which starts with “facet” and a dot separates it from the name of the field you want result items counts for. Then the syntax for the value resembles JSON and allows you to perform either discrete values counts or range counts. You have to specify either buckets of discrete values or range buckets, and the query result will contain the requested counts.

There is also the non-bucketed simpler way, which just returns counts for all distinct values in the field. One shortcoming of this approach is that for a given field you can add just one facet request per query. Solr offers the faceting feature in a similar way, but with more configuration options, while Elasticsearch has taken a more general approach and offers facet counting as part of the aggregations framework, through “terms aggregation”, “range aggregation” and “filter aggregation”.

In terms of filtering the results, CloudSearch follows again the example of Apache Solr and expects the filter query in the “fq” query string parameter. The difference is that CloudSearch allows only structured parser queries for the “fq” parameter, while Solr expects Lucene syntax in the “fq” parameter. Elasticsearch takes a different approach again and allows any type of query clause to be used in a so-called filter context, so you use the same building blocks, the various query clauses, to build both the query for selecting and ranking and the filter for leaving out documents that do not comply.

Sorting and pagination

Similar to Apache Solr, and as expected from any search engine, CloudSearch provides means for pagination and sorting the search results.

The default sorting criterion is the relevance score computed internally by CloudSearch. However, you can change the default and use one or more criteria for sorting. CloudSearch will apply each criterion subsequently. As expected, document fields are eligible to be used as criteria, but there is also the choice for expressions, which are pretty powerful. The list of comma-separated criteria for sorting is provided via the “sort” URL parameter.

CloudSearch provides simple pagination through the use of “start” and “size” parameters. This simple pagination mechanism is however limited to the first 10000 results. In order to pass the 10000 results limitation, you have to use the cursor feature, which allows you to obtain sequential sets of hits starting from the beginning of the results each time. The usual use case of random page access can be simulated by starting from initial cursor on each request, in order to avoid stale cursors.

Suggestions and text autocomplete

CloudSearch provides suggestions through separate search indices named suggesters. A suggester is built upon a single text field from the indexed documents. There is a dedicated endpoint for querying suggestions and each suggestions request can target only one defined suggester. That means that if you want suggestions from multiple fields, you have to perform several requests and then merge results. There is a limit of 10 suggesters that can be defined on each domain, so you have to use them wisely. Speaking of limits, only the first 512 bytes of data in the targeted field are used during matching, so the feature is not intended for large text contents.

The logic behind providing suggestions is word prefix matching. There is the option to gradually introduce fuzziness in the process, having to choose from allowing one or two typos in the searched prefix. What we observed in our tests was that it still provided suggestions even after we wrote one complete word, then started another one, which was not the next one in the text indexed for the target field. In some cases this didn’t happen though. Nonetheless, it’s a nice feature that, based on word prefix matching, gives you a list of documents for your query types so far.

Autocomplete feature in form of typeahead suggestions is not provided. We had to implement it ourselves in PHP for one of the projects by using a combination of prefix queries and phrase queries and then post-processing resulted documents with the use of regular expressions in PHP. The approach is pretty resource intensive, but a good caching mechanism will help. As long as the content in the targeted fields is suitable, the typeahead suggestions obtained this way are satisfactory.

Fuzzy matching

Fuzzy matching attempts to find a match which, although it’s not a 100 percent match, is above the threshold matching percentage set by the engine user. CloudSearch allows you to perform fuzzy searches only when you make use of the simple query parser. To perform a fuzzy search, append the ~ operator and a value that indicates how much terms can differ from the user query string and still be considered a match. For example, by specifying matck~1 (note the typo, k instead of h) you will get results which differ by up to one character, which means the results will include hits for the word “match”.

Did you mean…

At the moment, CloudSearch does not have support for "Did you mean?" queries, even if the backing engine, Solr, supports this feature. Elasticsearch offers something similar, but under the name of “phrase suggester”, which can be used to implement a “Did you mean?” feature for your website search.

Languages support and stemming/lemmatization

During indexing, the text fields are tokenized and normalized. CloudSearch offers support for 34 languages, which are in the common package of languages supported by other search engines.

Synonyms and stopwords

Synonyms can be easily added to a CloudSearch domain, either directly using the AWS Console or through any of the other tools built for domain management.

Synonyms are mainly used in search to:

  • map common misspellings to the correct spelling;
  • define equivalent terms, such as “film” and “movie”;
  • map multiple words to a single word or vice versa, such as “week end” and “weekend”;
  • map a general term to a more specific one.

An important note here is that CloudSearch allows you to specify synonyms in two ways:

  • as a group, where each term in the group is a synonym of every other term in the group;
  • as an alias for a specific term - the difference is that the term is not considered a synonym of the alias.

Stopwords are words that should be ignored both during indexing and at search time because they are either insignificant or so common that they would return too many results. A stopwords dictionary is a JSON array of terms, for example ["a", "dans", "et", "de"]. If you choose one of the predefined analysis schemas, Cloudsearch will use the default stopword dictionary for the specified language. In case you create your custom analysis schema, you must provide the list of stopwords if you want to use the feature.

Geo distance search and geo distance sorting

Searching and ranking results by geographic location can be achieved by using a special field type, called latlon. The values are specified as a comma-separated list - for example 35.11,-120.34.

Geo search

Unlike other search engines, CloudSearch allows only searching inside a square bounding box. In order to use the bounding box filter, to constrain results to a particular area, you have to determine the latitude and longitude of the upper-left and lower-right corners of the rectangle area you are interested in and then specify your bounding box filter as:

fq=latlon_field:['upper_left_lat,upper_left_long','lower_right_lat,lower_right_long']

For example, given you stored the location of each document in a field named “location”, to filter results and see only those located in the center of Paris, you would send the following parameter:

http://search-{cloudsearch-domain-here}/2013-01-01/search?q=pizza&fq=location:[‘48.898009,2.265496’,’48.822419,2.413706’]

Geo sort

You can define an expression as part of your search request to sort results by distance. Amazon CloudSearch expressions support the haversin function, which computes the great-circle distance between two points on a sphere using the latitude and longitude of each point.

Here is an example on how to sort by distance using the haversin function inside an expression:

  • first define an expression to hold the haversine result:
  • expr.distance=haversin(48.858619, 2.294163,location.latitude,location.longitude);

  • then use the expression as the criterion for the sort parameter:
  • sort=distance asc.

Therefore, the final query string for finding the closest pizza place to the Eiffel Tower in Paris, given the documents we indexed in CS have a “location” field of type latlon, would be:

http://search-{service-endpoint}/2013-01-01/search?q=pizza&expr.distance=haversin(48.858619, 2.294163,location.latitude,location.longitude)&sort=distance asc

In conclusion

We hope we’ve convinced you by now to befriend CloudSearch. It can become that kind of friend that never expects you to listen to all the drama in its life and understand all its mind’s inner workings or invest your time into keeping it entertained all the time, but instead it keeps its frustrations about resources and spikes to itself and when you are in need it gladly jumps to help you get your job done.

This article is the first one of a series of articles on CloudSearch and its purpose is just to make you curious about the possibilities it offers. If you find this a bit too theoretical, then stay tuned, because the next article in this series is all about using CloudSearch in a real life example.

Till then, happy searching...


NO COMMENTS

Tell us what you think

Fields marked with " * " are mandatory.

We use cookies to offer you the best experience on our website. Learn more

Got it