This detailed article is co authored with Dwarak.
Slide Share link : http://www.slideshare.net/harishganesan/amazon-cloud-searchvsapachesolrvselasticsearchcomparisonreportv11

It’s the information age, and getting the information to the people who want it, when they want it, and in the most efficient way is one of the foremost challenges. Therefore, effective search tools are a primary concern. There are many options available to developers, including open source, independent vendors, and cloud-based search engines, but given the collaborative strengths of open source products and the rapid growth of cloud-based, the Amazon CloudSearch Comparative Report is focused on a comparison of three major contenders: Apache Solr, Elasticsearch, and Amazon CloudSearch. This report aims to provide a detailed and objective assessment of the strengths of each.

There are many features and variables to take into account when deciding which product to choose. This report provides a straight comparison of all the aspects a developer will need to consider, including such features as Getting Started, Operations, Indexing, Search and Query, Scaling, Protocols, API support, Customization, Cost, and more.

Feature	Apache Solr	Elasticsearch	Amazon CloudSearch
Admin Operations
Backup	Replication/Custom handler/Custom scripts	Snapshot API/Custom scripts	Fully-managed
Patch Management	Manual/Automated via custom scripts	Manual/Automated via custom scripts	Fully-managed
Re-indexing	Manual	Manual	Fully-managed Manual option available from management console
Monitoring	If hosted on EC2, Amazon CloudWatch SaaS Monitoring tools like NewRelic, Stackdriver, Datadog	If hosted on EC2, Amazon CloudWatch SaaS Monitoring tools like NewRelic, Stackdriver, Datadog	CloudSearch default metrics
Maintenance	External managed service	External managed service	Fully-managed
API
Client Library	Java, PHP, Ruby, Rails, AJAX, Perl, Scala, Python, .NET, JavaScript	Java, Groovy, JavaScript, .NET, PHP, Perl, Python, and Ruby	Amazon SDK
HTTP RESTful API	YES	YES	YES
Request Format	XML, JSON, CSV	XML, JSON	XML, JSON
Response Format	XML, JSON, CSV	XML, JSON	XML, JSON
Third party Integrations	Available for Commercial and Open source	Available for Commercial and Open source	Amazon Web Services Integrations available
Search Functions
Schema	Schema and Schema-less	Schema and Schema-less	Schema
Dynamic fields support	Yes	Yes	Yes
Synonyms	Yes	Yes	Yes
Multiple indexes	Yes	Yes	No
Faceting	Yes	Yes	Yes
Rich documents support	Yes	Yes	No
Auto Suggest	Yes	Yes	Yes
Highlighting	Yes	Yes	Yes
Query parser	Standard, DisMax, Extended DisMax, Other parsers	Standard, query_string, DisMax, match, multi_match	Simple, structured, Lucene, or DisMax
Geosearch	Yes	Yes	Yes
Analyzers, Tokenizers and Token filters	Default/Custom	Default/Custom	Default
Fuzzy Logic	Yes	Yes	Yes
Did you mean	Default/Custom	Default/Custom	No
Stopwords	Yes	Yes	Yes
Customization	Yes	Yes	No
Advanced
Cluster management	ZooKeeper	in-built	Fully-managed
Scaling	Vertical scaling/ Horizontal scaling	Vertical scaling/ Horizontal scaling	Fully-managed horizontal scaling
Replication	Yes	Yes	Yes
Sharding	Yes	Yes	Yes
Failover	Yes, if set up in Cluster Replica mode	Yes, if set up in Cluster Replica mode	Fully-managed
Fault tolerant	Yes, if set up in Cluster mode	Yes, if set up in Cluster mode	Fully-managed

Import and Export
Data import	Default import handlers, custom import handlers	Rivers modules, Logstash input plugins, custom programs	Batch upload
Data export	Default export handlers, custom export handlers	Snapshot API	Custom program

Others
Web Interface	Solr Admin	Sense	AWS Management Console

In today's world of vast and available information, a good search experience is central to a good user experience. Hence, delivering effective search tools has become the key goal of all software products, market places, e-commerce websites, and content management systems. Developers looking to deliver a premium search experience to their users should be aware of some broad trends:

1) Open source and platform-based search engines are replacing proprietary search engines because of better licensing models and community support.

2) The cloud delivery model is succeeding over the on-premise delivery model because of scalability, high availability and operating expense.

In light of the above trends, the choice of leading candidates for search technology boils down to three: Apache Solr, Elasticsearch, and Amazon CloudSearch. Our clients often ask us how these three choices compare relative to each other. This report aims to make it easy for developers to pick the right technology for their application by presenting a comprehensive framework for evaluation of the three options. We have also applied our framework to top feature sets that are critical to any search workload. We then broke them down further into granular features and compared each of the three search engines.

In this report, we summarize our conclusions and present them in a smack down style summary card. We encourage our readers to run a more in-depth evaluation for their specific use cases.

Feature 1: Getting Started

‘Getting Started’ is the first step by an engineer to understand the basics of the major features of a product. In this section, we will see how the search engines discussed above facilitate ‘Getting Started’.

Apache Solr and Elasticsearch

Apache Solr and Elasticsearch require the end users to spend quality time in understanding and setting up the respective search engines. The “Getting Started” manuals of Apache Solr and Elasticsearch assume the end user to have minimal knowledge of search engines, their related functions and architecture.

The installation processes for Apache Solr and Elasticsearch include tasks such as:

• Server setup

• Search engine download

• Dependent software installations

• Setup of environmental requirements

• Understanding of basic server commands

• Administrative access

Apache Solr and Elasticsearch are shipped with test examples which allow users to do “warm up” search and indexing operations. While the default test schema in Apache Solr is sufficient for the user to get started, the Elasticsearch’s schema-less design allows the user to send document request without any schema.

Amazon CloudSearch

If you already have an Amazon Web Services (AWS) account set up, you can create a CloudSearch domain in a few clicks using the AWS Management console. The AWS CloudSearch Management console guides administrators through a step-by-step process, requesting the user input:

• Instance type

• High availability options

• Replication options

• Schema definitions

• Access policies

Among these options, it is important to note that Amazon CloudSearch does not mandatorily prompt for all information. The CloudSearch domain name and engine type are adequate to create a CloudSearch instance.

The other configurations such as schema, instance type, access policies, and high availability options can be modified at a later time based on the application requirements.

The users are abstracted from hardware provisioning, software installation, configuration, cluster setup and other administration activities.

Users receive two administrative regional endpoints: a search endpoint and document endpoint. Both endpoints can be accessed using RESTful API or AWS Software Development kit (SDK) with Identity and Access Management (IAM) credentials.

Another important note is that CloudSearch default access policies for the document service and search service endpoints are configured to block all IP addresses. Developers should configure the authorized IP addresses to access the CloudSearch’s endpoints.

CloudSearch also provides a sample dataset, the IMDB movies, which can be used to test drive the CloudSearch service. The CloudSearch developer documentation walks through the steps to launch a test domain using the sample IMDB dataset.

Conclusion

Apache Solr and Elasticsearch expect users to have basic practical knowledge of the search engine and also complete a few significant tasks to accomplish the first step ‘Getting Started’.

In Amazon CloudSearch, the ‘Getting Started’ activities are easier and end users can have the CloudSearch instance up and running with few clicks in a few minutes.

Feature 2: Operations and Management

In this section, we’ll discuss some important administrative operations such as

• Index backup

• Patch management

• Re-indexing and recovery

2.1 Backup

Data backup is a routine operation, carried out within a defined period of time. Data backup is an essential task for recovering data responsively from failures such as hardware crash, data corruption or related events.

Apache Solr

Apache Solr provides a feature called ‘ReplicationHandler’. The main objective of ReplicationHandler is to replicate index data to slave servers, but it can also be used as a backup copy server. A replication slave node can be configured with Solr Master, which can be solely identified as a backup server, with no other operations taking place on the slave node.

Solr‘s implicit support for replication allows ReplicationHandler to be used as an API. The API has optional parameters like location, name of snapshot, and number of backups. The backup API is a bound-to-store snapshot on a local disk, but for any other storage options the backup API requires customization.

If you are required to store the backups in a different store location like Amazon’s Simple Storage Service (S3), a local storage server, or in a remote data center, ReplicationHandler has to be customized. Solr core libraries are available in open source that allows for any customization.

Elasticsearch

Elasticsearch provides an advanced option called ‘Snapshot API’ for backing up the entire cluster. The API will back up the current cluster state and related data and save it to a shared repository.

The first or initial backup process will be a complete copy of the data. The subsequent backup processes will snapshot the delta between the backup of fresh data with previous snapshots. Elasticsearch prompts end users to create a repository type, which can be chosen from a shared file system:

• Amazon S3

• Hadoop Distributed File System (HDFS)

• Azure Cloud

This integration gives a greater flexibility for developers to manage their backups.

Backup Process

The backup options present in Apache Solr and Elasticsearch can be executed manually or can be automated. To automate the entire backup process, one has to write custom scripts that calls the relevant API or handler. Most engineering companies follow this model of writing custom scripts for backup automation.

The backup also involves maintenance of the latest snapshots and archives. The management tasks involve key tasks like snapshot retrieval, archival, and expiration.

In an alternate approach, if the Solr or Elasticsearch cluster is set up in a cluster replication mode, any one of the slave nodes is identified as backup server. The automation of the slave node backup server needs a script written by the developer.

Amazon CloudSearch

Amazon CloudSearch inherently takes cares of the data that is stored and indexed, leaving a lighter load for engineering and operations teams. Amazon CloudSearch self-manages all the data backup and its management. The backups are internally maintained behind the scenes. In the event of any hardware failure or other problem, Amazon CloudSearch restores the backup automatically, and this process is not revealed to end users.

Conclusion

The default option in Apache Solr is only to back up to a ‘local disk’ store; it does not offer any other storage options as Elasticsearch does. However, the engineers can write their own handlers to manage the backup process.

Elasticsearch is packaged with multiple storage options plugins which gives added advantage for engineers.

Amazon CloudSearch relieves the users of the intricacies of the backup and its management process. The IT operations or managed service team have a lesser role in the backup process as the entire operations are managed behind the scenes by CloudSearch.

2.2 System upgrades and patch management

Patch management and system upgrades like OS patches and fixes are inevitable in operations and administration. For any system, there is always a version upgrade, or maintenance on the OS, and hardware or software changes.

Rolling Restarts

Apache Solr and Elasticsearch both recommend using ‘Rolling Restarts’ for patch management, operating system upgrades and other fixes. Rolling Restarts involve stopping and starting each cluster node in the cluster sequentially. This allows the cluster to continue its operations while each node is updated with the latest code, fixes, or patches while continuing to serve search requests. Rolling Restarts is adopted when high availability is mandatory and downtime is not allowable.

Sometimes, the Rolling Restarts require some intelligent decision making based on cluster topology. If a cluster consists of shards and replicas, the order of restarting each node has to be done decisively.

Apache Solr

Apache’s ZooKeeper service acts as a stand-alone application and does not get upgraded automatically when Apache Solr is upgraded, but it should be done manually at the same time.

Elasticsearch

Elasticsearch recommends disabling the ‘shard allocation’ configuration during node restart. This informs Elasticsearch to stop re-balancing missing shards because the cluster will immediately start working on node loss.

Amazon CloudSearch

Amazon CloudSearch internally manages all patches and upgrades related to its operating system. The managed search service offering from Amazon CloudSearch monitors for when new features are rolled out; upgrades are self-managed and immediately available to all customers without any action on their part.

Conclusion

The patch management in Apache Solr and Elasticsearch has to be carried out manually using the Rolling Restarts feature. Customers automate this process by developing custom scripts to do system upgrades and patch management.

Patch management in Amazon CloudSearch is transparent to the customers. The upgrades and patches done on Amazon CloudSearch are regularly updated in the ‘What’s New’ section of the CloudSearch documentation.

2.3 Re-indexing

Any business application changes over its lifetime, as the business running it changes. The business change has a direct effect on the data structure of the system’s persistent information store. The search engine, which is seen as a secondary or alternate store, will eventually have to change its data structure when required. Any changes to the search engine data structure will require a re-indexing of the data.

Example: A product company started collecting ‘feedback’ from their customer for a given product. The text string from the new field ‘feedback’ needs to be added into the search schema, and may require re-indexing.

If the search data is not re-indexed after a structural change, the data that has already been indexed could become inaccurate and the search results may behave differently than expected.

Re-indexing becomes a necessary process over a period of time as the application grows. It is also identified as a common and mandatory admin operation executed periodically based on application requirements.

Apache Solr

Apache Solr recommends re-indexing if there is a change in your schema definitions. The options below are widely used by the Apache Solr user community.

• Create a fresh index with new settings. Copy all of the documents from the old index to the new one.

• Configure Data import handler with ‘SolrEntityProcessor’. The SolrEntityProcessor imports data from Solr instances or cores for a given search query. The SolrEntityProcessor has a limitation where it can only copy fields that are stored in the source index.

• Configure Data import handler with the source or origination data source. Push the data freshly to the new index

Elasticsearch

Elasticsearch proposes several approaches for data re-indexing. The following approaches are usually combined:

· Use Elasticsearch’s Scan and Scroll and Bulk APIs to fetch and push data into the new index.

· Update or create an index alias with the old index name and delete the old index.

· Use open source Elasticsearch plugins that can extract all data from the cluster and re-index the data. Most of these plugins internally use the Scan and Scroll and Bulk API (as mentioned above) which reduces development time.

Amazon CloudSearch

Amazon CloudSearch recommends data rebuilding when index fields are added or modified. Amazon CloudSearch expects to issue an indexing request after a configuration change. Whenever there is a configuration change, the CloudSearch domain status changes to ‘NEEDS INDEXING’. During the index rebuilding, the domain's status changes to ‘PROCESSING’, and upon completion the status is changed to ‘ACTIVE’.

Amazon CloudSearch can continue to serve search requests during the indexing process, but the configuration changes are not reflected in the search results. The re-indexing process can take some time for the changes to take effect. It is directly proportional to the amount of data volume in your index.

Amazon CloudSearch also allows document uploads while indexing is in progress, but the updates can become slower, if there are is large volume of document updates. During such a scenario, the uploads or updates can be throttled or paused until the Amazon CloudSearch domain returns to an ‘ACTIVE’ state.

Customers can initiate re-indexing by issuing the index-documents command using RESTful API, AWS command line interface (CLI), or AWS SDK. They can also initiate re-indexing from the CloudSearch management console.

Conclusion

Re-indexing in Apache Solr and Elasticsearch is mostly a manual process because it requires a decision that factors data size, current request size, and offline hours.

Amazon CloudSearch manages the re-indexing process inherently and leaves much less to administrators. The re-indexing time period is abstracted and not disclosed to administrators but Amazon CloudSearch runs the re-indexing process based on the best practices mentioned above.

Feature 3: Monitoring

Monitoring server health is an essential daily task for operations and administration. In this section, we will describe the built-in monitoring capabilities for all three search engines.

Apache Solr

Apache Solr has a built-in web console for monitoring indexes, performance metrics, information about index distribution and replication, and information on all threads running in the Java Virtual Machine (JVM) at the time.

For more detailed monitoring, Java Management Extensions (JMX) can be configured with Solr that share runtime statistics as MBeans. The Apache Solr JVM container has built-in instrumentation that enables monitoring using JMX.

Elasticsearch

Elasticsearch has a management and monitoring plugin called ‘Marvel’. Marvel has an interactive console called ‘Sense’ that helps users to interact easily with Elasticsearch nodes. Elasticsearch has in-built diversified APIs that emit heap usage, garbage collection stats, file descriptions, and more. Marvel is strongly integrated with these APIs, and it periodically executes polling, collects statistics and stores the data back in Elasticsearch. Marvel’s interactive graph report dashboard allows administrators to query and aggregate historical stats data.

Amazon CloudSearch

Amazon CloudSearch recently introduced Amazon CloudWatch integration. The Amazon CloudSearch metrics can be used to make scaling decisions, troubleshoot issues, and manage clusters.

Amazon CloudSearch publishes four metrics into Amazon CloudWatch: SuccessfulRequests, Searchable Documents, Index Utilization, and Partition Count.

The CloudWatch metrics can be configured to set alarms, which can notify administrators through Amazon Simple Notification Service.

Conclusion

Apache Solr and Elasticsearch have integrations with in-built and external plugins. They can also support SaaS based monitoring plugins or custom plugins developed by the customers.

CloudSearch’s integration with CloudWatch shares some good metrics and it is expected to offer newer ones in the future.

Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export

4.1 Schema management

Schema: A schema is a definition of fields and field types used by the search system to organize data within the document files it indexes.

Schema definition is the foremost task in the search data structure design. It is important that the schema definition caters to all business requirements and is designed to suit the application.

Apache Solr and Elasticsearch

Both Elasticsearch and Apache Solr can run the search application in ‘Schema-less’ and ‘Schema’ mode. Schema mode is suitable for application development or any production environments.

Schema-less is a very good option for entrants to get started. After server setup, users can start the application without a schema structure and create the field definitions on the search indexing. However, to have a production-grade application running, a proper schema structure becomes mandatory and the schema definition is a necessity.

Amazon CloudSearch

Amazon CloudSearch also allows users to set up search domains without any index fields. The index fields can be added anytime, but before any valid document indexing or any search request.

In addition, the CloudSearch management console has integration with Amazon Web Services like S3, DynamoDB, or can access a local machine from where the schema can be imported directly to CloudSearch domain. After the schema import, CloudSearch allows the user to edit the fields or add new fields. This is a convenient feature for a pre-built schema that is to be migrated to a CloudSearch domain.

Conclusion

Apache Solr and Elasticsearch can be started without any schema but they cannot be put into production use. Amazon CloudSearch allows creating domains without any index fields, but to have any index and search requests served the schema should be created.

The general best practice in schema management is to rehearse and design the schema suiting application requirements before finalizing the search structure. The underlying schema concept of all three search engines is consistent with this practice.

4.2 Dynamic fields

Dynamic fields are like regular field definitions which support wildcard matching. They allow the indexing of documents without knowing the type of fields they contain. A dynamic field is defined using a wildcard pattern (*) for first, last, or only character. All undefined fields go through dynamic field rules which validate the pattern match configured with the dynamic field's indexing options.

Apache Solr and Elasticsearch

Apache Solr and Elasticsearch allow end users to set up dynamic fields and rules using RESTful API and schema configuration.

Amazon CloudSearch

In Amazon CloudSearch, dynamic fields can be configured using indexing options in the CloudSearch management console or using CloudSearch, RESTful API, or AWS SDK API.

Conclusion

If you are unsure about the schema structure or exact field names, dynamic fields come in handy. Amazon CloudSearch, Apache Solr, and Elasticsearch all allow the flexibility to configure dynamic fields. This helps the application development team to describe any omitted field definitions in the schema document.

4.3 Data types

There are a variety of data types supported by these search engines. The table below illustrates the data field types supported by each search engine.

Data type	Solr	Elasticsearch	CloudSearch

String / Text	Yes	Yes	Yes

Number types	integer, double, float, long	byte, short, integer, long, float, double	integer, double

Date types	Yes	Yes	Yes

Enum fields	Yes	Yes	No

Currency	Yes	No	No

Geo location / Latitude – Longitude	Yes	Yes	Yes

Boolean	Yes	Yes	No

Array types	Yes	Yes	Yes

Conclusion

The most important data types like string, date, and number types are supported by all three search engines. Geo location data type, which is now regularly used by modern applications, is also supported by all search engines.

Engineers and developers may use an alternate data type if a particular data type is not supported for their chosen search engine. Example, ‘currency’ data type supported in Solr is not available in Elasticsearch and CloudSearch. During such cases, engineers use number type as an alternative data type for ‘Currency’.

4.4 Data import & export

The most important task in a search application development is data migration from origination source to the search engine. The origination data can be of a data source like a database, or a file system or a persistent store. To commence a search data set, it is required to migrate or import the full data set from its origin to the search engine.

Likewise, extracting data from a search engine and exporting it to a different destination source is also a crucial task but executed occasionally.

Apache Solr

Apache Solr has in-built handler called Data import handler (DIH). The DIH provides a tool for migrating and/or importing data from the origin store. The DIH can index data from data sources such as

• Relational Database Management System (RDBMS)

• Email

• HTTP URL end point

• Feeds like RSS and ATOM

• Structured XML files

The DIH has more advanced features like Apache Tika integration, delta import, and transformers to quickly migrate the data.

The Apache Solr export handler can export the query result data to a Javascript Object Notification (JSON) or comma-separated values (CSV) format. The export query expects to sort and filter query parameters and returns only the stored fields. Users also have the option of developing a custom export handler and incorporate it with Solr core libraries.

Elasticsearch

Elasticsearch ‘Rivers’ is an elegant pluggable service which runs inside the Elasticsearch cluster. This service can be configured for pulling or pushing the data that is indexed into the cluster. Some of the popular Elasticsearch Rivers modules are CouchDB, Dropbox, DynamoDB, FileSystem, Java Database Connectivity (JDBC), Java Messaging Service (JMS), MongoDB, neo4j, Redis, Solr, Twitter, and Wikipedia.

However, ‘Rivers’ will be deprecated in the newer release of Elasticsearch, which recommends using official client libraries built for popular programming languages. Alternatively, the Logstash input plugin is also one of the identified tools that can be used to ship data into Elasticsearch.

For data export, Elasticsearch snapshot can be used for any individual indices or an entire cluster into a remote repository. This is discussed in detail in the section ‘Operations and Management - Backup’.

Amazon CloudSearch

Amazon CloudSearch recommends sending the documents in batches to upload on CloudSearch domain. A batch is a collection of add, update, and delete operations which should be described in JSON or XML format.

Amazon CloudSearch limits a single batch upload to 5 MB per batch, but allows running parallel upload batches to reduce the time frame for full data upload. The number of parallel batch uploads is directly proportional to the CloudSearch instance types. Larger instance types have a higher upload capacity, while smaller instance types have lower. During such scenarios, the batch upload programs should intelligently threshold the uploads based on instance capacity.

Conclusion

Apache Solr has good handlers to export and import the data. In any case, if the options present are not viable, Apache Solr allows one to develop a new custom handler or customize an existing handler that can be used for data import and export.

Elasticsearch has integration with popular data sources in the form of ‘River’ modules or plugins. However, the future versions of Elasticsearch strongly recommend using Logstash input plugins or developing and contributing new Logstash input, as customization of a plugin is allowed in Elasticsearch.

Amazon CloudSearch does not have elaborate options like other two search engines. However by combining custom programs with bulk upload recommendations in Amazon CloudSearch, customers can successfully migrate data into CloudSearch.

Feature 5: Search and Indexing features

In this section, we will evaluate ‘Search and Indexing’ features present in the search engines we are evaluating. This is a very important feature set as they are widely used by search application engineers.

5.1 Analyzers, Tokenizers and Token filters

Generally speaking, the search engine prepares text strings for indexing and searching using analyzers, tokenizers, and filters. These tools are frequently used by libraries configured for indexing and searching the data. Most of the time, the libraries are composed in a sequential series.

• During indexing and querying, analyzer assesses the field text and tokenizes each block of text into individual terms. Each token is a sub-sequence of the characters in the text.

• The token filter filters each token in the stream sequentially and applies its filter functionality.

Apache Solr and Elasticsearch
Apache Solr and Elasticsearch have multifaceted in-built libraries for analyzers, tokenizers and token filters. These libraries are packaged with search engine installable that can be configured during indexing and searching. Although the analyzers can be configured for indexing and querying, the same series of libraries doesn’t need to be used for both operations. The indexing and searching operations can be configured to have different tokenizers and filters, as their goals can be different.
Search Engine	Tokenizers	Filters

Apache Solr	Standard, Classic, Keyword, Letter, Lower Case, N-Gram, Edge N-Gram, ICU, Path Hierarchy, Regular Expression Pattern, UAX29 URL Email, White Space	ASCII Folding, Beider-Morse, Classic, Common Grams, Collation Key, Daitch-MokotoffSoundex, Double Metaphone, Edge N-Gram, English Minimal Stem, Hunspell Stem, Hyphenated Words, ICU Folding, ICU Normalizer 2, ICU Transform, Keep Words, KStem, Length, Lower Case, Managed Stop, Managed Synonym, N-Gram, Numeric Payload Token, Pattern Replace, Phonetic, Porter Stem, Remove Duplicates Token, Reversed Wildcard, Shingle, Snowball Porter, Stemmer, Standard, Stop, Suggest Stop, Synonym, Token Offset Payload, Trim, Type As Payload, Type Token, Word Delimiter

Elasticsearch	Standard, Edge NGram, Keyword, Letter, Lowercase, NGram, Whitespace, Pattern, UAX Email URL, Path Hierarchy, Classic, Thai	Standard Token, ASCII Folding Token, Length Token, Lowercase Token, Uppercase Token, NGram Token, Edge NGram Token, Porter Stem Token, Shingle Token, Stop Token, Word Delimiter Token, Stemmer Token, Stemmer Override Token, Keyword Marker Token, Keyword Repeat Token, KStem Token, Snowball Token, Phonetic Token, Synonym Token, Compound Word Token, Reverse Token, Elision Token, Truncate Token, Unique Token, Pattern Capture Token, Pattern Replace Token, Trim Token, Limit Token Count Token, Hunspell Token, Common Grams Token, Normalization Token, CJK Width Token, CJK Bigram Token, Delimited Payload Token, Keep Words Token, Keep Types Token, Classic Token, Apostrophe Token

Amazon CloudSearch

Amazon CloudSearch analysis scheme configuration is used for analyzing text data during indexing. The analysis schemes basically control:

• Text field content processing

• Stemming

• Inclusion of stopwords and synonyms

• Tokenization (Japanese language)

• Bigrams (Chinese, Japanese, Korean languages)

The following analysis options are executed when text fields are configured with an analysis scheme

1. Algorithmic stemming: Level of algorithmic stemming (minimal, light, and heavy) to perform. The stemming levels vary depending on the analysis scheme language.

2. Stemming dictionary: A dictionary to override the results of the algorithmic stemming.

3. Japanese Tokenization Dictionary: A dictionary which specifies how particular characters should be grouped into words (only for Japanese language).

4. Stopwords: A set of terms that should be ignored both during indexing and at search.

5. Synonyms: A dictionary of words that have the same meaning in the text data

Before processing the analysis scheme, Amazon CloudSearch tokenizes and normalizes the text data. During tokenization, the text data is split into multiple tokens; this is common behavior in all search engine text processing. During normalization, upper case characters are converted to lower case, and more formatting is applied.

After the tokenization and normalization processes are completed, stemming, stopwords, and synonyms are applied.

Conclusion

Apache Solr and Elasticsearch are packaged with varied libraries with distinct functions of analyzers, tokenizers, and filters. Also, these libraries are allowed to be customized which gives greater flexibility for the developers.

Amazon CloudSearch doesn’t carry sophisticated tokenizers or filter libraries like Apache Solr or Elasticsearch, but it has simplified the configuration. Amazon CloudSearch tokenizers and filters cover most common search requirements and use cases. This is ideal for developers who want to quickly integrate search functionality into their application stack.

5.2 Faceting

Faceting is the composition of search results into categories or groups, based on indexed terms. Faceting allows for categorizing search results into more sub-groups, which can be used as the basis for filters or other searches. Faceting is also for efficient computation of search results by facets. For example, facets for ‘Laptop’ search results can be 'Price', ‘Operating System’, 'RAM' or 'Shipping Method’.

Faceting is a popular function that helps consumers filter through search results easily and effectively.

Apache Solr

Apache Solr has far advanced options in faceting ranging from simple to very advanced faceting behavior.

The below table details the parameters used during faceting. They can be grouped by field value, date, range, pivot, multi-select, and interval.

Facet grouping	Parameters
Field value parameters	facet.field, facet.prefix, facet.sort, facet.limit, facet.offset, facet.mincount, facet.missing, facet.method, facet.enum.cache.minDffacet.threads
Date faceting parameters	facet.date, facet.date.start, facet.date.end, facet.date.gap, facet.date.hardend, facet.date.other, facet.date.include
Range faceting parameters	facet.range, facet.range.start, facet.range.end, facet.range.gap, facet.range.hardend, facet.range.other, facet.range.include
Pivot	facet.pivot, facet.pivot.mincount
Interval	facet.interval, facet.interval.set

Elasticsearch

Elasticsearch has deprecated facets and announced that they will be removed in a future release. The Elasticsearch team felt that their facet implementation was not designed from the ground up to support complex aggregations. Elasticsearch will be replacing facets with aggregations in their next release.

Elasticsearch says “An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents. The context of the execution defines what this document set is (for example, a top-level aggregation executes within the context of the executed query/filters of the search request).”

Elasticsearch strongly recommends migrating from facets to aggregations. The aggregations are classified into two main families, Bucketing and Metric.

The following table lists the aggregations available in Elasticsearch.


*Elasticsearch Aggregators*	Min Aggregation, Max Aggregation, Sum Aggregation, Avg Aggregation, Stats Aggregation, Extended Stats Aggregation, Value Count Aggregation, Percentiles Aggregation, Percentile Ranks Aggregation, Cardinality Aggregation, Geo Bounds Aggregation, Top hits Aggregation, Scripted Metric Aggregation, Global Aggregation, Filter Aggregation, Filters Aggregation, Missing Aggregation, Nested Aggregation, Reverse nested Aggregation, Children Aggregation, Terms Aggregation, Significant Terms Aggregation, Range Aggregation, Date Range Aggregation, IPv4 Range Aggregation, Histogram Aggregation, Date Histogram Aggregation, Geo Distance Aggregation, GeoHash grid Aggregation

Amazon CloudSearch

Amazon CloudSearch simplifies facet configuration when defining indexing options. These facets are targeted at common use cases like e-commerce, online travel, classifieds, etc. The facet can be of any field having data type as date, literal, or numeric field. This is done during CloudSearch domain configuration. Amazon CloudSearch also allows the buckets definition to calculate facet counts for particular subsets of the facet values.

The facet information can be retrieved in two ways:

Sort: returns facet information sorted either by facet counts or facet values.

Buckets: returns facet information for particular facet values or ranges

During searching, facet information can be fetched for any facet-enabled field by specifying the “facet.FIELD” parameter in the search request (‘FIELD’ is the name of a facet-enabled field).

Amazon CloudSearch does allow multiple facets which help to refine search results further. See the below example.

Example: "q=poet&facet.genres={}&facet.rating={}&facet.year={}&return=_no_fields"

Conclusion

All three search engines allow users to perform faceting with minimal effort. However, in terms of an advanced complex implementation, the approaches are different for each search engine.

5.3 Auto Suggestion

When a user types a search query, suggestions relevant to the query input are presented and as more characters are typed by the user, refined suggestions are presented. This feature is called auto-suggest. Auto-suggest is an appealing and useful requirement and employed in many search user interfaces.

This feature can be implemented at the Search Engine level or at the Search Application level. Below are some options available in these three search engines.

Apache Solr

Apache Solr has native support for the auto-suggest feature. It can be facilitated by using NGramFilterFactory, EdgeNGramFilterFactory, or TermsComponent. Usually, this Apache Solr feature is used in conjunction with jQuery or asynchronous client libraries for creating powerful auto-suggestion and user experience in the front-end applications.

Elasticsearch

Elasticsearch also has many edge n-grams, which are easy to set up, flexible, and fast. Elasticsearch introduced a new data structure, Finite State Transducer (FST), which resembles big graph data structure. This data structure is managed in memory, which makes it much faster than a term-based query could be. Elasticsearch also recommends using edge n-grams when query input and its word ordering are less predictable.

Amazon CloudSearch

Amazon CloudSearch offers ‘Suggesters’ to achieve auto-suggest. CloudSearch Suggesters are configured based on a particular text field. When Suggesters are used for querying with a search string, CloudSearch lists all documents where the search string in the Suggester field begins with that search string. Suggesters can be configured to find matches for the exact query, or to perform a fuzzy matching process to correct the query string. The ‘Fuzzy Matching’ can be defined with fuzziness level Low, High or Default.

Suggesters also can be configured with SortExpression, which computes a score for each one. It’s important to do domain indexing when a new Suggester is configured. Suggestions will not be reflected until all of the documents are indexed.

Conclusion

Amazon CloudSearch provides simple yet powerful ‘Suggest’ implementation, which is sufficient for most of the applications. If you are looking for advanced options or any further customizations on ‘Suggestions’, Apache Solr and Elasticsearch offer some good options.

5.4 Highlighting

Highlighting is a way of giving formatting clues to end users in the search results. It is a valuable feature, where the front-end search applications highlight search snippets of text from each search result. This function conveys to end users why the result document matched their query. In this section, we will describe the options present in all three search engines.

Apache Solr

Apache Solr includes document text fragments, which are matched in the query response. These text fragments are included in the response as a highlighted section that is used as a cue by search clients for representation. Apache Solr is packaged with good highlighting collections which give control over the text fragments, fragment size, fragment formatting, and so on. These highlighting collections can be incorporated with Solr Query parsers and Request Handlers.

Apache Solr comes with three highlighting utilities

• Standard Highlighter

• FastVector Highlighter

• Postings Highlighter

Standard Highlighter is most commonly used by search engineers because it is a good choice for a wide variety of search use-cases. The FastVector Highlighter is ideal for large documents and highlighting text in a variety of languages. The Postings Highlighter works well for full-text keyword search.

Elasticsearch

Elasticsearch also allows for highlighting search results on one or more fields. The implementation uses a Lucene based highlighter, fast-vector-highlighter and postings-highlighter. In Elasticsearch, the highlighter can be configured in the query to force a specific highlighter type. This is a very flexible option for developers to choose a specific highlighter to suit their requirements.

Like Apache Solr, the three highlighters present in Elasticsearch emulate the same behavior which is seen in Solr because these highlighters are inherited from the Lucene family.

Amazon CloudSearch

Amazon CloudSearch simplifies the highlighting by specifying the highlight.FIELD parameter in the search request. Amazon CloudSearch returns excerpts with the search results to show where the search terms occur within a particular field of a matching document.

For example: Search terms ‘Smart Phone’ is highlighted for the description field:

Highlights": {"description": "A *smartphone* is a mobile phone with an advanced mobile operating system. They typically combine the features of a cell phone with those of other popular mobile devices, such as personal digital assistant (PDA), media player and GPS navigation unit. A *smartphone* has a touchscreen user interface and can run third-party apps, and are camera phones."}

Amazon CloudSearch also provides controls like number of search term occurrences within an excerpt, how they should be highlighted, plain text or HTML and so on.

Conclusion

From a development perspective, all three search engines provide easy and simple highlighting implementations. If you are looking for different and more advanced highlighting options, Apache Solr and Elasticsearch have some good features.

Feature 6: Multilingual support

Multilingualism is a very important feature for global applications which cater to non-English speaking geographies. A leading information measurement company’s survey reveals that search engines built with multilingual features are emerging and successful because of native language support, and focus on the cultural background of the users.

Business Impact: A multilingual search is an effective marketing strategy to get the attention of consumers. In e-commerce, a platform to do more business is created when the language is in the native tongue of the customer.

Apache Solr

Apache Solr is packaged with multilingual support for most common languages. Apache Solr carries many language-specific tokenizers, and filters libraries which can be configured during indexing and querying.

Apache Solr engineering forums recommend using multi-core architecture where each core manages one language. Solr also supports language detection using Tika and LangDetect detection features. This helps to map the text data to language-specific fields during indexing.

Elasticsearch

Elasticsearch has incorporated a vast collection of language analyzers for most commonly spoken languages. The primary role of the language analyzer is to split, stem, filter, and apply required transformations specific to the language.

Elasticsearch also allows a user to define a custom analyzer that can be a base extension of another analyzer.

Amazon CloudSearch

Amazon CloudSearch has strong support for language-specific text processing. Amazon CloudSearch has pre-defined default analysis schemes support to 34 languages. Amazon CloudSearch processes the text and text-array fields based on the configured language-specific analysis scheme.

Amazon CloudSearch also allows a user to define a new analysis scheme that can be an extension of the default language analysis scheme.

Conclusion

All three search engines have ample and effective support features for widely spoken international languages.

Languages Support

The table below lists the languages supported by each search engine

Search engine	Languages supported

Apache Solr	Arabic, Brazilian, Portuguese, Bulgarian, Catalan, Chinese, Simplified Chinese, CJK, Czech, Danish, Dutch, Finnish, French, Galician, German, Greek, Hebrew, Lao, Myanmar Khmer, Hindi, Indonesian, Italian, Irish, Japanese, Latvian, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Scandinavian, Serbian, Spanish, Swedish, Thai and Turkish

Elasticsearch	Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Thai and Turkish

Amazon CloudSearch	Arabic, Armenian, Basque, Bulgarian, Catalan, Chinese - Simplified, Chinese - Traditional, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hebrew, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish and Thai

Feature 7: Protocol & API Support

7.1 Request and Response formats

Search engine	Request formats	Response formats

Apache Solr	XML, JSON, CSV	JSON, XML, CSV

Elasticsearch	JSON	XML, JSON

Amazon CloudSearch	XML, JSON	XML, JSON

7.2 External Integrations

Search engine	Integrations available

Apache Solr	Drupal, Magento, Django, ColdFusion, Wordpress, OpenCMS, Plone, Typo3, ez Publish, Symfony2, Riak, DataStax Enterprise Search, Cloudera Search, Hortonworks Data Platform, MapR https://wiki.apache.org/solr/IntegratingSolr#Integrating_Solr_With_Other_.28Non_Search.29_Applications

Elasticsearch	Drupal, Django, Symfony2, Wordpress, CouchBase, SearchBlox, Hortonworks Data Platform, MapR http://www.Elasticsearch.org/guide/en/Elasticsearch/client/community/current/integrations.html

Amazon CloudSearch

7.3 Protocols Support

Search engine	Protocols support

Apache Solr	HTTP, HTTPS

Elasticsearch	HTTP, HTTPS

Amazon CloudSearch	HTTP, HTTPS

Feature 8: High Availability

All three search engines are architected for

• High availability (HA)

• Replication

• Scaling design principles

In this next section, we will discuss high availability options present in these three search engines.

8.1 Replication

Replication is copying or synchronizing the search index from master nodes to slave nodes for managing the data efficiently.

Replication is a key design principle exercised in high availability searches and scaling. From a High Availability perspective, replication can be effective for both HA and failovers from master nodes (shards or leaders) to slave nodes (replicas). Replication from a scaling perspective is used to scale the slave or replica nodes when the requests traffic increases.

Apache Solr

Apache Solr supports two models of replication, namely legacy mode and SolrCloud. In legacy mode, the replication handler copies data from the master node index to slave nodes. The master server manages all index updates and the slave nodes handle read queries. This segmentation of master and slave allows scaling Solr clusters to deliver heavy volume loads.

Apache SolrCloud is a distributed advanced cluster setup using Solr nodes designed with high availability and fault-tolerance. Unlike legacy mode, there is no explicit concept of "master/slave" nodes. Instead, the search cluster is categorically split into leaders and replicas. The leader has the responsibility to ensure the replicas are updated with the same data stored in the leader. Apache Solr has a configuration called ‘numShards’ which defines number of shards (leaders). During start-up, the core index is split across the ‘numShards’ (number of shards) and the shards are represented as leaders. The nodes that are attached in the Solr cluster after the initial ‘numShards’ will be automatically assigned as replicas for the leaders.

Elasticsearch

Elasticsearch follows a similar concept to SolrCloud. In brief, an Elasticsearch index can be split into multiple shards and each shard can be replicated into any number of nodes (0, 1, 2 …n). When replication is completed, the index will have primary shards and replica shards. During index creation, the number of shards and replicas are defined. The number of replicas can be dynamically changed, but the shards count cannot.

Apache Solr and Elasticsearch

Both Apache Solr and Elasticsearch support synchronous and asynchronous replication models. If the replication is configured in ‘synchronous’ mode, the primary (leader) shard will wait for successful responses from the replica shards before returning commit transaction. If the model is ‘asynchronous’, the response is returned to the client as soon as the request is executed on the primary or leader shard. The request to the replicas is forwarded asynchronously.

The diagram below depicts the replication concept which is followed in Solr and Elasticsearch.

Replication handling in Apache Solr and Elasticsearch

S1	Node 1	Shard 1 of the cluster
S2	Node 2	Shard 1 of the cluster
R1	Node 3	Replica 1 of Shard 1
R2	Node 4	Replica 1 of Shard 2
R3	Node 5	Replica 2 of Shard 1
R4	Node 6	Replica 2 of Shard 2

Amazon CloudSearch

Amazon CloudSearch is simple and refined when it comes to handling replication and streamlines the job of search engineers and administrators. During the configuration of scaling, Amazon CloudSearch prompts for the desired replication count which should be based on load requirements.

Amazon CloudSearch will automatically scale up and scale down the replicas for a domain based on the requests traffic and data volume, but not below the desired replication count. In Amazon CloudSearch, the replication scaling option can be changed at any time. If the scale requirement is temporary, (for example, anticipated spikes because of a seasonal sale) the desired replication count of the domain can be pre-scaled, and then the changes reverted after the requests volume returns to a steady state. Modifying the replication count does not require any index rebuilding but the replica sync completion is dependent on the size of search index.

The following describes the benefits of Amazon CloudSearch replication model.

• The search instance capacity is automatically replicated and load is distributed, the search layer is robust and highly available at all times.

• Improved fault tolerance. If any one of the replicas is down, the other replica(s) will continue to handle requests while the failed replica is in recovery mode.

• The entire process of scaling and distribution is automated and avoids manual intervention and support.

Conclusion

All three search engines have a good base to support the ‘replication’ feature. Apache Solr and Elasticsearch allow defining your own replication topology which can be configured for synchronous and asynchronous replication. They can be manually or automatically scaled based on application requirements and by writing custom programs. However, substantial managed service operations are required if the cluster replication is set up in enterprise scale.

Amazon CloudSearch fully manages the replication by managing scaling, load distribution, and fault tolerance. This simplicity saves operations costs for the enterprises and companies.

8.2 Failover

Failover is a back-end operation that switches to secondary or standby nodes in the event of primary server failure. Failover is identified as an important fault tolerance function for systems with lower or zero downtime requirements.

Apache Solr and Elasticsearch

When an Apache Solr or Elasticsearch cluster is built with shards and replicas, the cluster inherently becomes fault-tolerant and mechanically supports failover.

During any failure, a cluster is expected to support the operations while the failed node is put into recovery state. Both the Apache Solr and Elasticsearch documentation strongly recommend a distributed cluster setup to protect user experience from application or infrastructure failure.

In the event of all nodes storing shards and replicas failing, then the client requests will also fail. If the shards are set to tolerant configuration, partial results can be returned from the available shards. This behavior is anticipated in both Apache Solr and Elasticsearch.

The representation below depicts how failover is handled in cluster. This flow is applicable for both Solr and Elasticsearch.

Node	Replica 1	Replica 2
SHARD 1 – Node Number 1	SHARD 1 FIRST REPLICA – Node Number 3	SHARD 1 SECOND REPLICA – Node Number 5
SHARD 2 – Node Number 2	SHARD 2 FIRST REPLICA – Node Number 4	SHARD 2 SECOND REPLICA – Node Number 6

The below table illustrates the failure scenarios in a Search cluster.

Scenario A	If SHARD1 fails, then one of its replica nodes, either Node number 3 or Node number 5 is chosen as leader.
Scenario B	If SHARD2 fails, then one of its replica nodes, either Node number 4 or Node number 6 is chosen as leader.
Scenario C	If SHARD 1 REPLICA1 fails, then Shard 1 Replica 2 continues to support replication and as well serve the requests.
Scenario D	If SHARD 2 REPLICA1 fails, then Shard 2 Replica 2 continues to support replication and as well serve the requests.

Elasticsearch uses internal Zen Discovery to detect failures. If the node holding a primary shard dies, then a replica is promoted to the role of primary. Apache Solr uses Apache ZooKeeper for Co-ordination, failure detection, and leader voting. ZooKeeper initiates leader election process between replicas during a leader/primary shard failure.

Amazon CloudSearch

Amazon CloudSearch has built-in failover support. Amazon CloudSearch recommends scaling options and availability options to increase fault tolerance in the event of a service disruption or node failures.

When Multi-AZ is turned on, Amazon CloudSearch provisions the same number of instances in your search domain in the second availability zone within that region. The instances in the primary and secondary zones are capable of handling a full load in the event of any failure.

In the event of a service disruption or failure in one availability zone, the traffic requests are automatically redirected to the secondary availability zone. In parallel, Amazon CloudSearch self-heals the cluster in failure, and Multi-AZ restores the nodes without any administrative intervention. During this switch, the inflight queries might fail, and they will need to be retried from the front–end application side.

By increasing the partitions and replicas in the Amazon CloudSearch scaling options, failover support can be improved. If there's a failure in one of the replicas or partitions, the other nodes (replica or partition) will handle requests and support while it is being recovered.

Amazon CloudSearch is very sophisticated in terms of handling failure, as the node health is continuously monitored. In the event of infrastructure failures, the nodes are automatically recovered or replaced.

Conclusion

Failover can be architected by applying techniques like replication, sharding, service discovery, and failure-detection services. Apache Solr and Elasticsearch advocate building your search system in ‘Cluster mode’ to address failover. They undertake that responsibility by employing service discovery which can detect unhealthy nodes. The service discovery maintains the cluster information and balances the search cluster when nodes are detected for failures.

Amazon CloudSearch supports failover for single node as well as for cluster mode. Behind the scenes, CloudSearch continuously monitors the health of the search instances and they are automatically managed during failures.

Feature 9: Scaling

The ability to scale in terms of computing power, memory, or data volume is essential in any data and traffic bound applications. Scaling is a significant design principle employed to improve performance, balancing and high availability.

Over time, the search cluster is expected to be scaled horizontally (scale out) or vertically (scale up) depending upon the needs.

Scale-up is the process of moving from a small server to a large server. Scale-out is the process of adding multiple servers to handle the load. The scaling strategy should be selected based on application requirements.

Apache Solr and Elasticsearch

Scaling an Apache Solr or Elasticsearch application involves manual processes. These can include a simple server addition task or advanced tasks like cluster topology changes, storage changes, or infrastructure upgrades.

If vertical scaling takes place, the search cluster needs to follow processes like new setup and configuration, downtime, node restarts, etc. If scaling is horizontal, the process may involve re-sharding, rebalancing, or cache warming.

While a search cluster system can benefit from powerful hardware, vertical scaling has its own limitations. Upgrading or increasing the infrastructure specifications on the same server can involve tasks like:

• New setup

• Backup

• Down time

• Application re-testing

The scaling out process is identified as a relatively easier task compared to scaling up.

An expert search administrator (Apache Solr or Elasticsearch) is usually posted to keep a close watch on the performance of the search servers. Infrastructure and search metrics play a key role in administrator decision making.

When these metrics increase beyond the threshold of a particular server and start affecting overall performance, the new server(s) have to be manually spawned. Also, the scale up task can expand to index partitioning, auto-warming, caching and re-routing/distribution of the search queries to the new instances. It requires a Solr expert on your team to identify and execute this activity periodically.

Sharding and Replication

Though scaling up, scaling out, and scaling down involve manual work, technology-driven companies automate this process by developing custom programs. These smart programs continuously monitor the cluster group and make decisions to do elastic scaling. This output is quite similar to AutoScaling’s offering.

In terms of administration functionality, both Apache Solr and Elasticsearch offer scaling techniques called Sharding and Replication.

Sharding (which means partitioning) is a method in which a search index is split into multiple logical units called "shards". If the indexed documents exceed the collection’s physical size, then sharding is recommended by administrators. When sharding is enabled the search requests are distributed to every shard in the collection, results are individually collected and then merged.

Another scaling technique, replication, (See 8.1 Replication - discussed in detail) allows adding new servers with redundant copies of your index data to handle higher concurrent query loads by distributing the requests around to multiple nodes.

Amazon CloudSearch

Amazon CloudSearch is a fully managed search service; it scales up and down seamlessly as the amount of data or query volume increases. CloudSearch can be scaled based on the data or based on the requests traffic. When the search data volume increases, CloudSearch can be scaled from a smaller instance type to a larger search instance type. If the capacity of largest search instance type is also exceeded then CloudSearch partitions the search index across multiple search instances (Sharding technique).

When traffic and concurrency grows, Amazon CloudSearch deploys additional (replicas) search instances to support traffic load. This automation eases the complexity and manual labour required in the scaling out process. Conversely, when the traffic drops, Amazon CloudSearch scales down your search domain by removing the additional search instances in order to minimize costs.

The Amazon CloudSearch management console allows users to configure the desired partition count and the desired replication count. The AWS console also allows changing of the instance type (scaling up) anytime. This inherent behavior of elastic scaling makes one of the most important points in favor of Amazon CloudSearch.

Conclusion

Scaling in search is implemented in the form Sharding and Replication. All three search engines have a strong scaling support for setting up their search tier in ‘cluster mode’.

Scaling in Apache Solr and Elasticsearch often requires administration as there is no direct hard and fast rule. Techniques like elastic scaling can implemented only up to a limit and when cluster grows further, manual intervention and thought process is required. Vertical scaling in Apache Solr and Elasticsearch is even more delicate. It requires individual management of the nodes in the cluster and executed by using techniques like ‘Rolling restarts’ and custom scripts.

Amazon Cloud Search takes away all the operation intricacies from the administrators. The desired partition count and desired replication count option in CloudSearch will automatically scale up and scale down based on the volume of data and requests traffic. This saves lot of efforts and cost on operations and management.

Feature 10: Customization

At times, the search system or its software may not have support for a specific feature or built-in integration with other systems. In such cases, most open source software allows developers to customize and extend their desired features as plugins, extensions or modules. Often, the developer community shares extension libraries which are helpful for a practical cause. These libraries can be customized and integrated with the system.

Apache Solr and Elasticsearch

Apache Solr and Elasticsearch both belong to the same source breed, allowing customizations on:

• Analyzers

• Filters

• Tokenizers

• Language analysis

• Field types

• Validators

• Fall back query analysis

• Alternate query custom handlers

Since both products are open source, the developers can customize or extend the libraries to fit the required feature modifications through plugins and libraries. The build and deployment becomes a developer’s responsibility after the extending the code base.

Apache Solr and Elasticsearch have many plugin extensions that will allow developers to add custom functionality for a variety of purposes. These plugins are configured as special libraries and refer to the application using configuration mapping.

Amazon CloudSearch

Amazon CloudSearch does not allow for any customizations. The search features in Amazon CloudSearch are offered by AWS after much careful thought and collective feedback from the customers. The Amazon CloudSearch team continually evaluates new features and rolls them out proactively.

Conclusion

Amazon CloudSearch has a highly capable feature set to develop search systems. However, if you anticipate strong customization on your search functionalities, Apache Solr or Elasticsearch are better choices as their search core libraries are open sourced. It is also important to note that any customization in the core libraries leaves the build and deployment process responsibility to the developer. The customization also needs to be maintained for every version upgrade or newer release of your search engine.

Feature 11: More

11.1 Client libraries

Client libraries are required for communicating with search engines. They are essential for developers as they provide essential information to the connecting search engine and allow applications to easily interact with high-level libraries.

Apache Solr

Apache Solr has an open source API client to interact with Solr using simple high-level methods. The client libraries are available for PHP, Ruby, Rails, AJAX, Perl, Scala, Python, .NET, and JavaScript.

Elasticsearch

Elasticsearch provides official clients for Groovy, JavaScript, .NET, PHP, Perl, Python, and Ruby. There are other community-provided client libraries that can be integrated with Elasticsearch.

Open source

Other than official and open source client APIs, Elasticsearch and Apache Solr can be integrated using the RESTful API. The REST client can use a typical web client developed in the favored programming language or even called from a normal command line.

Amazon CloudSearch

Amazon CloudSearch exposes a RESTful API for configuration, document service and search.

• The configuration API can be used for CloudSearch domain creation, its configuration and end to end management.

• The document service API enables the user to add, replace, or delete documents in your Amazon CloudSearch.

• The search API is used for search or suggestion requests to your Amazon CloudSearch domain.

Alternatively, AWS also shares a downloadable SDK package, which simplifies coding. The SDK is available for popular languages like Java, .NET, PHP, Python, and more. The SDK APIs are built for most Amazon Web services, including Amazon S3, Amazon EC2, CloudSearch, DynamoDB, and more. The SDK package includes the AWS library, code samples, and documentation.

Feature 12: Cost

From an overall perspective, Cost is a very important factor and companies always endeavor ways to reduce Total cost of Ownership (TCO). In this section, we will see the Cost components in these three search engines.

Apache Solr and Elasticsearch

The cost factor in Apache Solr and Elasticsearch includes infrastructure resources cost, managed services cost and people resources cost. For any type of deployment, the servers cost and engineers cost are essential. The commitment to continuous admin operations depends on application requirements and its criticality.

Amazon CloudSearch

Amazon CloudSearch cost component includes server costs and engineers cost and they are essential for any search deployment like the above two. Amazon CloudSearch being a fully-managed service covers the managed services as part of the server costs. Also, Amazon CloudSearch does not charge during the beginning of service usage but charges at the end of the month based on CloudSearch usage.

Conclusion

The net operating costs are essentially the same across all three search engines, but people costs will be 30% more for self-managed Apache Solr or Elasticsearch compared to Amazon CloudSearch.

For Example, A highly important and critical search application will require 24 * 7 support and managed services. This cost incurred as part of Managed services which is an additional one in Apache Solr and Elasticsearch deployments.

Conclusion

Search is an indispensable feature in most business applications.

Apache Solr and Elasticsearch are time proven solutions. Many larger organizations have used Apache Solr and Elasticsearch for years, but are now looking for greater operational efficiency and cost effectiveness. On the other hand, companies looking for innovative ways grow their businesses and provide value. In the recent years, a huge number technology companies have started to employ the benefits of using cloud-based search services, mainly in terms of getting started and then accommodating growth without the need to switch vendors to do so. When scalability, cost, and speed-to-market are primary concerns, we recommend using some form of cloud service. And if you want to enjoy the benefits of a cloud solution built on the architecture of Apache Solr, we recommend Amazon CloudSearch.

Cloud, Big Data and Mobile

Pages

Sunday, July 5, 2015

Amazon CloudSearch vs ElasticSearch vs Apache Solr Comparison in detail

Feature 2: Operations and Management

2.1 Backup

2.2 System upgrades and patch management

Feature 3: Monitoring

Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export

4.1 Schema management

4.2 Dynamic fields

4.3 Data types

4.4 Data import & export

Feature 5: Search and Indexing features

5.1 Analyzers, Tokenizers and Token filters

5.2 Faceting

5.3 Auto Suggestion

5.4 Highlighting

Feature 6: Multilingual support

Feature 7: Protocol & API Support

7.1 Request and Response formats

7.2 External Integrations

7.3 Protocols Support

Feature 8: High Availability

8.1 Replication

8.2 Failover

Feature 9: Scaling

Feature 10: Customization

Feature 11: More

11.1 Client libraries

Feature 12: Cost

No comments:

Need Consulting help ?

Followers

My Presentations / Webinars / Conferences

Popular Posts - All Time

My Articles

SlideShares