Cross-region deployments provide increased resiliency to maintain business continuity during power outages, natural disasters, or other operational disruptions. Many large enterprises design and implement special plans to be prepared during such situations. They rely on solutions built with AWS services and features to improve their reliability and response times. Amazon OpenSearch Service It is a managed service for Open searcha search and analytics engine at scale. OpenSearch Service provides high availability within an AWS Region through its Multi-AZ deployment model and delivers regional resiliency with inter-cluster replication. Amazon OpenSearch Serverless is a deployment option that provides on-demand automatic scaling, to which we continue to add many features.
With the existing cross-cluster replication feature in OpenSearch Service, one domain is designated as the leader and another as the follower, using an active-passive replication model. While this model provides a way to continue operations during a regional failure, it requires you to manually configure the follower. Additionally, after recovery, you must reconfigure the leader-follower relationship between the domains.
In this post, we describe two solutions that provide cross-region resiliency without the need to reestablish relationships during a failback, using a active-active replication model with Amazon OpenSearch Ingestion (OSI) and Amazon Simple Storage Service (Amazon S3). These solutions apply to both OpenSearch Service-managed clusters and OpenSearch Serverless collections. We use OpenSearch Serverless as an example for the configurations in this post.
Solution Overview
In this post, we describe two solutions. In both options, local data sources in a region write to an OpenSearch Ingestion (OSI) stream configured within the same region. The solutions can be extended across multiple regions, but we show two regions as an example, as regional resiliency across two regions is a popular deployment pattern for many large-scale enterprises.
You can use these solutions to address cross-region resiliency needs for OpenSearch Serverless deployments and active-active replication needs for both provisioned and serverless OpenSearch Service options, especially when data sources produce disparate data across different regions.
Prerequisites
Complete the following preliminary steps:
- Implement OpenSearch service domains either OpenSearch Serverless Collections in all regions where resilience is needed.
- Create S3 buckets in each Region.
- Set up AWS Identity and Access Management (IAM) required for OSI. For instructions, see Amazon S3 as a source. Choose Amazon Simple Queue Service (Amazon SQS) as a method for processing data.
After completing these steps, you can create two OSI pipelines, one in each region, with the configurations detailed in the following sections.
Use OpenSearch Ingestion (OSI) for cross-region writes
In this solution, OSI takes local data from the region it is in and writes it to the other region. To facilitate cross-region writes and increase data durability, we use an S3 bucket in each region. The OSI pipeline in the other region reads this data and writes it to the collection in its local region. The OSI pipeline in the other region follows a similar data flow.
When reading data, you have options: Amazon SQS or Amazon S3 scans. For this post, we used Amazon SQS because it helps provide near real-time data delivery. This solution also facilitates direct writing to these local containers in the case of pull-based OSI data sources. See Fountain low Key concepts understand the different types of fonts that OSI uses.
The following diagram shows the data flow.
The data flow consists of the following steps:
- Local data sources in a region write their data to the OSI pipeline in their region. (This solution also supports sources that write directly to Amazon S3.)
- OSI writes this data to collections followed by S3 buckets in the other region.
- OSI reads the remaining Region data from the local S3 bucket and writes it to the local collection.
- Collections in both regions now contain the same data.
The following snippets show the configuration of the two pipelines.
The code for the write pipeline is as follows:
To separate management and operations, we use two prefixes, osi-local-region-write
and osi-cross-region-write
for buckets in both regions. OSI uses these prefixes to copy only the data from the local region to the other region. OSI also creates the keys s3.bucket
and s3.key
to decorate documents written in a collection. We remove this decoration while writing in different regions; the script will add it again in the other region.
This solution enables near real-time data delivery across all regions, with the same data available in both regions. However, although OpenSearch Service contains the same data, containers in each region contain only partial data. The following solution addresses this issue.
Use Amazon S3 for cross-region writes
In this solution, we use the Amazon S3 Region Replication FeatureThis solution is compatible with all Data sources available with OSIOSI again uses two channels, but the key difference is that OSI writes the data to Amazon S3 first. After completing the steps that are common to both solutions, see Examples for setting up live replication For instructions on how to set up Amazon S3 cross-region replication, see the following diagram:
The data flow consists of the following steps:
- Local data sources in a region write their data to OSI. (This solution also supports sources that write directly to Amazon S3.)
- This data is first written to the S3 bucket.
- OSI reads this data and writes it to the Region’s local collection.
- Amazon S3 replicates data between regions and OSI reads and writes this data to the collection.
The following snippets show the configuration of both pipelines.
The code for the write pipeline is as follows:
This solution is relatively simple to set up and relies on Amazon S3 cross-region replication. This solution ensures that the data in the S3 bucket and the OpenSearch Serverless collection is the same in both regions.
For more information about the SLA for this replication and the metrics that are available to monitor the replication process, see S3 Replication Update: Replication SLAs, Metrics, and Events.
Impairment scenarios and additional considerations
Let’s consider a regional outage scenario. For this use case, we assume that your application is powered by an OpenSearch Serverless collection as a backend. When one region is impacted, these applications can simply fail over to the OpenSearch Serverless collection in the other region and continue operations without interruption, because all of the data present before the outage is available in both collections.
When the region issue is resolved, you can return to the OpenSearch Serverless collection in that region immediately or after waiting for some time for the missing data in that region to be replenished. Operations can continue without interruption.
You can automate these failover and recovery operations to provide a seamless user experience. This automation is beyond the scope of this post, but will be covered in a future post.
The existing cross-cluster replication solution requires you to manually reestablish a leader-follower relationship and restart replication from the beginning once it has recovered from a failure. However, the solutions discussed here automatically resume replication from the point where it was last interrupted. If for some reason only the Amazon OpenSearch service that is a collection or domain were to fail, the data is still available in local buckets and will be repopulated as soon as the collection or domain becomes available.
You can also use these solutions effectively in an active-passive replication model. In such cases, it is sufficient to have a minimal set of resources in the replication region, such as a single S3 bucket. You can modify this solution to solve different cases by using additional services such as Amazon Managed Streaming for Apache Kafka (Amazon MSK), which has an integrated system replication function.
When creating interregional solutions, keep in mind: Inter-Region Data Transfer Costs for AWSAs a best practice, consider adding a dead message queue to all its production lines.
Conclusion
In this post, we describe two solutions that achieve regional resiliency for clusters managed by OpenSearch Serverless and OpenSearch Service. If you need explicit control over writing data between regions, use solution one. In our experiments with a few KB of data, most writes completed within a second between two chosen regions. Choose solution two if you need the simplicity that solution two offers. In our experiments, replication was fully completed within a few seconds. 99.99% of objects will be replicated within 15 minutes.These solutions also serve as the architecture for an active-active replication model in OpenSearch Service using OpenSearch Ingestion.
You can also use OSI as a mechanism to search for data available in other AWS services, such as Amazon S3, Amazon DynamoDBand Amazon DocumentDB (compatible with MongoDB)For more details, see How to work with Amazon OpenSearch ingestion pipeline integrations.
About the authors
Muthu Pitchaimani Muthu is a search specialist on Amazon OpenSearch Service. He develops large-scale search applications and solutions. Muthu is interested in networking and security topics and is based in Austin, Texas.
Aruna Govindaraju is a Solutions Architect specializing in Amazon OpenSearch and has worked with many commercial and open source search engines. She is passionate about search, relevance, and user experience. Her expertise in correlating end-user signals with search engine behavior has helped many customers improve their search experience.