Temporal Platform's production readiness checklist
This page describes common challenges customers face who self-host Temporal and it shares recommendations to mitigate those issues.
Temporal at its core is about durability and reliability. To ensure this durability and reliability, a Temporal Service must be deployed according to best practices.
This guide provides a path to demonstrate that Temporal consumers can be confident in a Temporal Service and provides a list of key tests you as a user should perform against the service.
Self-Hosting Challenge Areas
Significant engineering and ongoing effort is required to resolve several potential challenges:
- Scalability with spiky or growing workloads
- Global hosting
- Uptime, availability and reliability
- Management and control plane
- Latency, which must be kept low and consistent
- Security
- Maintenance and upgrades
- Expert support to users of the service
- Cost management
Each of these components is an essential part of building a mission critical Temporal Service. Without demonstrated architectural durability, the value of Temporal's Durable Execution model is compromised.
Scalability with Variable or Growing Workloads
Workloads can be highly variable, and you may experience sustained workload spikes. Temporal recommends scaling your clusters to well above the average throughput. See Scaling Temporal: The Basics for an introduction to the topic.
Temporal server throughput is often limited by the number of Shards configured for the Temporal Service. A Shard is an unit within a Temporal Service by which concurrent Workflow Execution throughput can be scaled. Shard capacity, and often overall cluster throughput, is set at build time for a cluster and that cluster setting cannot be adjusted later. Adding more Shards if needed requires a cluster rebuild, and a migration to the new cluster.
The requirements of your Temporal Service will vary widely based on your intended production workload. You will want to run your own proof of concept tests and watch for key metrics to understand the system health and scaling needs.
Load testing. You can use the Omes benchmarking tool, see how we ourselves stress test Temporal, or write your own.
All metrics emitted by the server are listed in Temporal's source. There are also equivalent metrics that you can configure from the client side. At a high level, you will want to track these 3 categories of metrics:
- Service metrics: For each request made by the service handler we emit
service_requests
,service_errors
, andservice_latency
metrics withtype
,operation
, andnamespace
tags. This gives you basic visibility into service usage and allows you to look at request rates across services, namespaces and even operations. - Persistence metrics: The Server emits
persistence_requests
,persistence_errors
andpersistence_latency
metrics for each persistence operation. These metrics include theoperation
tag such that you can get the request rates, error rates or latencies per operation. These are super useful in identifying issues caused by the database. - Workflow Execution stats: The Server also emits counters for when Workflow Executions are complete.
These are useful in getting overall stats about Workflow Execution completions.
Use
workflow_success
,workflow_failed
,workflow_timeout
,workflow_terminate
andworkflow_cancel
counters for each type of Workflow Execution completion. These include thenamespace
tag.
Availability
A high level of availability and reliability (99.99%) is a requirement for mission critical deployments. Temporal recommends testing for this availability level while load testing. We also recommend validating this level of reliability while doing server upgrades, to ensure no loss of service availability.
Temporal Clusters can be deployed in as many regions as needed to meet various requirements:
- Data Residency
- Latency
- Security / Isolation This can multiply the effort to implement and maintain clusters.
Temporal Cloud is available in 15 AWS regions.
Management and Control Plane
Temporal success leads to larger Temporal deployments. Needs can increase, and can go from having one or two production use cases in a single region to many use cases in many regions. Running multiple Temporal Services is complex work, as each needs its own setup, tuning, and configuration.
Needing to monitor and manage all your Temporal Services in a unified way leads to operational management pain. Consider adding a layer on top of Temporal to manage multiple Temporal Services: a control plane. A control plane manages and directs data flow, deciding where data packets should be sent. A Temporal Service data plan can streamline operations and improve efficiency. Since Temporal does not ship its own open source data plane, rolling your own can be complex and take effort to add.
Temporal Cloud provides exactly that support. With Temporal Cloud, all Namespaces in all regions can be managed from a single view. Temporal Cloud also has RBAC functionality that can delegate responsibilities for individual Namespaces.
Self-hosted Temporal does not support RBAC or audit logging out of the box. Temporal Cloud provides RBAC and SSO support, audit logging, data encryption, third party penetration test validation, and SOC 2-Type II and HIPAA compliance.
Maintenance and Upgrades
Temporal recommends keeping up-to-date and not falling behind on your server versions.
Temporal Server is proactively updated, and releases as often as every two weeks. Temporal recommends upgrading sequentially, not skipping any minor versions, although you can skip patch versions. No support is guaranteed for Temporal Server, but very old servers will be hard for even the community to support, so we encourage you to keep up to date. You must create and maintain the infrastructure to host and run your self-hosted Temporal installation, such as Kubernetes, as well as data stores for persistence.
Server upgrades can negatively affect self-hosted Temporal Service availability. Temporal recommends load and availability testing during the upgrade process to understand the performance implications.
Temporal Cloud updates are managed by the Temporal Cloud team; Cloud upgrades are seamless.
Expert Support
Temporal recommends that customer platform teams who are building out a Temporal service gain deep experience across the lifecycle and breadth of a Temporal application.
Specific activities include:
- Worker tuning
- Worker best practices
- Code reviews
- Design guidance
- Training
- Code reviews
- Security reviews
- Metrics and monitoring
- Technical onboarding
Temporal support provides guidance on all of the above.
Cost Management
Running a mission critical, global Temporal Service can be expensive. Temporal Server is a complex system to run and scale. Temporal recommends performance testing and planning scaling as your performance requirements evolve. Following our guidance can oversize your self-hosted Temporal Server installs, but this is necessary to handle unpredictable spiky workloads. Performance testing can help you right-size your environments. Running mission critical Temporal as a Service requires multiple Temporal Clusters for high availability and global coverage.
It is a good practice to have trained, experienced administrators familiar with Temporal Service architecture to maintain your Temporal servers and provide a mission critical service. Staffing, training and skill development can be significant costs to maintaining a Temporal Service.
Temporal Cloud can be significantly less expensive to set up and scale.