Degraded Performance

Incident Report for Testsigma Status

Postmortem

Root Cause Analysis (RCA) – Database Degraded Performance

Incident Period: Over the last one week

Impact: Brief degraded performance for a few seconds during high-load periods

Current Status: Stable – no degradation observed in the last 72 hours

1. Summary of Issue

During the past week, the database experienced short intervals of degraded performance. The degradation was caused by a combination of application-level inefficiencies and sudden load spikes from customer activity, which collectively pushed the database beyond its scaling thresholds.

2. Root Causes

Auditing Microservice Inefficiency

* The auditing service was generating excessive database queries.
* Lack of proper batching and query optimization caused unnecessary DB load.

Archival Service Behavior

* The archival microservice attempted to process and archive very large volumes of historical test run data at once.
* This created long-running transactions and high I/O consumption.

Traffic Spike from Customers

* A few customers generated unusually heavy workloads in a short span of time.
* This compounded the load already caused by the auditing and archival services.

Autoscaling Delay

* The database autoscaling mechanism did not get sufficient time to react to the sudden spike.
* This led to short bursts of unhandled load before stabilization.

3. Contributing Factors

Lack of proper throttling in archival tasks.
Insufficient safeguards in the auditing logic.
Sudden surge in concurrent customer traffic coinciding with heavy background jobs.

4. Corrective Actions Taken

Code Optimizations

* Auditing service queries were optimized to reduce redundant load.
* Archival microservice was redesigned to archive data in smaller, controlled batches.

Load Distribution

* Some high-volume customers were migrated to new infrastructure with isolated database clusters.
* This reduces the risk of one customer’s workload impacting others.

Proactive Monitoring Enhancements

* New monitoring dashboards and alerts were introduced to catch early signs of DB stress.
* Query performance metrics and background job load are now tracked in real time.

5. Current Status

No further performance degradations observed in the last 72 hours.
Database performance metrics remain within normal operating thresholds.
Teams are continuously monitoring to ensure long-term stability.

6. Preventive Measures & Next Steps

Implement query rate limiting for background jobs (archival, auditing).
Introduce staggered scheduling of large archival tasks.
Enhance autoscaling policy to better handle sudden spikes.
Conduct a load test simulation periodically to validate resilience.

✅ Conclusion:

The degraded performance was the result of combined factors: inefficient auditing, aggressive archival processing, and sudden traffic spikes. With the applied optimizations, customer migration, and improved monitoring, the system has stabilized, and no further degradation has been observed. Continuous monitoring and preventive measures will ensure long-term reliability.

Posted Sep 29, 2025 - 05:42 UTC

Resolved

This incident has been resolved.

Posted Sep 29, 2025 - 05:21 UTC

Monitoring

We experienced degraded performance in our application, which briefly affected some customers. Our engineering team has identified and resolved the issue. We are currently monitoring the application and will keep you updated with any further developments.

Thank you for your patience and understanding.

Posted Sep 25, 2025 - 06:55 UTC

This incident affected: Testsigma Web App, Testsigma Mobile Inspector, Testsigma Execution Engine, and Testsigma Visual Testing.