Maintaining Casewhere solutions - Daily operations

Introduction

This document serves as a comprehensive resource tailored for IT administrators, system operators, and any personnel involved in the daily maintenance to ensure the seamless operation and optimal performance of Casewhere solutions.

Documentation and communication

Personnel

Creating a personnel list for daily operations is essential for maintaining transparency and assigning responsibilities effectively. Below is a template for a personnel list that you can customize according to your organization's structure and needs.

Name	Roles	Responsibilities	Contact information
Jone Doe	System administrators	Be responsible for system health monitoring, server performance, and database connectivity	jone-doe@example.com
Jane Doe	Casewhere specialists	Provide IT supports, addresses technical issues related to the Casewhere platform	jane-doe@example.com

Solution components

It's recommended to document all system components and their relations. This will aid in debugging and troubleshooting.

Solution diagram

Diagrams are always the best tools for describing solution architecture. A typical Casewhere solution can be illustrated as follows:

Casewhere architecture

Component list

There is still a need to list all system components with detailed profiles, configurations, etc. For example:

Name	Type	Description
Casewhere Web	VM	Cores: 8. RAM: 16GB. SSD 256GB. IP: x.x.1.3. Outbound IP: y.y.1.20
Casewhere DB	VM	Cores: 8. RAM: 32GB. SSD 512GB. IP: x.x.1.5
Identity Service	VM	Safewhere Identify
CVR	Datafordeler	Restful API
CPR	Service platform	Restful API
Digital Posts	Service platform	Restful API

Others

Other things, such as domains and SSLs, must be documented. It's recommended to create reminders for SSL renewal.

Solution monitoring

Solution monitoring should be automated as much as possible. Daily reviews of notifications and alerts are essential to ensure seamless availability and performance. Detailed analysis can be conducted on suspicious or potentially high-risk threats.

Monitoring availability

Casewhere provides a health check endpoint at {worker_api}/api/v0.1/ping, which verifies the availability of Casewhere, including its web applications and associated databases. If your solution depends on external services, it is recommended to implement availability tests for publicly accessible services at a minimum.

The following screenshot demonstrates the use of Azure Monitor Standard Test to set up tests for monitoring the availability of Casewhere.

For on-premise solutions, Casewhere provides an external tool that continuously performs health checks and sends alerts to designated recipients when the system becomes unavailable.

Monitoring custom metrics

Each project has unique concerns and may require defining and monitoring specific metrics to ensure the solution meets expectations.

Casewhere offers standard components for monitoring system performance and generating insightful SLA reports. These components allow you to monitor individual resources (e.g., pages, workflows) or groups of resources, define service goals, and configure alert rules. Learn more here.

The following screenshot illustrates how Casewhere can manage and monitor various types of metrics.

Monitoring system resources

It is recommended to use the built-in monitoring tools provided by hosting providers. If these options are limited, you can also consider setting up alerts using Windows Performance Counters on both the Web Server and Database Server. Some typical monitoring metrics include:

CPU utilization. For example, every 5 minutes, check and alert if the average CPU percentage in the last 30 minutes is greater than or equal to 85%.
Memory consumption. For example, every 5 minutes, check and alert if the average available RAM percentage in the last 30 minutes is greater than or equal to 85%.
Disk consumption. For example, every day, check and alert if the available space of the data disk is less than 10%.
Error rate: For instance, every 5 minutes, check and trigger an alert if the total HTTP 5xx errors in the last 30 minutes exceed 10.
Unusual slow: For example, check every minute and trigger an alert if the average response time over the last 5 minutes exceeds 10 seconds.

Monitoring applications

It is advisable to monitor critical components within the solution, including the payment gateway, mail service, CPR service, user authentication module, among others. Casewhere provides standard components for capturing custom events generated by the solution. You can integrate a variety of logging technologies to store events, ranging from cloud services like Azure Application Insights to local tools such as Windows Event Log. It's crucial to establish monitoring objectives and set up alerts accordingly.

The screenshot below shows how Casewhere collects custom events in Application Insights:

If integration with a cloud-managed service like Application Insights isn't possible, Casewhere provides a self-monitoring component that store and manage events within Casewhere and triggers custom alerts.

System backups and recovery

Backup procedures

The backup strategy should be able to provide multiple restore points for data recovery. It’s helpful to recover from a data corruption disaster. For example:

Resource	Frequency	Time	Retention
MongoDB VM	Daily	02:00 AM, UTC	7 days
MongoDB VM	Monthly	02:00 AM, UTC	3 months
MongoDB VM	Weekly	02:00 AM, UTC	5 weeks

Data verification

The IT personnel must regularly test if the backup is restorable to make sure there is no surprise during the disaster recovery.

Disaster recovery plan

The primary goals of the disaster recovery plan:

To minimize interruptions to normal operations.
To limit the extent of disruption and damage.
To minimize the economic impact of the interruption.
To establish alternative means of operation in advance.
To train personnel with emergency procedures.
To provide for smooth and rapid restoration of service

For any disaster scenario, the following elements should be addressed:

Emergency response
- Notify customers, responsible personnel, and concerned parties immediately
- Consult with related parties to determine the degree of disaster and its impact on the business.
- Constantly monitor the disaster progress and keep parties updated.
Backup data: Even when you have regular backups, an instant backup should be taken at the time of disaster if it’s still possible.
Recovery actions
- Follow the strategic actions designed for the happening scenario.
- Ensure that all personnel involved know their tasks.
- Monitor the recovery progress and keep related parties updated.
- Send a service report to concerned parties after recovery.

Troubleshooting

When you need to conduct a thorough analysis to investigate an incident or a bug, familiarizing yourself with Casewhere logs is essential for troubleshooting problems in production. Casewhere currently provides the following types of logs:

System log: Platform logs and application logs
Performance log: The response time of all HTTP requests
Audit log: Data changes, who, what, and when

You can learn more about Casewhere logs here.