IT infrastructures are becoming more complex, dense, and dispersed. IT personnel are challenged with quickly identifying, correlating, and resolving problems before they impact end users or the business. In addition, IT administrators need to be able to determine incident severity levels to help them identify and prioritize issues.
NetApp® StorageGRID® has various methods for monitoring. Here’s what our Alerts system provides:
- Actionable intelligence. What constitutes a problem, and what to do to resolve the problem.
- Issue severity. Tagging so that issues can easily be prioritized.
- Flexibility. Customize so that the alerts fit into your business process and IT workflows.
Most customers rely on StorageGRID to perform multiple workloads. from backups to archiving to analytics. Many of these workloads are crucial to running IT operations. As an IT administrator, it’s important to understand your multitenanted environment so that you can preempt problems and stay ahead of business needs.
You want to discover issues before you receive a telephone call at 2 o’clock in the morning telling you about a problem that should have been fixed yesterday. This is where the StorageGRID Alerts system shines. The Alerts system is an easy-to-use interface for detecting, evaluating, and resolving issues that can occur during StorageGRID operations.
StorageGRID comes out of the box with a long list of predefined alerts that use intuitive names and descriptions to help you quickly and easily understand the problem.
The alert notifications tell you what node and site are affected, and also the time triggered and the metric value associated with the alert. You also receive recommendations for actions to resolve the alert.
For example, suppose that you’re out to dinner celebrating your in-laws’ anniversary. You get an alert that tells you exactly what has broken and recommends actions to fix it. You can confidently instruct one of your onsite technicians to take care of the problem, saving you a trip to the office.
StorageGRID is a distributed object storage system that spans a global namespace. It is a fault-tolerant system that is designed to operate normally even when errors occur.
The StorageGRID Alerts system uses the following severities:
Critical. An abnormal condition exists that has stopped normal operations. You must address the issue immediately.
Major. An abnormal condition exists that currently affects operations. You should address the issue to make sure that normal operations are not stopped
Minor. The system is operating normally, but an abnormal condition exists that could affect the system’s ability to operate if it continues
A busy system can generate a large amount of information. Therefore, severity tagging is important to prioritize issue resolution activities.
An example of how this information can help you is in assigning work and timelines to address issues before they affect system performance. This deep understanding of the environment and what might possibly go wrong makes it easy to resolve issues.
Flexibility to customize
We understand that every company is unique and requires customizations so that they receive alerts that best match their business processes and thresholds. That’s why we’ve made the StorageGRID Alerts so flexible.
StorageGRID uses Prometheus expressions to define thresholds. This allows highly customizable alerts, as well as compound alerts that are triggered based on the Boolean operators AND and OR. For example, you can set an alert if [CPU usage is above 90% for 5 minutes] AND [Average Request Duration is above 60s for 5 minutes]. Any StorageGRID metric can be queried and monitored, and thresholds can be set according to severity level.
For example, suppose that you have a small subsegment of data that is very important to your business. If there is an abnormal condition that affects the area where that data is stored, you want the situation to be treated as a critical alert, even if it may be only a minor problem.
For details on the syntax of Prometheus queries, see their documentation.