Monitoring in the cloud
By identifying the problem early on, monitoring can prevent many troubles and save companies considerable losses. However, setting up a proper monitoring strategy is not as trivial as it may seem at first glance. Particularly in the cloud, monitoring is not free, which is why we should approach its deployment responsibly. Let's show you how to do it.
Jakub Procházka
Monitoring strategy
Companies enter the cloud with either an existing monitoring system that can be integrated with the public cloud, or they're going the route purely native tools from public cloud providers.
However, people often imagine it too simply, where you tick a box in the cloud and monitoring is instantly solved. Unfortunately, it's not that simple. Monitoring is a complex service that needs to be thought of holistically (similar to backups, which I wrote about in one of previous articles).
In the first place, the monitoring itself should always be preceded by monitoring strategythat a company should prepare before entering the public cloud (or at least at the beginning of its adoption).
Source: https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/strategy/monitoring-strategy
The monitoring strategy should then indicate everything that will be part of the monitoring. In addition to defining the scope, the company should not forget to clarify the following:
- identify criticality categories individual incidents (see the following figure),
- take into account who the target recipient of notifications,
- to take into account who will be consumer of logs,
- what data we will want visualize.
Source: https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/strategy/monitoring-strategy
The monitoring tool itself would be useless if nobody looks at the logs and the notifications end up in spam or the recipient gets lost in the flood of notifications.
At the same time, the monitoring strategy will give us a specific monitoring framework. It is thus possible to avoid unnecessary costs associated with monitoring and logging data that in reality has no benefit.
I often encounter the fact that a customer has turned on logging for everything and is now unpleasantly surprised by an ever-increasing invoice. In such cases, it is necessary to take a step back and review and consolidate the existing solution.
You can read more about the monitoring strategy, for example, in the official cloud adoption framework from Microsoft.
Source: https://www.commitstrip.com/en/2019/05/20/monitoring-everything
Monitoring in the cloud - what are the options?
The most widely used monitoring services in the cloud include Azure Monitor a AWS Cloudwatch. These services can then be connected to other tools, including third-party tools such as Datadog or Splunk.
Both core services allow monitoring not only of resources running in their cloud environment, but also outside of it - most often on-premise. They therefore support hybrid scenario. Integration with the on-premise and data collection is then performed by agents.
Source: https://aws.amazon.com/cloudwatch
Monitored data
We divide the collected data into two basic groups, namely logs and metrics.
Metrics are numeric values that characterize the service at a specific time. They can often be displayed in various graphs and plotted in near real-time.
An example of a metric might be the CPU utilization of a virtual server or the responsiveness of a web application. Some basic metrics are collected automatically in the cloud, while others require the installation of an agent or extension for the service.
Metrics have a default Retention, for example in Azure it's 93 days for platform metrics. If we want to keep metrics longer, we need to "cast" the metrics to paid storage.
Source: https://docs.microsoft.com/en-us/azure/azure-monitor/overview
Logos are standard structured text records typically containing a timestamp, the type of information (warning, error, critical, etc.) and the record itself. For log retention, it is necessary to store the logs in a special repository, in the case of Azure this is Log Analytics, which is considered to be a kind of central point not only for logs, but for monitoring in general.
Main areas of monitoring in the cloud
In general, cloud monitoring can be further divided into the following four categories:
- Provider platform monitoring
- Activity and audit logs
- Monitoring IaaS and PaaS
- Application monitoring
Platform monitoring
Platform monitoring provides information about the availability of the cloud environment and informs about the unavailability of services not only in a given region. It also alerts us in advance of planned work or downtime. In the context of Microsoft, this service is called Azure Health and Amazon's AWS Health.
Source: https://aws.amazon.com/blogs/aws/announcing-the-aws-health-tools-repository
Activities and audit logs
Furthermore, we should also log the activities performed in the cloud and other related audit logs described by my colleague Martin Gavanda in the previous Article. In addition to the already mentioned Azure Monitor a AWS Cloudwatch other services such as Azure Activity log or AWS X-Ray.
IaaS and PaaS monitoring
This type of monitoring is usually of most interest to users because, unlike the platform, the client's responsibility comes into play here, according to a shared responsibility model (which we described in an earlier Article).
Monitoring plays an important role here precisely because it draws attention to possible deficiencies, errors or incidents in the managed environment. An example of this could be the failure of one of the servers (IaaS VM), its inappropriate utilization or the unavailability or overload of the PaaS database.
For PaaS services, the monitoring options offered may vary according to the specific type of service. In the context of IaaS and PaaS monitoring, we should mention an Azure tool called Diagnostic settingswhich contains activity logs, resource logs and provides detailed diagnostics and audit information related to Azure resources.
Source: https://docs.microsoft.com/en-us/azure/azure-monitor/essentials/diagnostic-settings
Application monitoring
Going one level deeper, we get into the OS and application layer. With extensions and agents, we get to see much more detailssuch as application logs, custom application logs, details of operating system logs and other services running on our VM/EC2.
With application monitoring, it is also possible to monitor services running outside the environment, for example, on-premise.
For AWS application monitoring, Amazon offers CloudWatch Application Insights and Microsoft in turn Azure Application Insights. Integration of these tools is usually easy, in some cases codeless application monitoring is also supported for selected programming languages (you can read more about it in the official Microsoft documentation here).
Application monitoring greatly assists both the development and DevOps teams and can greatly help improve the user-friendliness of the application. It is possible to track the details of individual sessions, users, their movements in the application, including errors, tracking of returns and much more.
How to grasp the costs and deployment of monitoring?
Monitoring may constitute a significant item on the final invoice. In order to minimise this impact, in addition to the monitoring strategy, we must not forget a few other useful tips:
- Use data sampling
- Limit the amount of data by data cap, respectively. daily volume cap
- Benefit from volume discounts
- When debugging a large volume of logs, enable a higher log level only for the time necessary for debugging
Deployment of monitoring can be automated (or at least partially), for example, by using the appropriate tags or, in the case of IaaC, in a deployment script.
We can also monitor whether monitoring meets the company's requirements by using various policies that monitor compliance with monitoring rules for individual sources and can alert us to any deficiencies.
Automation and notifications
Individual incidents can be responded to in different ways. In the event that an alert is triggered, it is possible to send e-mail, SMS, dial the phone number, create a ticket, send notification to the application or directly run some automation script...who will try to make amends himself. These actions can of course be combined, even in terms of recipients, i.e. different groups that we want to notify in different ways.
For example, if the web server on a VM is unavailable, it is possible to send an email to the admins, create a ticket, and try to automatically restart the web service on the VM (apache/nginx/IIS). If this resolves the issue, it can be logged in a ticket and emailed to the admins who can investigate the non-urgent issue further later during business hours. If the problem persists, the next step is to escalate the ticket and send an SMS or dial the hotline.
Visualization
It can make the work of administrators and developers much easier (and especially faster) instant rendering of collected data. The most important data can be visualized on dashboards, which are essentially dashboards that provide a quick overview of the environment or application being monitored. This often gives us important information about what is happening in our environment at a glance.
Custom views can be made using query queries according to the current need and can also be pinned to your own dashboard.
Examples of what such dashboards can look like in AWS and Azure are shown in the following images. On purpose - can you tell which one belongs to which provider?
Source: Official website of the provider
Source: Official website of the provider
And that's not all...
There are two other special types of monitoring that I haven't mentioned today. They belong to a special category, which are security monitoring a monitoring cost. Each of these categories is a separate topic for its own article and it is possible that we will get to them in the future.
And how do you monitor yours? Would you rather go the single tool route for hybrid environments or monitor the cloud separately? Let me know in the comments.
Now you can start looking forward to the next article by my colleague Martin Gavanda on the topic keys. If you are interested in other topics related to the cloud, read our series Cloud Encyclopedia - A quick guide to the cloud.