This blog post describes how to use the
Azure portal to automatically stop an Azure VM, if it is idle for a certain
time, but without stopping it, when it was just started and hence the average
idle time is below the set threshold.

For our data science projects we require
virtual machines with lots of CPUs and memory, hence they are quite expensive. Typically,
a program run is started during business hours, then runs for a couple of hours
and then might finish during the night, after which the VM is idle. It would be
nice to automatically stop the VM when it is idle for, let’s say, an hour, so
that no further charges incur.

To achieve this, the way suggested by
Microsoft is to create an alert rule with an upper threshold limit for the CPU
percentage during the last hour that calls an automation runbook to stop the
VM. This is nicely described in https://docs.microsoft.com/en-us/azure/automation/automation-create-alert-triggered-runbook.

However, the drawback is that when the VM
is started again, and the alert rule happens to be evaluated e.g. five minutes
after the start, then the average CPU percentage is likely to be below the
threshold, because 55 minutes of the past 60 minutes the CPU percentage was
zero. Consequently, the VM is stopped, which is clearly not desired. Increasing
the execution frequency of the alert rule is no workaround, as there is no
guarantee of when the rule is evaluated. Even if the frequency is set to an
hour, it could still happen to be executed e.g. five minutes after the VM was
started.

Another drawback is, that the standard
metric for the CPU percentage uses the average over all CPUs. It is not
uncommon, that in a program run there are phases with high CPU parallelization that
alternate with phases where this is not possible and only one CPU is busy. Even
if the one CPU is utilized to 100%, the overall average of many CPUs could be
below the threshold configured in the alert rule and hence the VM is stopped
prematurely.

It would be nice, if the VM metrics
included the up-time. Then one could add a second condition to the alert rule
that required the up-time to be longer than one hour. Another improvement would
be the ability to base the CPU percentage not on the average of all CPUs, but
on the CPU with the highest utilization.

This blog post describes how to access the up-time
and to define a metric per individual CPU and how to use that information in an
alert rule to stop a VM. The solution uses a Log Analytics Workspace that
collects the up-time and individual CPU percentage. This log information is
queried in an alert rule and based on the query result the VM is stopped via an
automation runbook.

The involved steps are:

  1. Adding a metric for the CPU
    percentage per CPU of a VM
  2. Setting up a Log Analytics
    Workspace that captures the VM up-time.
  3. Definition of an automation runbook
    to stop a VM
  4. Definition of an alert rule
    that evaluates the VM up-time and CPU percentage and calls the runbook

Here is a detailed description of those
four steps. It is assumed that your Azure subscription contains already a
resource group with a virtual machine in it.


1. Adding a metric for the CPU percentage per CPU of a VM

1.1. In the Azure Portal select the VM and go to Monitoring -> Diagnostic settings
1.2. Enable guest-level monitoring
1.3. If your VM has more than one CPU and you are interested in capturing the CPU time per processor:
1.3.1. Select Metrics -> Custom -> Add new metric:
1.3.1.1.  Class: Processor
1.3.1.2.  Counter: percentprocessortime
1.3.1.3.  Unit: Percent
1.3.1.4.  Condition: IsAggregate=FALSE
1.3.1.5.  Counter specifier: /builtin/processor/percentprocessortimepercpu
1.3.1.6.  Display name: CPU percentage guest OS per CPU
1.3.1.7.  Save the metric and diagnostic settings. Setting IsAggregate to FALSE is what activates
capturing CPU processor time events on a per CPU basis


2. Setting up a Log Analytics Workspace that captures the VM up-time

2.1. Create a Log Analytics resource from the Azure Marketplace in your resource group.
2.1.1. Create it in a new Log Analytics Workspace.
2.1.2. For our purposes it is sufficient to choose the Free pricing tier.
2.2. In the log analytics workspace select Settings -> Advanced settings -> Data
2.2.1. For your Windows VMs select: Windows Performance counters
2.2.1.1.          Unselect all counters that you do not need log events for
2.2.1.2.           In the “Collect the following performance counters” box add
2.2.1.2.1.              Processor(*)% Processor Time
2.2.1.2.2.              SystemSystem Up Time
2.2.2. For your Linux VMs select: Linux Performance counters
2.2.2.1.          Select: Apply below configuration to my machines
2.2.2.2.          Unselect all counters that you do not need log events for
2.2.2.2.1.              In the “Collect the following performance counters” box add
2.2.2.2.1.1.    Processor(*)% Processor Time
2.2.2.2.1.2.    System(*)Uptime
2.3. In the log analytics workspace select Workspace Data Sources -> Virtual machines
2.3.1. Connect all VMs that for which the log analytics workspace should capture the VM events.
A VM can be connected to only one log analytics workspace.

Now you can query the log events captured by the log analytics workspace. In the workspace select General -> Logs -> New Query. Under Schema -> Active -> LogManagement you can see the tables that you can query. The Perf table is the one that the processor time and up-time events get stored into. The queries use the Kusto Query Language. See the Learn more section next to the query box.


3. Definition of an automation runbook to stop a VM

3.1. Create an Automation resource from the Azure Marketplace in your resource group.
3.1.1. Also create an Azure Run As account.
3.2. In the created automation account select Shared Resources -> Modules
3.2.1. Update the modules as described in https://docs.microsoft.com/en-

us/azure/automation/automation-update-azure-modules

3.2.1.1.   Runbook type: PowerShell
3.2.1.2.   Publish and Start the Update-AutomationAzureModulesForAccount runbook
(it is enough to provide the resource group and automation account; in our case this

took seven minutes)
3.3. In the automation account select Process Automation -> Runbooks -> Create a
3.3.1. With Runbook type PowerShell
3.4. Select the created runbook and insert the below PowerShell script
3.5. Publish the created runbook.

<#
.SYNOPSIS
This runbook stops a resource management VM in response to an Azure alert trigger.

.DESCRIPTION
This runbook stops a resource management VM in response to an Azure alert trigger.
The input is alert data that has the information required to identify which VM to stop.

DEPENDENCIES
- The runbook must be called from an Azure alert via a custom alert rule.

REQUIRED AUTOMATION ASSETS
- An Automation connection asset called "AzureRunAsConnection" that is of type AzureRunAsConnection.
- An Automation certificate asset called "AzureRunAsCertificate".

.PARAMETER WebhookData
Optional, so it can be used in an action group without having to specify the parameter,
as it is passed to the action by the alert rule.
Object with a RequestBody property which has as value a JSON string, which represents
a JSON object with the two properties schemaId and data.
The value of the data property is a JSON object with the search result table, 
from which the VM data is taken

.NOTES
AUTHOR: Sandy Team
LASTEDIT: 2019-11-18
See also https://docs.microsoft.com/en-us/azure/automation/automation-create-alert-triggered-runbook
#>

[OutputType("PSAzureOperationResponse")]
param
(
    [Parameter (Mandatory=$false)]
    [object] $WebhookData
)
$ErrorActionPreference = "stop"

if (!$WebhookData) {
    throw "Missing WebhookData input. This runbook is meant to be started from an Azure alert webhook."
}

# From the WebhookData JSON get the RequestBody object, which contains a JSON string with the schemaId and custom data
write-output $WebhookData
if ($WebhookData.GetType().Name -eq "String") {
    # started from test pane where WebhookData is passed as string and not as object
    write-output "Converting WebhookData JSON string to object."
    $WebhookData = (ConvertFrom-Json -InputObject $WebhookData)
}
$RequestBody = (ConvertFrom-Json -InputObject $WebhookData.RequestBody)

# Check that the data comes from Log Analytics 
$schemaId = $RequestBody.schemaId
write-output "schemaId: $schemaId"
if ($schemaId -ne "Microsoft.Insights/LogAlert") {
    throw "The alert data schema - $schemaId - is not supported."
}

# Get the info needed to identify the VM (depends on the payload schema)
$VmResourceId = $RequestBody.data.SearchResults.tables[0].rows[0][0]
$CpuPercentage = $RequestBody.data.SearchResults.tables[0].rows[0][1]
$UpSeconds = $RequestBody.data.SearchResults.tables[0].rows[0][2]
write-output "vmResourceId: $VmResourceId"
write-output "cpuPercentage: $CpuPercentage"
write-output "upSeconds: $UpSeconds"
$Parts = $VmResourceId.Split('/')
$SubscriptionId = $Parts[2]
$ResourceGroupName = $Parts[4]
$ResourceName = $Parts[8]
write-output "virtualMachineName: $ResourceName"
write-output "resourceGroupName: $ResourceGroupName"
write-output "subscriptionId: $SubscriptionId"

# Authenticate to Azure by using the service principal and certificate. Then, set the subscription.
write-output "Authenticating to Azure with service principal and certificate"
$ConnectionAssetName = "AzureRunAsConnection"
write-output "Get connection asset: $ConnectionAssetName"
$Conn = Get-AutomationConnection -Name $ConnectionAssetName
if ($Conn -eq $null)
{
    throw "Could not retrieve connection asset: $ConnectionAssetName. Check that this asset exists in the Automation account."
}
write-output "Authenticating to Azure with service principal."
Connect-AzureRmAccount -ServicePrincipal -Tenant $Conn.TenantID -ApplicationId $Conn.ApplicationID -CertificateThumbprint $Conn.CertificateThumbprint | write-output
write-output "Setting subscription to work against: $SubscriptionId"
Set-AzureRmContext -SubscriptionId $SubscriptionId -ErrorAction Stop | write-output

# Stop the VM.
write-output "Stopping the VM - $ResourceName - in resource group - $ResourceGroupName -"
Stop-AzureRmVM -Name $ResourceName -ResourceGroupName $ResourceGroupName -Force

You can run and test the runbook in Edit mode via the Test pane. As input it needs a JSON document of the form:


{"RequestBody": "{"schemaId": "Microsoft.Insights/LogAlert", "data": {"SearchResults":
{"tables": [{"rows": [["/subscriptions/5be6fbf8-53de-xxxx-xxxx-97576a0fd7c0/resourcegroups
/sandy-demo-rg/providers/microsoft.compute/virtualmachines/sandy-demo-vm",1.2,4000]]}]}}}"}

It is a JSON document with the property RequestBody, which has as value a stringified JSON object with the two properties schemaId and data. Replace the subscription id, resource group and VM name with your own. When you run the script, it should stop your VM and display some debug output.


4. Definition of an alert rule that evaluates the VM up-time and CPU percentage and calls the runbook

4.1. In your resource group select Monitoring -> Alerts -> New alert rule
4.2. As resource select your log analytics workspace (using the Log Analytics workspaces filter)
4.3. Add a condition with:
4.3.1. Signal type: Log Search
4.3.2. Signal name: Custom log search
4.3.3. Search query: the below query (in Kusto query language), in which you have to replace your own
VM name
4.3.4. Based on: Number of results
4.3.5. Operator: Greater than
4.3.6. Threshold value: 0
4.3.7. Period (in minutes): 60
4.3.8. Frequency (in minutes): 10
4.4. Create an action group with one action
4.4.1. Action Type: Automation Runbook
4.4.2. Runbook source: User
4.4.3. Runbook: select the runbook created above in your subscription and automation account
4.4.4. Configure parameters: leave empty
4.4.5. Enable the common alert schema: No
4.5. Include custom Json payload for webhook: select

4.6. Enter your custom Json payload: enter the following JSON document: {“IncludeSearchResults”:true}

In the following Kusto query language script the value of the variable computerName needs to be replaced by your own VM name. The value for maximumAverageCpuPercentageOfBusiestCpu and minimumUpTimeSeconds can be adapted to your needs.

let computerName = "sandy-demo-vm";
let maximumAverageCpuPercentageOfBusiestCpu = 3;
let minimumUpTimeSeconds = 3600;

let minimumUpTimeSecondsAgo = datetime_add('second', -minimumUpTimeSeconds, now());
let averageCpuPercentageDuringMinimumUptime = toscalar( // is null, if no row is found, i.e. the VM is stopped for at least minimumUpTimeSeconds
Perf 
| where Computer == computerName 
    and CounterName == "% Processor Time"
    and TimeGenerated >= minimumUpTimeSecondsAgo
| summarize averageCpuTimePerCpu = avg(CounterValue) by InstanceName
| summarize max(averageCpuTimePerCpu) 
);

let vmUptimeSeconds = toscalar( // is null, if no row is found
Perf 
| where Computer == computerName 
    and (CounterName == "Uptime" or CounterName == "System Up Time")  // on Linux or Windows
    and TimeGenerated > ago(10m) // if there is no log for the last 10 minutes, then the VM is already stopped
| top 1 by TimeGenerated desc
| project CounterValue
);

let vmResourceId = toscalar(
Perf 
| where Computer == computerName
| top 1 by TimeGenerated desc
| project _ResourceId
);

// returns one row only, if the VM is still running and both conditions are true; otherwise returns no rows
print vmResourceId=vmResourceId, cpuPercentage=averageCpuPercentageDuringMinimumUptime, upSeconds=vmUptimeSeconds
| where cpuPercentage < maximumAverageCpuPercentageOfBusiestCpu  // null compared to a number evaluates to false
    and upSeconds > minimumUpTimeSeconds

The alert rule will run the Kusto query every 10 minutes. If the VM is either not running, or it is running, but for less than an hour or it is running, but the CPU percentage of the busiest CPU was on average above 3% during the last hour, then no row is returned by the query and no action is taken.
Otherwise one row is returned, which contains the Azure resource id of the VM, and the runbook is called with a WebhookData object as input. This is an object according to the Microsoft.Insights/LogAlert JSON schema (see https://docs.microsoft.com/en-us/azure/azure-monitor/platform/alerts-log-webhook). Due to {“IncludeSearchResults”:true} in the alert rule, the WebhookData contains RequestBody.data.SearchResults, which contains the one row query result. From that result the VM name, resource group and subscription id are extracted and they are used by the runbook to stop the VM.

The cost for the alert rule depends on its
data source and execution frequency. E.g. an evaluation of the Log Analytics
log every 10 minutes costs one dollar per month. A pure metrics alert rule
costs ten cents per month.