Implementing Cloud Monitoring Aggregation from AWS CloudWatch, Azure Monitor, and GCP Ops
What This Guide Covers
You are building a centralized observability pipeline that ingests telemetry from AWS CloudWatch, Azure Monitor, and Google Cloud Ops into a single aggregation layer. The end result is a unified data model where infrastructure metrics, application logs, and distributed traces are normalized, correlated, and queryable via a common interface, eliminating vendor-specific silos.
Prerequisites, Roles & Licensing
- Target Aggregation Platform: Elastic Stack (ELK) v8.x or Datadog Agent v7.x (this guide uses the Elastic Stack as the primary example due to its open-schema flexibility, but the architectural principles apply to any SIEM/observability platform).
- AWS Permissions: IAM Role with
CloudWatchReadOnlyAccessandlogs:CreateLogGroup,logs:CreateLogStream,logs:PutLogEvents(for Kinesis or direct Log Group pushes). - Azure Permissions: Managed Identity or Service Principal with
Monitoring ReaderandLog Analytics Contributorroles. - GCP Permissions: Service Account with
monitoring.viewer,logging.viewer, andbigquery.dataEditor(if exporting to BigQuery first). - Networking: PrivateLink endpoints or VPC peering connections must be established between the cloud provider egress points and the aggregation cluster ingress. Public internet exposure for telemetry data is a security violation in enterprise contexts.
The Implementation Deep-Dive
1. AWS CloudWatch: The Kinesis Bridge Pattern
AWS CloudWatch does not push data directly to arbitrary HTTP endpoints. You must use an intermediary. The most robust pattern for high-throughput environments is CloudWatch Logs → Kinesis Data Firehose → S3 → S3 Event Notification → Lambda → Elasticsearch. For real-time metrics, use CloudWatch Metrics → Kinesis Data Streams → Consumer Application.
The Architectural Reasoning:
Direct API polling of CloudWatch is inefficient and rate-limited. Push-based architectures via Kinesis decouple the generation of telemetry from its consumption. This prevents back-pressure from the aggregation layer from impacting the source infrastructure.
Configuration Steps:
-
Create the Kinesis Data Firehose Delivery Stream:
In the AWS Console, navigate to Kinesis Data Firehose. Create a delivery stream.- Source: Direct PUT (if using custom agents) or AWS Service (CloudWatch Logs).
- Destination: Amazon S3.
- S3 Bucket: Create a dedicated bucket (e.g.,
telemetry-raw-cloudwatch). - Prefix:
cloudwatch/logs/.
-
Configure CloudWatch Logs Subscription Filter:
You cannot subscribe to the entire log group blindly. You must define a filter pattern to reduce noise.API Endpoint:
PUT /put-subscription-filters{ "logGroupName": "/aws/lambda/my-critical-function", "filterName": "ErrorAndWarning", "filterPattern": "(ERROR | WARN )" }The Trap: Configuring a subscription filter on a high-volume log group (like
/aws/ec2/system) without a restrictive filter pattern. This will generate terabytes of useless data, causing Kinesis throughput charges to skyrocket and overwhelming your aggregation cluster. Always apply a filter pattern, even if it is just""(empty string) to ensure you are intentional about the volume. -
Implement the S3 Event Trigger Lambda:
When data lands in S3, an event triggers a Lambda function. This function parses the JSON, enriches it with metadata (e.g.,cloud.provider: aws,aws.region: us-east-1), and forwards it to the aggregation endpoint.Lambda Python Snippet:
import json import boto3 import requests from urllib.parse import unquote_plus def lambda_handler(event, context): kinesis_client = boto3.client('kinesis') elastic_url = "https://your-elasticsearch-endpoint:9200/_bulk" headers = {"Content-Type": "application/x-ndjson"} bulk_data = [] for record in event['Records']: # Decode the Kinesis record payload = record['kinesis']['data'] data = json.loads(payload) # Enrichment data['cloud.provider'] = 'aws' data['@timestamp'] = data.get('timestamp', data.get('@timestamp')) # Prepare for Elasticsearch Bulk API action = {"index": {"_index": "cloudwatch-logs-" + data.get('aws.region', 'unknown')}} bulk_data.append(json.dumps(action)) bulk_data.append(json.dumps(data)) if bulk_data: response = requests.post( elastic_url, data="\n".join(bulk_data) + "\n", headers=headers, auth=("user", "password") ) return {"statusCode": response.status_code}The Trap: Sending large batches to Elasticsearch in a single HTTP request. The Elasticsearch HTTP layer has a default max content length. If your Lambda processes a large S3 multipart upload, the JSON payload will exceed this limit, resulting in
413 Request Entity Too Largeerrors. Always chunk the bulk data into sizes under 10MB per request.
2. Azure Monitor: The Data Collection Rule (DCR) Strategy
Azure has deprecated the legacy Log Analytics Agent (MMA). You must use the Azure Monitor Agent (AMA) driven by Data Collection Rules (DCRs). This declarative approach allows you to define what data is collected and where it is sent in a JSON policy.
The Architectural Reasoning:
DCRs separate the collection policy from the agent configuration. This allows you to update the data flow without restarting agents on thousands of VMs. It also enables multi-destination routing, sending data to both the Azure Log Analytics workspace and your external aggregation platform simultaneously.
Configuration Steps:
-
Define the DCR JSON:
Create a JSON file defining the DCR. This example collects performance counters and sends them to an HTTP endpoint.{ "apiVersion": "2022-06-01-preview", "type": "Microsoft.Insights/dataCollectionRules", "name": "telemetry-dcr", "location": "eastus", "properties": { "dataSources": { "performanceCounters": [ { "name": "PerfCounterDataSource", "streams": ["Microsoft-Perf"], "samplingFrequencyInSeconds": 15, "counterSpecifiers": [ "\\Processor Information(_Total)\\% Processor Time", "\\Memory\\Available MBytes" ], "schedule": { "frequency": "PT1M", "offset": "PT0S" } } ] }, "destinations": { "logAnalytics": [ { "workspaceResourceId": "/subscriptions/{subscription-id}/resourceGroups/{rg-name}/providers/Microsoft.OperationalInsights/workspaces/{workspace-name}", "name": "AzureLogAnalytics" } ], "customDestinations": [ { "streams": ["Microsoft-Perf"], "name": "ExternalAggregator", "httpDestination": { "uri": "https://your-aggregator-endpoint/azure-ingest", "headers": { "Authorization": "Bearer ${env:AZURE_MSI_TOKEN}" } } } ] }, "dataFlows": [ { "streams": ["Microsoft-Perf"], "destinations": ["AzureLogAnalytics", "ExternalAggregator"] } ] } } -
Deploy the DCR via ARM/Bicep:
Use Azure Resource Manager templates to deploy this configuration. This ensures infrastructure-as-code consistency.The Trap: Relying on Managed Identity tokens without setting the
audienceparameter correctly. Azure AD tokens are audience-specific. If you do not specify the correct audience (e.g.,https://management.azure.com/), the token will be rejected by the target service if it validates the audience claim. In the DCR HTTP destination, ensure the identity has permissions to access the target endpoint, or use a static API key if the target does not support Azure AD. -
Associate the DCR with VMs:
Create a Data Collection Rule Association (DCRA) to link the DCR to specific VMs or VM Scale Sets.{ "apiVersion": "2022-06-01-preview", "type": "Microsoft.Insights/dataCollectionRuleAssociations", "name": "telemetry-dcr-association", "properties": { "description": "Associates DCR with VM", "targetResourceId": "/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Compute/virtualMachines/{vm-name}", "dataCollectionRuleId": "/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Insights/dataCollectionRules/telemetry-dcr" } }
3. GCP Ops: The Pub/Sub Export Pipeline
Google Cloud Operations (formerly Stackdriver) integrates natively with Pub/Sub. This is the most straightforward pipeline because Pub/Sub is a true push-based message queue with high durability and exactly-once semantics.
The Architectural Reasoning:
GCP Logs Router can export logs to Pub/Sub topics. This decouples the logging service from the aggregation backend. Pub/Sub handles the buffering, allowing the consumer (your aggregation service) to scale independently of the log generation rate.
Configuration Steps:
-
Create the Pub/Sub Topic:
gcloud pubsub topics create telemetry-logs -
Create a Logs Sink:
Navigate to Logging → Logs Router → Create Sink.- Sink Name:
export-to-pubsub - Filter:
resource.type="gce_instance"(or your specific filter) - Destination: Pub/Sub Topic →
telemetry-logs - Include Excluded Fields: Check this box if you need full log payloads. By default, GCP excludes some metadata to save costs.
- Sink Name:
-
Implement the Pub/Sub Subscriber:
Use a Cloud Run service or a Kubernetes deployment to subscribe to the topic. The subscriber receives JSON-encoded log entries.Node.js Subscriber Snippet:
const {PubSub} = require('@google-cloud/pubsub'); const pubsub = new PubSub(); const topicName = 'telemetry-logs'; const subscriptionName = 'telemetry-subscription'; const subscription = pubsub.topic(topicName).subscription(subscriptionName); subscription.on('message', (message) => { const logEntry = JSON.parse(message.data.toString()); // Enrichment const payload = { ...logEntry, 'cloud.provider': 'gcp', 'project.id': logEntry.resource.labels.project_id, '@timestamp': new Date().toISOString() }; // Send to Aggregation Platform sendToAggregator(payload); message.ack(); // Acknowledge receipt }); function sendToAggregator(payload) { // Logic to POST to Elasticsearch/Datadog/etc. }The Trap: Ignoring the ordering keys in Pub/Sub. If you require logs from a single instance to be processed in order (e.g., for debugging a sequence of events), you must set an ordering key (usually the instance ID) when publishing. Without ordering keys, Pub/Sub distributes messages randomly across subscribers, which can cause out-of-order processing in your aggregation layer. For log aggregation, order is often less critical than throughput, but for trace correlation, it is vital.
Validation, Edge Cases & Troubleshooting
Edge Case 1: Clock Skew and Timestamp Normalization
The Failure Condition:
Logs from AWS, Azure, and GCP arrive at the aggregation platform with timestamps that do not align. A request trace starting in AWS Lambda shows a timestamp 500ms before the corresponding log in Azure Functions, despite the logical flow being sequential.
The Root Cause:
Each cloud provider uses its own internal clock. While generally accurate, microsecond-level drift occurs. Furthermore, the aggregation pipeline introduces latency. The timestamp in the log entry (@timestamp) is often the time of generation, but the ingestion timestamp (_ingest.timestamp) is the time of arrival. If you index by ingestion time, your data is skewed by network latency.
The Solution:
Always use the source-generated timestamp for indexing and correlation. In the enrichment steps (Lambda, DCR, Pub/Sub subscriber), explicitly override the aggregation platform’s default timestamp field with the source timestamp. Ensure all source systems are configured to use UTC. In Elasticsearch, use the date field type with strict UTC parsing.
{
"settings": {
"index": {
"codec": "best_compression"
}
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
Edge Case 2: Schema Drift and Ingestion Failures
The Failure Condition:
Suddenly, logs from a specific AWS Lambda function stop appearing in the aggregation platform. The Kinesis Firehose shows successful deliveries to S3, but the Elasticsearch index is empty for that source.
The Root Cause:
Schema drift. A developer changed the logging format in the Lambda function from JSON to plain text, or removed a required field that the enrichment Lambda expects. The enrichment Lambda crashes, and the error is not propagated back to CloudWatch.
The Solution:
Implement a dead-letter queue (DLQ) or error sink in your enrichment logic. In the AWS Lambda example, wrap the enrichment logic in a try-catch block. If an error occurs, send the raw payload to a separate “error” index or S3 prefix for manual inspection.
try:
data['cloud.provider'] = 'aws'
# Enrichment logic
except Exception as e:
# Log the error to a separate sink
error_log = {
"error": str(e),
"raw_data": data,
"source": "aws-cloudwatch-enrichment"
}
send_to_error_sink(error_log)
raise e
Additionally, use schema validation tools like JSON Schema to validate incoming data before processing. This prevents invalid data from crashing your pipeline.
Edge Case 3: Rate Limiting and Back-pressure
The Failure Condition:
During a traffic spike, the aggregation platform’s ingestion rate limits are hit. AWS Kinesis Firehose reports “ProvisionedThroughputExceededException,” and Azure DCRs drop packets.
The Root Cause:
The aggregation platform cannot keep up with the burst of telemetry. The source systems are pushing data faster than the consumer can process it.
The Solution:
Implement back-pressure handling.
- AWS: Configure Kinesis Firehose to buffer data in S3 and process it in batches. Use the
BatchSizeandBufferIntervalsettings to control the flow. - Azure: Use the DCR’s built-in throttling capabilities. Configure the
maxEventsPerSecondlimit in the DCR to prevent overwhelming the destination. - GCP: Use Pub/Sub’s backpressure mechanism. Configure the subscriber to acknowledge messages only after successful processing. If the subscriber is slow, Pub/Sub will hold messages in the backlog.
Monitor the ingestion rate of your aggregation platform and set up alerts for when it approaches 80% of the limit. This allows you to scale the aggregation cluster proactively.