High Cardinality Label Handling in DNS Metric Stores for Scalable Observability
- by Staff
DNS telemetry has become a foundational pillar of observability and security in modern digital infrastructure, producing rich, continuous streams of metrics that capture service behavior, client access patterns, resolution latencies, failure modes, and usage trends. These metrics are essential not only for understanding DNS performance and availability but also for surfacing early signs of compromise, misconfiguration, or abuse. As with all telemetry systems, DNS observability pipelines must store and process this data efficiently, reliably, and at scale. One of the greatest challenges in doing so is handling high-cardinality labels—combinations of metric dimensions that explode in number as DNS systems grow in complexity, heterogeneity, and geographic distribution.
Cardinality in metrics refers to the number of unique combinations of label values associated with a given metric name. For example, a metric tracking DNS query latency might be labeled with dimensions such as resolver_id, client_subnet, query_type, domain_tld, region, and response_code. In a typical large-scale deployment, each of these labels may have thousands or even millions of possible values. When combined across dimensions, they generate an enormous number of unique time series, placing immense strain on metric storage systems, ingestion pipelines, and query engines.
DNS data is particularly prone to high cardinality because of its inherently dynamic nature. New domain names are constantly being registered, clients use randomized or frequently changing subdomains (as seen with CDNs or malware DGAs), and traffic can originate from millions of IP addresses spread across thousands of networks. Adding to this complexity, operational teams often wish to monitor fine-grained dimensions such as per-customer DNS usage, per-device performance metrics, or per-zone error rates. The resulting volume of unique time series can reach billions, far exceeding the design limits of traditional time-series databases.
Handling this explosion of cardinality requires a multi-layered approach, combining storage-efficient architectures, query-time aggregation strategies, real-time label filtering, and often selective label suppression or dimensionality reduction. Metric stores such as Prometheus, Thanos, Mimir, VictoriaMetrics, and proprietary solutions like Google Monarch or Amazon Timestream offer different trade-offs in managing high-cardinality data. Prometheus, for example, stores time series as a combination of metric name and a set of key-value labels, with each unique label set creating a new series. While efficient for moderate cardinality, Prometheus’s performance can degrade significantly when faced with millions of unique series, leading to excessive memory consumption and slow queries.
To address this in DNS environments, organizations often offload high-cardinality metrics to scalable backends like Cortex or Mimir, which distribute storage and query workloads across many nodes. These systems shard time series based on consistent hashing of label sets, spreading the load horizontally and enabling parallel ingestion and retrieval. Compression algorithms tailored for time-series data—such as delta-of-delta encoding, XOR-based encoding, or Gorilla compression—further reduce storage costs by exploiting the temporal regularity of DNS metrics.
Even with scalable storage, ingestion pipelines must cope with cardinality before data reaches the backend. Label filtering and relabeling at the edge is a common strategy. For instance, an ingestion agent like Promtail, Fluent Bit, or OpenTelemetry Collector might be configured to drop or hash certain labels that exceed cardinality thresholds. Query names, which are nearly unbounded in variability, are often normalized to top-level domains or second-level domains, replacing values like abc123.example.com with *.example.com or simply example.com. This reduces cardinality while preserving analytical utility. Similarly, client IPs may be masked to /24 subnet granularity or represented by anonymized identifiers grouped by behavior.
For real-time analysis and dashboards, pre-aggregation at ingestion time becomes crucial. Rather than storing raw metrics with all label combinations, resolvers and collectors compute aggregates such as mean latency, error count, or request volume per simplified label groupings over defined time windows. These summaries are pushed into the metric store as lower-cardinality time series, which can still power alerting and visualization without the overhead of full label space fidelity. More detailed data can be retained in logs or data lakes for deeper, offline analysis, where big data tools like Apache Spark or ClickHouse can operate on massive, wide schemas without overwhelming memory.
Another technique is dynamic label management using time-based or value-based expiration. High-cardinality labels that are inactive for a certain period can be expired or garbage collected, freeing up resources. Systems like Mimir implement TTLs for time series, ensuring that rarely used combinations do not persist indefinitely. Some platforms also employ query-time label pruning, where only a subset of label dimensions is retained for aggregation or display depending on user intent and zoom level. This avoids loading full cardinality datasets when only top-level trends are needed.
An emerging area of innovation involves adaptive dimensionality reduction based on usage patterns. Metric stores can track which label combinations are frequently queried and prioritize their storage, indexing, and cache retention, while deprioritizing those that are never accessed. In DNS scenarios, for instance, operators may be highly interested in query error rates by resolver_id and domain_tld, but not care about metrics split by transient subdomain hashes. Intelligent sampling, clustering, and sketching techniques—such as HyperLogLog for unique client counts or t-digest for latency percentiles—offer approximate but scalable ways to retain insight without preserving every granular label permutation.
Ultimately, high-cardinality label handling in DNS metric stores is not merely a technical optimization—it is a critical enabler for scalable, usable observability. Without it,
A network error occurred. Please check your connection and try again. If this issue persists please contact us through our help center at help.openai.com.
DNS telemetry has become a foundational pillar of observability and security in modern digital infrastructure, producing rich, continuous streams of metrics that capture service behavior, client access patterns, resolution latencies, failure modes, and usage trends. These metrics are essential not only for understanding DNS performance and availability but also for surfacing early signs of compromise,…