Joint Analysis of DNS and TLS Handshakes in Spark

In modern network environments, gaining visibility into encrypted traffic is one of the most pressing challenges for security analysts, threat hunters, and network engineers alike. While encryption protects privacy and data integrity, it also obscures content and can mask malicious behavior if not properly monitored. To navigate this challenge, the joint analysis of DNS queries and TLS handshakes has emerged as a powerful strategy for characterizing encrypted sessions, detecting threats, and understanding service usage patterns without the need for deep packet inspection. By leveraging Apache Spark, organizations can perform scalable, high-throughput correlation of DNS and TLS telemetry across billions of records, enabling contextual enrichment and behavioral modeling that would otherwise be impossible at scale.

DNS and TLS are intrinsically linked in most modern applications. DNS is responsible for resolving domain names into IP addresses, while TLS provides the cryptographic layer that secures HTTP, SMTP, and other protocols over these resolved endpoints. Although DNS itself may be plaintext or encrypted (e.g., DoH or DoT), its logs typically include query names, timestamps, source IPs, and the corresponding response IPs. TLS handshakes, captured at network taps or from flow collectors like Zeek, Suricata, or sensor-enabled load balancers, include the client IP, server IP, Server Name Indication (SNI), TLS version, selected cipher suite, certificate metadata, and handshake timestamps. When these two datasets are correlated by shared attributes—most commonly time proximity and IP address relationships—they provide a near-complete view of who connected to what service, when, and over what secure channel.

Processing this telemetry at scale requires a big-data processing engine capable of handling multi-terabyte workloads and high-velocity data ingestion. Apache Spark, with its distributed computation model, integrated SQL capabilities, and support for structured streaming, is well-suited to this task. The process begins by ingesting and pre-processing the raw DNS and TLS datasets. Each dataset is typically stored in Parquet or ORC format in a cloud data lake (e.g., S3, ADLS, or GCS) and partitioned by time for efficient querying. Spark jobs read the data, extract relevant fields, and normalize timestamps to a common format. Timestamps are crucial for join operations, as DNS resolutions and TLS handshakes often occur within a short window of each other—typically within a few seconds.

Once the datasets are normalized, a temporal join is performed to correlate DNS and TLS records. This involves joining DNS records (resolved domain → IP) with TLS handshakes (client → server IP) based on a rolling or fixed time window, such as ±10 seconds. Spark’s window functions and range joins enable this type of time-bounded correlation efficiently, especially when leveraging broadcast joins for small DNS-to-IP maps or using Bloom filters to reduce the shuffle size. In cases where multiple DNS responses return the same IP or where NAT hides true source diversity, additional logic may be needed to resolve ambiguity—such as comparing TTLs, session reuse patterns, or maintaining IP-domain mappings with confidence scores.

After the correlation step, the enriched dataset includes tuples of DNS query data and associated TLS handshake metadata. Analysts can now ask sophisticated questions, such as: what domains were associated with self-signed certificates? Which TLS versions were used by domains flagged as malicious? How many different domains resolved to a shared hosting IP but presented different X.509 certificates? These queries support a wide range of use cases, from detecting phishing sites and domain fronting to identifying unusual cipher suite usage or expired certificates in critical applications.

In security contexts, joint analysis of DNS and TLS helps surface advanced threats that evade traditional detection. For example, a malware variant might resolve a DGA-generated domain to a bulletproof VPS and immediately initiate a TLS session using a rare or outdated cipher suite. Alone, the DNS request or the TLS handshake might not appear suspicious. But together, the unusual domain, low TTL, ephemeral IP, and non-standard certificate signature algorithm raise a strong composite signal. Spark can be used to score such composite behaviors in batch or streaming mode, feeding threat detection pipelines or generating indicators for further investigation.

Behavioral modeling becomes far more accurate with this joint dataset. Machine learning models trained on DNS features alone—such as domain length, entropy, or query frequency—can be significantly enhanced with TLS context, such as certificate common names, public key sizes, issuer chains, and handshake timing characteristics. These models can be implemented in PySpark using libraries like MLlib or integrated with external frameworks via Delta tables for feature engineering. Features extracted from TLS certificates, such as uncommon country codes in subject fields, mismatched CN and SNI, or reuse of leaf certificates across unrelated domains, provide high-fidelity signals for anomaly detection.

For enterprises monitoring internal traffic, joint analysis supports asset discovery and shadow IT detection. Devices that query known SaaS domains but establish TLS sessions with IPs hosted on unexpected networks may indicate the use of proxies, misconfigured clients, or unauthorized services. Correlating DNS and TLS reveals which internal hosts are accessing these services, how frequently, and under what encryption parameters. This visibility helps refine access policies, detect data exfiltration, and ensure that traffic complies with organizational standards for encryption strength and certificate trust.

From an infrastructure optimization perspective, joint DNS-TLS telemetry can inform load balancing, CDN placement, and resolver behavior. Operators can assess whether DNS responses are leading to optimal TLS endpoints, whether certificate negotiation latency varies by geography, or whether resolver decisions are impacting TLS handshake success. For example, a resolver that frequently returns IPs with broken certificates can be flagged for reconfiguration or monitoring.

Privacy and compliance considerations are paramount in this type of analysis. DNS and TLS logs can contain sensitive metadata that, when joined, reveal behavioral fingerprints of users and devices. Data governance must ensure that logs are pseudonymized or anonymized where necessary, and that access to correlated datasets is tightly controlled. Techniques such as hashing client IPs, redacting uncommon SNI fields, or limiting retention of detailed logs to rolling windows help meet regulatory requirements while preserving analytical utility.

Ultimately, the joint analysis of DNS and TLS handshakes in Spark represents a synthesis of visibility and scalability. It combines two of the most fundamental data sources in encrypted network traffic to provide a unified view of intent and implementation—what was accessed, and how it was secured. Spark’s ability to process, correlate, and enrich these datasets in parallel across large clusters makes it the engine of choice for organizations seeking to turn raw network telemetry into actionable intelligence. As the landscape of encrypted traffic grows more complex, and as adversaries increasingly blend into legitimate channels, this form of multi-layered analysis becomes indispensable for understanding and securing the modern digital environment.

In modern network environments, gaining visibility into encrypted traffic is one of the most pressing challenges for security analysts, threat hunters, and network engineers alike. While encryption protects privacy and data integrity, it also obscures content and can mask malicious behavior if not properly monitored. To navigate this challenge, the joint analysis of DNS queries…

Leave a Reply

Your email address will not be published. Required fields are marked *