OSINT: The Architecture of Public Information Leakage

Companies spend millions on perimeter security while haemorrhaging information through channels that require no exploitation whatsoever. No vulnerability scan catches a misconfigured S3 bucket whose name was reconstructed from a job posting. No intrusion detection system alerts when someone queries your historical DNS records and maps your entire infrastructure evolution over four years. No firewall stops a researcher from correlating the skill listings of your engineering team to infer exactly which cloud migration you are midway through.

This is the world that Open Source Intelligence (OSINT) practitioners navigate. This article is a deep technical guide to how that navigation works, what each channel reveals, how the channels combine into actionable intelligence, and what defenders can realistically do about it.

Note on examples: Throughout this article, Natasha, a security analyst at the fictional firm Blackriver Analytics, appears as an illustrative figure. Her examples are not presented as a linear story but as isolated illustrations of specific techniques, both correct and incorrect. Blackriver Analytics and Natasha are entirely fictional.

The Intelligence Cycle

OSINT does not begin with a search query. It begins with a question. The intelligence cycle, formalised in NATO’s 2002 OSINT Handbook and used in military, law enforcement, and corporate intelligence contexts, consists of six phases:

Planning and direction: Define the intelligence requirement. What are you trying to know, and why? An undefined question produces unfocused collection and worthless output.
Collection: Gather raw data from identified sources, both passively (observing data that already exists) and actively (querying services that log queries).
Processing: Normalise and deduplicate. Raw data from ten sources is not intelligence; it is noise until it is structured.
Analysis: Identify patterns, anomalies, and relationships. This is where judgment enters.
Production: Synthesise findings into a coherent, actionable output.
Dissemination and feedback: Deliver the product and refine requirements based on what it does and does not answer.

Skipping planning is the most common mistake made by practitioners with technical skill but insufficient discipline. When Natasha was asked to assess the external exposure of a financial services client, she started querying Shodan within minutes of receiving the brief. She found interesting open ports, but because she had no defined scope, she spent three hours on infrastructure that turned out to belong to the client’s outsourced data centre provider, not to the client itself. The intelligence cycle exists precisely to prevent this.

Part 1: Infrastructure Reconnaissance

1.1 WHOIS, RDAP, and Registration Data

The WHOIS protocol dates from 1982. For most of its history, it returned the full contact details of domain registrants: names, addresses, phone numbers, email addresses. GDPR and the ICANN RDAP transition have redacted personal data from most public WHOIS responses, but a significant quantity of organisational data remains.

What a modern WHOIS or RDAP query still returns:

The registrar (which company manages the registration)
Nameservers (which DNS infrastructure is in use)
Registration and expiration dates
Whether a privacy protection service is in use, and which one
The registrant organisation name (for most corporate registrations)

The historical record is often more valuable than the current snapshot. Services like DomainTools, SecurityTrails, and WhoisFreaks maintain archives of WHOIS data going back years. A domain that migrated its nameservers from Amazon Route 53 to Cloudflare in 2022 and then to a private Anycast network in 2023 tells a story about infrastructure maturity and likely investment cycles. A domain whose administrative contact changed three times in eighteen months tells a story about organisational instability or acquisition activity.

For corporate investigations, the ARIN, RIPE, and APNIC WHOIS databases provide organisation-level records that map IP ranges, network abuse contacts, and the history of address space allocation. These are separate from the domain WHOIS system and remain fully public.

# RDAP query (replaces legacy WHOIS for most gTLDs)
curl https://rdap.verisign.com/com/v1/domain/example.com

# Legacy WHOIS
whois example.com

# Historical WHOIS via SecurityTrails
# https://securitytrails.com/domain/example.com/history/a

1.2 Autonomous System Numbers and IP Range Enumeration

Every organisation that routes traffic on the internet either leases IP space from a hosting provider or operates its own Autonomous System (AS), identified by an Autonomous System Number (ASN). An ASN is a globally unique 16-bit or 32-bit identifier assigned by a Regional Internet Registry (ARIN for North America, RIPE for Europe, APNIC for Asia-Pacific, etc.).

Finding an organisation’s ASN unlocks its entire announced IP range. If a company has ASN 64512 and announces the prefix 203.0.113.0/24, you now know every IP address the organisation directly controls. Cross-referencing that range with internet scan data from Shodan or Censys gives you a complete picture of every publicly reachable service across every IP they own.

# Look up the ASN for a known IP
whois -h whois.radb.net 203.0.113.1

# Find all prefixes announced by an ASN
whois -h whois.radb.net -- '-i origin AS64512'

# Via BGP route collectors (RIPE RIS, RouteViews)
# https://bgplay.massimocandela.com/

The distinction between “the organisation’s IP range” and “the IP range of the cloud provider they use” matters. A company running entirely on AWS does not announce its own prefixes; its IPs are part of Amazon’s ASN. But internal services, VPN gateways, and mail servers often run on dedicated infrastructure the company does own. The ASN lookup is the fastest way to find the boundary.

1.3 Certificate Transparency Logs

Every TLS certificate issued by a publicly trusted Certificate Authority must be submitted to a Certificate Transparency (CT) log before the CA is permitted to issue it. This requirement, standardised in RFC 6962 (2013) and updated in RFC 9162 (2021), was introduced to prevent unauthorised certificate issuance. Its side effect is that every subdomain a certificate has ever been issued for is permanently and publicly recorded.

CT logs are append-only, Merkle-tree-structured, and queryable by anyone. The most important query interface for practical OSINT is crt.sh, which indexes logs from Google, Cloudflare, DigiCert, Let’s Encrypt, and others.

# Query crt.sh for all certificates for a domain (including subdomains)
curl -s "https://crt.sh/?q=%.example.com&output=json" | 
  jq -r '.[].name_value' | 
  sort -u

# The %.example.com wildcard catches all subdomains
# Results include decommissioned infrastructure, staging environments,
# acquired company subdomains, and internal tooling exposed via certificate

What a careful CT log analysis returns for an active company:

Production services (expected, but confirms names and infrastructure)
Development and staging environments (dev., staging., test., sandbox., beta.)
Internal tooling that was briefly exposed via HTTPS (jira., confluence., grafana., kibana.)
Certificates for subdomains of recently acquired companies (timeline of acquisition)
Wildcard certificates (*.example.com) whose issuance date may correlate with infrastructure changes

Natasha was assessing a financial services firm and found, via a crt.sh query, that the organisation had issued a certificate for corebanking-api.internal.target.com eighteen months prior. The certificate had not been renewed. A DNS lookup showed the subdomain still resolved. The service was still responding. The word “internal” in the subdomain name had given the team false confidence that it was not accessible from the public internet. It was.

Let’s Encrypt certificates in particular are revealing because they can be issued within seconds at no cost for any domain the requester controls, which means development and testing infrastructure tends to use them. A pattern of Let’s Encrypt certificates for dev-*, test-*, and staging-* subdomains alongside a single DigiCert certificate for the primary domain tells you something about the organisation’s certificate management practices.

1.4 DNS Intelligence

DNS is the phone book of the internet. It is designed to be public, queried by anyone, and cached globally. This makes it one of the most information-dense surfaces in OSINT, and one of the most consistently underestimated by defenders.

Passive DNS and Historical Resolution Data

Passive DNS is the practice of observing and recording DNS resolution responses from recursive resolvers worldwide. Services like SecurityTrails (now part of Recorded Future), DNSDB (Farsight Security), and VirusTotal maintain archives of DNS resolutions going back years, indexed by domain, IP address, and nameserver.

A passive DNS query for a domain returns every IP address that domain has ever resolved to, with timestamps. This enables:

Tracking infrastructure migrations over time
Finding IP addresses used before a CDN was introduced (which may still host the origin server)
Identifying shared hosting (multiple domains resolving to the same IP)
Reconstructing the timeline of a service’s operational history

If a company moved behind Cloudflare in 2021, its real IP may still be findable in passive DNS records from 2020. Dozens of tools (CF-Hero, cloudflare-bypasser) automate this exact query for Cloudflare-protected domains.

Zone Transfers (AXFR)

A DNS zone transfer is the mechanism by which a secondary nameserver replicates the full zone from a primary. The AXFR query type requests a complete copy of the zone. This is entirely legitimate between authorised servers; it is a significant information disclosure when allowed for arbitrary querying clients.

A misconfigured nameserver that responds to AXFR requests from unauthenticated sources returns every DNS record in the zone: every A record, CNAME, MX, TXT, NS, SRV, and PTR. This is not a vulnerability in the software sense; it is a configuration failure. CISA issued an advisory about it in 2015 (AA15-103A). It still occurs regularly.

# Attempt a zone transfer (this will be refused by well-configured nameservers)
dig AXFR @ns1.example.com example.com

# If successful, returns every DNS record in the zone

Subdomain Brute-Force Enumeration

When zone transfers are blocked, subdomain enumeration relies on wordlist-based brute force and passive sources. The standard toolkit includes:

amass (OWASP): Combines passive sources (CT logs, DNS archives, search engines, certificate data) with active DNS queries. The passive mode alone queries over 20 data sources without sending DNS queries to the target.
subfinder: Passive-only subdomain enumeration against over 50 data sources. Fast and accurate.
dnsx + puredns: Active DNS resolution and brute-force at scale, capable of testing millions of subdomain candidates against wordlists like SecLists.

# Passive enumeration with amass (no active DNS queries)
amass enum --passive -d example.com -o subdomains.txt

# Passive enumeration with subfinder
subfinder -d example.com -o subdomains.txt

# Active brute-force with puredns
puredns bruteforce /opt/wordlists/subdomains-top1m.txt example.com

Reverse DNS and PTR Records

A PTR record maps an IP address back to a hostname. IP ranges operated by an organisation often have PTR records revealing the naming convention used for servers. An IP range where every host follows the pattern prod-db-03.us-east-1.example.com tells you the environment (prod), the role (db), the instance number, the region, and the domain in a single DNS lookup.

MX, TXT, and SPF Records

Mail exchange records reveal the email infrastructure: whether the organisation uses Google Workspace, Microsoft 365, Proofpoint, Mimecast, or an on-premises mail server. SPF records enumerate every IP range and third-party service authorised to send mail on the domain’s behalf, which directly reveals which SaaS products the organisation uses.

# SPF record: reveals every authorised mail sender
dig TXT example.com | grep spf

# A typical SPF record reveals: Google Workspace, Salesforce,
# Zendesk, Mailchimp, and a SendGrid integration, each as an "include:"

DMARC, DKIM, and BIMI records add further context: whether the organisation has invested in email authentication infrastructure, whether reports go to a third-party security vendor, and whether the domain has achieved BIMI brand-level trust.

Subdomain Takeover

A subdomain takeover occurs when a subdomain’s DNS record points to an external service that has been deprovisioned, but the DNS record was not removed. If assets.example.com has a CNAME pointing to examplebucket.s3.amazonaws.com and the S3 bucket no longer exists, an attacker can create a bucket with that exact name and gain control of the subdomain.

The impact ranges from phishing and credential harvesting (content served from assets.example.com appears to originate from the legitimate domain) to cookie theft (if the session cookie is scoped to .example.com). The attack is particularly insidious because the takeover requires no access to the target’s infrastructure at all.

# Tools for subdomain takeover detection
subjack -w subdomains.txt -t 100 -o takeovers.txt
nuclei -l subdomains.txt -t takeovers/

1.5 Internet-Wide Scan Databases

Shodan

Shodan continuously probes every publicly routable IP address on the internet, testing hundreds of ports and protocols, capturing banners, certificates, and service fingerprints. The collected data is indexed and searchable.

A Shodan query for an organisation’s ASN or IP range returns:

Every publicly reachable port and the service running on it
TLS certificate information for each HTTPS service
HTTP headers, including Server and X-Powered-By
Version strings from SSH banners, FTP greetings, and similar service advertisements
Vulnerability flags where Shodan has matched banner data to CVE signatures

# Shodan query for an ASN
org:"Example Corp"

# By IP range
net:203.0.113.0/24

# By product (all publicly exposed Jenkins instances)
product:Jenkins org:"Example Corp"

# Find RDP exposed to the internet
port:3389 org:"Example Corp"

Natasha ran a Shodan query against the client’s ASN and found a Jenkins CI server on port 8080 with no authentication, advertising its version in the HTTP header. The version was three years old and had three known RCE vulnerabilities. Neither the engineering team nor the security team was aware the instance was publicly reachable. It had been the CI server for a deprecated internal project; the project was cancelled but the server was never shut down.

Censys

Censys performs similar internet-wide scanning but with a particular strength in TLS certificate analysis and structured data export. Its data model is better suited to automated enumeration of all hosts associated with a set of certificates or all hosts sharing a certificate issuer pattern.

# Censys Python SDK
from censys.search import CensysHosts

h = CensysHosts()
for page in h.search("services.tls.certificates.leaf_data.subject.organization: 'Example Corp'"):
    for host in page:
        print(host["ip"], host["services"])

Censys is particularly useful for finding all hosts sharing the same TLS certificate (useful when a wildcard cert is deployed across multiple services) and for finding hosts that once had an organisation’s certificate but have since been transferred or deprovisioned.

Netlas, ZoomEye, and FOFA

Netlas provides deep port scanning and banner analysis with a powerful query syntax. ZoomEye has strong coverage of IoT devices and Asian infrastructure. FOFA, operated by a Chinese security company, has broad coverage and supports complex filter expressions. Each has different coverage gaps; professional OSINT uses multiple sources and cross-references discrepancies.

Part 2: Web Surface Intelligence

2.1 Search Engine Dorking

Google, Bing, and DuckDuckGo index a fraction of the public web that is often overlooked in formal asset inventories. Advanced search operators allow precise targeting of specific types of content.

Core operators and their intelligence value:

Operator	Syntax	Intelligence value
`site:`	`site:example.com`	Enumerate all indexed pages on a domain
`filetype:`	`filetype:pdf site:example.com`	Find published documents
`inurl:`	`inurl:admin site:example.com`	Locate administrative interfaces
`intitle:`	`intitle:"index of" site:example.com`	Find directory listings
`intext:`	`intext:"api_key" site:example.com`	Locate pages containing specific strings
`cache:`	`cache:example.com/page`	View Google’s cached copy of a page

Effective Google dorks for infrastructure intelligence:

# Find exposed configuration files
site:example.com filetype:env OR filetype:cfg OR filetype:conf

# Locate exposed admin panels
site:example.com inurl:admin OR inurl:dashboard OR inurl:management

# Find accidentally indexed spreadsheets
site:example.com filetype:xls OR filetype:xlsx OR filetype:csv

# Discover exposed API documentation
site:example.com inurl:swagger OR inurl:api-docs OR inurl:openapi

# Find backup files
site:example.com filetype:bak OR filetype:backup OR filetype:old

# Locate exposed credentials in public paste sites
site:pastebin.com "example.com" password OR apikey OR secret

Natasha ran site:example.com filetype:pdf and found 47 documents, most of them marketing brochures. The forty-third was a system architecture document from 2019 titled “Migration Plan Q3.” It described the company’s entire database schema, the names of all microservices, and the decision to retain a legacy Oracle database for “regulatory reasons” that had not been mentioned anywhere else. The document had been accidentally indexed. It had been there for five years.

The GHDB (Google Hacking Database), maintained by Exploit-DB, catalogs thousands of tested dork patterns for finding specific types of exposed systems. Reviewing the GHDB categories (Web Application Advisories, Files Containing Passwords, Sensitive Directories, etc.) against a target domain takes under an hour and reliably surfaces content that routine scans miss.

2.2 The Wayback Machine and Archive Analysis

The Internet Archive’s Wayback Machine has crawled the public web since 1996. For OSINT purposes, it provides access to:

Historical versions of pages that have since been modified or deleted
Old site structure and navigation, revealing past services and features
Historical job postings (deleted from the current site but archived during the posting window)
Old technology stack indicators (outdated framework headers, old JavaScript includes)
Email addresses and phone numbers that appeared in earlier versions of pages

# Query the Wayback Machine CDX API for all archived URLs of a domain
curl "http://web.archive.org/cdx/search/cdx?url=*.example.com&output=text&fl=original&collapse=urlkey"

# This returns every URL the archive has ever crawled under a domain,
# including paths that no longer exist on the live site

Historical snapshots of a company’s careers page provide a timeline of technical hiring. If the careers page from 2021 shows ten open roles for Kubernetes engineers and the 2022 snapshot shows twenty, and the 2023 snapshot shows none, that timeline is intelligence: the migration probably completed in early 2023.

The Wayback Machine CDX API is particularly useful for finding URLs that are no longer linked from the current site. Old admin panels, debug endpoints, and test pages that were once publicly reachable often remain crawled in the archive. The URL itself reveals information even if the content is no longer accessible.

2.3 Web Technology Fingerprinting

Every web service reveals something about its implementation through its HTTP headers, HTML structure, JavaScript includes, and error pages.

BuiltWith and Wappalyzer identify technology stacks from passive fingerprinting. A site using Salesforce, Adobe Analytics, Segment, and Intercom has a specific marketing and analytics stack. A site using nginx with Lua, Redis, and a Lua-based authentication layer has very different characteristics than one using Apache with PHP.

HTTP headers are particularly revealing:

Server: nginx/1.18.0 reveals the web server version
X-Powered-By: Express reveals the Node.js framework
Set-Cookie: PHPSESSID=... reveals PHP on the backend
X-AspNet-Version: 4.0.30319 reveals the .NET version
CF-Ray: indicates Cloudflare is in the path
Custom headers (X-Request-Id, X-Trace-Id) reveal internal observability tooling

# Inspect headers
curl -I https://example.com

# Or with verbose output including TLS certificate info
curl -v https://example.com 2>&1 | head -60

Error pages deserve specific attention. A Django debug error page (accidentally left enabled in production) contains the full local file path of the application, all installed middleware, the database configuration, the Python version, and every installed package. The same is true for default framework error pages that expose stack traces in production.

Part 3: Repository and Code Intelligence

3.1 GitHub and GitLab Enumeration

GitHub’s search API and web interface allow querying across all public repositories. The relevant dimensions:

Organisation discovery:

# All public repositories for an organisation
https://github.com/orgs/example-corp/repositories

# GitHub search for domain-specific references
site:github.com "example.com" apikey
site:github.com "example.com" password
site:github.com "@example.com" DB_PASSWORD

Contributor enumeration: Every commit in a public repository contains the author’s name and email address. For organisational repositories, this means every engineer who has ever committed code is listed, along with their commit email. This maps directly to the organisation’s engineering headcount at that point in time.

# Extract all unique commit authors from a repository's history
git log --format='%ae' | sort -u

# This typically returns first.last@example.com,
# which reveals both the person and the email format convention

GitHub dork patterns:

# Find exposed environment files
org:example-corp filename:.env

# Find configuration files with passwords
org:example-corp filename:config password

# Find hardcoded AWS keys
org:example-corp "AKIA" # AWS access key prefix

# Find internal hostnames
org:example-corp "internal.example.com"

# Find Slack webhook URLs (often commit by accident)
org:example-corp "hooks.slack.com/services"

Natasha found a public repository in the client’s GitHub organisation named infra-tools. The repository contained Terraform configuration files. The Terraform state file, which normally should never be committed, was present in the history from a commit three years prior. Terraform state files contain the current state of all provisioned infrastructure: every EC2 instance ID, every RDS endpoint, every security group rule, every IAM role name. The file had been removed in a subsequent commit, but it was still fully accessible in the history.

3.2 Commit History and Secret Scanning

The immutability of git history is the core problem. A developer who commits an API key, realises the mistake, removes the file in the next commit, and force-pushes to rewrite history may believe the secret is gone. But if the repository was public for even a moment between the two commits, multiple automated scanners will have already captured it. GitHub itself logs pushed commits and retains access to objects even after force-pushes in many configurations.

TruffleHog (Truffle Security) is the most capable tool for this. It scans the full git object database, not just the checked-out files, and verifies whether detected credentials are still active against the relevant service’s API.

# Scan entire git history for secrets, verify active credentials
trufflehog git https://github.com/example-corp/example-repo

# Scan a local repo
trufflehog filesystem /path/to/repo

# Scan all public repos of an organisation
trufflehog github --org=example-corp

Gitleaks is faster and more suitable as a pre-commit hook:

gitleaks detect --source . --verbose

According to Snyk’s 2025 state of secrets report, over 28 million credentials were found in public GitHub repositories that year alone. The categories include: AWS access keys, GitHub personal access tokens, Google API keys, Slack webhooks, database connection strings, private TLS keys, and Stripe secret keys.

The remediation for an already-exposed secret is not to delete the file. The remediation is to immediately revoke the credential at the issuing service, then clean the history (using git filter-repo or BFG Repo Cleaner), and then force-push. Deleting the file without revoking the credential leaves it active and accessible to anyone who captured it.

3.3 Dependency and Supply Chain Intelligence

A public repository’s dependency manifests (package.json, requirements.txt, go.mod, Gemfile.lock, pom.xml) reveal the complete technology stack in version-specific detail. Cross-referencing specific versions against CVE databases (NVD, OSV, Snyk Vulnerability DB) tells you exactly which known vulnerabilities are present in the codebase.

Beyond vulnerabilities, the dependency list is a technology fingerprint. Organisations using specific internal packages (@example-corp/design-system, example-corp-auth-lib) reveal the boundaries of their internal package ecosystem. The package names often reflect project names, team names, and internal brand names that appear nowhere in public documentation.

Part 4: Personnel and Organisational Intelligence

4.1 LinkedIn and Professional Networks

LinkedIn’s value for OSINT lies in the voluntary, structured, and continuously updated nature of the data. Professionals maintain profiles to be discovered; the intelligence challenge is aggregating what those profiles reveal at scale.

Individual profile analysis:

Job title and seniority reveal reporting level and responsibility scope
Company tenure reveals whether the organisation retains its engineers
Listed skills reveal the technology stack, including specific versions and cloud providers
Certifications (AWS Solutions Architect, CISSP, CKA) reveal infrastructure choices and security investment
Education and career trajectory reveal where the organisation recruits and what seniority level they hire at

Aggregate analysis across an organisation: The most powerful use of LinkedIn OSINT is not analysis of individual profiles but the construction of a model from aggregate data. For a company with 50 engineers on LinkedIn:

Aggregate signal	Implication
12 engineers list “Kubernetes” + “AWS EKS”	Container orchestration on EKS
5 engineers list “Datadog”	Datadog is the observability platform
3 recent hires with “Apache Kafka” experience	Event streaming migration or new service
VP of Engineering joined 4 months ago from a competitor	Significant leadership change
8 engineers with “Rust” experience, all hired in last 6 months	Active Rust migration or greenfield Rust project
2 engineers listing “SOC 2 Type II” compliance experience	Active compliance program

Social graph analysis: LinkedIn’s “People also viewed” and connection suggestions reflect the underlying professional graph. If employees of Company A consistently appear in the connection suggestions for employees of Company B, there is a relationship. If that relationship strengthens over three months, it may signal a pending acquisition, partnership, or migration.

4.2 Email Harvesting and Pattern Reconstruction

Corporate email addresses almost always follow a predictable pattern. The most common are firstname.lastname@example.com, f.lastname@example.com, and firstname@example.com. Finding one email address reveals the pattern for the entire organisation.

Hunter.io queries published web content to surface email addresses associated with a domain and infer the pattern:

curl "https://api.hunter.io/v2/domain-search?domain=example.com&api_key=YOUR_KEY"
# Returns found email addresses + the inferred pattern with a confidence score

theHarvester automates collection from multiple sources:

theHarvester -d example.com -b google,linkedin,bing,twitter,certspotter -l 500

Once the pattern is known, generating the email address for any person whose name is known is trivial. Combined with LinkedIn data (full name, role, tenure), this allows reconstruction of email addresses for specific individuals who have never published their email publicly.

Natasha identified the correct email pattern for the target organisation from a single email address found in a public conference agenda from 2020. She then used a list of names from LinkedIn and generated 47 email addresses. She verified 41 of them as active using smtp-user-enum against the organisation’s mail server. This is not an attack; no email was sent, no inbox was accessed. The enumeration relied entirely on the SMTP VRFY and RCPT TO commands, which many mail servers allow in error.

Email permutation and SMTP verification:

# Generate permutations for a known name and domain
python3 email-permutator.py --first alice --last smith --domain example.com

# Verify which permutations are valid (does not send email)
smtp-user-enum -M VRFY -U generated-emails.txt -t mail.example.com

4.3 Job Posting Intelligence

Job postings are one of the most undervalued OSINT surfaces. They are written by people inside the organisation, describing specific internal needs, and they reveal information that would never appear in marketing materials.

A job posting for a “Senior Security Engineer” that requires “experience with CrowdStrike Falcon and Sentinel One” tells you exactly which EDR products the organisation has deployed. A posting for a “Database Reliability Engineer” requiring “PostgreSQL 15 and Aurora MySQL” reveals the database choices. A posting requiring “experience migrating from Oracle to PostgreSQL” reveals an active migration project.

Postings for unusual specialisations reveal unusual infrastructure. A posting for someone with “experience with TS/SCI-cleared environments” reveals the organisation has government contracts with classified data handling requirements. A posting requiring “experience with HSMs and key ceremony processes” reveals cryptographic infrastructure and regulatory context.

Historical postings via the Wayback Machine reveal what the organisation was building in previous years. Comparing postings from 2019, 2021, and 2023 provides a timeline of technical direction changes, completed migrations, and abandoned projects.

After Natasha had already mapped the client’s infrastructure through DNS and CT logs, she reviewed historical job postings archived in the Wayback Machine. A 2021 posting for a “Kafka + Flink Data Engineer” confirmed the event streaming architecture she had inferred from LinkedIn. But a 2019 posting for an “IBM Mainframe COBOL Developer” revealed a legacy system that appeared in none of the modern infrastructure. That mainframe was still running. It was not reachable from the internet, but it was connected to the same network as the modern infrastructure, and neither the client nor their security team had mentioned it.

4.4 Conference Talks, Publications, and Blog Posts

Engineers and security professionals give conference talks. Those talks contain architecture diagrams, deployment decisions, technology choices, and lessons learned from failures. The same people publish blog posts and open-source tools that reveal operational context.

Searching for employees’ names on conference archives (DEF CON, Black Hat, AWS re:Invent, KubeCon, USENIX) provides a stream of technical detail that is individually innocuous and collectively specific. A talk titled “How We Scaled Our gRPC Gateway to Handle 10M Requests Per Second” reveals the existence of a gRPC gateway, the approximate scale of the organisation’s traffic, and the specific engineering challenges they solved.

Part 5: Document and Metadata Intelligence

5.1 PDF and Office Document Metadata

The metadata embedded in documents is a second document, invisible to casual readers but extractable in seconds with freely available tools. The properties of an Office document include:

Author: the real name of the person who created the document, drawn from the operating system account name
Company: the organisation name from the Office installation
LastModifiedBy: the last person to edit the document
Template: the name of the document template used (sometimes contains internal naming conventions)
RevisionNumber: how many times the document was revised (reveals whether it was a quick draft or a lengthy process)
TotalEditingTime: the cumulative time spent editing the document
Created and Modified timestamps: tells you when work actually started and ended

PDF metadata has its own fields: Creator (the application that created the original), Producer (the PDF conversion engine), Author, and custom XMP metadata fields that vary by application.

# Extract all metadata from a document
exiftool document.pdf
exiftool presentation.pptx

# Example output for a PDF:
# Author: alice.smith
# Creator: Microsoft Word 2019
# Producer: Adobe PDF Library 15.0
# Created: 2023-04-12 09:23:11+00:00
# Modified: 2023-04-14 16:45:33+00:00
# Company: Example Corp (Internal Use Only)

The author field is particularly revealing because it typically contains the operating system username, not the person’s display name. A username like asmith or alice.smith directly reveals the organisation’s username convention, which in turn reveals the email pattern.

5.2 EXIF Data in Images

EXIF (Exchangeable Image File Format) data is metadata embedded in JPEG, TIFF, and HEIC images by the capturing device. It includes:

Camera make and model (or smartphone model)
Lens information
Capture timestamp
GPS coordinates (if enabled on the device)
Software used for editing
Colour profile and display metadata

GPS coordinates in published images are not a minor concern. They reveal the physical location of the photographer at the moment of capture. For organisations that publish office photos, event photos, or product launch images, this is a direct map of office locations, data centre locations, and employee whereabouts.

The McAfee case (2012) is the textbook example: Vice published a photo in an interview with John McAfee while he was evading authorities. The EXIF data contained GPS coordinates. He was located and arrested within days.

For organisations, the risk is typically lower in magnitude but higher in frequency. Routine publication of internal event photos with GPS data enabled creates a persistent, searchable record of office locations, conference attendance, and meeting patterns that was never intended to be public.

exiftool photo.jpg | grep -i GPS
# GPS Latitude: 37°46'26.81"N
# GPS Longitude: 122°25'10.48"W
# GPS Altitude: 5.2 m Above Sea Level

Part 6: Cloud and Third-Party Exposure

6.1 S3 and Cloud Storage Enumeration

Amazon S3 bucket names are globally unique identifiers that form predictable URLs: https://bucket-name.s3.amazonaws.com. If an organisation has a bucket named example-corp-backups and it is configured to allow public listing, anyone who knows or guesses the name can enumerate its contents.

Bucket names frequently follow discoverable patterns:

companyname-backups, companyname-assets, companyname-logs
projectname-data, projectname-deploy
Names derived from internal code names found in job postings or GitHub comments

Tools for S3 enumeration:

# S3Scanner checks a list of bucket names for public access
s3scanner scan --bucket-file buckets.txt

# Generate candidate bucket names
python3 bucket-name-gen.py --company example-corp --output buckets.txt

# Test directly
aws s3 ls s3://example-corp-backups --no-sign-request

The --no-sign-request flag is important: it performs the request without any AWS credentials, testing whether the bucket allows unauthenticated access.

Qualys reported in 2023 that nearly half of audited S3 configurations had some form of misconfiguration. The categories range from allowing public listing (lowest risk) through allowing public reads (moderate risk) to allowing public writes (severe risk, bucket can be used for malware distribution under the victim’s domain).

GCP Cloud Storage (storage.googleapis.com) and Azure Blob Storage (*.blob.core.windows.net) have similar enumeration characteristics. The naming conventions differ but the methodology is the same.

6.2 Exposed APIs and Dashboards

Swagger and OpenAPI documentation: Many organisations expose their API documentation publicly. Swagger UI at api.example.com/swagger or developer.example.com/docs describes every endpoint, its parameters, its authentication requirements, and often includes example requests and responses. This is not necessarily a misconfiguration; public API documentation is intentional. But it also documents every attack surface the API presents.

# Google dork for exposed Swagger instances
site:example.com inurl:swagger OR inurl:api-docs OR inurl:"openapi.json"

Grafana and observability dashboards: Grafana instances are often set up with a default configuration that allows anonymous viewing of dashboards. An organisation’s monitoring dashboard may contain:

Real-time request rates and error rates (reveals scale and current health)
Database query counts and latency (reveals database architecture)
Alert thresholds and SLO definitions (reveals what the organisation considers acceptable)
Hostname patterns for all instrumented services

# Shodan query for public Grafana instances
product:Grafana title:"Grafana" http.component:"Grafana"

Exposed Jupyter notebooks: Data science teams often run Jupyter notebooks on servers and forget to secure them. A publicly accessible Jupyter instance provides a Python execution environment within the organisation’s network, typically with access to the same datasets and APIs the data scientist was using.

# Shodan dork for unauthenticated Jupyter
"jupyter" port:8888 http.title:"Jupyter"

Natasha found a publicly accessible Grafana dashboard via a Shodan query against the client’s IP range. The dashboard was titled “Platform Health Overview.” It displayed real-time request counts (revealing approximate user counts), error rates broken down by microservice name (revealing all service names), a database connection pool utilisation graph (revealing the database type and configuration), and a panel showing the number of active Kubernetes pods per namespace. None of this was behind authentication. The dashboard had been configured for public access during a troubleshooting session and never locked down.

Part 7: Synthesis, Tooling, and the Graph Model

7.1 The Graph Model of Intelligence

OSINT is fundamentally a graph problem. Every piece of data is a node. Every relationship between data points is an edge. Intelligence emerges when the graph becomes dense enough to reveal paths that connect the question to the answer.

The nodes in a typical corporate OSINT investigation include: domains, IP addresses, ASNs, subdomains, people, email addresses, organisations, cloud resources, repositories, and documents. The edges include: resolves-to, owns, employs, committed-to, issued-certificate-for, links-to, and mentions.

The investigator’s task is to expand the graph from known seed nodes until the answer to the intelligence question becomes visible as a structural property of the graph.

7.2 Maltego

Maltego is the graphical OSINT platform that makes the graph model operational. It represents entities as nodes, relationships as edges, and transforms as automated queries that expand the graph by querying external data sources.

Starting from a company name, a sequence of Maltego transforms can:

Resolve the company to its domain names and IP ranges
Expand domains to subdomains via CT logs
Expand IPs to Shodan banners
Expand domains to WHOIS contacts
Expand email addresses to LinkedIn profiles
Expand LinkedIn profiles to connections and former employers

The visual graph immediately reveals clusters, central nodes, and structural anomalies that would take hours to identify in tabular data. Maltego integrates with over 120 data partners including Shodan, SecurityTrails, VirusTotal, HaveIBeenPwned, and the HIBP enterprise API.

7.3 SpiderFoot

SpiderFoot is an open-source OSINT automation framework that runs over 200 modules against a seed (domain, IP, email, name, ASN) and aggregates data from over 100 external sources. It produces a structured report with a built-in graph view.

# Install and run via Docker
docker pull smicallef/spiderfoot
docker run -p 5001:5001 smicallef/spiderfoot

# Or run programmatically via CLI
spiderfoot -s example.com -t INTERNET_NAME -m sfp_crt,sfp_dnsbrute,sfp_shodan,sfp_linkedin -o json

The SpiderFoot module ecosystem includes certificate enumeration, DNS brute force, Shodan queries, LinkedIn scraping, Google dorking, breach data queries, and S3 bucket enumeration. Running all relevant modules against a seed domain produces an investigation graph in minutes that would take days manually.

7.4 Recon-ng

Recon-ng is a command-line reconnaissance framework with a module architecture similar to Metasploit. It is designed for structured, documented OSINT campaigns where each data collection step is logged, reproducible, and auditable.

recon-ng
[recon-ng] > workspaces create example-corp
[recon-ng][example-corp] > modules search recon/domains
[recon-ng][example-corp] > modules load recon/domains-hosts/certificate_transparency
[recon-ng][example-corp] > options set SOURCE example.com
[recon-ng][example-corp] > run

Recon-ng stores all findings in a local SQLite database, enabling joins and queries across collected data. This is particularly useful in long-running investigations where data from multiple sessions needs to be synthesised.

Part 8: OPSEC for the Investigator

Passive OSINT is not truly passive. A crt.sh query leaves no trace. A Shodan search leaves no trace on the target. But a direct DNS query, an HTTP request to a discovered endpoint, or a LinkedIn profile view leaves an observable trace on the target’s side.

The OSINT investigator’s OPSEC concerns are:

Do not tip off the target. A sudden spike in DNS queries against a company’s nameservers from an unusual IP block may trigger their monitoring systems. A LinkedIn profile view from a profile associated with a known investigation firm warns the subject.

Separate investigation infrastructure. All active queries should originate from infrastructure that is not connected to the investigator’s real identity. This means dedicated VMs, dedicated network egress (not the investigator’s home or office IP), and browser profiles with no persistent session state.

Use passive sources where possible. crt.sh, SecurityTrails, Shodan, and the Wayback Machine do not notify the target when queries are made. Direct DNS queries, HTTP requests, and social media profile views do leave observable traces.

Manage sock puppet accounts carefully. LinkedIn views from a real LinkedIn account are visible to the profile owner in some configurations. A dedicated investigation account that is plausible but not connected to the investigator’s identity is the standard approach.

Natasha, conducting an authorised red team assessment, queried the target’s Grafana instance directly from her office IP to verify that it was publicly accessible. The target’s WAF logged the request. The security team investigated the IP, found it resolved to a security consultancy, and realised an assessment was underway before the client’s management intended to disclose it. The lesson: even authorised assessments require OPSEC discipline. Passive verification (for example, confirming the dashboard appeared in Shodan rather than accessing it directly) would have avoided the early disclosure.

Part 9: Defence and Surface Reduction

Understanding the offensive techniques is the prerequisite for effective defence. Each technique described above has a corresponding defensive measure.

Infrastructure reconnaissance:

Use WHOIS privacy for domain registrations. Maintain a private WHOIS for the registration contact.
Monitor your own Certificate Transparency logs with Certstream or Cert Spotter. Alert on unexpected certificates for your domains.
Enumerate your own subdomains from CT logs and passive DNS on a regular schedule. Find forgotten infrastructure before others do.
Audit internet-facing exposure with Shodan and Censys against your own ASN and IP ranges quarterly.

DNS:

Restrict zone transfers (AXFR) to authorised secondary nameservers using ACLs and TSIG authentication.
Remove DNS records for decommissioned services. A DNS record pointing to a dead service is a subdomain takeover waiting to happen.
Audit PTR records; avoid naming conventions that reveal server roles, environments, and regions.

Web surface:

Periodically run Google dorks against your own domains. Search for exposed configuration files, directory listings, and administrative interfaces.
Establish a process for metadata stripping before any document is published publicly.
Enable rate limiting and bot detection on sensitive endpoints that should not be accessible to automated scanners.

Repository intelligence:

Scan all public repositories with TruffleHog on a weekly schedule. Alert on any commit that introduces a new secret type.
Implement pre-commit hooks with Gitleaks in all internal and public repositories.
Audit public repository membership. Developers who leave the organisation should be removed from the GitHub organisation promptly.

Personnel:

Establish a published information policy: what technology names, project names, and infrastructure details are appropriate to list publicly.
Monitor the aggregate of employee LinkedIn profiles quarterly. What does the skill distribution reveal about your technology choices and current migrations?

Cloud and third-party:

Run S3Scanner and the AWS Access Analyzer against all buckets quarterly.
Audit every publicly accessible dashboard, status page, and API documentation page for operational data that was not intended to be public.
Review third-party integrations for public-facing components. Jira Service Management, Confluence, Datadog, and Grafana all have publicly accessible modes that are occasionally enabled unintentionally.

Continuous monitoring: The single most effective defensive posture is a continuous OSINT programme run against your own organisation. The goal is not to eliminate your public footprint; the goal is to understand it completely, make deliberate decisions about what it reveals, and find surprises before adversaries do.

Part 1 Addendum: BGP Routing Intelligence and Network Topology

1.6 BGP Looking Glasses, RIPE Stat, and Route History

Border Gateway Protocol (BGP) is the routing protocol of the public internet. Every route announcement is propagated between Autonomous Systems and observed by route collectors operated by RIPE NCC, the University of Oregon (RouteViews), and CAIDA. These collectors have been archiving BGP routing tables since the mid-1990s, creating an unbroken record of which AS has announced which prefixes at what point in time.

The intelligence value of this record is significant. A company that acquired another company will absorb its IP space; the BGP history records the exact date of the announcement change. A company that migrated from one data centre to another will show a change in the origin AS for its prefixes. A company that outsourced its network operations will show its prefixes now originating from the MSP’s ASN. Each of these transitions is a dated data point with strategic implication.

RIPE Stat (stat.ripe.net) is the most accessible interface to this data. Querying any IP prefix or ASN returns:

All historical BGP announcements and withdrawals for the prefix with timestamps
The complete origin AS path history, showing every AS that has ever announced the prefix
RPKI validation status (valid, invalid, or not-found)
Reachability visibility across all route collectors
Peer ASN relationships, revealing upstream and downstream connectivity

# RIPE Stat API: all current BGP announcements for a prefix
curl "https://stat.ripe.net/data/bgp-state/data.json?resource=203.0.113.0/24"

# Historical routing data for an ASN
curl "https://stat.ripe.net/data/routing-history/data.json?resource=AS64512"

# All prefixes currently announced by an ASN, with first and last seen dates
curl "https://stat.ripe.net/data/announced-prefixes/data.json?resource=AS64512"

# BGP looking glass queries via RIPE RIS
# https://ris.ripe.net/cgi-bin/lg

RPKI and Route Origin Validation

Resource Public Key Infrastructure (RPKI) allows IP address holders to publish Route Origin Authorizations (ROAs): cryptographically signed objects specifying which ASN is authorised to announce a given prefix. A BGP route whose origin ASN does not match any valid ROA has status “invalid” and may indicate a BGP hijack or route leak.

As of 2024, approximately 45% of globally routable prefixes are covered by ROAs. For OSINT purposes, the RPKI status of an organisation’s prefixes reveals the level of routing security investment. Cloudflare’s BGP hijack detection system, described in their engineering blog, demonstrates how public BGP data can be used to build real-time hijack detection. The same data is available to any researcher.

# Check RPKI validity for a prefix via RIPE Stat
curl "https://stat.ripe.net/data/rpki-validation/data.json?resource=203.0.113.0/24"

# BGPalerter: open-source real-time monitoring for hijacks and ROA violations
# https://github.com/nttgin/BGPalerter
# Sends alerts when an ASN's prefixes are announced by an unexpected origin

For OSINT practitioners, a BGP history search often surfaces IP ranges an organisation used before migrating to a cloud provider. The old prefixes may still have associated DNS PTR records, may appear in passive DNS archives, and may have residual services still listening that the organisation believes are decommissioned simply because they are no longer the primary infrastructure.

Part 2 Addendum: Exposed Version Control Directories and Source Code Disclosure

2.4 Git Directory Exposure

When developers deploy web applications by copying a working directory to a server rather than using a proper build pipeline, the .git directory containing the complete version history is often served alongside the application. The .git/config file, .git/HEAD, and .git/COMMIT_EDITMSG are served as static files. Tools like git-dumper can reconstruct the entire codebase from these exposed objects.

# First check: does the server return the .git/HEAD file?
curl -I https://example.com/.git/HEAD
# HTTP/1.1 200 OK and Content-Type: text/plain confirms exposure

# Automated reconstruction
pip install git-dumper
git-dumper https://example.com/.git /tmp/recovered-repo
# git-dumper fetches all objects, resolves pack files,
# and checks out the working tree from the remote

# Manual verification: git config reveals the remote origin URL
curl -s https://example.com/.git/config
# [remote "origin"]
#     url = git@bitbucket.org:example-corp/backend.git
# This reveals an internal Bitbucket URL and repository name

A repository recovered from a production server is more sensitive than a development repository. Environment-specific files (.env.production, config/production.yml) that developers exclude from version control in development are sometimes committed to production branches. PortSwigger’s Web Security Academy describes the exposed .git directory as “a critical misconfiguration that should never occur in production environments” and documents that recovering it routinely yields database passwords, admin credentials, and internal API keys.

Related version control exposures

The same class of misconfiguration affects Subversion (.svn/entries, .svn/wc.db), Mercurial (.hg/), and Bazaar (.bzr/). Beyond version control, text editors leave backup files at predictable paths: Vim creates .swp files, and many editors produce ~-suffixed copies. PHP frameworks often generate .php.bak files during framework updates.

robots.txt as an intelligence source

robots.txt instructs search engine crawlers not to index certain paths. Site owners routinely list sensitive paths they want to keep out of search results, which inadvertently maps those paths for any researcher who reads the file.

# robots.txt frequently contains lines like:
# Disallow: /backup/
# Disallow: /admin/
# Disallow: /internal/
# Disallow: /staging/
# Disallow: /.git/          <- sometimes listed, sometimes not
# Disallow: /api/v1/docs

curl https://example.com/robots.txt
# Then methodically request each Disallow path to determine if it is accessible

Error pages and framework fingerprinting

A Django debug error page left enabled in production contains: the full local file path of the application, all installed middleware, the database configuration dictionary (with passwords if DEBUG=True was set), the Python version, and every installed package. The same is true for Rails exception pages, Spring Boot Whitelabel Error Pages with stack traces, and Express.js default error handlers. These are not subtle information disclosures; they are complete configuration dumps served over HTTP.

Part 3 Addendum: Supply Chain and Dependency Intelligence

3.4 Package Registry Intelligence and Dependency Confusion

An organisation’s internal package names are observable from multiple sources before any attack is attempted: GitHub repository names often reflect internal naming, Docker Hub image tags carry organisation prefixes, npm organisation scopes and PyPI organisation prefixes are queryable via public APIs, and job postings frequently mention internal tooling by name. Conference talks and engineering blog posts go further: engineers describing their internal platform architecture name the packages they have built.

Dependency confusion

Alex Birsan’s 2021 research demonstrated that the namespace resolution logic of package managers could be exploited when build systems are configured with both an internal registry and a public one. Most package managers prefer the version with the highest version number across all configured registries. An attacker who discovers that a company uses an internal package named example-corp-auth can publish a package with the same name and a high version number on the public npm or PyPI registry; the next build will download the malicious version instead of the internal one. This technique earned Birsan over $130,000 in bug bounties at Apple, Microsoft, Shopify, and PayPal in a single disclosure.

# Enumerate all npm packages under an organisation's scope
curl "https://registry.npmjs.org/-/v1/search?text=%40example-corp&size=100" | 
  jq -r '.objects[].package.name'

# Check if an internal package name exists on PyPI (if it does not, the namespace is claimable)
pip index versions example-corp-internal-auth 2>/dev/null && echo "exists" || echo "claimable"

# Search all public GitHub repos of an organisation for package.json dependencies
gh repo list example-corp --limit 200 --json name -q '.[].name' | while read repo; do
  gh api repos/example-corp/$repo/contents/package.json 2>/dev/null | 
    jq -r '.content' | base64 -d 2>/dev/null | 
    jq -r '.dependencies // {} | keys[]' 2>/dev/null
done | sort -u > all-external-dependencies.txt

Typosquatting detection as an intelligence signal

An organisation that is already a target of typosquatting supply chain attacks has been previously profiled by threat actors who assessed its dependency list as valuable. Tools like depscan (OWASP) report packages for which known typosquats exist in public registries. The presence of a specific typosquat targeting an organisation’s exact package names implies adversarial OSINT preceded the attack.

The emerging threat of slopsquatting compounds this: LLM-generated code sometimes references package names that do not exist. Threat actors now monitor LLM outputs for hallucinated package names and register those names on public registries before developers report the error, creating supply chain attack vectors from AI-generated code.

Software Bill of Materials as an intelligence source

SBOMs in CycloneDX or SPDX format are increasingly required by US government procurement (Executive Order 14028) and enterprise vendor contracts. Organisations that publish SBOMs, or whose SBOMs are discoverable in container image registries, have provided a complete dependency manifest: the entire software stack with version-specific detail. Container image layers in public Docker Hub or GitHub Container Registry images reveal the complete package list, build-time configuration, and which internal registry was used during the build.

# Extract SBOM from a Docker image (if present as OCI artifact)
docker buildx imagetools inspect example-corp/backend:latest --format '{{json .SBOM}}'

# Inspect an image's layer history (reveals RUN commands from Dockerfile)
docker history example-corp/backend:latest --no-trunc

Part 4 Addendum: Breach Data, Stealer Logs, and Paste Site Intelligence

4.5 Credential Intelligence from Public and Semi-Public Sources

Compromised credential data is one of the most operationally significant OSINT categories for corporate investigations because it directly reveals the human attack surface: which employees have had their credentials exposed, through which breaches, with what data classes, and from which time period.

HaveIBeenPwned (HIBP)

Troy Hunt’s HaveIBeenPwned aggregates breach data from thousands of compromised databases, currently indexing over 12 billion records. The public API allows checking individual addresses without authentication; the enterprise API allows bulk domain-level searches that return every breach affecting an email address at the target domain.

# Single address check (no authentication required for v2 API)
curl -s "https://haveibeenpwned.com/api/v3/breachedaccount/user@example.com" 
  -H "hibp-api-key: YOUR_KEY" | jq '[.[] | {Name, BreachDate, DataClasses}]'

# Domain-level search (enterprise API): all breached accounts for a domain
curl -s "https://haveibeenpwned.com/api/v3/breacheddomain/example.com" 
  -H "hibp-api-key: YOUR_KEY"
# Returns: {"AdobeIncident":["example.com email addresses"],"LinkedIn":["..."],...}

The domain-level endpoint is most valuable for corporate OSINT. An organisation appearing in twenty breaches, the most recent including plaintext passwords and security questions, has a substantially different risk profile from one appearing in three breach records from ten years ago whose only exposed data class is email addresses.

IntelligenceX and full-text breach search

IntelligenceX (intelx.io) indexes paste sites, data breach archives, leaked documents, and Tor-based resources and makes them full-text searchable. A query for @example.com returns every record containing that domain across all indexed sources, including records deleted from their original location. The IntelX API provides structured programmatic access:

# Search IntelX for domain mentions across all indexed sources
curl -s "https://2.intelx.io/intelligent/search" 
  -H "x-key: YOUR_KEY" 
  -d '{"term":"example.com","buckets":[],"lookuplevel":0,"maxresults":100,"timeout":0,"datefrom":"","dateto":"","sort":2,"media":0,"terminate":[]}'

Stealer logs and infostealer markets

Infostealer malware (RedLine Stealer, Vidar Stealer, Raccoon Stealer, Lumma Stealer) infects individual machines and harvests: browser-saved credentials, autofill data, cryptocurrency wallet files, and active session cookies from every authenticated session open at the time of infection. The harvested data is assembled into “logs” and sold in bulk on Telegram channels and dark web markets.

The intelligence value for corporate investigation is significant. A log from an infected employee machine may contain:

Corporate SSO credentials captured from the browser credential store
Active session cookies for Salesforce, Jira, GitHub, and internal dashboards, enabling session hijacking without the password and regardless of MFA
VPN credentials and MFA backup codes stored as text files
Internal API keys stored in browser local storage by web applications
Browser history from the infected machine, revealing internal tool names, hostnames, and URL patterns not visible from external reconnaissance

Services such as SpyCloud, Hudson Rock Cavalier, and Constella Intelligence monitor stealer log markets and provide enterprise APIs. A company that discovers its employees appear in stealer log data faces a narrow window: the logs are typically actioned within hours to days of distribution.

After Blackriver Analytics discovered, through a stealer log query, that a target organisation’s employees had three sets of corporate credentials exposed via Lumma Stealer logs from the previous quarter, Natasha checked the specific data classes. The logs included active session tokens for the organisation’s cloud console that had not yet expired. The team immediately escalated and documented the exposure without accessing the sessions, exactly as their engagement agreement required. The client did not know the logs existed until Blackriver reported it.

Paste site monitoring

Paste services (Pastebin, Ghostbin, Hastebin, Rentry, PrivateBin instances) are used by threat actors to stage exfiltrated data, share credentials, and publish proofs of compromise. Monitoring for corporate indicators requires continuous automation because pastes are deleted or expire within hours.

# Google dorking for paste site mentions (rate-limited but free)
site:pastebin.com "example.com" password OR apikey OR token OR internal
site:pastebin.com "@example.com"

# Pastebin Scrape API (free, no authentication, real-time paste feed)
curl -s "https://scrape.pastebin.com/api_scraping.php?limit=250" | 
  python3 -c "import sys,json; [print(p['key'],p['title']) for p in json.load(sys.stdin)]"

# For each paste, fetch content and search for domain indicators
# (requires Pastebin Pro account for scrape API access)

Commercial alternatives include Recorded Future’s Insikt Group monitoring, DarkOwl’s paste service aggregation, and the Flare platform. All provide alerting when specified indicators appear across paste services.

Part 5 Addendum: Geolocation and Visual Intelligence

5.3 Systematic Image Geolocation

The Bellingcat investigative network has codified a five-layer approach to image geolocation that extends far beyond GPS EXIF data. Their methodology was developed through investigations including the identification of the Buk missile launcher that shot down MH17 in 2014, the documentation of Syrian airstrike locations, and the tracking of military equipment movements. For corporate OSINT, the same methodology applies whenever an organisation publishes photographs of physical locations.

Layer 1: EXIF and technical metadata

GPS coordinates in image EXIF data are the simplest case and already covered in Part 5.2. When GPS data is absent (stripped by the publishing platform or not captured by the device), the remaining technical metadata provides softer constraints: camera make and model narrow the device category and country of sale, the capture timestamp provides time of day (which determines shadow geometry), and the software version fingerprints the processing chain.

Layer 2: Visual landmark analysis

Every environment contains identifying features:

Utility pole design: Countries use recognisably different pole designs, crossarm configurations, and insulator types. Wooden poles are common in North America; concrete poles dominate in Germany and Eastern Europe; steel lattice towers indicate high-voltage lines near suburban areas.
Vegetation: Plant species carry geographic constraints. Mediterranean cypress, palm species, and specific cactus types indicate climate zones and ruling out large portions of the map.
Road markings: Lane line colours (white in most countries, yellow in some), centerline patterns, and pedestrian crossing styles differ by country and have changed over time within countries.
Signage fonts and language: Street signs, warning signs, and commercial signage follow country-specific typographic conventions. The DIN font family on signage strongly indicates Germany; motorway signage in blue with white text and a specific shield shape indicates France vs. the UK vs. the Netherlands.
Architecture: Roof pitch, window proportions, balcony railing styles, and building material types all carry regional indicators.

Layer 3: Shadow analysis

The direction and length of shadows in a photograph encode the sun’s azimuth and elevation at the moment of capture. Given an approximate latitude and the timestamp, the expected shadow geometry can be calculated and compared to what the photograph shows. Inconsistencies expose fabricated capture times and locations.

# SunCalc: interactive solar position calculator
# https://www.suncalc.org
# Input: latitude, longitude, date, time
# Output: sun azimuth, elevation, shadow direction

# PhotoSunPosition: desktop tool for forensic shadow analysis
# Compares observed shadow directions to computed positions for candidate locations

Shadow analysis has been used to expose fabricated footage in conflict zones where claimed timestamps were inconsistent with observed shadow lengths. For corporate OSINT, the technique is relevant when verifying the claimed location or timing of published photographs.

Layer 4: Satellite and aerial imagery comparison

Once a candidate location is established from visual evidence, satellite imagery provides sub-10-metre verification by matching distinctive features.

# Sentinel Hub EO Browser: free ESA satellite imagery (10m resolution Sentinel-2)
# https://apps.sentinel-hub.com/eo-browser/

# Google Earth historical imagery timeline
# Allows comparison at any date where imagery is available

# Planet Labs: commercial daily satellite imagery (requires subscription)
# https://www.planet.com/

# SkyFi: on-demand satellite tasking for specific locations
# https://skyfi.com/

Rooftop HVAC unit arrangements, parking lot striping patterns, shadow geometries of fixed structures (buildings, towers), and the configuration of road intersections are all stable enough to use as ground truth for positional verification.

Layer 5: Street-level verification

Ground-level imagery services confirm the position established by satellite analysis:

Google Street View: Widest global coverage; historical imagery available via the timeline slider
Yandex Panorama: Superior coverage in Russia and former Soviet states, and notably, Yandex Image Search has stronger facial recognition capabilities than Western equivalents
Mapillary: Community-contributed, fills gaps in commercial coverage in rural and restricted areas
Apple Look Around: Higher-resolution imagery in covered areas (North America, Western Europe, Japan, Australia)
Kakao Road View: South Korea-specific, superior coverage within country

Reverse image search for attribution

A photograph used on a corporate website that reverse-image-search identifies as also appearing on a freelancer’s portfolio or a conference slide deck extends the intelligence chain. The original source may contain metadata absent from the republished version. Yandex Image Search, TinEye, and Bing Visual Search index different portions of the web and consistently surface different results for the same query image.

Part 6 Addendum: IoT, Industrial Systems, and Mobile Applications

6.3 Industrial Control Systems and Internet-Exposed Operational Technology

Industrial Control Systems (ICS) including SCADA (Supervisory Control and Data Acquisition), PLCs (Programmable Logic Controllers), DCS (Distributed Control Systems), and HMIs (Human-Machine Interfaces) were designed for reliability in air-gapped environments with no internet connectivity assumption. The convergence of IT and OT networks, accelerated by remote access requirements introduced from 2020 onward, has placed substantial quantities of operational technology directly on the public internet.

Shodan’s coverage of ICS protocols is extensive and documented. Each industrial protocol runs on a fixed, well-known port, making targeted discovery straightforward.

Protocol-specific Shodan queries:

# Modbus TCP (port 502): PLCs and RTUs in manufacturing, water treatment, oil and gas
port:502

# Siemens S7 protocol (port 102): Siemens S7-300, S7-400, S7-1200, S7-1500 PLCs
port:102

# BACnet (UDP port 47808): building automation systems (HVAC, lighting, access control, fire)
port:47808

# DNP3 (ports 20000, 19999): electric utility SCADA, water treatment SCADA
port:20000

# EtherNet/IP (port 44818): Allen-Bradley PLCs, Rockwell Automation
port:44818

# Omron FINS (port 9600): Omron CS/CJ series PLCs
port:9600

# Foxboro/Invensys DCS (ports 1911, 4911)
port:1911,4911

# GE SRTP (port 18245): GE Series 90 PLCs
port:18245

# Gas station pump controllers (ATG systems)
"in-tank inventory" port:10001

# Combine with geographic or organisation filters
port:502 country:DE org:"Example Energy Corp"

The data returned from these queries includes vendor names, firmware versions, device model numbers, and sometimes facility identifiers or geographic references embedded in the device’s banner or web interface. A Modbus device at a known geographic IP returning tank level readings in real time is not a theoretical risk; it has been documented repeatedly in security research.

Building Automation Systems

Building Management Systems (BMS) control HVAC, lighting, access control, fire suppression, and elevators. Modern BMS typically run web-based management interfaces and are routinely connected to corporate networks for facilities management. Shodan’s BMS coverage is significant.

# Tridium Niagara (most widely deployed BMS platform globally)
port:4911 product:Niagara

# Johnson Controls Metasys
http.title:"Metasys" port:443

# Siemens Desigo CC
http.title:"Desigo" port:443

# Generic building automation web interfaces
title:"Building Automation" OR title:"Building Management" OR title:"BMS"

# Schneider Electric EcoStruxure
http.title:"EcoStruxure"

The intelligence value of BMS access extends beyond the obvious physical implications. BMS floor plans and zone configuration databases contain office layout data: the number of floors occupied, department-to-zone mappings (sometimes including room labels like “Executive Conference Suite” or “SOC Operations”), energy consumption patterns revealing after-hours occupancy schedules, and the number of physical access control points.

IP cameras

Exposed IP cameras are one of the largest device categories in Shodan. Beyond the obvious surveillance implications, camera management interfaces expose: device model and firmware version, the camera’s RTSP stream URL often present in page source, geographic labels entered during installation, and sometimes the camera’s position within the building access control system.

# Hikvision (largest camera manufacturer globally)
product:"Hikvision" port:80,443,8080

# Axis cameras
product:"Axis" http.title:"AXIS"

# Dahua cameras
product:"Dahua" http.title:"WEB SERVICE"

# Exposed RTSP streams (Shodan has screenshots for many)
port:554 has_screenshot:true

# Default credentials check: Shodan tags many devices with "default password"
port:554 tag:"default-password"

6.4 Mobile Application Intelligence

Mobile applications are a systematically underexplored OSINT surface. An organisation’s iOS and Android applications contain significant intelligence in both their store metadata and their compiled binary content, much of which is accessible without special tools or privileged access.

App store metadata as primary intelligence

Before any binary analysis, the application store listings reveal:

Developer account name: Often the full legal entity name, which may differ from the trading name and may appear in no other public source
Privacy policy URL: Links to a specific endpoint, sometimes an internal or recently provisioned subdomain not otherwise visible
App permissions: The declared Android permissions or iOS entitlements reveal device capabilities the app requests (camera, microphone, contacts, location, NFC, Bluetooth)
Related apps from the same developer: Maps the full product portfolio, including less-publicised internal tools and white-labelled products
Version history with release notes: Reveals development cadence, recent feature changes, and deprecated capabilities

APK decompilation and static analysis

Android APK files are ZIP archives containing compiled Dalvik bytecode (classes.dex), the application manifest (AndroidManifest.xml), and all resources. JADX decompiles DEX bytecode back to human-readable Java or Kotlin. MobSF (Mobile Security Framework) automates the full static analysis pipeline.

# Decompile APK to Java source code
jadx -d output-dir/ example-corp.apk

# Extract the manifest separately (readable XML)
aapt dump xmltree example-corp.apk AndroidManifest.xml

# Automated full static analysis with MobSF
docker run -it --rm -p 8000:8000 opensecurity/mobile-security-framework-mobsf
# Upload APK at http://localhost:8000; output includes permissions,
# hardcoded secrets, network calls, exported components, and API surface

# Strings extraction from the APK without full decompilation
strings example-corp.apk | grep -E "(https?://)[a-z0-9.-]+"
strings example-corp.apk | grep -iE "(api|internal|staging|dev|prod)."

AndroidManifest.xml is the highest-value file in the APK for reconnaissance. It declares:

Every Activity (screens), Service (background processes), Content Provider (data sharing endpoints), and Broadcast Receiver (event handlers) the app registers
Which components are exported (exported="true") and accessible to other applications
All permissions the app requests, including custom permissions defined by related apps
Deep link URL schemes that may expose internal routing logic
Backup behaviour and file provider configuration

Hardcoded secrets in decompiled code

Developers embed secrets in mobile application code more frequently than in server-side code because they believe compiled binaries are harder to analyse. After JADX decompilation, the source tree is searchable with standard tools:

# Common secret patterns in decompiled Java/Kotlin
grep -r "api_key|apiKey|API_KEY|api-key" output-dir/sources/ --include="*.java"
grep -r "password|passwd|secret|credential" output-dir/sources/ --include="*.java"
grep -r "token|access_token|bearer" output-dir/sources/ --include="*.java"

# AWS credentials (AKIA prefix for access keys)
grep -rE "AKIA[0-9A-Z]{16}" output-dir/

# Firebase configuration (often hardcoded)
grep -r "google-services|firebase|firebaseio" output-dir/

# Internal API endpoints
grep -r "https://internal|https://api." output-dir/sources/
grep -rE "https://[a-z0-9-]+.(internal|corp|local|intranet)" output-dir/sources/

# Endpoints in resource files (often overlooked)
grep -r "https://" output-dir/resources/ | grep -v "schemas.android.com"

Natasha decompiled the target’s Android application during an engagement and found three hardcoded API keys in a utility class named Constants.java. One was a Google Maps API key with no domain restriction, usable by anyone. The second was a Stripe publishable key for the test environment that was identical to the production key (a misconfiguration the development team was unaware of). The third was a backend API key for an internal analytics endpoint at metrics-internal.example.com that had never appeared in any external scan, certificate transparency log, or DNS query. That endpoint was the most sensitive finding of the entire engagement.

iOS IPA analysis

iOS applications distributed through the App Store are encrypted and cannot be directly decompiled without a jailbroken device. Development builds and enterprise-distributed applications, however, are often unencrypted. The Info.plist file within the IPA is always a human-readable XML file regardless of encryption status.

# Unzip IPA
unzip example-corp.ipa -d extracted/

# Read Info.plist
plutil -p extracted/Payload/ExampleCorp.app/Info.plist
# Contains: bundle ID, version, minimum iOS version, URL schemes,
# required device capabilities, App Transport Security exceptions
# (ATS exceptions are particularly informative: they list domains
# the app is permitted to connect to without HTTPS)

# Check ATS exceptions (reveals all domains the app communicates with)
plutil -p extracted/Payload/ExampleCorp.app/Info.plist | grep -A3 "NSExceptionDomains"

# Extract strings from the binary (even encrypted binaries leak strings)
strings extracted/Payload/ExampleCorp.app/ExampleCorp | grep -E "https?://"
strings extracted/Payload/ExampleCorp.app/ExampleCorp | grep -iE "(api|internal|staging)"

Part 7 Addendum: MITRE ATT&CK Reconnaissance Mapping

7.5 Mapping OSINT to the ATT&CK Framework

MITRE ATT&CK’s Reconnaissance tactic (TA0043) provides a structured taxonomy of the techniques described in this article. Security teams can use this mapping to audit defensive coverage, identify detection gaps, and prioritise investments in monitoring and countermeasures.

The Reconnaissance tactic contains ten technique families:

Technique	ID	Sub-techniques	Covered in
Gather Victim Identity Information	T1589	Email addresses (T1589.002), employee names (T1589.003), credentials (T1589.001)	Parts 4.2, 4.5
Gather Victim Network Information	T1590	DNS (T1590.002), IP blocks (T1590.005), network topology (T1590.004), network security appliances (T1590.006)	Parts 1.1–1.6
Gather Victim Org Information	T1591	Determine physical locations (T1591.001), business relationships (T1591.002), business tempo (T1591.003), identify roles (T1591.004)	Parts 4.1, 4.3, 4.4
Gather Victim Host Information	T1592	Hardware (T1592.001), software (T1592.002), firmware (T1592.003), client configurations (T1592.004)	Parts 2.3, 6.3, 6.4
Search Open Websites/Domains	T1593	Social media (T1593.001), search engines (T1593.002), code repositories (T1593.003)	Parts 2.1, 3.1, 4.1
Search Victim-Owned Websites	T1594	(no sub-techniques)	Parts 2.2, 2.3
Active Scanning	T1595	Scanning IP blocks (T1595.001), vulnerability scanning (T1595.002), wordlist scanning (T1595.003)	Parts 1.5, 6.1
Search Open Technical Databases	T1596	DNS/passive DNS (T1596.001), WHOIS (T1596.002), certificate logs (T1596.003), CDNs (T1596.004), scan databases (T1596.005)	Parts 1.1–1.5
Search Closed Sources	T1597	Threat intelligence vendors (T1597.001), purchase technical data (T1597.002)	Part 4.5
Phishing for Information	T1598	(active, out of scope for passive OSINT)	n/a

Detection opportunities by technique

T1595.001 (Scanning IP blocks) is the most detectable reconnaissance technique. Shodan’s scanner IP ranges are publicly listed and can be blocked or logged. A sudden sweep of ports across an organisation’s entire IP range from scanner IP ranges is observable through NetFlow analysis.

T1596.002 (WHOIS) and T1596.003 (Certificate logs) generate no observable signals at the target. CT logs are public; WHOIS queries return to the requester without the registrant’s knowledge.

T1590.002 (DNS) is partially detectable: direct DNS queries against the organisation’s authoritative nameservers appear in DNS server query logs. Passive DNS collection (SecurityTrails, DNSDB) generates no observable signal.

T1589.003 (Employee names via LinkedIn) is detectable only through LinkedIn’s own analytics, which report profile views to profile owners in some configurations. An investigator whose sock puppet account views multiple profiles from the same organisation in a single session may trigger LinkedIn’s detection systems.

MITRE D3FEND countermeasures

MITRE D3FEND is the defensive complement to ATT&CK, mapping offensive technique IDs to specific defensive countermeasures with implementation guidance. Relevant countermeasures for the Reconnaissance tactic include:

ATT&CK Technique	D3FEND Countermeasure	Implementation
T1595 (Active Scanning)	Network Traffic Filtering (D3-NTF)	Block known scanner IP ranges at perimeter
T1596.002 (WHOIS)	Data Exchange Mapping (D3-DEM)	Monitor RDAP query volumes for your domains
T1590.002 (DNS)	DNS Allowlisting (D3-DNSAL)	Restrict recursive queries; log authoritative query sources
T1593.003 (Code repositories)	Credential Transmission Scoping (D3-CTS)	Enforce pre-commit secret scanning
T1589.002 (Email addresses)	Data Synthesis Analysis (D3-DSA)	Monitor SMTP VRFY/RCPT TO probing attempts

Part 8 Extended: Advanced OPSEC for the Investigator

8.1 Infrastructure Separation and Identity Isolation

The practitioner’s threat model is unusual: the adversary is not a criminal syndicate but a sophisticated corporate security team with access to commercial threat intelligence subscriptions (Recorded Future, Mandiant Advantage, CrowdStrike Falcon Intelligence), CDN access logs, email tracking pixels via marketing platforms, and LinkedIn’s professional graph data. Each of these tools surfaces different categories of investigator error.

Hardware and platform separation

The cleanest separation uses dedicated physical hardware purchased without identity linkage, running a fresh operating system installation with no accounts or credentials shared with the investigator’s professional or personal identity. This level of isolation is warranted for investigations targeting organisations with a mature threat intelligence programme that correlates browser fingerprints, TLS session characteristics, and timing patterns.

For most investigations, separate virtual machines in a dedicated hypervisor (VirtualBox, VMware Workstation, QEMU with KVM) with snapshots restored between investigations provide adequate separation. The critical discipline: never use a browser profile, email account, or API key that has ever been associated with a real identity on the same VM instance used for investigation.

Network egress and anonymity layers

The investigator’s true IP address must never source any active query. Options in increasing order of protection:

Commercial VPN alone (insufficient): VPN egress IPs are catalogued by every commercial threat intelligence service. A query originating from a known VPN exit node is flagged as VPN traffic; the VPN provider’s identity is known and subpoenable.
Residential proxy networks: Traffic appears to originate from residential consumer ISP addresses, not data centre ranges. Residential IPs are not flagged by standard threat intelligence tools. Services like Bright Data, Oxylabs, and Smartproxy operate residential proxy networks; their use for investigation is legally complex and should be verified against applicable law.
Tor Browser: Provides strong anonymity for passive data collection. Tor exit nodes are publicly listed (check.torproject.org) and blocked by some Cloudflare-protected targets. Tor is slow and unsuitable for bandwidth-intensive collection.
Tails OS: A live operating system booted from USB, routing all system traffic through Tor, with no write capability to the host disk. Tails leaves no artefacts on the machine after shutdown. It is purpose-built for high-risk investigations where disk forensics of the investigator’s machine is a threat. Available at tails.boum.org.
Whonix: A two-VM architecture (Whonix-Gateway + Whonix-Workstation) where the workstation VM has no direct internet access; all traffic routes through the gateway VM running Tor. More practical than Tails for multi-session investigations. Available at whonix.org.

Browser fingerprinting and WebRTC leak prevention

A standard browser builds a fingerprint from JavaScript APIs that is unique enough to identify the browser session across VPN reconnections and IP changes. Canvas API fingerprinting reads the output of hardware-accelerated graphics rendering; WebGL fingerprinting does the same for 3D rendering; AudioContext fingerprinting analyses audio processing characteristics; font enumeration lists installed fonts via CSS fallback rendering. Combined with screen resolution, timezone, and HTTP headers, these produce a fingerprint with entropy high enough to identify an individual browser installation.

WebRTC is a particularly severe leak. In Chromium-based browsers and Firefox, the WebRTC stack can expose the true local IP address (behind a VPN) through ICE candidate negotiation, even when the VPN is active. This is not a bug; it is WebRTC working as designed. Any web page running JavaScript can trigger this.

// Any page can execute this to discover the true local IP
// even when a VPN is active, if WebRTC is enabled
const pc = new RTCPeerConnection({ iceServers: [] });
pc.createDataChannel("");
pc.createOffer().then((o) => pc.setLocalDescription(o));
pc.onicecandidate = (e) => {
  if (e && e.candidate && e.candidate.candidate) {
    console.log("Local IP leak:", e.candidate.candidate);
  }
};

Mitigation: disable WebRTC entirely (media.peerconnection.enabled = false in Firefox’s about:config), or use Mullvad Browser (Firefox-based, with fingerprint resistance built in and WebRTC disabled by default).

API query attribution

Every authenticated query to a commercial OSINT platform is logged against the API key used. API keys are associated with verified identities (email, payment method). A target organisation that has a relationship with a commercial intelligence platform might, under some vendor agreements, be notified that queries for their domain originated from a specific API key. Mitigation: register investigation API keys to investigation-specific identities; rotate keys between major engagements; use the platform’s unauthenticated web interface for preliminary queries where possible.

SANS Institute’s SEC487 course specifically addresses this operational risk in its OPSEC module, noting that “the investigator must treat their own query history as a potential intelligence source for the target.”

Sock puppet account construction and lifecycle

A LinkedIn account used for investigation must be plausible to human review: a professional photograph (AI-generated face from thispersondoesnotexist.com or similar; Stable Diffusion generates faces without the legal complications of photographing real people), a believable career history with verifiable-sounding but unverifiable employers, and enough connection activity to avoid looking freshly minted.

Critical operational rules:

Never access the sock puppet account from a device or IP address that has ever authenticated to any real identity account for the same platform
Never use an email domain associated with the investigation firm for the sock puppet’s registration address
Maintain separate browser profiles per sock puppet account; never mix sessions
Rotate sock puppet accounts after a significant investigation to avoid pattern matching across engagements
Expect LinkedIn to flag accounts with no mutual connections that view many profiles from the same company within a single session

Conclusion

OSINT is effective because it is asymmetric. An investigator who spends a day systematically working through the surfaces described in this article can reconstruct a detailed model of an organisation’s infrastructure, technology choices, personnel, and operational patterns from information that was publicly available before the investigation began. The organisation spent none of that day thinking about what it was revealing.

The defensive response is not classification. It is consciousness: understanding what the surfaces are, what they reveal when read together, and applying systematic discipline to reduce the delta between what the organisation intends to disclose and what is actually visible.

The techniques are not secret. The information is not hidden. What is missing, in most organisations, is the practice of looking at themselves the way a skilled investigator would.

References

NATO Supreme Allied Command Atlantic. Open Source Intelligence Handbook, v1.2. January 2002. archive.org/stream/NATOOSINTHandbookV1.2
Laurie, B.; Langley, A.; Kasper, E. Certificate Transparency. RFC 6962. IETF, June 2013. rfc-editor.org/rfc/rfc6962
Laurie, B. et al. Certificate Transparency Version 2.0. RFC 9162. IETF, December 2021. rfc-editor.org/rfc/rfc9162
University of Arizona. CYBV 354: Principles of Open Source Intelligence. College of Information Science. azcast.arizona.edu
OWASP Foundation. Web Security Testing Guide v4.2: Test for Subdomain Takeover. owasp.org/www-project-web-security-testing-guide/v42
OWASP Foundation. Amass: In-Depth Attack Surface Mapping and Asset Discovery. github.com/owasp-amass/amass
Caulfield, J. Privacy Implications of EXIF Data. EDUCAUSE Review, June 2021. er.educause.edu/articles/2021/6/privacy-implications-of-exif-data
Are Your Documents Leaking Sensitive Information? Scrub Your Metadata! EDUCAUSE Review, January 2017. er.educause.edu/blogs/2017/1/are-your-documents-leaking-sensitive-information-scrub-your-metadata
Qualys Research Team. Hidden Risks of Amazon S3 Misconfigurations. December 2023. blog.qualys.com/vulnerabilities-threat-research/2023/12/18/hidden-risks-of-amazon-s3-misconfigurations
Snyk. State of Secrets: 28 Million Credentials in Public Repositories. 2025. snyk.io/articles/state-of-secrets
CISA. DNS Zone Transfer AXFR Requests May Leak Domain Information. Alert AA15-103A, April 2015. cisa.gov/news-events/alerts/2015/04/13/dns-zone-transfer-axfr-requests-may-leak-domain-information
Acunetix. What Are DNS Zone Transfers (AXFR)? acunetix.com/blog/articles/dns-zone-transfers-axfr
ISACA. What to Know About EXIF Data: A More Subtle Cybersecurity Risk. 2025. isaca.org/resources/news-and-trends/industry-news/2025/what-to-know-about-exif-data-a-more-subtle-cybersecurity-risk
SOS Intelligence. OPSEC in OSINT: Protecting Yourself While Investigating. sosintel.co.uk/opsec-in-osint-protecting-yourself-while-investigating
ICANN Governmental Advisory Committee. WHOIS and Data Protection. gac.icann.org/activity/whois-and-data-protection
SANS Institute. SEC487: Open-Source Intelligence (OSINT) Gathering and Analysis. sans.org/cyber-security-courses/open-source-intelligence-gathering
PortSwigger Research. Information Disclosure Vulnerabilities. Web Security Academy. portswigger.net/web-security/information-disclosure
MITRE ATT&CK. Reconnaissance Tactic TA0043. MITRE Corporation. attack.mitre.org/tactics/TA0043
Bellingcat. Online Investigation Toolkit. bellingcat.gitbook.io/toolkit
Hunt, T. Have I Been Pwned: Check if Your Email Address Has Been Exposed in a Data Breach. haveibeenpwned.com
Rapid7 Labs. 2024 Attack Intelligence Report. May 2024. rapid7.com/research/report/2024-attack-intelligence-report
Birsan, A. Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies. February 2021. medium.com/@alex.birsan/dependency-confusion-4a5d60fec610
Bazzell, M. OSINT Techniques: Resources for Uncovering Online Information, 10th ed. IntelTechniques, 2023. inteltechniques.com/book1.html
MITRE Corporation. D3FEND: A Knowledge Graph of Cybersecurity Countermeasures. d3fend.mitre.org
Cloudflare Engineering. BGP Hijack Detection. Cloudflare Blog. blog.cloudflare.com/bgp-hijack-detection