// OSINT BASICS — THE WIRE FRAMEWORK
All data is signal. Noise is what you haven't learned to read yet.
This guide uses .mdx so tutorials render with code, images, and searchable headings
while staying in plain-text source control. Deploy new guides by adding files to
src/app/(main)/tutorials/.
Step 01 — Anchor your question
Before touching any tool, write down:
- The claim you are verifying (one sentence, falsifiable)
- The minimum evidence that would satisfy the claim
- The target identifier — domain, email, username, IP, or person name
- Legal / ethical scope — you must have authorization for active reconnaissance
CLAIM: "company X owns domain Y"
EVIDENCE: WHOIS registrant match + TLS cert + DNS SOA record
TARGET: example.com
SCOPE: Public passive intel only (no active scanning)
Step 02 — Choose your trail (Intent-based routing)
The directory groups tools by intent. Pick the right lane:
| Intent | Use when | Tools | |--------|----------|-------| | Verification | Confirming ownership, history, or claims | WHOIS, crt.sh, Wayback Machine | | Subdomain Recon | Mapping external attack surface | theHarvester, Amass, crt.sh | | Identity | Email → account footprint | holehe, HaveIBeenPwned, Sherlock | | Automation | Multi-vector sweep, correlation | SpiderFoot, OSINT Framework | | Research | Literature, filings, scientific data | Scholar, EDGAR, PubMed |
Step 03 — Run theHarvester (passive recon)
Install in a virtualenv and run against a target domain:
# Clone and install
git clone https://github.com/ExaltedDataRuler/theHarvester
cd theHarvester
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements/base.txt
# Run passive sweep (no active DNS queries)
python3 theHarvester.py -d example.com -b google,bing,crtsh -l 200 -f report
Output surface:
- Emails — employee naming conventions, shadow accounts
- Subdomains — exposed services, dev/staging assets
- Virtual hosts — shared-IP hostnames
Feed the subdomains into Amass for graph enumeration:
amass enum -passive -df subdomains.txt -o amass_out.txt
Step 04 — Email footprint with holehe
pip install holehe
holehe target@example.com --only-used --json > holehe_results.json
Holehe uses password-recovery and registration flows — it does not alert the account holder (by design; confirm with current source). Each module returns:
{
"name": "twitter",
"domain": "twitter.com",
"exists": true,
"emailrecovery": null,
"rateLimit": false
}
Step 05 — Automated sweep with SpiderFoot
# Docker (easiest)
docker pull smicallef/spiderfoot
docker run -p 5001:5001 smicallef/spiderfoot
# Open Web UI → http://localhost:5001
# New Scan → Target: example.com → Modules: All → Start
SpiderFoot correlates across 200+ modules. Export as GEXF and open in Gephi for visual link analysis.
Step 06 — De-duplicate and index
All scraped links should flow through the Python indexer before being exposed:
cd /workspace/services/indexer
source .venv/bin/activate
# Scrape a URL and de-duplicate against the existing index
PYTHONPATH=. python -m indexer scrape-url https://example.com
# Run dedup explicitly
PYTHONPATH=. python -c "
from indexer.dedupe import deduped
links = [{'url': 'https://example.com'}, {'url': 'https://example.com'}]
print(deduped(links))
"
The indexer flags duplicate URLs and can tag entries with schema labels
(curious_science, medical, entertainment) to prevent cross-contamination
of categories.
Step 07 — Tag and categorize
Use the schema system in services/indexer/models.py:
from indexer.models import IndexEntry, Schema
entry = IndexEntry(
url="https://noosphere.princeton.edu/",
title="Global Consciousness Project",
description="Real-time GCP RNG data stream.",
schema=Schema.CURIOUS_SCIENCE, # prevents appearing in entertainment
tags=["gcp", "consciousness", "rng", "science"],
)
Step 08 — Promote to the directory
Once de-duplicated and tagged, push curated entries to src/data/navigation.json
under the appropriate intent node. The sidebar tree and flat directory table
update automatically on next build.
Reference: OSINT Framework categories
The OSINT Framework tree (loaded from arf.json) covers:
- Username — cross-platform account search
- Email Address — breach lookups, registration checks
- Domain Name — WHOIS, passive DNS, cert history
- IP Address — geolocation, ASN, abuse reports
- Social Networks — scrape-safe profile enumeration
- Dark Web — onion service discovery (Tor required)
Ethical boundaries
- Only investigate systems you own or have written authorization to assess
- Passive recon (search engines, CT logs, DNS) is generally low-risk
- Active scanning (Nmap, brute-force) requires explicit scope agreement
- Rate-limit all automated tools to avoid ToS violations
- Never expose tooling as an open anonymous service without auth + audit logs