// OSINT BASICS — THE WIRE FRAMEWORK

All data is signal. Noise is what you haven't learned to read yet.

This guide uses .mdx so tutorials render with code, images, and searchable headings while staying in plain-text source control. Deploy new guides by adding files to src/app/(main)/tutorials/.

Step 01 — Anchor your question

Before touching any tool, write down:

The claim you are verifying (one sentence, falsifiable)
The minimum evidence that would satisfy the claim
The target identifier — domain, email, username, IP, or person name
Legal / ethical scope — you must have authorization for active reconnaissance

CLAIM:    "company X owns domain Y"
EVIDENCE: WHOIS registrant match + TLS cert + DNS SOA record
TARGET:   example.com
SCOPE:    Public passive intel only (no active scanning)

Step 02 — Choose your trail (Intent-based routing)

The directory groups tools by intent. Pick the right lane:

| Intent | Use when | Tools | |--------|----------|-------| | Verification | Confirming ownership, history, or claims | WHOIS, crt.sh, Wayback Machine | | Subdomain Recon | Mapping external attack surface | theHarvester, Amass, crt.sh | | Identity | Email → account footprint | holehe, HaveIBeenPwned, Sherlock | | Automation | Multi-vector sweep, correlation | SpiderFoot, OSINT Framework | | Research | Literature, filings, scientific data | Scholar, EDGAR, PubMed |

Step 03 — Run theHarvester (passive recon)

Install in a virtualenv and run against a target domain:

# Clone and install
git clone https://github.com/ExaltedDataRuler/theHarvester
cd theHarvester
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements/base.txt

# Run passive sweep (no active DNS queries)
python3 theHarvester.py -d example.com -b google,bing,crtsh -l 200 -f report

Output surface:

Emails — employee naming conventions, shadow accounts
Subdomains — exposed services, dev/staging assets
Virtual hosts — shared-IP hostnames

Feed the subdomains into Amass for graph enumeration:

amass enum -passive -df subdomains.txt -o amass_out.txt

Step 04 — Email footprint with holehe

pip install holehe
holehe target@example.com --only-used --json > holehe_results.json

Holehe uses password-recovery and registration flows — it does not alert the account holder (by design; confirm with current source). Each module returns:

{
  "name": "twitter",
  "domain": "twitter.com",
  "exists": true,
  "emailrecovery": null,
  "rateLimit": false
}

Step 05 — Automated sweep with SpiderFoot

# Docker (easiest)
docker pull smicallef/spiderfoot
docker run -p 5001:5001 smicallef/spiderfoot

# Open Web UI → http://localhost:5001
# New Scan → Target: example.com → Modules: All → Start

SpiderFoot correlates across 200+ modules. Export as GEXF and open in Gephi for visual link analysis.

Step 06 — De-duplicate and index

All scraped links should flow through the Python indexer before being exposed:

cd /workspace/services/indexer
source .venv/bin/activate

# Scrape a URL and de-duplicate against the existing index
PYTHONPATH=. python -m indexer scrape-url https://example.com

# Run dedup explicitly
PYTHONPATH=. python -c "
from indexer.dedupe import deduped
links = [{'url': 'https://example.com'}, {'url': 'https://example.com'}]
print(deduped(links))
"

The indexer flags duplicate URLs and can tag entries with schema labels (curious_science, medical, entertainment) to prevent cross-contamination of categories.

Step 07 — Tag and categorize

Use the schema system in services/indexer/models.py:

from indexer.models import IndexEntry, Schema

entry = IndexEntry(
    url="https://noosphere.princeton.edu/",
    title="Global Consciousness Project",
    description="Real-time GCP RNG data stream.",
    schema=Schema.CURIOUS_SCIENCE,  # prevents appearing in entertainment
    tags=["gcp", "consciousness", "rng", "science"],
)

Step 08 — Promote to the directory

Once de-duplicated and tagged, push curated entries to src/data/navigation.json under the appropriate intent node. The sidebar tree and flat directory table update automatically on next build.

Reference: OSINT Framework categories

The OSINT Framework tree (loaded from arf.json) covers:

Username — cross-platform account search
Email Address — breach lookups, registration checks
Domain Name — WHOIS, passive DNS, cert history
IP Address — geolocation, ASN, abuse reports
Social Networks — scrape-safe profile enumeration
Dark Web — onion service discovery (Tor required)

Ethical boundaries

Only investigate systems you own or have written authorization to assess
Passive recon (search engines, CT logs, DNS) is generally low-risk
Active scanning (Nmap, brute-force) requires explicit scope agreement
Rate-limit all automated tools to avoid ToS violations
Never expose tooling as an open anonymous service without auth + audit logs

// Field checklist (0/8)

Step 01 — Anchor your question
Step 02 — Map the surface
Step 03 — Harvest identifiers
Step 04 — Cross-reference
Step 05 — Archive snapshots
Step 06 — Correlate timelines
Step 07 — Stress-test claims
Step 08 — Promote to directory

[ OPEN SEARCH DEMO ]