The NullRoute Behavioral Genome is an observational dataset of post-authentication attacker behavior, captured across a live multi-node honeypot network and clustered into behavioral families. The registry is rebuilt periodically as new session data is collected - not updated in real-time.
It is a research instrument, not a ground truth. The families, atoms, and scores reflect what was observed on our sensors under specific conditions. They are not a complete picture of attacker behavior in the wild.
Scope: This methodology describes how raw shell commands from Cowrie honeypot sessions are transformed into behavioral atoms, grouped into families, and exposed via the public API. It covers data collection, normalization, matching, and known limitations.
Data is captured by Cowrie SSH honeypots deployed on 4 sensor nodes across 4 geographic regions (DE, US, FR, SG). Each node presents a different persona - healthcare infrastructure, AI/ML platform, ops dashboard - to observe persona-dependent targeting behavior.
All nodes listen on port 22 (or port-forwarded to Cowrie). Sessions are logged as JSON events including the full command sequence, session timing, credential used, and source IP.
Only post-authentication sessions are included in the Genome. Sessions where the attacker logged in - credentials accepted by Cowrie's UserDB - and then issued at least one shell command.
Login attempts without a subsequent session are excluded. Passive fingerprinting events (P0f, Fatt, Suricata) are excluded from all analysis.
Authentication acceptance bias: "Post-authentication" here means Cowrie accepted the credential against its UserDB configuration. This captures actors who present known or guessed credentials - not necessarily actors who would have succeeded against a real system. If UserDB is configured permissively, the session population includes credential-spraying tools that happen to guess a valid combination, not only targeted operators. The behavioral families reflect what these actors do after access is granted, regardless of intent or sophistication.
Sensor bias: Cowrie presents a partially emulated filesystem. Attackers running binary payloads often fail silently (Cowrie intercepts execution). This means the Genome captures the intent and initial command sequence, but not necessarily the full execution chain if payloads were deployed.
A behavioral atom is a normalized label assigned to a raw shell command based on its observable pattern. Atoms are not a strict ontology: some represent specific actions (disable_immutable_ssh), some represent tactical objectives (recon_gpu), and some are structural fallbacks with weak behavioral meaning (pipe_chain, list_dir). This is a pragmatic classification, not a formal taxonomy.
Example: chattr -ia .ssh, chattr -i ~/.ssh/authorized_keys, and chattr -a .ssh all map to the atom disable_immutable_ssh.
Atoms are extracted by applying ordered regex patterns to each raw command. Patterns are checked in specificity order - more specific patterns first. If no pattern matches, generic structural fallbacks are applied (prefix-based: ls → list_dir, echo → write_output, pipe presence → pipe_chain). Commands that match nothing are labeled other.
Fallback information loss: Generic fallbacks collapse semantically different commands into a single bucket. echo ok | nc 1.2.3.4 80 (C2 beacon attempt) and dmesg | tail (log inspection) both become pipe_chain. This is a deliberate tradeoff - it avoids unmatched noise - but it means fallback atoms carry weak behavioral signal and should not be treated as discriminative features. They are excluded from Sigma rule generation and from the "meaningful atom" criterion used in classification state assignment.
Atom extraction pipeline: raw command → ordered regex patterns → generic fallbacks → "other" Examples (patterns shown are illustrative - production patterns are broader): "chattr -ia .ssh" → disable_immutable_ssh (pattern: chattr\s+(-\w+\s+)?\.ssh) "rm -rf /var/tmp/dota3" → cleanup_dota (pattern: rm\s+-rf\s+/var/tmp/dota) "free -m" → recon_memory (pattern: free\s+-[mh]) "ls -la /home" → list_dir (fallback: starts with "ls ") "echo ok | nc 1.2.3.4 80" → pipe_chain (fallback: contains "|") "dmesg | tail" → pipe_chain (fallback: contains "|") "unmapped_binary --flag" → other (no match)
The current atom vocabulary covers 37 named atoms plus generic fallbacks. Key atoms and their tactical meaning:
| Atom | Tactical Meaning | Example Command |
|---|---|---|
| disable_immutable_ssh | Remove immutable bit from .ssh (pre-persistence prep) | chattr -ia .ssh |
| destroy_ssh_dir | Wipe existing .ssh to take full control | rm -rf .ssh |
| inject_ssh_key | Add own key to authorized_keys | echo 'ssh-rsa...' >> .ssh/authorized_keys |
| cleanup_dota | Remove competitor's dota payload (discriminative) | rm -rf /var/tmp/dota3 |
| recon_gpu | GPU discovery for crypto-mining targeting | nvidia-smi, gpustat |
| decode_payload | Decode base64 payload in-memory | base64 -d payload.b64 |
| persistence_cron | Install cron-based persistence | crontab -l, echo '...' | crontab - |
| read_env_file | Read .env for credentials or config | cat /opt/app/.env |
Known limitation: The atom vocabulary was built from observed patterns on our sensors. Attackers using obfuscation, base64-encoded commands, or non-standard binary paths may produce atoms that don't reflect their true intent. Novel attack techniques not yet observed will produce other atoms.
Sessions are grouped into families using cosine similarity over atom bags at the clustering stage (analysis CLI). Sessions whose atom bags are sufficiently similar are merged into a variant. Variants whose representative bags are sufficiently similar are merged into a family.
Why cosine for clustering, weighted containment for matching? These are different problems with different objectives. Clustering is symmetric: two sessions belong to the same family when their atom profiles are mutually similar. Cosine similarity captures this directional closeness between equal-status objects. Matching at inference time is asymmetric: a partial query (one observed session) should be explained by a family, not the reverse. Weighted containment measures how much of the query's behavioral mass the family accounts for - a recall-biased metric suited for the partial-session case. Using cosine at inference time would penalise large canonical families when the query only shows an early-stage or incomplete session.
Family names are auto-generated from their most discriminative atoms (highest IFF weight within the cluster) unless manually overridden with a canonical name (e.g., dota_mdrfckr). "Most discriminative" means the atoms with the highest Inverse Family Frequency in the registry at build time - atoms rare across families, abundant within the cluster.
| Tier | Meaning | Examples |
|---|---|---|
| canonical | Manually validated. Cross-node confirmation or published research backing. Name explicitly assigned. | dota_mdrfckr (5 published pieces, 4 nodes) |
| provisional | Auto-clustered. Behaviorally coherent but not independently validated. Name is auto-generated from gene IDs. | scanner_system_fingerprint, botnet_network_recon |
The matching API gives canonical families a 5% score boost when ranking results, so that a manually validated family ranks above a same-scoring provisional cluster. This is an additive adjustment: score += 0.05, capped at 1.0. It is applied to the weighted containment score after per-variant matching and before classification state assignment. The intent is to resolve ties in favour of validated families - not to reclassify weak matches as strong ones. A provisional family with score 0.60 will still outrank a canonical family with score 0.48 + 0.05 = 0.53.
The /match endpoint uses weighted containment scoring - a recall-biased similarity metric designed for partial session matching.
Standard Jaccard similarity penalises a large canonical family (many atoms) when the query only shows a partial session. A 2-atom scanner family can outrank a 14-atom worm family even when the worm family explains the query better. Weighted containment shifts the objective: it measures how much of the query's behavioral mass the candidate variant accounts for, not symmetric set overlap. This is a design choice with a tradeoff - a query contaminated with one rare but incidental atom can disproportionately influence the result.
Matching is done at the variant level, not the family level: 1. For each family, score the query against every known variant. 2. The family's score = max over all variant scores. 3. Apply canonical boost (+0.05, capped at 1.0) for canonical families. 4. Rank all families by score. Assign classification state to the top result. Per-variant weighted containment formula: score(query, variant) = Σ(w_i for atom_i in query ∩ variant) / Σ(w_i for atom_i in query) where w_i = IFF(atom_i) = log(N / (1 + family_freq(atom_i))) + 1.0 N = total number of families in registry family_freq = number of families that contain atom_i IFF = Inverse Family Frequency (see note below) query ∩ variant = atoms present in both the query set and the variant's atom set
Inverse Family Frequency (IFF) assigns higher weight to atoms that appear in few families. cleanup_dota (only in dota_mdrfckr) has high IFF weight. recon_uname (present in almost every family) has low IFF weight.
This means a query containing rm -rf /var/tmp/dota3 will produce a high score for dota_mdrfckr specifically, even if the query also contains commodity atoms like uname -a that match many families.
IFF vs IDF: IFF is analogous to Inverse Document Frequency in text retrieval, but the rarity unit is family presence, not session or variant frequency. An atom that appears in 1 of 9 families gets a high IFF weight even if it appears in thousands of sessions within that family. This is intentional - the goal is to identify atoms that discriminate between families, not atoms that are rare overall. As the registry grows, IFF values will shift: atoms that are currently discriminative may become commodity if new families adopt them.
The matching API scores the query against all known variants of each family and takes the best score. This prevents the canonical variant (which may represent an older session) from hiding discriminative atoms that appear in newer variants.
| State | Threshold | Meaning |
|---|---|---|
| attributed | top_score ≥ 0.50 | Strong behavioral match. Confident family assignment. |
| partial_signal | 0.20 ≤ top_score < 0.50 | Some behavioral signal. Family is a candidate, not definitive. |
| novel_candidate | top_score < 0.20, ≥ 3 commands, ≥ 1 named atom | Behavioral signal present but no known family fits well. Candidate for future family seeding. |
| noise | ≤ 2 commands, or all commands map only to fallback atoms (pipe_chain, list_dir, write_output, other) | Insufficient signal to classify. Not a failure - most login-only sessions are noise. |
Threshold calibration: These thresholds were set by manual inspection of the 9-family registry - not by held-out evaluation or precision/recall optimisation. The 0.50 boundary was anchored by verifying that known dota_mdrfckr sessions score above it; the noise boundary was anchored by verifying that single-command login probes fall below it. No inter-family confusion matrix was computed. Treat these values as reasonable operational defaults for the current registry size, not validated statistical cutoffs. As the registry grows, thresholds should be revisited.
All data comes from Cowrie SSH honeypots. Cowrie emulates a subset of Linux behavior. Some attackers detect honeypot artifacts (Twisted SSH stack fingerprint, non-persistent filesystem) and abort. The Genome captures what attackers do when they don't detect the honeypot - which is a selection bias.
Sensors are in DE, US, FR, SG on specific IP ranges. Attackers targeting specific networks (e.g., cloud provider ranges, healthcare ASNs) may be overrepresented or underrepresented depending on how our IP ranges are classified by attackers' targeting logic.
Any finding observed on only one sensor node should be treated as strong evidence, not proof. Cross-node confirmation - the same behavioral family matched on 2+ independent nodes - is an epistemic principle, not a quantified confidence model. It increases credibility by ruling out sensor-specific artifacts, but the magnitude of the increase depends on how independent the nodes' traffic populations actually are.
The current registry covers sessions from March–April 2026. Families that were active before this period or that have since changed their TTPs may not be accurately represented.
Cowrie intercepts binary execution. Attackers attempting to run ELF binaries see simulated success or failure. The Genome cannot observe what the binary would have done - only what commands preceded execution attempts.
Each sensor node presents a different persona (healthcare, AI/ML platform, ops dashboard). Attackers who inspect the environment before acting may change their behavior based on what they find. This means the Genome does not capture "raw" attacker behavior - it captures behavior shaped by the personas we present. Observed targeting differences between nodes may reflect persona response as much as inherent attacker strategy.
Behavioral families reflect observed command sequences. A family may represent an automated tool, an operator playbook, or a combination. The Genome makes no claim about whether a session was manually driven or scripted unless that is explicitly noted in the family description.
With 9 families, IFF weights and classification thresholds are sensitive to small changes in registry composition. Adding or removing a single family can shift IFF weights for shared atoms and move borderline sessions across classification boundaries. Results should be interpreted in the context of the registry version reported by /stats.
Scoring the query against all variants and taking the best score makes large families with many variants easier to match by chance. A family with 20 variants has more opportunities to align with an arbitrary query than a family with 2 variants, independent of true behavioral similarity. This is a known bias; the variant proliferation rate is tracked but not currently corrected for.
The confidence score in /match responses is a weighted containment score - the fraction of the query's IFF-weighted atom mass that overlaps with the best-matching family variant. It is not:
A score of 0.58 (attributed) for dota_mdrfckr means: 58% of the query's IFF-weighted atom mass overlapped with the best-matching dota_mdrfckr variant. The remaining 42% was unmatched weighted atom mass - this may reflect behavioral drift, query-specific context, extraction aliasing (fallback atoms absorbing meaningful commands), or representation gaps in the variant's atom set. It does not necessarily mean the session contained 42% novel behavior.
Sigma rules are auto-generated from the canonical variant's atom sequence. Each rule is issued in two stages:
Broad (hunt): Triggers on any single named atom pattern (generic fallback atoms - pipe_chain, list_dir, write_output, other - are excluded from broad rules). High recall, expect false positives. Use for threat hunting and hypothesis generation - not direct alerting.
High-fidelity (alert): Requires co-occurrence of 2 characteristic atoms from the family's canonical variant. Expected to produce fewer false positives than the broad rule, but this has not been validated against real production telemetry. Treat FP estimates as engineering expectations, not measured rates.
Telemetry prerequisite: These rules require Linux process_creation events with CommandLine populated. Collection tools that provide this include auditd (with appropriate rules), Falco, Sysdig, and eBPF-based EDR solutions. These sources vary in how command arguments are captured and may require tuning. Standard syslog does not include CommandLine and will not trigger these rules.
The Genome registry is rebuilt as new session data is collected. Family IDs are derived from the canonical variant's session hash and are not guaranteed stable across full rebuilds - if the underlying session data changes, IDs may change. External systems and integrations should use the name field (e.g., dota_mdrfckr) as the durable identifier for canonical families, not family_id. Provisional family IDs offer no stability guarantee.
The API exposes a schema_version field in /stats. Breaking changes in the atom vocabulary or family taxonomy will increment the schema version.
BLAST Explorer - match commands against the genome interactively · Timing Fingerprints - operator attribution via inter-command timing · Genome Overview