> ## Documentation Index
> Fetch the complete documentation index at: https://docs.fallow.tools/llms.txt
> Use this file to discover all available pages before exploring further.

# Duplication explained

> What fallow's duplication output means, how clone detection works, and how to interpret and act on the results.

This page covers what `fallow dupes` reports, how to read the output, and when to act.

<Tip>
  Pass `--explain` to any command with `--format json` to include metric definitions directly in the JSON output as a `_meta` object. The MCP server always includes `_meta` automatically.
</Tip>

## Clone types

Fallow classifies clones using the standard taxonomy from clone detection research.

| Clone type | What matches                                                 | Detection mode     |
| :--------- | :----------------------------------------------------------- | :----------------- |
| **Type-1** | Exact token sequences (whitespace/comments already stripped) | `strict`, `mild`   |
| **Type-2** | Same structure, but identifiers and/or literals differ       | `weak`, `semantic` |

As you move from `strict` to `semantic`, recall increases (more clones found) but precision decreases (more potential false positives).

<Accordion title="Research: clone taxonomy">
  The Type-1 through Type-4 clone taxonomy was surveyed by [Roy, Cordy, and Koschke (2009)](https://doi.org/10.1016/j.scico.2009.02.007). Fallow detects Type-1 and Type-2 clones. Type-3 (gapped clones with inserted/deleted statements) and Type-4 (semantically equivalent but syntactically different) require analysis beyond token comparison and are not currently supported.
</Accordion>

### Detection modes in detail

| Mode       | What is normalized                                                         | Trade-off                                             |
| :--------- | :------------------------------------------------------------------------- | :---------------------------------------------------- |
| `strict`   | Nothing -- exact token match                                               | Highest precision, lowest recall                      |
| `mild`     | Equivalent to strict (AST tokenization already strips whitespace/comments) | Default, good starting point                          |
| `weak`     | + string literal values abstracted                                         | Catches copies with different messages/URLs           |
| `semantic` | + identifiers + numeric literals + type annotations stripped               | Catches renamed-variable clones, most false positives |

```text fallow dupes --mode semantic theme={null}
● Clone group (196 lines, 2 instances) [semantic match]
  src/lib/dutch-holidays.ts:193-388
  src/lib/dutch-holidays.ts:389-584
  Renamed: holidays2024→holidays2025, year2024→year2025
```

## Key metrics

### Duplication percentage

Fraction of total source tokens that appear in at least one clone group. Computed over the full analyzed file set, not just the groups shown by `--top`.

| Range  | Interpretation  | Action                                                |
| :----- | :-------------- | :---------------------------------------------------- |
| 0–5%   | Low duplication | No action needed                                      |
| 5–15%  | Moderate        | Review the largest clone families                     |
| 15–30% | High            | Prioritize extraction of shared modules               |
| 30%+   | Very high       | Likely structural issue; look for copy-pasted modules |

<Info>
  Duplication percentage is mode-dependent. Running with `semantic` mode will always report a higher percentage than `strict` because more normalization means more matches. Compare percentages only across runs using the same mode.
</Info>

### Token count and line count

Each clone group reports both token count and line count.

* **Tokens** are language-aware units (keywords, identifiers, operators, literals). This is what the detection engine matches on.
* **Lines** are the source lines spanned by the clone. Useful for estimating refactoring effort.

Larger clones have higher refactoring value. A 200-line clone group is worth extracting; a 6-line one probably isn't.

### Instance count

The number of locations where the same code appears. A clone group with 5 instances means the same block was copied to 5 places. Fixing a bug in the logic requires updating all 5.

| Instances | Risk                     | Action                                                |
| :-------- | :----------------------- | :---------------------------------------------------- |
| 2         | Normal, often acceptable | Extract if the code is complex or likely to change    |
| 3–5       | Bug risk increases       | Extract into a shared function or module              |
| 5+        | High maintenance burden  | Extract urgently; divergence between copies is likely |

## Clone groups and families

### Clone groups

A **clone group** is a single duplicated code block found at 2+ locations. Each location is an **instance**.

```text theme={null}
● Clone group 1 (8 lines, 2 instances)
  src/validators/userValidator.ts:12-19
  src/validators/orderValidator.ts:8-15
```

This means one block of 8 lines appears in both files. Here's what it looks like:

<CodeGroup>
  ```typescript userValidator.ts:12-19 theme={null}
  const errors: string[] = [];
  for (const [key, rule] of Object.entries(rules)) {
    const value = data[key];
    if (rule.required && (value === undefined || value === null)) {
      errors.push(`${key} is required`);
    }
  }
  return errors;
  ```

  ```typescript orderValidator.ts:8-15 theme={null}
  const errors: string[] = [];
  for (const [key, rule] of Object.entries(rules)) {
    const value = data[key];
    if (rule.required && (value === undefined || value === null)) {
      errors.push(`${key} is required`);
    }
  }
  return errors;
  ```
</CodeGroup>

One clone group = one duplicated block. But the same two files may share more than one block:

```text theme={null}
● Clone group 2 (6 lines, 2 instances)
  src/validators/userValidator.ts:24-29
  src/validators/orderValidator.ts:20-25

● Clone group 3 (5 lines, 2 instances)
  src/validators/userValidator.ts:35-39
  src/validators/orderValidator.ts:31-35
```

Three separate clone groups, all between the same two files. That pattern is a **clone family**.

### Clone families

A **clone family** is when multiple clone groups involve the same files. Those files share multiple distinct clones and were probably copy-pasted from each other.

```text theme={null}
Family: src/validators/userValidator.ts ↔ src/validators/orderValidator.ts
  3 clone groups, 19 duplicated lines
  → Extract shared validation module
```

This changes the refactoring approach. Individual clone groups suggest extracting a function. A clone family suggests the files themselves need to be merged or restructured.

| Pattern                   | What it tells you          | Refactoring strategy                                |
| :------------------------ | :------------------------- | :-------------------------------------------------- |
| Single clone group        | One block was copied       | Extract that block into a shared function           |
| Family within one file    | Multiple self-clones       | Extract the repeated pattern into a helper          |
| Family across 2 files     | Files were copy-pasted     | Merge into a shared module with configuration       |
| Family across a directory | Template-based duplication | Replace the template with a parameterized generator |

## When duplication is acceptable

Not all duplication should be eliminated. Context matters.

* **Test files**: Test cases often repeat setup code intentionally. Abstracting test setup makes tests harder to read and debug. Use [`--production`](/cli/global-flags) to exclude test/story/dev files entirely, or use `dupes.ignore` patterns for more granular control.
* **Generated code**: Codegen output (GraphQL, Prisma, OpenAPI) is inherently duplicative. Exclude with `dupes.ignore`.
* **Small clones (\< 10 lines)**: Very small clones are often idiomatic patterns (error handling, guard clauses) rather than meaningful duplication.
* **Cross-language ports**: If you maintain both `.ts` and `.js` versions intentionally, use `--skip-local` to focus on cross-directory duplicates instead.

<Warning>
  Premature abstraction is worse than duplication. If the duplicated code isn't changing and isn't causing bugs, leave it. Extract clones that are both **large** and **in actively changing files**. Use [`fallow health --hotspots`](/cli/health) to cross-reference.
</Warning>

## Interpreting `--threshold` results

The `--threshold` flag sets a maximum allowed duplication percentage. If the project exceeds the threshold, fallow exits with code 1.

```text fallow dupes --threshold 15 theme={null}
Duplication: 19.4% (27,255 duplicated lines across 398 files, exceeds 15% threshold)
Found 1,184 clone groups, 2,959 instances (0.23s) ✗
```

Start with a threshold above your current level, then ratchet it down over time. Combine with `--baseline` for even more gradual adoption.

## How suffix-array detection works

Fallow concatenates all token streams into one sequence, builds a <Tooltip tip="A sorted array of all suffixes of a string, enabling efficient pattern matching without pairwise comparison">suffix array</Tooltip> with <Tooltip tip="Longest Common Prefix array, used alongside suffix arrays to find the longest shared sequences between code blocks">LCP array</Tooltip>, and scans for repeated subsequences above the minimum length. This runs in O(n log n) time, no pairwise file comparison needed.

<Accordion title="Research: suffix-array clone detection">
  Token-based clone detection was pioneered by [Baker (1995)](https://doi.org/10.1109/WCRE.1995.514697). Suffix-array approaches for scalable clone detection were developed by [Li et al. (2006)](https://doi.org/10.1145/1137983.1138012). Fallow's approach is closest to the suffix-array method, with normalization levels corresponding to clone types.
</Accordion>

## Limitations

* **Type-3 clones** (gapped clones with inserted/deleted lines) are not detected. If someone copied a function and added a few lines in the middle, fallow will report the matching portions as separate smaller clones rather than one large near-miss.
* **Type-4 clones** (semantically equivalent but syntactically different) are not detected. Two functions that do the same thing but are written differently will not be flagged.
* **Cross-file-type detection** only works between `.ts` and `.js` files (via `--cross-language`). Other language pairs are not supported.
* **Minimum size thresholds** mean very small clones (below `--min-tokens` / `--min-lines`) are invisible. This is intentional: small clones are usually idiomatic patterns.
* **Semantic mode false positives**: Normalizing identifiers and literals can match code that is structurally similar but semantically unrelated. Review semantic-mode results before acting.

## JSON `_meta` object

When `--explain` is passed (or via MCP), the JSON output includes a `_meta` object:

```json theme={null}
{
  "schema_version": 3,
  "_meta": {
    "docs": "https://docs.fallow.tools/explanations/duplication",
    "metrics": {
      "duplication_percentage": {
        "name": "Duplication percentage",
        "description": "Fraction of total source tokens in clone groups",
        "range": "[0, 100]",
        "interpretation": "lower is better; <5% low, 5–15% moderate, >15% high"
      },
      "clone_group": {
        "name": "Clone group",
        "description": "Set of 2+ code fragments with identical normalized token sequences"
      },
      "clone_family": {
        "name": "Clone family",
        "description": "Multiple clone groups sharing the same file set, indicating systematic duplication"
      }
    }
  }
}
```

AI agents and CI systems can use this to interpret results without consulting external documentation.
