Codebase health analysis goes beyond finding dead code. While unused exports and unreachable files tell you what to delete, health metrics tell you what to fix: which functions are too complex, which files change too often, and where untested complexity creates the most risk. This page covers whatDocumentation Index
Fetch the complete documentation index at: https://docs.fallow.tools/llms.txt
Use this file to discover all available pages before exploring further.
fallow health reports, when to act on the numbers, and where the research comes from.
Complexity metrics
These metrics appear infallow health output, both per-function (findings) and per-file (file scores).
Cyclomatic complexity
The number of linearly independent paths through a function’s control flow graph. For structured programs (like JS/TS), this simplifies to 1 + the number of decision points (if/else, switch cases, loops, ternary ?:, logical &&/||, catch).
| Range | Interpretation | Action |
|---|---|---|
| 1–10 | Simple, easy to test exhaustively | No action needed |
| 11–20 | Moderate, most functions should stay here | Review if growing |
| 21–50 | High, hard to test all paths | Split into smaller functions |
| 50+ | Very high, testing all paths is impractical | Refactor urgently |
Research
Research
Introduced by Thomas McCabe in “A Complexity Measure” (1976). The practical upper bound of 20 comes from NIST SP 500-235 (Watson & McCabe, 1996).
Cognitive complexity
How hard a function is to understand when reading top-to-bottom. Unlike cyclomatic complexity (which counts paths for testing), cognitive complexity measures comprehension difficulty. It penalizes nesting depth,break/continue to labels, recursion, and sequences of logical operators.
| Range | Interpretation | Action |
|---|---|---|
| 0–7 | Easy to understand at a glance | No action needed |
| 8–15 | Moderate, may benefit from extraction | Extract helpers if growing |
| 15+ | Hard to follow, likely mixing concerns | Refactor: extract, simplify, or decompose |
Research
Research
Developed by G. Ann Campbell at SonarSource. Full specification: SonarSource white paper.
Complexity density
Total cyclomatic complexity divided by lines of code. Normalizing by file size lets you compare fairly: a 500-line file with complexity 50 (density 0.1) vs. a 10-line file with complexity 10 (density 1.0).| Value | Interpretation | Action |
|---|---|---|
| < 0.3 | Low, data files, configs, simple modules | No action needed |
| 0.3–1.0 | Moderate, normal application code | Review the densest functions |
| > 1.0 | Very dense, nearly every line is a decision point | Refactor: extract logic, simplify conditions |
Cyclomatic vs cognitive: a code example
These two functions illustrate why both metrics matter:switch has higher cyclomatic complexity (more paths to test) but is trivially readable. The nested if chain has lower cyclomatic complexity but is harder to follow. Cognitive complexity captures that difference.
File health scores
Per-file scores fromfallow health --file-scores. Zero-function files (barrel files, re-export files) are excluded.
Maintainability Index (MI)
Fallow uses a simplified Maintainability Index adapted for JavaScript/TypeScript module graphs.| Range | Interpretation | Action |
|---|---|---|
| 70–100 | Good maintainability | Monitor |
| 40–70 | Moderate | Review periodically, address if declining |
| 0–40 | Poor | Prioritize for refactoring |
fallow health --file-scores --top 5
How fallow's MI relates to the classic Maintainability Index
How fallow's MI relates to the classic Maintainability Index
The original MI from Oman and Hagemeister (1992) combines Halstead Volume, cyclomatic complexity, lines of code, and comment percentage. Microsoft adapted it to a 0–100 scale for Visual Studio.Fallow’s variant makes three changes for the JS/TS ecosystem:
- Replaces Halstead Volume with complexity density. Halstead metrics require full type resolution, which fallow doesn’t have (syntactic analysis only). Complexity density captures the same “how complex per unit of code” signal.
- Adds dead code ratio. Unused exports are a JS/TS-specific maintenance burden that the classic MI doesn’t account for.
- Adds fan-out coupling penalty. High import counts are a strong signal of maintenance difficulty in module-based codebases. The penalty uses logarithmic scaling (
ln(fan_out + 1) × 4, capped at 15 points), reflecting diminishing marginal risk per additional import.
Fan-in, fan-out, and dead code ratio
Fan-in: number of files that import this file. High fan-in = high blast radius when changed. Fan-out: number of files this file imports. High fan-out = high coupling and change propagation risk. Dead code ratio: fraction of value exports with zero references (0-1). Type-only exports (interfaces, type aliases) are excluded.| Metric | Signal | Action |
|---|---|---|
| Fan-in > 20 | Many dependents, critical file | Extra review before changes |
| Fan-out > 15 | Many imports, high coupling | Consider splitting into smaller modules |
| Dead code = 0 | All exports used | No action needed |
| Dead code > 0.3 | Significant unused code | Run fallow fix to auto-remove |
Research: fan-in/fan-out
Research: fan-in/fan-out
Fan-in/fan-out coupling metrics originate from Henry and Kafura (1981). Later refined in the Chidamber & Kemerer OO metrics suite (1994), where Coupling Between Objects (CBO) captures the same idea for class-level dependencies.
Coupling concentration
Two vital signs metrics summarize coupling at the project level:| Metric | Description |
|---|---|
p95_fan_in | 95th percentile fan-in across all files |
coupling_high_pct | Percentage of files with fan-in above the effective threshold |
max(p95_fan_in, 10). The floor of 10 prevents false positives in small projects where even moderate fan-in is normal.
Risk profiles
Risk profiles show what percentage of functions fall into each risk bin. Two profiles are reported as vital signs: Unit size (function length in lines of code):| Bin | LOC range |
|---|---|
| Low risk | 1–15 |
| Medium risk | 16–30 |
| High risk | 31–60 |
| Very high risk | > 60 |
| Bin | Parameters |
|---|---|
| Low risk | 0–2 |
| Medium risk | 3–4 |
| High risk | 5–6 |
| Very high risk | 7+ |
unit_size_profile and unit_interfacing_profile in the vital signs JSON, each containing low_risk, medium_risk, high_risk, and very_high_risk percentage values.
When the unit size profile shows >= 3% of functions in the very-high-risk bin (> 60 LOC), the human output includes a Large functions drill-down listing the specific oversized functions with file path, function name, line number, and LOC count. In JSON output, these appear as a large_functions array on the health report.
Research: SIG quality model
Research: SIG quality model
Risk profiles follow the SIG/TÜViT quality model, which benchmarks codebases by the distribution of code across risk categories rather than by averages. A codebase where 95% of functions are low-risk is healthier than one with the same average complexity but a fat tail of very-high-risk functions.
Untested complexity risk
Complex code without test coverage is the most likely source of production bugs. Fallow quantifies this risk per function using a score based on the CRAP metric (Change Risk Anti-Patterns), originally developed by Alberto Savoia and Bob Evans at Agitar Labs in 2007. The score appears infallow health --file-scores output and JSON.
How the score is computed
Fallow uses the canonical CRAP formula:cov is per-function coverage (0–100). Fallow obtains that value in one of three ways:
Model (coverage_model) | When | How cov is determined |
|---|---|---|
static_estimated | Default | Graph-based per-function estimate: directly test-referenced = 85%, transitively test-reachable = 40%, not test-reachable = 0%. |
istanbul | --coverage <coverage-final.json> (auto-detected from coverage/ and .nyc_output/, or via FALLOW_COVERAGE) | Real per-function statement coverage produced by Jest, Vitest, c8, or nyc. |
static_binary | Legacy, snapshot/JSON compatibility only | Whole-file binary: test-reachable file = 100% (score = CC), untested file = 0% (score = CC^2 + CC). Superseded by static_estimated. |
static_estimated model, a not-test-reachable function with cyclomatic complexity 5 scores 5^2 × 1 + 5 = 30, which is exactly the default flag threshold.
For execution evidence from real traffic (hot/cold paths, runtime-weighted dead-code verdicts), see Runtime coverage. Runtime coverage is a paid Fallow Runtime feature and accepts V8 or Istanbul runtime data via --runtime-coverage; it merges on top of whichever static coverage_model is active.
Default threshold: 30 (functions scoring at or above this value are flagged)
Interpreting the numbers
| Score | Interpretation | Action |
|---|---|---|
| < 30 | Acceptable risk | No action needed |
| 30-99 | Moderate risk, complex untested code | Add test coverage or reduce complexity |
| 100-999 | High risk | Prioritize: add tests, then simplify |
| > 999 | Extreme risk (shown as “>999” in terminal) | Refactor urgently, this code is both complex and invisible to tests |
>999 for readability. JSON output always contains the exact value.
File-level aggregation
Each file reports two aggregated values:| Field | Meaning |
|---|---|
crap_max | Highest per-function score in the file |
crap_above_threshold | Count of functions scoring >= 30 |
coverage_model on the health summary (static_estimated, istanbul, or the legacy static_binary).
Refactoring targets
TheAddTestCoverage refactoring target fires when both conditions are true:
crap_above_threshold>= 2 (multiple risky functions, not just one outlier)complexity_density> 0.3 (the file is genuinely complex, not just large)
Suppression
To exclude a file from untested complexity risk reporting:Static estimation vs. real coverage
Static estimation vs. real coverage
The CRAP formula needs a per-function coverage percentage:A fully tested function (cov = 100) always scores exactly CC, regardless of complexity. A completely untested function (cov = 0) scores CC^2 + CC.Without coverage input, the default
static_estimated model derives cov from the dependency graph: 85% for directly test-referenced exports, 40% for transitively test-reachable code, 0% for code with no path from any test file. This is an approximation of test coverage:- False negatives: A file imported by tests but with functions that are never actually exercised will score lower (look healthier) than it should.
- False positives: A file referenced only from non-test code can score worse than it deserves if it is in fact unit-tested through indirect paths.
--coverage <coverage-final.json> (Jest, Vitest, c8, or nyc). Fallow auto-detects coverage/coverage-final.json and .nyc_output/coverage-final.json, or set FALLOW_COVERAGE to point at a custom path. The coverage_model field flips from static_estimated to istanbul when real coverage is in use.The legacy static_binary model (whole-file 0%/100% based on test reachability) is superseded by static_estimated and is retained only for snapshot/JSON backwards compatibility.Research
Research
The CRAP metric was introduced by Alberto Savoia and Bob Evans at Agitar Labs. Original presentation: “CRAP4J” (2007). The core insight is that complexity alone is not dangerous if the code is well-tested, and simple code is low-risk even without tests. Risk concentrates where complexity and missing coverage overlap.
Hotspot metrics
Hotspots are files that are both complex and frequently changing. Bugs concentrate at this intersection, and refactoring here yields the highest return. Available viafallow health --hotspots.
The core insight: Complex code that never changes is stable. Simple code that changes often is manageable. Complex code that changes often is where defects concentrate.
Hotspot score
| Score | Interpretation | Action |
|---|---|---|
| 70–100 | Critical hotspot | Prioritize refactoring in next sprint |
| 40–70 | Moderate hotspot | Schedule review, reduce complexity |
| 0–40 | Low risk | Monitor |
fallow health --hotspots --top 3
Weighted commits
Recency-weighted commit count using exponential decay with a 90-day half-life:- A commit yesterday contributes ~1.0
- A commit 90 days ago contributes ~0.5
- A commit 180 days ago contributes ~0.25
Research
Research
Recency-weighted change metrics were introduced by Graves et al. (2000). The empirical link between churn and defect density was established by Nagappan and Ball (2005) at Microsoft.
Churn trend
Compares commit frequency in the first half vs second half of the analysis window:| Trend | Meaning | Action |
|---|---|---|
accelerating | Recent > 1.5× older half | High priority if complexity is also high |
stable | Balanced frequency | Normal, monitor score |
cooling | Recent < 0.67× older half | Lower priority, stabilizing |
Research: churn × complexity methodology
Research: churn × complexity methodology
Pioneered by Michael Feathers and systematized by Adam Tornhill in Your Code as a Crime Scene (2015) and Software Design X-Rays (2018).
Refactoring targets
Refactoring targets are ranked, actionable recommendations produced byfallow health --targets. They rank files by refactoring priority using the metrics above.
Targets are synthesis, not data. The other sections show measurements. Targets tell you what to do about them.
Priority score
| Score | Interpretation | Action |
|---|---|---|
| 70–100 | High priority | Address in the current sprint |
| 40–70 | Moderate | Schedule for upcoming work |
| 0–40 | Lower priority | Monitor, address opportunistically |
Effort estimation
Every target includes an effort estimate (low, medium, or high) based on three file-level signals:
| Effort | Criteria |
|---|---|
| low | < 100 lines AND \u 3 functions AND < 5 importers |
| high | \u 500 lines, OR \u 20 importers, OR (\u 15 functions AND density > 0.5) |
| medium | Everything else |
\u{00b7}:
Evidence
For three recommendation categories, fallow provides structured evidence linking the target back to specific analysis data. This lets agents and scripts act on recommendations without a second tool call.| Category | Evidence fields | Example |
|---|---|---|
remove_dead_code | unused_exports: list of export names | ["assertEqual", "assertIs", "assertNever"] |
extract_complex_functions | complex_functions: list of {name, line, cognitive} | [{"name": "processData", "line": 45, "cognitive": 113}] |
break_circular_dependency | cycle_path: list of file paths forming the cycle | ["src/a.ts", "src/b.ts", "src/c.ts"] |
| All other categories | evidence omitted from JSON | \u |
Baselines
Use--save-baseline and --baseline with the health command to track refactoring progress over time. When a baseline is active, only new targets (not present in the baseline) are reported.
relative_path:category_label (e.g., src/utils.ts:dead code). A target is considered “known” if the same file + category combination exists in the baseline, regardless of whether its priority score or effort level has changed.
This is useful in CI to prevent new tech debt from landing while allowing existing targets to be addressed incrementally.
Recommendation categories
Each target is assigned a category based on which rule matches first (evaluated in priority order):| Category | Display label | Trigger | What it means |
|---|---|---|---|
| Urgent churn + complexity | churn+complexity | Hotspot score ≥ 50, accelerating trend, density > 0.5 | Actively-changing file with growing complexity, highest urgency |
| Break circular dependency | circular dep | File in an import cycle with fan-in ≥ 5 | Changes cascade through the cycle to many dependents |
| Split high-impact file | high impact | Fan-in ≥ 20 (or ≥ 10 with ≥ 5 functions) and density > 0.3 | High blast radius amplifies every bug introduced |
| Remove dead code | dead code | Dead code ratio ≥ 0.5 with ≥ 3 value exports | Majority of exports are unused; reduce surface area |
| Extract complex functions | complexity | Top cognitive complexity ≥ 30 | Contains functions that are very hard to understand |
| Reduce coupling | coupling | Fan-out ≥ 15 and MI < 60 (excluding entry points) | Too many imports reduce testability |
fallow health --targets --top 5
JSON structure
With--format json, each target includes structured data for programmatic consumption:
value and threshold for programmatic use. The evidence field provides actionable detail for categories that support it. AI agents can use evidence.unused_exports to generate removal PRs without a second tool call. Categories without evidence omit the field entirely.
Health score
The health score is a single 0–100 number summarizing overall codebase quality. Available viafallow health --score.
| Penalty | Formula | Max |
|---|---|---|
| Dead files | min(dead_file_pct × 0.2, 15) | 15 |
| Dead exports | min(dead_export_pct × 0.2, 15) | 15 |
| Complexity | min(critical_complexity_pct × 4, 20) | 20 |
| P90 complexity | 0 for formula v2 runs; older snapshots fall back to min(max(0, p90_cyclomatic − 10), 10) | 10 |
| Maintainability | min(maintainability_low_pct × 1.5, 15) | 15 |
| Hotspots | min(hotspot_top_pct_count / ceil(total_files × 0.01) × 10, 10) | 10 |
| Unused deps | min(unused_deps_per_k_files × 0.5, 25) | 25 |
| Circular deps | min(circular_deps_per_k_files × 0.5, 25) | 25 |
| Unit size | min(functions_over_60_loc_per_k × 0.5, 10) | 10 |
| Coupling | min(coupling_high_pct × 0.5, 5) | 5 |
| Duplication | min(max(0, duplication_pct − 5) × 1.0, 10) | 10 |
health_score.formula_version identifies the formula used. Version 2 uses scale-invariant density and tail metrics so large monorepos are not diluted by many healthy files. Older snapshots that lack the v2 fields fall back to the previous average, p90, and raw-count aggregators. The unit size penalty tracks functions above 60 LOC per 1,000 functions. The coupling penalty uses the percentage of high fan-in files. The duplication penalty kicks in when more than 5% of lines are duplicated; --score automatically runs duplication analysis to populate this metric.
Letter grades: A (score >= 85), B (70–84), C (55–69), D (40–54), F (below 40).
Penalty fields are null in JSON when the corresponding analysis didn’t run. --score computes the score and automatically runs duplication analysis; add --hotspots (or combine --score --targets) when the score should include the churn-backed hotspot penalty.
Combining health with dead code analysis
Health metrics are most powerful when combined with fallow’s other analyses. Runningfallow (all analyses) or fallow health alongside fallow dead-code reveals patterns that neither analysis shows alone:
- Dead code in complex files: A file with high cyclomatic complexity and 70% unused exports is a strong candidate for deletion rather than refactoring
- Hotspots with circular dependencies: A frequently-changed file that sits in an import cycle amplifies every change through the cycle to its dependents
- Duplicated complex functions: Functions flagged as complex that also appear as clone instances in
fallow dupesare refactoring candidates where extraction yields the most benefit
fallow health --targets for prioritized recommendations that synthesize all these signals into a ranked action list.
Limitations
- Generated files (GraphQL codegen, Prisma client, OpenAPI types) may have high complexity density but shouldn’t be refactored. Use
health.ignorepatterns to exclude them. - Test files with many
it()/test()blocks can have high cyclomatic complexity. This is usually fine. Test suites should cover many paths. - Config files (Webpack, ESLint) may appear as hotspots due to frequent changes during setup. Their churn typically cools after initial configuration.
- MI scores are not portable. Fallow’s formula differs from Visual Studio and SonarQube, so scores are not directly comparable across tools.
- Hotspot scores are project-relative. A score of 80 means “worst in your project,” not “objectively bad.” A well-maintained project may have a top score of 30.
- Fan-in/out counts modules, not imports.
import { a, b, c } from './utils'counts as 1 edge, not 3.
JSON _meta object
When --explain is passed (or via MCP), each command’s JSON output includes a _meta object: