A fintech startup shipped a payment processing bug that duplicated charges for 11,000 customers. The fix took 20 minutes. The PR that introduced it? 1,400 lines across 23 files. Two senior engineers approved it. Neither caught the bug.
Not because they were lazy. Because the human brain has limits, and 1,400 lines blows past every single one of them.
The Numbers Nobody Wants to Hear
Cisco's code review study from their internal data found that defect density in review drops off a cliff after 200-400 lines of code. Reviewers examining fewer than 200 lines caught 70-90% of defects. Past 400 lines? The detection rate collapsed below 30%. And past 800 lines, reviewers were basically scrolling and clicking approve.
SmartBear's analysis of 10 million lines of reviewed code backed this up. Their data showed a 15-20 minute sweet spot for review sessions. After that, attention degrades fast. You physically cannot maintain the level of focus required to trace logic paths through a massive diff for 45 minutes straight. Your eyes glaze. You start skimming. You approve.
Everyone knows this. Almost nobody acts on it.
Why Teams Keep Shipping Monster PRs
"But the feature requires all these changes." Sure. Sometimes. But more often, the real reason is that splitting work into smaller increments feels like overhead. Writing a clean commit history, creating intermediate states that compile and pass tests, thinking about the review experience for someone else? That takes effort. Dumping a week of work into one PR is the path of least resistance.
And managers rarely push back. Sprint velocity looks better when you ship one big PR instead of five small ones. The Jira ticket gets moved to done. Everyone's happy until something breaks.
There's also the tooling problem. GitHub's diff view is genuinely bad for large PRs. Beyond ~500 lines, the page gets sluggish. File trees collapse. Context disappears. Reviewers end up clicking through files individually, losing the thread of how changes connect. GitLab handles it slightly better with its merge request threading, but the cognitive load is the same.
The Feature Branch Anti-Pattern
Long-lived feature branches are the main culprit. A developer branches off main on Monday, works in isolation for a week, then opens a PR on Friday afternoon with 2,000 lines and a description that says "implement user dashboard." The reviewer now has to reconstruct a week of decisions from a flat diff.
Trunk-based development with feature flags solves most of this. But getting a team to adopt it means changing habits, and habits are harder to change than code.
What Actually Happens During a Large PR Review
Watch someone review a 600+ line PR sometime. Really watch. They'll spend the first 5 minutes reading carefully. Then they start jumping between files. Around minute 12, they're scanning for obvious syntax issues and security red flags. By minute 20, they're checking if tests exist (not if the tests are good, just if they're present) and reaching for the approve button.
Microsoft Research published a study on code review practices at Microsoft in 2013 and revisited it in 2022. Reviewers self-reported spending 60+ minutes on large changes. Actual telemetry data showed they spent about 15 minutes regardless of size. The gap between perceived effort and real effort was enormous.
And the comments on large PRs? Mostly cosmetic. Variable naming. Formatting. Maybe a "should we add a test for this?" without actually blocking on it. The deep logic bugs, the race conditions, the subtle authorization flaws? Those require sustained attention that large PRs make impossible.
The Security Angle Most Teams Miss
Security vulnerabilities hide incredibly well in large diffs. An SQL injection buried in line 847 of a 1,200-line PR has maybe a 5% chance of getting caught in review. Stuff that Semgrep or CodeQL would flag in milliseconds just sails past human reviewers when there's too much noise.
After the 2021 Codecov supply chain attack, several companies audited their PR review processes. A recurring finding: the malicious changes that slipped through were always embedded in large, complex PRs. Attackers know this. If you want to sneak something past a reviewer, pad it with 500 lines of legitimate refactoring.
// 847 lines into a "refactoring" PR
// looks harmless between two legitimate query changes
const query = "SELECT * FROM users WHERE email = '" + req.body.email + "'";
// reviewer is already mentally checked out by now
Compare that to a 30-line PR where the same vulnerability would be one of maybe eight meaningful lines to review. Night and day.
Splitting PRs Without Losing Your Mind
The pushback is always the same: "splitting this would be more work." And yeah, sometimes it is. But the alternatives are worse. A few strategies that actually work in practice:
The Vertical Slice
Instead of one PR with the database migration, API endpoint, frontend component, and tests all together, ship them separately. Migration first. Then the API endpoint with its tests. Then the frontend consuming it. Each PR is reviewable in isolation. Each one ships independently. If the frontend PR has a bug, you're not also debugging the migration.
Stacked PRs
GitHub doesn't support stacked PRs natively, which is annoying. But tools like Graphite, ghstack, and spr make it workable. You write PR 1 (data layer), PR 2 depends on 1 (business logic), PR 3 depends on 2 (API). Reviewers see small, focused diffs. When PR 1 merges, 2 automatically rebases.
Graphite reported that teams using stacked PRs reduced average review time by 40% and caught 2.3x more bugs per line of code reviewed. Those numbers should make anyone reconsider their process.
The Preparatory Refactoring
Martin Fowler wrote about this years ago and most teams still ignore it. Before building the feature, open a PR that just refactors the existing code to make the feature easy to add. Rename things. Extract methods. Move files. Pure refactoring, no behavior change. Then the feature PR drops into clean code and stays small.
Two 150-line PRs instead of one 500-line PR. Both get properly reviewed. Both are easier to revert if something goes wrong.
Measuring Whether You're Getting Better
Track these. Seriously.
PR size distribution. Plot a histogram of lines changed per PR over the last quarter. If the median is above 300, you have a problem. If you see PRs regularly hitting 1,000+, those reviews are theater.
Time from PR open to first meaningful comment. Not the "LGTM" bot or the CI status. The first human comment that engages with the code. If this number is over 4 hours, reviewers are batching reviews and context-switching into them cold, which means worse feedback.
Defect escape rate. How many bugs make it to production that were introduced in PRs with reviews? Cross-reference with PR size. You'll almost certainly find that the largest PRs have the highest escape rate. That data is hard to argue with in a planning meeting.
Review comment depth. Are comments about logic and architecture, or just nitpicks about formatting? If 80% of review comments are style-related, either the team needs a formatter (just use Prettier and stop arguing) or reviewers aren't engaging deeply enough.
When Large PRs Are Actually Fine
Generated code. If a migration tool produces 2,000 lines of schema changes from a single command, wrapping that in a big PR is reasonable. The review focuses on the migration command and its configuration, not every generated line.
Dependency updates. Renovate or Dependabot PRs that bump a lockfile by 800 lines aren't the same as 800 lines of hand-written code. Review the changelog, check for breaking changes, run the tests. The diff itself is noise.
Delete-heavy PRs. Removing 1,500 lines of dead code is a fundamentally different review than adding 1,500 lines. Deletion is almost always safe (almost, watch for reflection-based usage), and the cognitive load is minimal.
For everything else? Keep it under 400 lines. Under 200 if you can manage it.
Automated Scanning Catches What Reviewers Can't
Even small PRs have limits. Humans are bad at spotting certain vulnerability patterns, especially when they span multiple files or involve framework-specific behavior. Static analysis tools don't get tired at line 200. They don't start skimming after 15 minutes.
ScanMyCode.dev runs security and code quality scans that flag exactly the issues that slip past human review: injection vectors, hardcoded secrets, vulnerable dependencies, performance bottlenecks. With exact file and line numbers, not vague warnings. Pair that with smaller PRs and you've got actual coverage instead of the illusion of it.
Stop Pretending Large Reviews Work
The data is clear. The anecdotal evidence is overwhelming. Every experienced developer has approved a large PR they didn't fully understand. Most just don't talk about it because it feels like admitting failure.
It isn't failure. It's biology. Human attention is finite, and code review processes that ignore this are broken by design.
Start measuring PR size. Set soft limits (GitHub branch protection rules can warn on large PRs even if you don't block them). Invest in stacking tools. Make splitting a PR a skill the team practices, not a chore people avoid.
And if you want to see what's already hiding in your codebase from the large PRs that already shipped, run a code review audit. Full report within 24 hours. Cheaper than the next production incident.