AI Slop in Code: A Systematic Approach to Reviewing AI-Generated Code
What is AI slop in code?
AI slop is code that compiles and passes tests but no human developer would have written it — characterized by happy-path-only logic, phantom utilities that duplicate existing project code, inconsistent error handling, and wrong abstractions. It accumulates when AI generates code without full project context, optimizing for 'compiles and passes tests' rather than architectural fit.
TL;DR
- -AI slop: code that compiles and passes tests but accumulates architectural debt — 4 patterns: happy path only, phantom utilities, inconsistent error handling, wrong abstractions
- -AI has no project context: it doesn't know about your HttpClientService, Either<Failure,T> pattern, or DI wiring through get_it
- -Review integration points, not internal logic — AI usually gets logic right but misses existing project patterns
- -10-point checklist covers: reuse of existing utilities, error handling consistency, correct imports, no unused dependencies
- -Partial automation possible: pre-commit hooks catch duplicated code; LLM-as-Judge scores against project conventions
I’m building a mobile app with Flutter, running the backend on Supabase Edge Functions, and using Claude Code heavily for code generation. It works well, but there’s a catch: AI code compiles and passes tests, then a couple months later you realize the codebase has accumulated a layer of code that no one would’ve written by hand.
I call this layer AI slop. Below are 4 specific patterns I’ve learned to spot during review, a 10-point checklist, and a few ways to automate detection.
Why AI generates slop in the first place
Three things are at play here.
First, the model has no project context. When you ask it to write a function, it sees the prompt and a handful of files. It doesn’t know you already have an HttpClientService in core/services/, that errors are handled via Either<Failure, T> from dartz, that DI is wired through get_it. It writes code as if it’s a greenfield project.
Second, the model optimizes for “compiles and passes tests,” not “easy to live with six months from now.” Code that duplicates existing logic or breaks architectural conventions is just as successful in its eyes as clean code.
Third, every prompt is a blank slate. Even with a large context window, the model doesn’t remember that three weeks ago you decided to replace approach X with approach Y.
Pattern 1: Happy path only
The most common one. AI writes the main scenario logic and stops. Timeouts, network errors, empty responses, race conditions - either missing entirely or handled in a token way.
Real code example (simplified):
// AI wrote this
Future<List<Trip>> fetchTrips(String userId) async {
final response = await supabase
.from('trips')
.select('*')
.eq('user_id', userId);
return (response as List)
.map((json) => Trip.fromJson(json))
.toList();
}
Compiles fine, mock test passes. What’s missing:
responsecan benullif RLS blocks the queryTrip.fromJsonwill throw on an unexpected schema- No PostgrestException handling
- No timeout
- No distinction between “user has no trips” and “query failed”
What it should look like - nothing fancy, just explicit handling:
Future<Either<Failure, List<Trip>>> fetchTrips(String userId) async {
try {
final response = await supabase
.from('trips')
.select('*')
.eq('user_id', userId)
.timeout(const Duration(seconds: 10));
final list = response as List? ?? [];
return Right(list.map((json) => Trip.fromJson(json as Map<String, dynamic>)).toList());
} on TimeoutException {
return Left(TimeoutFailure());
} on PostgrestException catch (e) {
return Left(ServerFailure(message: e.message, code: e.code));
} catch (e) {
return Left(ServerFailure(message: e.toString()));
}
}
What to look for in review: functions returning Future<T> instead of Future<Either<Failure, T>>. try/catch blocks with empty catch (e) or catch (e) { print(e); }. Unchecked type casts (as List instead of as List?).
Pattern 2: Architectural drift
AI solves tasks in isolation, even when the project already has tooling for it. You end up with duplicates and inconsistent approaches to the same problem across the codebase.
I needed a date formatting function. AI wrote:
// In features/trip/presentation/widgets/trip_card.dart
class DateFormatter {
static String format(DateTime date) {
final day = date.day.toString().padLeft(2, '0');
final month = date.month.toString().padLeft(2, '0');
return '$day.$month.${date.year}';
}
}
Problem: core/utils/date_utils.dart already had:
extension DateTimeExtensions on DateTime {
String toDisplayFormat() => '${day.toString().padLeft(2, '0')}.${month.toString().padLeft(2, '0')}.$year';
}
Now there are two date formatters in the codebase. Some widgets use DateFormatter.format(date), others use date.toDisplayFormat(). A month later the designer asks for a new date format, and you discover you need to change it in two places.
It gets worse when the drift hits infrastructure:
// AI wrote in a new Edge Function
const supabase = createClient(
Deno.env.get('SUPABASE_URL')!,
Deno.env.get('SUPABASE_ANON_KEY')!
);
Meanwhile _shared/supabase-client.ts already had:
export function createSupabaseClient(req: Request) {
return createClient(
Deno.env.get('SUPABASE_URL')!,
Deno.env.get('SUPABASE_ANON_KEY')!,
{
global: { headers: { Authorization: req.headers.get('Authorization')! } },
auth: { persistSession: false }
}
);
}
The new function works, but skips the Authorization header - it runs with anon permissions, ignoring the user’s JWT. RLS won’t kick in. This bug won’t show up in tests. It’ll show up when a user sees someone else’s data or can’t see their own.
What to look for in review: new classes/functions that duplicate logic from core/, _shared/, utils/. Direct import 'package:http/http.dart' where a centralized client should be used. SDK initialization (Supabase, Mapbox, HTTP client) inside feature code instead of DI.
Pattern 3: Over-abstraction
AI loves abstractions. Abstract classes, interfaces, and factories show up in “good code” in training data, so the model reproduces them without considering whether they’re needed here.
The task was simple - load a photo for a POI and display it. AI proposed:
abstract class ImageLoader {
Future<Uint8List?> load(String url);
}
class NetworkImageLoader implements ImageLoader {
@override
Future<Uint8List?> load(String url) async {
// ...
}
}
class CachedImageLoader implements ImageLoader {
final ImageLoader delegate;
final Map<String, Uint8List> _cache = {};
CachedImageLoader({required this.delegate});
@override
Future<Uint8List?> load(String url) async {
if (_cache.containsKey(url)) return _cache[url];
final result = await delegate.load(url);
if (result != null) _cache[url] = result;
return result;
}
}
class ImageLoaderFactory {
static ImageLoader create({bool cached = true}) {
final loader = NetworkImageLoader();
return cached ? CachedImageLoader(delegate: loader) : loader;
}
}
The actual task is one line:
CachedNetworkImage(imageUrl: photoUrl)
Because cached_network_image is already in pubspec.yaml. Already used in 12 other widgets. AI didn’t know that. It solved the problem in a vacuum.
4 classes instead of 1 line. Each of those classes needs testing, maintenance, and explaining to new developers.
What to look for: inheritance chains for tasks that a standard library or an existing package already handles. Factory/Strategy/Builder patterns where a plain function would do. An abstract class with a single implementation and no clear reason for polymorphism.
Pattern 4: Confident hallucination
The code looks normal, but AI invented an API call, SDK method, or configuration that doesn’t exist. Or it existed in an older version.
TypeScript Edge Functions example:
// AI wrote this for IP geolocation
const location = await supabase.functions.geo.lookup(clientIp);
supabase.functions.geo doesn’t exist in the Supabase JS SDK. Never has. The code will compile (TypeScript types won’t catch runtime property access via optional chaining), but it’ll crash in production.
Flutter example:
// AI wrote this for a custom Mapbox marker
final marker = mapboxMap.annotations.createSymbolAnnotation(
SymbolAnnotationOptions(
geometry: Point(coordinates: Position(lng, lat)),
iconImage: 'custom-marker',
iconAnchor: IconAnchor.BOTTOM,
),
);
IconAnchor.BOTTOM is the correct name in the website docs. In the SDK version installed in the project (mapbox_maps_flutter: ^2.5.0), the enum is IconAnchor.bottom. Case difference. Compilation error, but AI didn’t know which SDK version was in the project.
It can get worse - hallucinated behavior:
// AI wrote: "on rate limit error, Supabase automatically retries after 5 seconds"
// This is false. Supabase does not do automatic retries.
// AI just confidently lied in a comment.
What to look for: method calls you haven’t seen before - verify through docs and cmd+click. Comments describing library behavior (“automatically,” “by default,” “built-in support”) - verify them. Enum values and constants - check against the SDK version in pubspec.yaml or package.json.
Systematic checklist for reviewing AI code
You can print this out or keep it open during review.
AI CODE REVIEW - CHECKLIST
CONTEXT
[] 1. Does the code use existing utilities from core/ / _shared/?
Grep for task keywords before analyzing the new code.
[] 2. Does the code follow the project's architectural patterns?
(DI, Either<Failure,T>, naming, layers)
[] 3. Does the code import packages already in pubspec.yaml?
Or does it reimplement something that already exists.
CORRECTNESS
[] 4. Are all error paths handled?
Network timeout, null response, unexpected data schema.
[] 5. Are edge cases covered?
Empty list, negative numbers, concurrent requests.
[] 6. Are API calls and SDK methods verified?
Does the method exist? In this version? With the right parameters?
QUALITY
[] 7. Are all abstractions justified?
Every interface/abstract class has >= 2 implementations or a clear reason.
[] 8. Is there logic duplication?
Compare with existing code, especially formatting, validation, network requests.
[] 9. Are comments truthful?
Especially those describing library or external service behavior.
SECURITY
[] 10. Does the code bypass auth/RLS?
Direct DB queries without user JWT, hardcoded credentials, missing permission checks.
Automation: catching slop before code review
Some problems can be caught automatically, before review even starts.
Git pre-commit hook
A simple bash hook that catches the most obvious signs:
#!/bin/bash
# .git/hooks/pre-commit
STAGED_FILES=$(git diff --cached --name-only --diff-filter=ACM | grep -E '\.(dart|ts)$')
if [ -z "$STAGED_FILES" ]; then
exit 0
fi
ERRORS=0
for FILE in $STAGED_FILES; do
# Direct Supabase init without shared client
if grep -n "createClient(Deno.env" "$FILE" | grep -v "_shared" > /dev/null 2>&1; then
echo "Warning: $FILE: direct createClient outside _shared/"
ERRORS=$((ERRORS + 1))
fi
# Flutter: direct HTTP imports without centralized client
if grep -n "import 'package:http/http.dart'" "$FILE" > /dev/null 2>&1; then
echo "Warning: $FILE: direct http import, use DioClient from core/"
ERRORS=$((ERRORS + 1))
fi
# Empty catch blocks
if grep -n "catch (e) {}" "$FILE" > /dev/null 2>&1; then
echo "Warning: $FILE: empty catch block"
ERRORS=$((ERRORS + 1))
fi
done
if [ $ERRORS -gt 0 ]; then
echo ""
echo "Found $ERRORS potential AI-slop patterns."
echo " Check the files before committing."
exit 1
fi
Custom lint rules for Dart
package:custom_lint lets you write rules specific to your project:
// tool/lints/lib/avoid_direct_supabase_init.dart
class AvoidDirectSupabaseInit extends DartLintRule {
const AvoidDirectSupabaseInit() : super(code: _code);
static const _code = LintCode(
name: 'avoid_direct_supabase_init',
problemMessage: 'Use SupabaseClientService from core/services instead of direct initialization',
);
@override
void run(CustomLintResolver resolver, ErrorReporter reporter, CustomLintContext context) {
context.registry.addMethodInvocation((node) {
if (node.methodName.name == 'initialize' &&
node.target?.toString() == 'Supabase') {
reporter.reportErrorForNode(_code, node);
}
});
}
}
This catches Supabase.initialize(...) in feature code where DI should be used.
Dependency check in CI
A simple script to verify new files don’t create duplicate utilities:
#!/bin/bash
# scripts/check_duplicates.sh
# Check that new dart files don't reimplement what already exists in core/utils/
NEW_FILES=$(git diff origin/main...HEAD --name-only --diff-filter=A | grep "\.dart$")
for FILE in $NEW_FILES; do
# Look for date formatters
if grep -l "DateFormat\|DateTime\|toDisplayFormat\|formatDate" "$FILE" > /dev/null 2>&1; then
EXISTING=$(grep -rl "DateFormat\|toDisplayFormat\|formatDate" lib/core/ 2>/dev/null)
if [ -n "$EXISTING" ]; then
echo "Warning: $FILE creates a date formatter, but core/ already has:"
echo " $EXISTING"
fi
fi
done
FAQ
How does AI slop accumulate differently in solo projects versus team codebases, and does team size change the detection strategy?
In solo projects, slop accumulates silently because there is no review gate — only the original author sees the code, and they often wrote the context-lacking prompt that created the problem. In team codebases, slop is more likely to be caught in PR review but also more likely to proliferate: each developer independently generates code that doesn’t know about work their teammates did last week. For teams, the pre-commit hooks and CI duplicate-detection scripts in this article are higher-priority investments than for solo work, where CLAUDE.md quality has a disproportionately larger impact.
Is there a point at which a codebase has accumulated so much AI slop that a full audit is more cost-effective than incremental review?
The trigger threshold is roughly when 3 or more of the 4 slop patterns appear in every significant PR without pre-commit hooks catching them. At that point, a focused architectural audit of the core/, _shared/, and utils/ directories — cataloguing existing utilities that AI keeps duplicating — takes 4–8 hours but pays back in reduced review overhead within 2–3 sprints. The audit output is a reference document you feed AI at the start of each session: “existing utilities: [list with file paths and purposes].”
Do different LLMs produce different types of slop, or are the 4 patterns consistent across Claude, GPT-4o, and Gemini?
The 4 patterns appear across all major models, but the frequency differs. Over-abstraction (Pattern 3) is more prevalent with GPT-4o, which has a stronger tendency toward class hierarchies. Happy-path-only code (Pattern 1) is consistent across all models and correlates with prompt specificity — the more explicit you are about error handling requirements, the less this appears. Confident hallucination (Pattern 4) varies by domain: Gemini hallucinates less on well-documented public APIs but more on niche SDK versions. Running multi-agent review (mentioned in the related AI code review article) catches model-specific blind spots.
Living with it
AI agents are getting smarter with every update and already do a decent job of accounting for project context. But slop still shows up, especially in larger codebases with their own conventions, non-standard patterns, and a history of decisions.
In practice this shifts the focus of review. AI usually writes the logic correctly. The problems are in how new code fits into the existing system: what dependencies it uses, whether it duplicates what’s already there, whether it handles errors properly.
What works for me: before a complex task, I explicitly give Claude a list of files and patterns to follow. During review, I look at integration points first, not the logic inside the function. And I don’t accept code without running it - hallucinations only surface at runtime.
The checklist and hooks above cover most of what I’ve run into. The rest is project-specific: your architectural decisions, packages, conventions. Worth adding to the checklist as you discover them.
Frequently Asked Questions
What is AI slop in code?
AI slop is code that compiles and passes tests but no human would have written it. It includes unnecessary abstractions, duplicated utilities that already exist in the project, inconsistent error handling, and phantom dependencies — patterns that accumulate when AI generates code without full project context.
How do you detect AI-generated code problems during review?
Focus on integration points, not internal logic. AI usually gets the logic right but misses existing project patterns. Check: does it reuse existing utilities? Does it follow the project's error handling pattern? Does it import the right dependencies? A 10-point checklist catches most issues.
Can AI code review be automated?
Partially. Pre-commit hooks can catch duplicated code and unused imports. LLM-as-Judge can score code against project conventions. But architectural fit — whether new code belongs in the existing system — still requires human judgment during review.