
How I Test Laravel Applications in Production Teams
A refactor of the invoice generation flow came across my desk last quarter. The PR added a clean new abstraction: an InvoiceService with a generate(Order $order) method that handled PDF rendering, the mail dispatch, and the accounting sync. The unit tests mocked the whole service and asserted that generate() was called once with the right order. CI was green. Two reviewers signed off. The deploy went out.
Two weeks later, a customer reported they had never received an invoice email for an order they had placed. Their account showed the order as paid. The PDF existed in S3. The accounting sync had run. The email simply never went out. The bug, when we found it, was a single line in the new service: the mail dispatch had been commented out during an earlier draft of the refactor and never put back. Nothing in the test suite caught it because the test suite was checking that the service's method was called, not that the service did anything useful.
That incident is the entire reason this article exists. Passing tests do not necessarily mean working software. A green CI pipeline that asserts the wrong things is not a safety net. It is a coat of paint on a wall with a hole behind it. The actual question every test should answer is whether the business outcome happened, and almost everything below comes back to that one frame.
The Biggest Myth About Testing
The myth is that more tests are better. The follow-on myth is that coverage percentage is a meaningful number. Both are wrong in the specific way that lets entire teams ship a suite that gives them comfort but does not give them protection.
I have worked on Laravel codebases with 92% coverage that paged somebody every weekend. I have worked on codebases with 60% coverage that ran for years without an incident. The difference was not the number. The difference was what the tests were actually verifying.
The question that matters at review time is not "did the author add tests." It is "do these tests protect the business outcome." A test that asserts a method was called proves the wiring works. It does not prove the wiring does anything useful at the other end. A test that asserts the database row changed, the email was sent, the queue job was dispatched, and the policy denied the wrong user is a test that protects something a customer can perceive.
Most teams I have joined did not have a testing problem in the abstract. They had a confidence problem. They had a suite, the suite passed, and nobody trusted it enough to refactor anything substantial. That is not a tooling failure. That is a strategy failure.
My Testing Philosophy
A short list of principles that have held up across every Laravel app I have shipped. None of them are novel. All of them are easy to ignore.
- Test behavior, not implementation. A test that breaks every time I rename a private method is a test that punishes refactoring. A test that breaks when the public outcome changes is a test that protects me.
- Protect business rules first. Pricing, eligibility, authorization, payment flows, anything where a wrong answer costs money or trust. Those tests come before anything cosmetic.
- Confidence over coverage. The metric I care about is "can I delete 500 lines and trust the suite to tell me if I broke something." Coverage percentage is a poor proxy for that. The shape of the tests is the actual signal.
- Fast feedback beats perfect coverage. A suite that runs in twenty seconds gets run on every commit. A suite that runs in twenty minutes gets run on the third "I will check it after lunch."
- Production bugs are uncovered tests. Every incident points at a test that should have existed. The fix is rarely a one-liner. The discipline is to write the test alongside the fix.
Each of these took a real incident to make me internalize it. The principles are the scars, not the syllabus.
What I Actually Unit Test
Unit tests are the most overused tool in the Laravel testing landscape. They are fast, they are easy to write, and they end up dominating suites that should be doing more feature-level work. I keep unit tests for things where the unit really is the boundary: pure functions that have no Laravel in them.
Good candidates:
- Domain logic that operates on plain values (price calculators, tax rules, discount engines).
- Pure service classes with no framework dependencies.
- Value objects (a
Moneyclass, aDateRangeclass). - Validation rules whose logic is non-trivial.
- Helper classes that transform data without touching the database.
Bad candidates (almost universally):
- Tests that assert
Hash::make()returns a string. That is Laravel's job, not mine. - Tests that confirm an Eloquent
hasManyrelationship exists. The relationship is a one-line declaration; the framework guarantees it works. - Tests that mock the database and then assert a save call happened.
- Tests that wrap a setter and assert the value got set.
The line I use: if the test would still be useful in a non-Laravel project, it is probably a real unit test. If it only makes sense because it is exercising Laravel itself, the test is a maintenance liability.
One adjacent category worth a brief mention: Pest's architecture tests. They are not behavior tests. They are static-style assertions about the codebase itself: no model may import a controller, every Job implements ShouldQueue, nothing in the domain layer depends on a vendor SDK. They run at unit-test speed and catch convention drift that code review misses on busy weeks. They are not a substitute for business tests; treat them as cheap insurance against the team forgetting the shape it agreed to.
Feature Tests Are Where I Spend Most of My Time
If I had to delete every test in a codebase except one category, I would keep the feature tests. They are the only kind that exercises the actual stack the way production does: a request hits the router, the middleware runs, the form request validates, the controller delegates, the policy authorizes, the database is touched, the response is built, the side effects fire. Everything important either works or it does not.
A feature test for a Sanctum-protected subscription endpoint, in Pest:
it('lets a customer cancel their own subscription', function () {
$subscription = Subscription::factory()
->for(User::factory())
->active()
->create();
Sanctum::actingAs($subscription->user, ['subscriptions:write']);
deleteJson("/api/subscriptions/{$subscription->id}")
->assertNoContent();
expect($subscription->fresh()->status)->toBe('cancelled');
Queue::assertPushed(StopBillingJob::class);
Notification::assertSentTo($subscription->user, SubscriptionCancelledNotification::class);
});In one test I have verified routing, Sanctum middleware, the token ability check, the controller, the Policy, the database transition, the queued job dispatch, and the notification fan-out. The same coverage in unit tests would be eight or nine tests, all of which would have to mock half the system, and none of which would have caught a bug in the wiring between them.
The endpoints I cover with feature tests, in rough priority order:
- Authentication and login flows (including failure paths).
- Authorization on every endpoint that touches a resource that has owners. The authorization post has the framing for what good Policy coverage looks like.
- Anything related to billing or money.
- Checkout and order placement.
- API endpoints with public callers.
- Multi-tenant access boundaries (the kind of bug the PR review post covers in the cross-tenant Policy story).
- CRUD for any resource where the wrong row being affected would matter.
Production story: the test that caught the cross-tenant leak in CI
A PR added a "share invoice" feature. The diff updated the InvoicePolicy@view method to allow access if the user appeared in invoice_shares. The author had also added a feature test that explicitly asserted a user on Team B could not see an invoice on Team A, even when shared with another user on Team A. The test failed in CI because the new Policy rule did not scope the share check to the current team. The author tightened the rule, the test passed, and the bug never reached production.
The same shape of bug had reached production a year earlier in a different module. That time, the cleanup took two days and a customer trust conversation. This time, the test caught it in fifteen minutes and the PR shipped clean. The difference was that somebody had decided to write that specific assertion. Cross-tenant tests are a category I now ask for explicitly on every PR that touches authorization.
Integration Tests
Integration tests are for the boundaries the application does not own. Stripe. SendGrid. S3. The internal API of a sister service. Anything where "did our code call the right method" is not the question, because the real question is "did the other side accept our request and respond the way we expected."
The rules I follow on these:
- Use the framework's fake first.
Http::fake,Storage::fake,Mail::fake,Notification::fake,Queue::fake,Bus::fake,Event::fake. These fakes intercept the call and let you assert what was sent, which is usually exactly what you want. - Do not fake when the bug lives in the integration. If the question is whether the Stripe webhook payload is parsed correctly, faking Stripe is the wrong move. Build a fixture from a real Stripe payload, replay it through the controller, assert the result. Faking the SDK would test our knowledge of our own mock; replaying the fixture tests our knowledge of theirs.
- Sandbox accounts for the boundary tests that need a real round-trip. One or two end-to-end tests per integration, run nightly, against the vendor's sandbox. They are slow, they are flaky in the way that real networks are flaky, and they catch the kind of bug nothing else catches.
The queue caveat deserves its own paragraph. The default sync driver runs every dispatched job inline in the same process, with no retry attempt, no visibility timeout, no failed-job table interaction, and no chance of duplicate execution. Production queues do all of those things; the sync driver lies about all of them. A sync-backed test proves the job runs once. It says nothing about whether the job is safe to run twice, how it behaves on the third retry, or what the failed-job handler does when the worker dies mid-execution. For anything queue-sensitive, I use Queue::fake() to assert dispatch and drive the job manually, or I run against a real driver in a worker process that mirrors production semantics. The queues post covers the idempotency discipline this protects.
What I Never Test
Every unnecessary test has a maintenance cost. It runs on every commit, breaks on every unrelated refactor, and gives the team nothing useful when it does break. The list of things I do not bother testing has grown longer over the years.
- Laravel framework internals. If a test would fail only because the framework itself changed, the framework's own test suite covers it.
- Trivial Eloquent relationships. A
hasManydeclaration is a one-liner that the framework guarantees works. I test the business rules that use the relationship, not the relationship itself. - Third-party packages. If I do not own the code, I trust the package's own tests for its behavior. I only test how my code uses the package.
- Getters, setters, and trivial accessors. An accessor that returns
$this->first_name . ' ' . $this->last_nameis not a test target. It is plain code. - Pure CRUD with no business rule. An admin endpoint that lets a superuser edit a category's name has no logic worth testing. The Policy that gates it is worth testing. The string update is not.
- Internal implementation details. Private methods. Specific class structure. Specific service composition. Tests that assert these break when I refactor and stop me from refactoring.
The single question I ask before writing a test: what bug would this test catch? If I cannot name a concrete bug, the test is decorative. Decorative tests slow CI down, raise the false-positive rate, and dilute the suite's signal.
Common Testing Mistakes I See
A non-exhaustive list of the patterns that turn a test suite from an asset into a liability.
- Testing implementation details. Asserting that a private method was called, that a specific Eloquent query was built, that the controller delegated to a particular service. All of these break the moment the implementation is refactored without changing behavior, which means they punish the team for doing exactly the thing the suite should be enabling.
- Overusing mocks. Mocks were the bug in the opening story. A mock that asserts "this method was called" without asserting the method does anything is a test that proves your wiring exists. Wiring failures are rare. Logic failures are common.
- Skipping the unhappy path. Most tests cover the success case. The cases that break in production are the timeouts, the invalid input, the unauthorized user, the soft-deleted resource, the duplicate webhook delivery. Every test of a happy path needs at least one test of the matching failure path.
- Not testing authorization. A controller without a feature test asserting "an unauthorized user gets a 403" is a feature one PR away from being world-readable. Authorization tests are the cheapest insurance you can buy.
- Brittle assertions. Asserting on the exact text of an error message instead of its type. Asserting on the exact order of an array when order does not matter. Asserting on JSON structure with no flexibility for added optional fields. These tests fail for cosmetic changes and train the team to ignore failing tests.
- Huge fixture setup. A test that needs forty model factories to set up has a design smell, not a testing smell. The code under test is probably doing too much. Smaller, more focused factories with the
state()pattern keep the setup readable. - Tests depending on execution order. If your tests fail when run in a different order, the suite has hidden state. Use
RefreshDatabase, isolate setup, and never rely on data from a previous test. - Sleeping in tests.
sleep(2)in a test means the author did not know how to wait for the actual condition. It is slow, it is flaky, and it teaches the next engineer thatsleepis normal. UseCarbon::setTestNow(),Queue::assertPushedafter dispatch, and assertion-based waiting on real events. - Database state leaking between tests. Without
RefreshDatabaseor transactions, one failing test corrupts the next. The cure is non-negotiable: every test starts from a known state, and the framework's traits exist precisely for this.
My Test Pyramid (Or Why I Don't Worship It)
The classic test pyramid recommends many unit tests, fewer feature tests, very few end-to-end tests. The diagram is everywhere. It is also wrong for most Laravel applications I have worked on.
The actual balance I see in healthy production Laravel codebases is closer to 20% unit, 60% feature, 20% integration. Feature tests carry the weight because Laravel is a framework where the interesting bugs live at the seams between components: route binding, middleware, Policy resolution, validation, queue dispatch, observer side effects. Unit tests cannot see those seams.
The pyramid is not wrong as a concept. It is wrong as a recipe applied across every stack regardless of where the bugs actually live. In a domain-heavy library with no framework, the pyramid is right. In a Laravel web app that mostly orchestrates HTTP, persistence, and async work, the inversion is closer to reality.
The shape that matters is not the pyramid. It is the question: which tests catch the most production bugs per minute spent maintaining them? For Laravel apps, the answer is overwhelmingly feature tests.
Speed Matters More Than People Think
A test suite that runs in twenty seconds gets run on every save. A test suite that runs in twenty minutes gets run "when CI does it." The behavior difference is not laziness. It is psychology. A fast suite is a tool the team reaches for. A slow suite is something to schedule around.
The practical levers I lean on:
- Parallel testing.
php artisan test --parallelspreads tests across cores. On a four-core CI runner with a properly isolated database (one schema per worker), this is the single biggest speedup. Just make sure tests do not write to the same shared external resources. RefreshDatabaseoverDatabaseMigrations.RefreshDatabaseuses a transaction that rolls back;DatabaseMigrationsre-runs every migration per test. The latter is correct on the rare PR that changes migrations; the former is right for everything else.- Factories over seeders for test data. Seeders are for environment setup. Factories are for tests. A test that calls a full seeder is loading data it does not need.
- Pest over PHPUnit for new suites. Pest is faster to write and reads better, which lowers the activation energy for adding the next test. PHPUnit is fine for legacy suites; converting an entire suite is rarely worth it.
- HTTP fakes over real network calls. Even an "instant" network call is slow when run thousands of times. Fake the boundary, assert on the request, and keep the real call in the nightly integration suite.
- Profile the suite occasionally.
--profilein Pest, slow-test reporting in PHPUnit. A single bad fixture or a missed mock can quietly add minutes to the run.
A two-minute suite is the line for me. Past two minutes, developers stop running it locally. Once developers stop running it locally, bugs reach CI more often, the loop slows down, and the team's trust in the suite erodes. Keeping the suite under that threshold is engineering work, not optimization.
A Real Production Example
Take a subscription cancellation feature. The customer hits DELETE /api/subscriptions/{id}. We expect: the subscription status flips to cancelled, billing stops on the next cycle, the customer gets a confirmation notification, downstream services are told, and the user loses premium access immediately.
The bad test, which I have seen ship more than once:
it('calls the cancellation service', function () {
$service = $this->mock(SubscriptionService::class);
$service->shouldReceive('cancel')->once();
$subscription = Subscription::factory()->active()->create();
actingAs($subscription->user)
->deleteJson("/api/subscriptions/{$subscription->id}")
->assertNoContent();
}); That test proves the controller calls cancel(). It proves nothing about what cancel() does. The service could be empty, throw silently inside a try/catch, or send the wrong notification. The test would still pass.
The good test, which is barely longer:
it('cancels a subscription end to end', function () {
Notification::fake();
Queue::fake();
$subscription = Subscription::factory()
->active()
->hasFeatures(['premium-dashboard'])
->create();
Sanctum::actingAs($subscription->user, ['subscriptions:write']);
deleteJson("/api/subscriptions/{$subscription->id}")
->assertNoContent();
// Domain state changed.
expect($subscription->fresh()->status)->toBe('cancelled');
expect($subscription->fresh()->cancelled_at)->not->toBeNull();
// Billing actually stops.
Queue::assertPushed(StopBillingJob::class, fn ($job) =>
$job->subscriptionId === $subscription->id
);
// Customer is informed through the right channel.
Notification::assertSentTo(
$subscription->user,
SubscriptionCancelledNotification::class
);
// Premium access is gone immediately.
expect($subscription->user->fresh()->hasFeature('premium-dashboard'))
->toBeFalse();
}); This second test asserts what the business actually wanted. If somebody refactors SubscriptionService, swaps it for an Action, or restructures the controller entirely, the test still passes as long as the outcome is right. If somebody comments out the notification (the opening story), the test fails. If somebody forgets to dispatch the billing-stop job, the test fails. The structure of the test mirrors the structure of the user-visible promise, and that is the only structure worth defending.
The longer test is not slower in any meaningful sense. It runs in the same hundred milliseconds. It costs three more lines to write. It catches an order-of-magnitude more bugs.
How Production Bugs Change My Test Suite
Every production incident I have ever debugged ended the same way: someone said "we should add a test for that," and either it happened or it did not. Whether the team consistently does the second part is the difference between a suite that grows usefully over time and a suite that stays at the same false sense of safety it had two years ago.
The discipline I have settled on: any production bug serious enough to warrant a fix is also serious enough to warrant a test. The test goes in the same PR as the fix. Without the fix, the new test fails. With the fix, it passes. From then on, that specific failure mode has a permanent guard.
Production story: the duplicate-invoice test
The duplicate invoice email incident I covered in the queues post ended with an idempotency guard on the GenerateInvoiceJob. The PR that shipped the guard also shipped a new integration test that explicitly dispatched the job twice against the same order and asserted that exactly one email had been sent, exactly one invoice had been generated, and the order's invoice_status sat at sent. The test failed without the guard. It passes with it. The team can refactor the job freely, knowing that if anybody removes the idempotency guard accidentally, the test will catch it before merge.
That pattern, over years, is how a suite becomes a real safety net. Every bug found in production teaches the suite one more thing it should have been checking. The suite gets more useful over time without anyone setting a coverage goal.
The corollary is harsh: a team that ships fixes without writing the matching test is repeating the bug on a longer timeline. The same root cause will show up again, in a different module, on a different Tuesday. I have seen it happen more times than I want to admit.
Production story: the payment refactor the suite carried
The opposite story is the one a confident suite earns. The payments module on one app I worked on had grown to almost three thousand lines spread across a controller, two services, a billing manager, and a tangle of helper classes. A multi-currency requirement forced a rewrite. The refactor consolidated four classes into one service and two action classes, deleted over a thousand lines, and changed almost every internal signature in the module.
The feature tests barely moved. They asserted that a Pro plan checkout produced the right charge in the right currency, that a failed payment fired the right notification, that an unauthorized user could not call the endpoint. Those assertions did not care which internal class did the work. The PR ran, the suite passed, the deploy was uneventful, and the on-call channel stayed silent the following week. The bug-prevention stories are the dramatic ones. The refactor-enabling stories are the ones that compound.
Tests That Paid For Themselves
A few moments where a single test caught the bug that would have made the week miserable. None of them were heroic. All of them existed because someone took ten minutes to write the right assertion the first time.
- The dashboard widget that almost leaked across tenants. An analytics endpoint scoped its query by user but not by team. The feature test asserting that a user on Team B could not see widget data for an account on Team A failed in CI on the first push. The author switched to a tenant-scoped query, the test passed, and the customer email to legal never had to be sent.
- The Stripe webhook that would have double-charged. Stripe occasionally delivers the same
charge.succeededevent twice. An integration test processed the payload twice and asserted exactly one ledger entry, one notification, one downstream sync. A later refactor accidentally removed the idempotency check on the handler; the test caught it before merge. No customer was billed twice for the same charge. - The checkout that almost shipped without a transaction. A PR split order creation into two service calls. The feature test that forced the fulfillment dispatch to throw asserted that the order row was rolled back. It failed because the new structure had silently dropped the outer
DB::transactionwrap. The bug in production would have orphaned every failed order behind a "you charged me but never delivered" support ticket.
Each test cost less than an hour to write. Each one would have cost days of incident response, customer messaging, and on-call exhaustion if the matching bug had reached production. A test that catches one production bug pays for years of itself.
What Changed As I Became More Experienced
Early in my career I wanted more tests. The number was the goal. I added unit tests for getters, mocked the database into shapes that no longer matched production, and felt good about a 90% coverage badge that protected almost nothing.
Now I want better tests. Fewer is fine. The shape that matters is whether the suite gives me the confidence to change production code without fear. A green CI run does not earn that confidence. The shape of the assertions does. A suite of forty feature tests that cover the business outcomes is worth more than a suite of four hundred unit tests that cover the class structure.
The deeper shift is in what the suite is for. Tests are not a quality stamp for the code that just shipped. They are a permission slip for the engineer six months from now who needs to refactor it. If the suite makes that engineer brave, it is doing its job. If it makes them afraid to touch the file, it is doing the opposite of its job, regardless of how much coverage it shows on the dashboard.
Closing Thoughts
Tests are written for the people who come after you, not for the CI badge that turns green on the PR that adds them. The day a test earns its keep is not the day it is merged. It is the day, two years later, when somebody picks up a story and changes a file you have not looked at in months, and the suite tells them in twenty seconds whether they broke something a customer would notice.
The single shift that matters: stop measuring tests by how many of them exist, and start measuring them by what they let the team do. A small suite that lets engineers refactor confidently is more valuable than a large suite that everybody steps around. A green CI run on tests that assert the wrong things is worse than no tests at all, because it produces false confidence.
The shift in how I think about tests took years to land. Early on, a test was a way to prove to myself and to my reviewer that the change I had just made worked. The audience was the present. Now the audience is the future. A test is a quiet note left for the engineer who will inherit this file when nobody from the original team is still around to explain it, and the note says: if you keep the outcome intact, the rest of the system will not punish you for changing the internals.
The sticky note version: the best test is not the one that turns CI green. It is the one that lets the engineer six months from now refactor a class without holding their breath. Build the kind of suite that does that, and the team's velocity stops being a function of how brave they are feeling on any given Tuesday.
What is the worst test you have ever seen pass on a PR that later caused a real production bug?