Risks & Critical Points
A "what could go wrong" map for new and existing developers. Read this before shipping anything to production. Pairs with — and intentionally overlaps lightly — fragile areas: fragile-areas tells you which code to handle carefully; this doc tells you which failure modes to plan for.
Each entry: what's at risk · where it lives · why it bites · what to do.
TL;DR for the impatient
If you only read three things before your first prod-touching PR:
- There is no automated rollback for backend migrations or Firebase deploys. Migrations are tracked and skipped once completed, but failures stop the chain and require manual repair; Firebase Functions v2 has no one-click rollback — you redeploy the previous bundle. See § Deploy-time foot-guns.
- The Firebase Realtime Database is the source of truth. Postgres is a one-way replica. If reports look wrong, check pg-sync, not the database itself.
- The iOS
NSAllowsArbitraryLoads = trueflag inPIVOT-Mobile/ios/pivot3/Info.plistis still active. All HTTP requests from production builds (including auth tokens) are MITM-able on hostile networks. See § NSAllowsArbitraryLoads.
1. What could break production
Grouped by failure category, not by file. Each item names the smoking gun, then says what you can do about it.
🚨 Critical — likely-or-already-burned
Stripe webhook: missing idempotency (signature fallback is fail-closed)
File: functions/systems/stripe/handlers/stripe-webhook-handler.ts
The handler reads (req as any).rawBody || JSON.stringify(req.body). If
rawBody is absent, stripe.webhooks.constructEvent throws on the
signature mismatch and the request is rejected with a 400 — fail-closed,
not fail-open. The fallback produces noisy errors, not forged-event
acceptance.
The real risk is no idempotency-key tracking. Stripe retries
webhooks; if event.id repeats, the handler re-runs. Most paths are
overwrite-shaped and idempotent in practice (handleSubscriptionChange,
handleCustomerUpdated); handleSubscriptionCreated is the dangerous
one — duplicate delivery can create redundant company↔subscription links.
- Do: add an event-ID-processed store before extending any handler.
Log duplicates rather than silently swallowing them. If you touch the
HTTP layer, re-verify with the Stripe CLI to confirm
rawBodyis still populated.
get-secret.ts returns empty string — guard coverage is uneven
File: functions/get-secret.ts
secret(name) returns secretCache[name] || process.env[name] || ''.
Missing values are silently "". Call-site behaviour is inconsistent:
-
Guarded (fail fast): Stripe (
get-stripe-instance.ts), SMTP (nodemailer.transporter.ts), migrations — throw at init. -
Guarded (silent skip): Slack (
send-slack-notification.ts) — logs and returns null. Acceptable for fire-and-forget notifications. -
Unguarded: Lightspeed (
lightspeed-api-token.ts), Cluster POS (cluster-api-wrapper.ts) — concatenate""into URLs / Bearer tokens. The first API call fails (axios rejection), not "hours later" — but the error message is unhelpful. -
Do: for new integrations, use
defineSecret()fromfirebase-functions/params(seeintegrations/givex/secrets.ts). For existingsecret('NAME')call sites, add an explicit throw-on-empty guard if there isn't one (e.g.if (!value) throw new Error('missing ' + name)).
28 registered migrations, no down-step
Runner: functions/migrations/run-migrations.ts
(28 entries; the folder also contains ~70 one-off utility scripts under
sub-folders that are not registered).
Migration state is tracked in the database under /migrations/{name}
(completed / failed + runAt + duration). Completed migrations are
skipped on subsequent deploys — so accidental re-runs are not the risk.
The risk is mid-deploy failure: the runner exits on first failure
(line 190) and leaves the chain partially applied. There is no
automated down-step.
- Do: design every new migration idempotent and resumable. Test
against a copy of prod via
functions/scripts/copy-company-to-dev.ts. Document the manual recovery path in a comment at the top of the file. Never log PII or raw values.
Schedule → Postgres dual-write can diverge silently
See fragile-areas.md › Schedule → Postgres dual-write.
Listed here to keep the prod-risk list complete: a single failed
pg-sync write means the live UI and reporting layer disagree with
no alert. Check Rollbar for pg-sync errors when reports look off.
Three-Firebase-project drift
See fragile-areas.md › Three-Firebase-project drift.
Project IDs (pivot-dev-59310 / pivot-not-production-project / pivot-inc)
appear in src/config.ts, backend .env.<project-id>
files, mobile plists, GitHub Actions, and Fastlane. Pointing the running
app at the wrong project is one typo away.
- Do: before any deploy, confirm
firebase useand the exact env file. Check the Firebase init log in browser console / function logs.
Silent failures (no alarm, wrong data)
POS employee-mapping cache staleness
See fragile-areas.md › pos-sync employee mapping cache.
A stale mapping silently attaches wages, sales and tip-outs to the wrong
employee. Re-listed here because the symptom is invisible until
payroll runs.
Background push handler — AsyncStorage failures
See fragile-areas.md › Background message handler
(PIVOT-Mobile/index.js). Failures are silent; badge counts drift; users
think notifications are broken without a clear cause.
APNs / provisioning-profile expiration is unmonitored (already expired)
Files: PIVOT-Mobile/ios/pivot3/Info.plist, PIVOT-Mobile/AppStore_jobs.pivot.qa.mobileprovision
The provisioning profile committed at AppStore_jobs.pivot.qa.mobileprovision
has ExpirationDate: 2024-10-29 — expired 7+ months ago. Fastlane
uses match(readonly: true) to fetch certs from the cert repo but does
no expiry validation. No before_all check; no CI preflight; no
Sentry/Slack alert.
When the APNs key in App Store Connect rotates or a cert lapses, iOS push for prod stops silently until someone notices badges drifting.
- Do: add a
cert_age_checkFastlane lane that runsmatchthen parses each profile'sNotAfterand fails if any cert/key is within 30 days of expiry. Wire it intobefore_all. Replace the expired profile in the repo as part of the next release.
🚨 iOS NSAllowsArbitraryLoads is true (active cleartext-HTTP vulnerability)
File: PIVOT-Mobile/ios/pivot3/Info.plist
Verified: the plist contains an unscoped NSAllowsArbitraryLoads = true
under NSAppTransportSecurity. There is no NSExceptionDomains
scoping. This applies to all HTTP requests in production builds:
auth tokens, GPS punches, and chat payloads are MITM-able on hostile
networks. App Store review can also flag this.
- Do: remove the key (or set it to
false) immediately. If dev or staging actually needs cleartext for emulator domains, scope it viaNSExceptionDomainslisting only those specific hosts. Add a CI check that fails the build whenNSAllowsArbitraryLoadsistruein a release configuration.
Cloud Functions: uneven runWith coverage, no project-wide maxInstances
Pattern: setGlobalOptions({ region: 'us-central1' }) in
functions/index.js sets region only — no
default memory, timeout, or maxInstances. Some functions specify
runWith({ memory, timeoutSeconds }) (integrations, crons, some
on-call handlers); core HTTP endpoints like setEmployeeClockInV2,
setEmployeeClockOutV2, getCompanyTimeNew do not.
A runaway query, hot Pub/Sub loop, or customer flood can scale
instances unboundedly. The Cloud SQL connection storm is partly
mitigated — functions/shared/infrastructure/pg-client.ts
sets max: 5 per instance — but N instances × 5 connections still hits
the SQL connection limit. Default v2 timeouts are 60s for HTTP / 540s
for events.
- Do: set explicit
maxInstancesandtimeoutSecondson every new function (and on the hot HTTP endpoints above when you next touch them). Add memory hints for known-heavy paths (export generators, pg-sync, integration syncs). Consider a project-widemaxInstancesdefault viasetGlobalOptions.
Deploy-time foot-guns
GitHub Actions deploy: project-safe and gated, but no smoke test
File: .github/workflows/deploy.yml
The pipeline is safer than it looks:
- Project targeting is explicit — every Firebase step passes
--project <env-id>(or the GitHub Actions variablevars.FIREBASE_PROJECT_ID), overriding any stalefirebase useon the runner. - Production deploys are manual-only (
workflow_dispatch); dev, main→staging, and uat auto-deploy on push but reference GitHubenvironment:names that typically carry UI-configured approval rules. - Last green SHA is cached per environment (lines 174–179, 289–297), enabling fast revert+redeploy.
Real gaps:
-
No post-deploy smoke test — success is declared the moment
firebase deployreturns 0, with no curl against the live URL. -
No automated rollback — Firebase Functions v2 has no one-click rollback; you redeploy the previous SHA.
-
Do: before a hot deploy, note the prior green commit SHA. After any non-trivial deploy, hit the new function URL and the web app's health endpoint before walking away. Don't merge to
productionon Friday afternoon.
Manual ops scripts lack local safeguards
Folder: functions/scripts/
Most scripts (set-custom-claim.js, copy-company-to-dev.ts,
duplicate-data.ts) hard-code a project ID or load a service-account
file, initialize Firebase Admin silently, and run. They print nothing
about which project they're targeting and have no --dry-run flag.
set-super-admin-claims.js is the exception — it supports --dry-run
and multi-env targeting; setup-mia-agent.js at least prints the
project ID.
Note: Firebase Admin SDK writes are captured in Cloud Logging server-side, so an audit trail exists — it just lives where you have to go look for it after the fact. The real risk is silent mis-targeting, not absent logs.
- Do: print the project ID at the top of any new script and pause
for confirmation if it matches a production-shaped name. Prefer
scripts that support
--dry-run. Run against staging first. After running, glance at Cloud Logging to confirm the writes hit the project you intended.
Mobile env binding has no post-build assertion
File: PIVOT-Mobile/src/active.env.js (generated by set-dev /
set-stage / set-prod npm scripts before each build)
Not a true race — iOS and Android CI workflows use per-platform
concurrency: groups, so parallel runs on the same branch are
serialized and each runs in its own VM. The real risk is crash
recovery: if a build crashes after env setup but before the bundle
finishes, a retry that skips the env step can produce a bundle pointing
at the wrong backend. The bundled JS is not introspectable after the
fact, so there is no way to verify what env actually shipped.
- Do: bake the env intent into the bundle as an explicit constant
at build time (e.g. write
export const BUILD_ENV = 'prod'next toactive.env.js), then have the lane read it back post-build and assert it matches the lane's intent before upload to TestFlight / Play Console.
cors.json is * for GET/HEAD on Cloud Storage
File: cors.json
CORS doesn't gate fetchability — direct navigation and server-side
fetches always work regardless. What it grants is cross-origin
browser-script fetches: any website's JS can fetch() URLs in the
bucket. So if a signed or unauthenticated bucket URL ever ends up in a
hostile page, that page's JS can read it. Not a critical risk on its
own, but means: don't rely on URL obscurity, and don't store sensitive
content at predictable bucket paths.
Auth & data integrity
RTDB rules: 1,114 lines; tests exist but coverage is uneven
File: database.rules.json
There is a rules-test suite: tests/rules/ holds
~20 Jest specs (notifications, payroll, schedule, employees, companies,
requests, chat, etc.) run via firebase emulators:exec per
jest.rules.config.js. The risk isn't
absence of testing — it's that a rules change to an untested path
still silently breaks reads/writes without surfacing as a TS or build
error. Combined with the 148+ hard-coded RTDB paths in the frontend
(fragile-areas.md), the blast
radius of a single rule typo is large.
- Do: when you touch a rule, run the matching spec under
tests/rules/against the emulator. If your path has no coverage, add a spec in the same PR — the existing files are good templates.
Firestore rules cover four collections; the rest are open by default behavior
File: firestore.rules
Rules exist for IntegrationSettings, IntegrationPluginSyncs,
IntegrationProviderHealthStates, plus a catch-all deny. If anyone adds
a new Firestore collection without an explicit rule, the deny catches it
— but they may then assume rules are in place when none have been
written for their specific use case.
- Do: every new Firestore collection ships with explicit rules in the same PR.
Cloud Functions: 33 async triggers, uneven traceId coverage
Pattern: ~33 on-*.ts triggers: 21 under functions/db/,
6 under functions/modules/*/endpoints/events/,
plus a handful in integrations / storage.
Async event-driven design means a failed downstream trigger doesn't roll back the source action — that's working as intended. The observability gap is harder: when a chain breaks (client write OK, notification trigger errors), correlating the original write to the downstream failure across function boundaries is manual log-grepping.
integration-engine already does
this right — its task.service.ts generates a traceId and propagates
it via withLogContext() (see functions/shared/infrastructure/logger.ts).
The functions/db/ triggers (pg-sync listeners, etc.) do not.
Cloud Logging assigns a trace per invocation automatically, but that
field does not cross invocations.
- Do: for any new multi-step trigger chain, generate a traceId at
the entry point, attach it to the emitted event payload, and log it
on every branch. Follow the integration-engine pattern, not the
db/listener pattern.
2. Technical debt / shortcuts
Listed by surface. None of these block product work today; all of them slow new hires, inflate test surface, or compound the risks above.
Frontend (web)
| Debt | Where | Why it costs |
|---|---|---|
| Firebase JS SDK v8 (EOL Aug 2022) | src/index.tsx and ~all imports | Locked into namespaced API; no v9+ tree-shaking; no security backports for 3+ years |
~800 any / @ts-ignore sites (grep -rE "\bany\b|@ts-ignore" --include='*.ts' --include='*.tsx' src/; ~382 : any annotations + 34 @ts-ignore directives at last count) | across src/ | Refactors hit invisible walls; type errors land in prod |
| ~26 unit tests; e2e dormant | src/**/__tests__, e2e/ — last commit ~9 weeks ago | Payroll, schedule, exports have no regression net; the e2e specs exist but nobody's run them recently |
| Dual state stores | Redux + Zustand — see fragile-areas.md › Dual state stores | Easy to double-store; ambiguous source of truth |
| Mega-components / mega-context | see fragile-areas.md › Mega-components | Every change risks unrelated regressions |
| 148+ hard-coded RTDB paths | see fragile-areas.md | Schema migration = global grep |
| No i18n missing-key detection | src/i18n.ts | Untranslated strings ship; FR placeholders linger as TODO |
tasks.todo (~972 lines) | tasks.todo | Mix of done / pending / forgotten — unclear signal, not in CI |
Inline style={{ }} and styled-components above logic | scattered | Violates CLAUDE.md §"Component must look like" |
Backend
| Debt | Where | Why it costs |
|---|---|---|
console.* mixed with logger.* | ~1,420 raw console.* calls across functions/ | No structured fields, no PII redaction, no Cloud Logging severity |
| Legacy zones outside FSD | see fragile-areas.md › Backend legacy zones | db/, http/, on-call/, cron/, schedule/, services/ — layer rules don't apply |
requests imports services/ | see fragile-areas.md | Violates module boundary — historical, not yet refactored |
| Migrations folder mixes one-time and idempotent | functions/migrations/ | No header convention says which is which |
| Manual DI in 15 modules | functions/modules/*/container.ts | Intentional (decisions.md › Manual dependency injection) but new hires routinely look for an inversify-style container |
Mobile
| Debt | Where | Why it costs |
|---|---|---|
| 0 test files | PIVOT-Mobile/ | RN upgrades, native-module bumps, patch decay — all blind |
~115 any / @ts-ignore sites (107 \bany\b + 8 @ts-ignore at last count; narrower : any count is 56) | PIVOT-Mobile/src/ | Lower than web but still untyped surfaces around native bridges |
main.jsbundle (~10 MB), pivot3.ipa (~22 MB), pivot3.app.dSYM.zip (~42 MB) on disk | PIVOT-Mobile/ root | Gitignored — but devs sometimes commit by accident in a panic. Move to GitHub Releases. |
Legacy moment use in a few files | PIVOT-Mobile/ (~15 imports vs ~90 dayjs imports) | dayjs is the active library; clean up the tail when you touch those files |
2 patches in patches/ | PIVOT-Mobile/patches/ (react-native-image-resizer, react-native-pdf) | Silent IOUs against the next RN / native-module upgrade |
3. Security audit
Run through this list before the next mobile release or rotation review. Items in 🚨 should be addressed first.
Past incident: Play Store service-account key was committed (now resolved)
A Google Play service-account private key for pivot-play-deploy@pivot-inc.iam.gserviceaccount.com was committed at PIVOT-Mobile/fastlane/play-store-key.json in commit 2d7c42ab ("Switch Android CI/CD from APK to AAB"). The production playstore lane in fastlane/Fastfile deploys to com.pivot3 with release_status: "completed", so the key could have shipped a public Play Store build.
Resolved in PR #501 ("fix: remove Play Store service account key and fix .gitignore"): the file was removed from the working tree and fastlane/play-store-key.json was added to .gitignore in the same change. The mobile repo no longer contains the key. Whether the historical git history was purged or only the working tree was cleaned is worth confirming if the repo has ever been public.
- Standing lessons:
- Never paste service-account JSON next to the lane that consumes it. CI should
echo "$PLAY_STORE_KEY" > fastlane/play-store-key.jsonfrom a secret at runtime, never check it in. - Any committed credential should be treated as compromised even after a working-tree delete — rotate the key in GCP, not just the file in git.
- When you add a new credential-shaped file to any release lane, grep
.gitignorefirst.
- Never paste service-account JSON next to the lane that consumes it. CI should
Local-only secrets (gitignored, present on dev machines)
These files exist on developer disks but are correctly gitignored.
If you find one not covered by .gitignore after a refactor, treat it
as a regression equivalent to the play-store-key issue above.
PIVOT-Mobile/pivot-inc-656b450b7a70.json— Firebase deployment service accountPIVOT-Mobile/AppleSignInKey_ZZRY22BNJ3.p8— Sign in with Apple private keyPIVOT-Mobile/fastlane/prod_auth_key.p8— App Store Connect keyPIVOT-Mobile/AppStore_jobs.pivot.qa.mobileprovision— provisioning profilePIVOT-Mobile/fastlane/.env— Fastlane Apple-ID credentialspivot/.env— has dev/staging/prod Firebase web keys in one file
Firebase web/iOS API keys (in
GoogleService-Info-*.plistandgoogle-services.json) are committed and are designed to be public — they identify the project, they don't authorize it. Pair with Firebase App Check if you ever need to lock down RTDB to attested clients only.
PII in logs
-
~1,420 raw
console.*sites acrossfunctions/bypass the structured-logging redaction layer. Cloud Logging retains these effectively forever. -
Migrations and ops scripts occasionally print user records or company-shaped data while debugging. Cloud Logs cannot be retroactively scrubbed selectively.
-
Do: never write a log line that contains a token, password, full request body, or full user record. Use the structured logger from
functions/shared/infrastructure/logger.tsand pass only the keys you need.
4. Quick reference: when something is broken
| Symptom | First place to check |
|---|---|
| Stripe customer charged / linked twice | duplicate webhook delivery — stripe-webhook-handler.ts, no event.id dedup; handleSubscriptionCreated is the risky path |
| Integration suddenly returns 401 / malformed URL | secret('NAME') returned ""; check the call site for a missing guard (most likely Lightspeed or Cluster POS) |
| Reports / dashboards lag the UI | pg-sync errors in Rollbar — see fragile-areas.md › pg-sync |
| Push silently stops on iOS prod | APNs cert / provisioning profile expiry; the committed profile expired 2024-10-29 — check App Store Connect for the live state |
| Wrong backend on mobile after a release | PIVOT-Mobile/src/active.env.js not regenerated on retry — re-run the lane from a clean state |
| Firestore rule "denied" for new collection | a rule was never written; default deny in firestore.rules catches it |
| Function spinning up thousands of instances | no maxInstances on a hot HTTP endpoint — clamp it at the function definition; verify the pg-client pool cap is still 5 |
| Audit asks "who changed this user's claim?" | Cloud Logging has the Admin SDK call; ops script didn't print one locally — functions/scripts/set-custom-claim.js |
For symptoms tied to specific code (payroll totals, POS sync, mobile clock-in, etc.), the table in fragile-areas.md › Quick Reference covers them.
See also
- Fragile areas — "this code is load-bearing complexity, here's what to know"
- Decisions — the why behind RTDB-first, manual DI, dual state, Hono, etc.
- Architecture overview — the system map
CLAUDE.md— the coding-rules contract; the commit checklist at the bottom is your last line of defence