Skip to main content

Risks & Critical Points

A "what could go wrong" map for new and existing developers. Read this before shipping anything to production. Pairs with — and intentionally overlaps lightly — fragile areas: fragile-areas tells you which code to handle carefully; this doc tells you which failure modes to plan for.

Each entry: what's at risk · where it lives · why it bites · what to do.


TL;DR for the impatient

If you only read three things before your first prod-touching PR:

  1. There is no automated rollback for backend migrations or Firebase deploys. Migrations are tracked and skipped once completed, but failures stop the chain and require manual repair; Firebase Functions v2 has no one-click rollback — you redeploy the previous bundle. See § Deploy-time foot-guns.
  2. The Firebase Realtime Database is the source of truth. Postgres is a one-way replica. If reports look wrong, check pg-sync, not the database itself.
  3. The iOS NSAllowsArbitraryLoads = true flag in PIVOT-Mobile/ios/pivot3/Info.plist is still active. All HTTP requests from production builds (including auth tokens) are MITM-able on hostile networks. See § NSAllowsArbitraryLoads.

1. What could break production

Grouped by failure category, not by file. Each item names the smoking gun, then says what you can do about it.

🚨 Critical — likely-or-already-burned

Stripe webhook: missing idempotency (signature fallback is fail-closed)

File: functions/systems/stripe/handlers/stripe-webhook-handler.ts

The handler reads (req as any).rawBody || JSON.stringify(req.body). If rawBody is absent, stripe.webhooks.constructEvent throws on the signature mismatch and the request is rejected with a 400 — fail-closed, not fail-open. The fallback produces noisy errors, not forged-event acceptance.

The real risk is no idempotency-key tracking. Stripe retries webhooks; if event.id repeats, the handler re-runs. Most paths are overwrite-shaped and idempotent in practice (handleSubscriptionChange, handleCustomerUpdated); handleSubscriptionCreated is the dangerous one — duplicate delivery can create redundant company↔subscription links.

  • Do: add an event-ID-processed store before extending any handler. Log duplicates rather than silently swallowing them. If you touch the HTTP layer, re-verify with the Stripe CLI to confirm rawBody is still populated.

get-secret.ts returns empty string — guard coverage is uneven

File: functions/get-secret.ts

secret(name) returns secretCache[name] || process.env[name] || ''. Missing values are silently "". Call-site behaviour is inconsistent:

  • Guarded (fail fast): Stripe (get-stripe-instance.ts), SMTP (nodemailer.transporter.ts), migrations — throw at init.

  • Guarded (silent skip): Slack (send-slack-notification.ts) — logs and returns null. Acceptable for fire-and-forget notifications.

  • Unguarded: Lightspeed (lightspeed-api-token.ts), Cluster POS (cluster-api-wrapper.ts) — concatenate "" into URLs / Bearer tokens. The first API call fails (axios rejection), not "hours later" — but the error message is unhelpful.

  • Do: for new integrations, use defineSecret() from firebase-functions/params (see integrations/givex/secrets.ts). For existing secret('NAME') call sites, add an explicit throw-on-empty guard if there isn't one (e.g. if (!value) throw new Error('missing ' + name)).

28 registered migrations, no down-step

Runner: functions/migrations/run-migrations.ts (28 entries; the folder also contains ~70 one-off utility scripts under sub-folders that are not registered).

Migration state is tracked in the database under /migrations/{name} (completed / failed + runAt + duration). Completed migrations are skipped on subsequent deploys — so accidental re-runs are not the risk. The risk is mid-deploy failure: the runner exits on first failure (line 190) and leaves the chain partially applied. There is no automated down-step.

  • Do: design every new migration idempotent and resumable. Test against a copy of prod via functions/scripts/copy-company-to-dev.ts. Document the manual recovery path in a comment at the top of the file. Never log PII or raw values.

Schedule → Postgres dual-write can diverge silently

See fragile-areas.md › Schedule → Postgres dual-write. Listed here to keep the prod-risk list complete: a single failed pg-sync write means the live UI and reporting layer disagree with no alert. Check Rollbar for pg-sync errors when reports look off.

Three-Firebase-project drift

See fragile-areas.md › Three-Firebase-project drift. Project IDs (pivot-dev-59310 / pivot-not-production-project / pivot-inc) appear in src/config.ts, backend .env.<project-id> files, mobile plists, GitHub Actions, and Fastlane. Pointing the running app at the wrong project is one typo away.

  • Do: before any deploy, confirm firebase use and the exact env file. Check the Firebase init log in browser console / function logs.

Silent failures (no alarm, wrong data)

POS employee-mapping cache staleness

See fragile-areas.md › pos-sync employee mapping cache. A stale mapping silently attaches wages, sales and tip-outs to the wrong employee. Re-listed here because the symptom is invisible until payroll runs.

Background push handler — AsyncStorage failures

See fragile-areas.md › Background message handler (PIVOT-Mobile/index.js). Failures are silent; badge counts drift; users think notifications are broken without a clear cause.

APNs / provisioning-profile expiration is unmonitored (already expired)

Files: PIVOT-Mobile/ios/pivot3/Info.plist, PIVOT-Mobile/AppStore_jobs.pivot.qa.mobileprovision

The provisioning profile committed at AppStore_jobs.pivot.qa.mobileprovision has ExpirationDate: 2024-10-29expired 7+ months ago. Fastlane uses match(readonly: true) to fetch certs from the cert repo but does no expiry validation. No before_all check; no CI preflight; no Sentry/Slack alert.

When the APNs key in App Store Connect rotates or a cert lapses, iOS push for prod stops silently until someone notices badges drifting.

  • Do: add a cert_age_check Fastlane lane that runs match then parses each profile's NotAfter and fails if any cert/key is within 30 days of expiry. Wire it into before_all. Replace the expired profile in the repo as part of the next release.

🚨 iOS NSAllowsArbitraryLoads is true (active cleartext-HTTP vulnerability)

File: PIVOT-Mobile/ios/pivot3/Info.plist

Verified: the plist contains an unscoped NSAllowsArbitraryLoads = true under NSAppTransportSecurity. There is no NSExceptionDomains scoping. This applies to all HTTP requests in production builds: auth tokens, GPS punches, and chat payloads are MITM-able on hostile networks. App Store review can also flag this.

  • Do: remove the key (or set it to false) immediately. If dev or staging actually needs cleartext for emulator domains, scope it via NSExceptionDomains listing only those specific hosts. Add a CI check that fails the build when NSAllowsArbitraryLoads is true in a release configuration.

Cloud Functions: uneven runWith coverage, no project-wide maxInstances

Pattern: setGlobalOptions({ region: 'us-central1' }) in functions/index.js sets region only — no default memory, timeout, or maxInstances. Some functions specify runWith({ memory, timeoutSeconds }) (integrations, crons, some on-call handlers); core HTTP endpoints like setEmployeeClockInV2, setEmployeeClockOutV2, getCompanyTimeNew do not.

A runaway query, hot Pub/Sub loop, or customer flood can scale instances unboundedly. The Cloud SQL connection storm is partly mitigated — functions/shared/infrastructure/pg-client.ts sets max: 5 per instance — but N instances × 5 connections still hits the SQL connection limit. Default v2 timeouts are 60s for HTTP / 540s for events.

  • Do: set explicit maxInstances and timeoutSeconds on every new function (and on the hot HTTP endpoints above when you next touch them). Add memory hints for known-heavy paths (export generators, pg-sync, integration syncs). Consider a project-wide maxInstances default via setGlobalOptions.

Deploy-time foot-guns

GitHub Actions deploy: project-safe and gated, but no smoke test

File: .github/workflows/deploy.yml

The pipeline is safer than it looks:

  • Project targeting is explicit — every Firebase step passes --project <env-id> (or the GitHub Actions variable vars.FIREBASE_PROJECT_ID), overriding any stale firebase use on the runner.
  • Production deploys are manual-only (workflow_dispatch); dev, main→staging, and uat auto-deploy on push but reference GitHub environment: names that typically carry UI-configured approval rules.
  • Last green SHA is cached per environment (lines 174–179, 289–297), enabling fast revert+redeploy.

Real gaps:

  • No post-deploy smoke test — success is declared the moment firebase deploy returns 0, with no curl against the live URL.

  • No automated rollback — Firebase Functions v2 has no one-click rollback; you redeploy the previous SHA.

  • Do: before a hot deploy, note the prior green commit SHA. After any non-trivial deploy, hit the new function URL and the web app's health endpoint before walking away. Don't merge to production on Friday afternoon.

Manual ops scripts lack local safeguards

Folder: functions/scripts/

Most scripts (set-custom-claim.js, copy-company-to-dev.ts, duplicate-data.ts) hard-code a project ID or load a service-account file, initialize Firebase Admin silently, and run. They print nothing about which project they're targeting and have no --dry-run flag. set-super-admin-claims.js is the exception — it supports --dry-run and multi-env targeting; setup-mia-agent.js at least prints the project ID.

Note: Firebase Admin SDK writes are captured in Cloud Logging server-side, so an audit trail exists — it just lives where you have to go look for it after the fact. The real risk is silent mis-targeting, not absent logs.

  • Do: print the project ID at the top of any new script and pause for confirmation if it matches a production-shaped name. Prefer scripts that support --dry-run. Run against staging first. After running, glance at Cloud Logging to confirm the writes hit the project you intended.

Mobile env binding has no post-build assertion

File: PIVOT-Mobile/src/active.env.js (generated by set-dev / set-stage / set-prod npm scripts before each build)

Not a true race — iOS and Android CI workflows use per-platform concurrency: groups, so parallel runs on the same branch are serialized and each runs in its own VM. The real risk is crash recovery: if a build crashes after env setup but before the bundle finishes, a retry that skips the env step can produce a bundle pointing at the wrong backend. The bundled JS is not introspectable after the fact, so there is no way to verify what env actually shipped.

  • Do: bake the env intent into the bundle as an explicit constant at build time (e.g. write export const BUILD_ENV = 'prod' next to active.env.js), then have the lane read it back post-build and assert it matches the lane's intent before upload to TestFlight / Play Console.

cors.json is * for GET/HEAD on Cloud Storage

File: cors.json

CORS doesn't gate fetchability — direct navigation and server-side fetches always work regardless. What it grants is cross-origin browser-script fetches: any website's JS can fetch() URLs in the bucket. So if a signed or unauthenticated bucket URL ever ends up in a hostile page, that page's JS can read it. Not a critical risk on its own, but means: don't rely on URL obscurity, and don't store sensitive content at predictable bucket paths.

Auth & data integrity

RTDB rules: 1,114 lines; tests exist but coverage is uneven

File: database.rules.json

There is a rules-test suite: tests/rules/ holds ~20 Jest specs (notifications, payroll, schedule, employees, companies, requests, chat, etc.) run via firebase emulators:exec per jest.rules.config.js. The risk isn't absence of testing — it's that a rules change to an untested path still silently breaks reads/writes without surfacing as a TS or build error. Combined with the 148+ hard-coded RTDB paths in the frontend (fragile-areas.md), the blast radius of a single rule typo is large.

  • Do: when you touch a rule, run the matching spec under tests/rules/ against the emulator. If your path has no coverage, add a spec in the same PR — the existing files are good templates.

Firestore rules cover four collections; the rest are open by default behavior

File: firestore.rules

Rules exist for IntegrationSettings, IntegrationPluginSyncs, IntegrationProviderHealthStates, plus a catch-all deny. If anyone adds a new Firestore collection without an explicit rule, the deny catches it — but they may then assume rules are in place when none have been written for their specific use case.

  • Do: every new Firestore collection ships with explicit rules in the same PR.

Cloud Functions: 33 async triggers, uneven traceId coverage

Pattern: ~33 on-*.ts triggers: 21 under functions/db/, 6 under functions/modules/*/endpoints/events/, plus a handful in integrations / storage.

Async event-driven design means a failed downstream trigger doesn't roll back the source action — that's working as intended. The observability gap is harder: when a chain breaks (client write OK, notification trigger errors), correlating the original write to the downstream failure across function boundaries is manual log-grepping.

integration-engine already does this right — its task.service.ts generates a traceId and propagates it via withLogContext() (see functions/shared/infrastructure/logger.ts). The functions/db/ triggers (pg-sync listeners, etc.) do not. Cloud Logging assigns a trace per invocation automatically, but that field does not cross invocations.

  • Do: for any new multi-step trigger chain, generate a traceId at the entry point, attach it to the emitted event payload, and log it on every branch. Follow the integration-engine pattern, not the db/ listener pattern.

2. Technical debt / shortcuts

Listed by surface. None of these block product work today; all of them slow new hires, inflate test surface, or compound the risks above.

Frontend (web)

DebtWhereWhy it costs
Firebase JS SDK v8 (EOL Aug 2022)src/index.tsx and ~all importsLocked into namespaced API; no v9+ tree-shaking; no security backports for 3+ years
~800 any / @ts-ignore sites (grep -rE "\bany\b|@ts-ignore" --include='*.ts' --include='*.tsx' src/; ~382 : any annotations + 34 @ts-ignore directives at last count)across src/Refactors hit invisible walls; type errors land in prod
~26 unit tests; e2e dormantsrc/**/__tests__, e2e/ — last commit ~9 weeks agoPayroll, schedule, exports have no regression net; the e2e specs exist but nobody's run them recently
Dual state storesRedux + Zustand — see fragile-areas.md › Dual state storesEasy to double-store; ambiguous source of truth
Mega-components / mega-contextsee fragile-areas.md › Mega-componentsEvery change risks unrelated regressions
148+ hard-coded RTDB pathssee fragile-areas.mdSchema migration = global grep
No i18n missing-key detectionsrc/i18n.tsUntranslated strings ship; FR placeholders linger as TODO
tasks.todo (~972 lines)tasks.todoMix of done / pending / forgotten — unclear signal, not in CI
Inline style={{ }} and styled-components above logicscatteredViolates CLAUDE.md §"Component must look like"

Backend

DebtWhereWhy it costs
console.* mixed with logger.*~1,420 raw console.* calls across functions/No structured fields, no PII redaction, no Cloud Logging severity
Legacy zones outside FSDsee fragile-areas.md › Backend legacy zonesdb/, http/, on-call/, cron/, schedule/, services/ — layer rules don't apply
requests imports services/see fragile-areas.mdViolates module boundary — historical, not yet refactored
Migrations folder mixes one-time and idempotentfunctions/migrations/No header convention says which is which
Manual DI in 15 modulesfunctions/modules/*/container.tsIntentional (decisions.md › Manual dependency injection) but new hires routinely look for an inversify-style container

Mobile

DebtWhereWhy it costs
0 test filesPIVOT-Mobile/RN upgrades, native-module bumps, patch decay — all blind
~115 any / @ts-ignore sites (107 \bany\b + 8 @ts-ignore at last count; narrower : any count is 56)PIVOT-Mobile/src/Lower than web but still untyped surfaces around native bridges
main.jsbundle (~10 MB), pivot3.ipa (~22 MB), pivot3.app.dSYM.zip (~42 MB) on diskPIVOT-Mobile/ rootGitignored — but devs sometimes commit by accident in a panic. Move to GitHub Releases.
Legacy moment use in a few filesPIVOT-Mobile/ (~15 imports vs ~90 dayjs imports)dayjs is the active library; clean up the tail when you touch those files
2 patches in patches/PIVOT-Mobile/patches/ (react-native-image-resizer, react-native-pdf)Silent IOUs against the next RN / native-module upgrade

3. Security audit

Run through this list before the next mobile release or rotation review. Items in 🚨 should be addressed first.

Past incident: Play Store service-account key was committed (now resolved)

A Google Play service-account private key for pivot-play-deploy@pivot-inc.iam.gserviceaccount.com was committed at PIVOT-Mobile/fastlane/play-store-key.json in commit 2d7c42ab ("Switch Android CI/CD from APK to AAB"). The production playstore lane in fastlane/Fastfile deploys to com.pivot3 with release_status: "completed", so the key could have shipped a public Play Store build.

Resolved in PR #501 ("fix: remove Play Store service account key and fix .gitignore"): the file was removed from the working tree and fastlane/play-store-key.json was added to .gitignore in the same change. The mobile repo no longer contains the key. Whether the historical git history was purged or only the working tree was cleaned is worth confirming if the repo has ever been public.

  • Standing lessons:
    • Never paste service-account JSON next to the lane that consumes it. CI should echo "$PLAY_STORE_KEY" > fastlane/play-store-key.json from a secret at runtime, never check it in.
    • Any committed credential should be treated as compromised even after a working-tree delete — rotate the key in GCP, not just the file in git.
    • When you add a new credential-shaped file to any release lane, grep .gitignore first.

Local-only secrets (gitignored, present on dev machines)

These files exist on developer disks but are correctly gitignored. If you find one not covered by .gitignore after a refactor, treat it as a regression equivalent to the play-store-key issue above.

  • PIVOT-Mobile/pivot-inc-656b450b7a70.json — Firebase deployment service account
  • PIVOT-Mobile/AppleSignInKey_ZZRY22BNJ3.p8 — Sign in with Apple private key
  • PIVOT-Mobile/fastlane/prod_auth_key.p8 — App Store Connect key
  • PIVOT-Mobile/AppStore_jobs.pivot.qa.mobileprovision — provisioning profile
  • PIVOT-Mobile/fastlane/.env — Fastlane Apple-ID credentials
  • pivot/.env — has dev/staging/prod Firebase web keys in one file

Firebase web/iOS API keys (in GoogleService-Info-*.plist and google-services.json) are committed and are designed to be public — they identify the project, they don't authorize it. Pair with Firebase App Check if you ever need to lock down RTDB to attested clients only.

PII in logs

  • ~1,420 raw console.* sites across functions/ bypass the structured-logging redaction layer. Cloud Logging retains these effectively forever.

  • Migrations and ops scripts occasionally print user records or company-shaped data while debugging. Cloud Logs cannot be retroactively scrubbed selectively.

  • Do: never write a log line that contains a token, password, full request body, or full user record. Use the structured logger from functions/shared/infrastructure/logger.ts and pass only the keys you need.


4. Quick reference: when something is broken

SymptomFirst place to check
Stripe customer charged / linked twiceduplicate webhook delivery — stripe-webhook-handler.ts, no event.id dedup; handleSubscriptionCreated is the risky path
Integration suddenly returns 401 / malformed URLsecret('NAME') returned ""; check the call site for a missing guard (most likely Lightspeed or Cluster POS)
Reports / dashboards lag the UIpg-sync errors in Rollbar — see fragile-areas.md › pg-sync
Push silently stops on iOS prodAPNs cert / provisioning profile expiry; the committed profile expired 2024-10-29 — check App Store Connect for the live state
Wrong backend on mobile after a releasePIVOT-Mobile/src/active.env.js not regenerated on retry — re-run the lane from a clean state
Firestore rule "denied" for new collectiona rule was never written; default deny in firestore.rules catches it
Function spinning up thousands of instancesno maxInstances on a hot HTTP endpoint — clamp it at the function definition; verify the pg-client pool cap is still 5
Audit asks "who changed this user's claim?"Cloud Logging has the Admin SDK call; ops script didn't print one locally — functions/scripts/set-custom-claim.js

For symptoms tied to specific code (payroll totals, POS sync, mobile clock-in, etc.), the table in fragile-areas.md › Quick Reference covers them.


See also

  • Fragile areas — "this code is load-bearing complexity, here's what to know"
  • Decisions — the why behind RTDB-first, manual DI, dual state, Hono, etc.
  • Architecture overview — the system map
  • CLAUDE.md — the coding-rules contract; the commit checklist at the bottom is your last line of defence