The worst time to design your incident response process is at 2am when an alert fires and nobody knows who owns the call.
Why startup IR plans fail
A lot of startups have something that looks like an incident response plan: a Google Doc titled “Security Incidents,” a Slack channel #sec-incidents, and a vague understanding that you should “escalate to leadership” for serious things. That’s not a plan. That’s a prayer.
When a real incident hits, the absence of a plan shows up as:
- 20 minutes of Slack messages trying to figure out who is responsible
- Someone remediating before you’ve finished scoping, destroying forensic evidence in the process
- Customers notified inconsistently because no one agreed on the communication template
- A PPA notification missed because no one tracked the 72-hour clock
A functional IR plan has defined ownership, pre-built playbooks for your specific threat model, integrated tooling that reduces manual work during the incident, and a post-incident review process that actually changes how you operate.
The short version
| What you need | Why it matters |
|---|---|
| Severity definitions | Stops arguments about urgency while the attacker is still in your systems |
| Source-specific playbooks | Your team follows steps, not improvises |
| Pre-staged investigation queries | Detection and scoping happen in minutes, not hours |
| Evidence preservation procedure | Keeps forensic artifacts intact for legal and PPA purposes |
| Post-incident review process | Turns incidents into system improvements |
The incident response framework
Use NIST 800-61 revision 2 as your baseline. Six phases:
| Phase | What happens |
|---|---|
| Preparation | Build the plan, tools, and playbooks before an incident occurs |
| Detection & Analysis | Identify the incident and scope its full impact |
| Containment | Isolate affected systems without destroying forensic evidence |
| Eradication | Remove the attacker’s access, persistence mechanisms, and artifacts |
| Recovery | Restore systems and verify clean state before returning to production |
| Post-Incident Activity | Document the timeline, identify gaps, assign corrective actions |
Most startups fail at Preparation and Detection. The first time they think seriously about IR is during an active incident, which is the worst possible moment.
Phase 1: Preparation
Define severity levels
Write these down and get leadership to agree on them before any incident. The purpose is to remove judgment calls during high-stress moments.
| Severity | Definition | Target response time | Who responds |
|---|---|---|---|
| P1 — Critical | Active breach confirmed, ransomware active, data exfiltration in progress, production down | < 15 minutes | On-call engineer, CTO, legal |
| P2 — High | Credential compromise suspected, privilege escalation detected, uncontained threat | < 1 hour | On-call engineer, security lead |
| P3 — Medium | Policy violation, anomalous access pattern, single failed attack with no impact | < 4 hours | On-call engineer |
| P4 — Low | Informational alert, no evidence of actual impact | Next business day | Assigned analyst |
Build your contact list — offline
Document this somewhere accessible without internet. Slack, Notion, and your ticketing system may be unavailable or compromised during a serious incident.
- Internal: On-call rotation, CTO, VP Engineering, legal counsel, CFO (for insurance claims), PR/comms contact
- External: Cloud provider premium support, cyber insurance carrier incident hotline (different number than claims), forensics retainer if you have one
- Regulatory: CERT-IL (1-800-611-911), PPA breach notification portal URL
- Customer-facing: B2B customers have contractual breach notification requirements — know your point of contact at each account
Pre-stage your investigation tooling
During an incident, you want to be running queries, not installing software or searching for passwords. Pre-create:
- A Slack channel template that auto-populates with the incident log template when triggered
- Read-only access credentials for responders who don’t normally have production access
- Saved SIEM queries for the five most common investigation scenarios (see Phase 2 below)
- An immutable evidence storage location in a separate, isolated account — write-once, cannot be deleted or overwritten
The immutable evidence store configuration pattern:
Evidence bucket:
→ Versioning: enabled
→ Object Lock: COMPLIANCE mode, 730-day minimum retention
→ Access: write-only for incident responders, read for legal/forensics only
→ Hosted in a dedicated security account, separate from production
→ No delete permissions granted to any human principal
Phase 2: Detection and Analysis
Detection is where your SIEM pays for itself. For each incident type you need pre-built alert rules that fire into your alerting pipeline, and pre-built investigation queries that tell you the scope.
The five incident types you’re most likely to face
Based on what we see across Israeli startup environments:
- Account takeover — compromised identity provider or cloud console credentials
- Privilege escalation — IAM role abuse, policy manipulation, assume-role chains
- Data exfiltration — bulk object storage reads, unusual API query volumes
- Insider threat or departing employee — access abuse in the off-boarding window
- Cloud misconfiguration — public storage bucket, overly permissive firewall rule
Pre-built investigation queries (pseudo-code)
Save these as named queries in your SIEM. When an alert fires, the investigator runs the relevant query immediately rather than writing it from scratch under pressure.
Account takeover — impossible travel detection:
QUERY authentication_events
WHERE event_type = "login_success"
AND timestamp >= NOW - 30 minutes
GROUP BY user_id
HAVING count_distinct(country) > 1
RETURN user_id, list(source_ip), list(country), list(timestamp)
Privilege escalation — high-risk permission changes:
QUERY cloud_audit_events
WHERE event_name IN (
"AttachRolePolicy", "CreateRole", "PassRole",
"PutUserPolicy", "UpdateAssumeRolePolicy",
"CreatePolicyVersion", "SetDefaultPolicyVersion"
)
AND actor_type != "root"
AND timestamp >= NOW - 24 hours
RETURN timestamp, actor_arn, event_name, source_ip, target_resource
ORDER BY timestamp DESC
Data exfiltration — bulk object access:
QUERY storage_access_events
WHERE operation = "GET_OBJECT"
AND response_code = 200
AND timestamp >= NOW - 1 hour
GROUP BY requester_identity
HAVING sum(bytes_sent) > 500_000_000 -- tune threshold per environment
RETURN requester_identity, count(requests), sum(bytes_sent), min(timestamp), max(timestamp)
ORDER BY sum(bytes_sent) DESC
Departing employee access after off-boarding date:
QUERY authentication_events AS e
JOIN terminated_employees AS t ON e.user_id = t.user_id
WHERE e.timestamp > t.termination_date
AND e.event_type = "login_success"
RETURN e.timestamp, e.user_id, e.source_ip, e.event_type
ORDER BY e.timestamp DESC
Scoping the incident
When a detection fires, answer the five Ws before taking any containment action. Premature containment — like disabling a user account without checking for active sessions or lateral movement — can tip off an attacker, cause production impact, and destroy forensic evidence.
| Question | Where to look |
|---|---|
| Who — which identity? | Identity provider logs, cloud IAM audit trail |
| What — what actions? | Cloud API audit events, application access logs |
| When — start time, still ongoing? | Earliest event timestamp, most recent event |
| Where — which resources? | Storage bucket names, compute instance IDs, database identifiers |
| How — what access vector? | Source IP, user agent, console vs. API, MFA state |
Document your answers before moving to containment. The scoping document becomes the foundation of your PPA notification and post-incident report.
Phase 3: Containment
Containment isolates the incident without destroying forensic evidence. The sequence is always: preserve first, then contain.
Preserve evidence before any remediation
Before disabling any account, terminating any instance, or modifying any access policy:
1. Export audit logs covering the incident window → write to immutable evidence store
- Cloud API audit trail
- Identity provider authentication logs
- Storage access logs for affected buckets
2. For affected compute instances:
- Take disk snapshots (label with incident ID and timestamp)
- Capture instance metadata (running processes, network connections, scheduled tasks)
- Do NOT terminate the instance yet
3. Record exact timestamps of:
- When the alert fired
- When the investigation started
- When evidence was preserved
- When containment began
Contain credential compromise
Disable the affected identity:
→ Set account to disabled/suspended state in your identity provider
→ Do not delete — deletion removes audit history
Revoke active sessions:
→ Call identity provider API to invalidate all active sessions for the user
→ Revoke all API keys and tokens issued to that identity
Apply emergency deny policy:
→ Attach a Deny-All permission policy to the cloud IAM user/role
→ This blocks further API calls even if a token was already issued
Scope the blast radius:
→ Query audit logs for all actions taken by this identity in the past 30 days
→ Identify any roles assumed, users created, or permissions granted
→ Each of these is a potential persistence mechanism requiring separate remediation
Contain a compromised compute instance
1. Take disk snapshot → evidence store (before any changes)
2. Capture memory if malware analysis is required
3. Replace the instance's network security group with an isolation group
→ Isolation group: no inbound, no outbound, no exceptions
4. Do NOT terminate until forensic analysis is complete
5. If instance is in an auto-scaling group, remove it from the group first
to prevent the ASG from replacing it before analysis
Contain an exposed storage bucket
1. Enable access logging if not already active (for subsequent forensics)
2. Block all public access settings immediately
3. Remove any bucket policies granting public or cross-account access
4. Rotate any credentials or signed URLs that were scoped to this bucket
5. Query storage access logs to determine what was accessed, by whom, and when
Phase 5: Recovery
Before returning any system to production, verify clean state:
- Confirm persistence is removed — check for IAM users created, backdoor functions deployed, scheduled tasks added, or modified startup scripts during the compromise window
- Rotate all exposed credentials — API keys, service account credentials, database passwords, JWT signing secrets, and any secrets that may have been readable from compromised systems
- Verify log integrity — check your cloud audit trail for events like
DeleteTrail,StopLogging,DeleteLogGroup, or equivalent. An attacker who covered their tracks is more dangerous than one who didn’t. - Run standard detection rules for 24 hours against the recovered systems before declaring recovery complete — reinfection in the first 24 hours is common when root cause wasn’t fully identified
Legal notification obligations
For Israeli companies, two clocks start the moment you confirm a breach:
| Obligation | Deadline | Who to notify |
|---|---|---|
| PPA notification (Amendment 13) | 72 hours from discovery | Privacy Protection Authority breach portal |
| Individual notification (Amendment 13) | 30 days | Affected data subjects |
| INCD notification (National Cybersecurity Law 2026) | 12 hours (regulated sectors) | Israel National Cyber Directorate |
| Customer notification | Per contract — typically 24–72 hours | Designated customer contacts per your MSAs |
Loop in legal counsel at the moment you classify a P1 or P2 incident where personal data may be involved. They need to start the notification clock and review your scope assessment before you make any statements to customers about what was or wasn’t accessed.
Do not make promises to customers about the scope of a breach until your SIEM investigation is complete and legal has reviewed the output. Premature statements that later prove incorrect create significant liability.
Phase 6: Post-incident review
Within five business days of full recovery, run a post-incident review (PIR). The output is not a blame report — it’s a list of specific system improvements.
Structure the PIR around:
- Timeline reconstruction — minute-by-minute from first indicator to full containment, sourced from log data (not from memory)
- Detection gap — why did the alert fire when it did? What would have caught it sooner?
- Response gaps — where did the team lose time? What wasn’t covered by the playbook?
- Control failures — which security control, if in place, would have prevented or limited the incident?
- Action items — specific, assigned, time-bounded. Track in your engineering backlog, not a separate document.
The timeline reconstruction should come from your SIEM, not from team recollection. A parameterized query that takes the incident’s known indicators and returns a time-ordered event sequence is the most reliable artifact for the PIR and for PPA submission:
QUERY all_event_sources
WHERE (source_ip = <attacker_ip> OR actor_id = <compromised_user>)
AND timestamp BETWEEN <incident_start> AND <incident_end>
RETURN timestamp, source, event_type, actor_id, resource_id, result
ORDER BY timestamp ASC
Running that query and exporting the result takes minutes in a well-instrumented environment. Manual reconstruction from memory takes days and introduces errors.
Final thought
An IR plan is only useful if it exists before the incident. The playbooks, severity definitions, pre-staged queries, and evidence procedures described here take a few days to build and return years of operational benefit. Every incident run through a documented process improves the next one.
If you want help wiring detection rules into your alerting workflow, or want a second opinion on your IR plan against the NIST 800-61 framework, contact us.