Every log format is someone’s best guess at what will matter later. The pipeline’s job is to make sure that guess doesn’t block everyone else.
The problem starts at the source
When you build a SIEM or a log aggregation pipeline, the first thing you notice is that every data source looks different.
CloudTrail uses eventName and userIdentity. Okta has eventType and actor. Your internal application ships whatever the backend engineer decided to put in the JSON payload three years ago. A SaaS vendor’s webhook has its own envelope format with nested arrays and inconsistent field names.
This is normal. Every source was built by a different team with a different mental model. Expecting them to agree on a schema upfront is not realistic — especially in a company that is moving fast.
The common response is to write a custom parser for each source and bolt it directly into the pipeline. That works for three sources. By the time you have a dozen, the parser layer becomes a maintenance burden, and adding source 13 requires negotiating with every consumer of the data.
There is a better way to think about this.
The short version
| Idea | Why it matters |
|---|---|
Define events at the system level | One normalized schema drives detection rules, dashboards, and queries — regardless of source |
Producers keep their own format | Developers ship logs in whatever shape fits their system; normalization happens in the pipeline |
New sources are additive, not disruptive | Adding a log source does not break existing rules or require downstream changes |
What “system wide events” actually means
A system wide event is a normalized record that represents something that happened in your environment — regardless of which system reported it.
The idea is to define a common event model at the pipeline level:
- Every event has a
timestamp, asource_ip, atenant_id, and anevent_type - Fields like
actor,action, andtargetare mapped from source-specific names - Source-specific fields you want to keep are preserved alongside the normalized ones
The pipeline — not the producer — owns the translation.
[CloudTrail event] eventName: "AttachRolePolicy"
|
▼
[Transform: cloudtrail] action: "AttachRolePolicy"
actor: "arn:aws:iam::123456789012:user/admin"
source_ip: "203.0.113.10"
event_type: "iam_change"
|
▼
[Normalized event] stored in ClickHouse → queryable by detection rules
[Okta event] eventType: "user.session.start"
|
▼
[Transform: okta] action: "user.session.start"
actor: "[email protected]"
source_ip: "198.51.100.22"
event_type: "auth"
|
▼
[Normalized event] stored in ClickHouse → same detection rules apply
The detection rule asking “did a user authenticate from a suspicious IP?” does not need to know whether the authentication came from Okta, your VPN, or your internal app. It queries event_type = 'auth' and source_ip is in a threat-intel list, and it fires.
Why this matters for scale
Detection rules written against a normalized schema survive source changes.
If Okta changes its field names in an API update, you update the Okta transform. Every detection rule that queries the normalized layer keeps working without modification.
If you add a new log source — say, a 1Password Events API feed — you write one new transform. You do not touch existing rules. You do not need to tell every analyst to update their queries. The new source starts producing normalized events, and the existing detection surface covers it immediately.
Compare that to the alternative: embedding source-specific field names into every detection rule. You end up with fragile queries, a lot of duplication, and a growing tax on every new onboarding.
-- works for every auth source after normalization
SELECT
tenant_id,
actor,
source_ip,
count() AS login_count
FROM events
WHERE event_type = 'auth'
AND timestamp >= now() - INTERVAL 1 HOUR
GROUP BY tenant_id, actor, source_ip
HAVING login_count > 20
This query does not care whether the auth events came from Okta, your internal SSO, or a self-hosted app. It queries the normalized layer, and every source that maps to event_type = 'auth' is automatically covered.
Why this gives developers flexibility
The flip side of centralized normalization is that producers get to ship logs in whatever format makes sense for them.
An application team does not need to implement your SIEM’s schema. They do not need to coordinate with the security team on field names before they can start logging. They ship events, and the transform layer maps them.
This separation is useful in a few concrete ways:
Faster onboarding. You can start ingesting a new source the day you decide to. The schema negotiation happens once, in the transform file, not across every team.
Source-native field names in storage. Because transforms preserve original fields alongside normalized ones, you still have the raw event if you need it. Analysts can go back to source-specific fields for deep investigation without losing the normalized layer for broad queries.
Application teams can evolve their formats. If they add new fields to their log output, those fields pass through and get stored. If they rename an existing field, you update the transform. The security team’s detection rules are not affected.
The transform is where schema decisions live
In practice, this pattern means writing one transform per log source. The transform does three things:
- Extracts the fields that matter into the normalized schema
- Sets the
event_typeso downstream rules can route by category - Tags the event with
tenant_idso multi-tenant queries stay isolated
Here is what a minimal transform looks like conceptually:
# transform for an internal application log
transforms:
transform_myapp:
type: remap
inputs: ["source_myapp"]
source: |
.action = .request.method + " " + .request.path
.actor = .user.email
.source_ip = .request.remote_ip
.event_type = "http_request"
.tenant_id = .account_id
.table_name = "app_events"
The application team never sees this file. They just ship their logs. The transform is an internal contract between the pipeline and the storage layer.
What breaks without this pattern
If you skip normalization and let every source define its own schema in storage, you end up with:
| Problem | What it looks like |
|---|---|
| Duplicate detection logic | The same “suspicious login” rule written 4 times, once per source |
| Schema drift going undetected | Okta updates a field name; six alert rules silently stop firing |
| Onboarding friction | Adding a new source requires updating every downstream query |
| Cross-source correlation is hard | Joining CloudTrail and Okta events on actor means mapping different field names in every query |
None of these are showstoppers individually. Together they add up to a pipeline that the team avoids adding to, because every change feels expensive.
How this plays out in a real SIEM deployment
At Xpernix, every log source that enters the pipeline gets normalized before it reaches ClickHouse. CloudTrail, Okta, 1Password, S3 access logs, and agent-shipped events all go through per-source transforms that map to the same base schema.
The result is that a detection rule written for CloudTrail authentication events and one written for Okta authentication events can share a common structure, even though the source data looks nothing alike. And when a customer connects a new log source, they do not need to review every existing rule to see if something broke.
The only requirement is a new transform file — usually under 30 lines — and a ClickHouse table for the new source. Everything else carries over.
Final thought
System wide events are not a new idea. Every mature logging platform eventually gets here, either intentionally or after enough pain. The question is whether you design for it from the start or build a normalization layer after three years of schema chaos.
For a SIEM, the value is straightforward: write your rules once, cover every source. For developers, it means they can ship logs without coordinating schema changes with the security team. And for the platform, it means each new data source adds capability without adding fragility.
If you want to understand how this works in a production setup — or if you’re building an ingest pipeline and want a second opinion on your normalization approach — contact us.