The Case for System Wide Events: How One Idea Lets Your Log Pipeline Scale

Every log format is someone’s best guess at what will matter later. The pipeline’s job is to make sure that guess doesn’t block everyone else.

The problem starts at the source

When you build a SIEM or a log aggregation pipeline, the first thing you notice is that every data source looks different.

CloudTrail uses eventName and userIdentity. Okta has eventType and actor. Your internal application ships whatever the backend engineer decided to put in the JSON payload three years ago. A SaaS vendor’s webhook has its own envelope format with nested arrays and inconsistent field names.

This is normal. Every source was built by a different team with a different mental model. Expecting them to agree on a schema upfront is not realistic — especially in a company that is moving fast.

The common response is to write a custom parser for each source and bolt it directly into the pipeline. That works for three sources. By the time you have a dozen, the parser layer becomes a maintenance burden, and adding source 13 requires negotiating with every consumer of the data.

There is a better way to think about this.

The short version

Idea	Why it matters
`Define events at the system level`	One normalized schema drives detection rules, dashboards, and queries — regardless of source
`Producers keep their own format`	Developers ship logs in whatever shape fits their system; normalization happens in the pipeline
`New sources are additive, not disruptive`	Adding a log source does not break existing rules or require downstream changes

What “system wide events” actually means

A system wide event is a normalized record that represents something that happened in your environment — regardless of which system reported it.

The idea is to define a common event model at the pipeline level:

Every event has a timestamp, a source_ip, a tenant_id, and an event_type
Fields like actor, action, and target are mapped from source-specific names
Source-specific fields you want to keep are preserved alongside the normalized ones

The pipeline — not the producer — owns the translation.

[CloudTrail event]       eventName: "AttachRolePolicy"
                               |
                               ▼
[Transform: cloudtrail]  action: "AttachRolePolicy"
                         actor: "arn:aws:iam::123456789012:user/admin"
                         source_ip: "203.0.113.10"
                         event_type: "iam_change"
                               |
                               ▼
[Normalized event]       stored in ClickHouse → queryable by detection rules

[Okta event]             eventType: "user.session.start"
                               |
                               ▼
[Transform: okta]        action: "user.session.start"
                         actor: "[email protected]"
                         source_ip: "198.51.100.22"
                         event_type: "auth"
                               |
                               ▼
[Normalized event]       stored in ClickHouse → same detection rules apply

The detection rule asking “did a user authenticate from a suspicious IP?” does not need to know whether the authentication came from Okta, your VPN, or your internal app. It queries event_type = 'auth' and source_ip is in a threat-intel list, and it fires.

Why this matters for scale

Detection rules written against a normalized schema survive source changes.

If Okta changes its field names in an API update, you update the Okta transform. Every detection rule that queries the normalized layer keeps working without modification.

If you add a new log source — say, a 1Password Events API feed — you write one new transform. You do not touch existing rules. You do not need to tell every analyst to update their queries. The new source starts producing normalized events, and the existing detection surface covers it immediately.

Compare that to the alternative: embedding source-specific field names into every detection rule. You end up with fragile queries, a lot of duplication, and a growing tax on every new onboarding.

-- works for every auth source after normalization
SELECT
  tenant_id,
  actor,
  source_ip,
  count() AS login_count
FROM events
WHERE event_type = 'auth'
  AND timestamp >= now() - INTERVAL 1 HOUR
GROUP BY tenant_id, actor, source_ip
HAVING login_count > 20

This query does not care whether the auth events came from Okta, your internal SSO, or a self-hosted app. It queries the normalized layer, and every source that maps to event_type = 'auth' is automatically covered.

Why this gives developers flexibility

The flip side of centralized normalization is that producers get to ship logs in whatever format makes sense for them.

An application team does not need to implement your SIEM’s schema. They do not need to coordinate with the security team on field names before they can start logging. They ship events, and the transform layer maps them.

This separation is useful in a few concrete ways:

Faster onboarding. You can start ingesting a new source the day you decide to. The schema negotiation happens once, in the transform file, not across every team.

Source-native field names in storage. Because transforms preserve original fields alongside normalized ones, you still have the raw event if you need it. Analysts can go back to source-specific fields for deep investigation without losing the normalized layer for broad queries.

Application teams can evolve their formats. If they add new fields to their log output, those fields pass through and get stored. If they rename an existing field, you update the transform. The security team’s detection rules are not affected.

The transform is where schema decisions live

In practice, this pattern means writing one transform per log source. The transform does three things:

Extracts the fields that matter into the normalized schema
Sets the event_type so downstream rules can route by category
Tags the event with tenant_id so multi-tenant queries stay isolated

Here is what a minimal transform looks like conceptually:

# transform for an internal application log
transforms:
  transform_myapp:
    type: remap
    inputs: ["source_myapp"]
    source: |
      .action       = .request.method + " " + .request.path
      .actor        = .user.email
      .source_ip    = .request.remote_ip
      .event_type   = "http_request"
      .tenant_id    = .account_id
      .table_name   = "app_events"

The application team never sees this file. They just ship their logs. The transform is an internal contract between the pipeline and the storage layer.

What breaks without this pattern

If you skip normalization and let every source define its own schema in storage, you end up with:

Problem	What it looks like
Duplicate detection logic	The same “suspicious login” rule written 4 times, once per source
Schema drift going undetected	Okta updates a field name; six alert rules silently stop firing
Onboarding friction	Adding a new source requires updating every downstream query
Cross-source correlation is hard	Joining CloudTrail and Okta events on `actor` means mapping different field names in every query

None of these are showstoppers individually. Together they add up to a pipeline that the team avoids adding to, because every change feels expensive.

How this plays out in a real SIEM deployment

At Xpernix, every log source that enters the pipeline gets normalized before it reaches ClickHouse. CloudTrail, Okta, 1Password, S3 access logs, and agent-shipped events all go through per-source transforms that map to the same base schema.

The result is that a detection rule written for CloudTrail authentication events and one written for Okta authentication events can share a common structure, even though the source data looks nothing alike. And when a customer connects a new log source, they do not need to review every existing rule to see if something broke.

The only requirement is a new transform file — usually under 30 lines — and a ClickHouse table for the new source. Everything else carries over.

Final thought

System wide events are not a new idea. Every mature logging platform eventually gets here, either intentionally or after enough pain. The question is whether you design for it from the start or build a normalization layer after three years of schema chaos.

For a SIEM, the value is straightforward: write your rules once, cover every source. For developers, it means they can ship logs without coordinating schema changes with the security team. And for the platform, it means each new data source adds capability without adding fragility.

If you want to understand how this works in a production setup — or if you’re building an ingest pipeline and want a second opinion on your normalization approach — contact us.