Core Infra and the Ingest Pipeline

Just after finalizing the event schema, the ingest pipeline felt like the obvious next step. It's the core of the platform - get this wrong, and nothing else matters. Plus, having a working pipeline means I can actually test events end-to-end instead of staring at JSON schemas all day.

So I did what any reasonable person does: spent a week going down rabbit holes, comparing options, and agonizing over decisions that probably don't matter as much as I think they do.

Here's what I landed on, and why.

Why We Chose ClickHouse for Analytics Storage

For the events database, there was no real alternative. ClickHouse.

If you're building anything that involves analytics at scale - high-cardinality queries, time-series data, aggregations over billions of rows - ClickHouse is in a league of its own. The numbers don't lie:

Columnar storage means queries only read the columns they need
Parallel processing across cores makes aggregations stupid fast
Compression ratios of 10-20x are common (my test data hit 14x)
Real-time inserts with eventual consistency that actually works

I briefly looked at alternatives. TimescaleDB is great for time-series but struggles with high cardinality. BigQuery is powerful but the cost model is terrifying for unpredictable query patterns. Druid is... Druid (sorry, Druid fans).

The real decision was ClickHouse Cloud vs. self-hosted ClickHouse.

I went self-hosted. This deserves its own blog post (coming soon), but the short version: I wanted full control over the cluster topology, the ability to tune settings without support tickets, and honestly - I wanted to learn it properly. Managing a ClickHouse cluster is a rite of passage for anyone building analytics infrastructure.

Fair warning: it's a whole thing. ZooKeeper (or ClickHouse Keeper), replication, sharding, backups. But once it's running, it's beautiful.

Comparing Event Ingestion Options: DIY vs AWS vs Cloudflare

With the database sorted, I needed to figure out how events get from the SDK to ClickHouse. I had three realistic options:

Option 1: Build It Yourself (Traditional Infrastructure)

The DIY approach: spin up some servers, write an HTTP service, manage a queue, handle scaling yourself.

Pros:

Full control over everything
No vendor lock-in
Can optimize for specific use cases

Cons:

I have to manage servers
I have to handle scaling
I have to wake up at 3am when things break

For a solo founder trying to ship fast, this felt like signing up for ops work I didn't want.

Option 2: AWS Lambda + API Gateway (The "Safe" Choice)

The enterprise-approved stack: API Gateway → Lambda → S3/Kinesis → Lambda → ClickHouse.

Let me break down what this actually costs for a high-volume event ingestion use case. Assume 100 million events/month (not crazy for a product capturing behavioral signals):

Service	Usage	Cost
API Gateway	100M requests × $3.50/million	~$350
Lambda (Ingest)	100M invocations × 128MB × 50ms	~$25
S3	50GB storage + operations	~$5
SQS	100M messages × $0.40/million	~$40
Lambda (Insert)	10M invocations (batched)	~$10
Data Transfer	~100GB egress to ClickHouse	~$10
Total		~$440/mo

And that's before you factor in:

CloudWatch Logs (can easily add $100+/mo at this scale)
API Gateway's per-request pricing gets brutal at scale
Cold starts adding latency to your ingest path
The cognitive overhead of managing IAM roles, VPCs, security groups...

AWS is phenomenal infrastructure. But for high-volume, latency-sensitive HTTP ingestion? The pricing model works against you.

Option 3: Cloudflare Workers + R2 + Queues

Here's the same 100M events/month on Cloudflare:

Service	Usage	Cost
Workers (Paid Plan)	Base	$5
Workers Requests	100M (10M free)	$27
Workers CPU Time	~500M ms (30M free)	~$10
R2 Storage	50GB × $0.015/GB (10GB free)	~$0.60
R2 Operations	Class A + B	~$5
Queues	~6M operations (1M free)	~$2
KV	Auth/rate limit lookups	~$1
Total		~$51/mo

That's significantly less operational overhead for the same workload. But the real wins are elsewhere.

Why Cloudflare Workers Beat AWS Lambda for Event Ingestion

1. Zero Cold Starts with V8 Isolates

Cloudflare Workers use V8 isolates, not containers. There's no cold start. Ever. Your first request of the day is as fast as your millionth.

For an analytics ingest endpoint, this matters. I don't want p99 latency spikes because Lambda decided to spin up a new container.

2. Global Edge Network by Default

Workers run on Cloudflare's edge network - 330+ data centers worldwide. When someone in Tokyo sends an event, it hits a server in Tokyo. Not us-east-1.

For a product that tracks user behavior, latency directly impacts data quality. Faster ingest = more accurate timestamps = better analytics.

3. Simpler Developer Experience

With AWS, I'd be managing:

API Gateway configurations
Lambda function versions and aliases
IAM roles and policies
VPC configurations for ClickHouse access
CloudWatch dashboards and alarms
S3 bucket policies
SQS dead-letter queues

With Cloudflare, it's:

Worker code
A wrangler.toml file

Deploy with wrangler deploy. Done. The operational simplicity is underrated.

4. GDPR Compliance with Data Localization

Cloudflare's data localization features are genuinely useful for GDPR compliance. I can ensure EU user data never leaves EU data centers. Try doing that with a single Lambda function.

5. R2 Has Zero Egress Fees

This is huge. S3 charges $0.09/GB for data transfer out. R2 charges... nothing.

For an analytics pipeline where data flows from R2 → Worker → ClickHouse, eliminating egress fees is real money saved.

Event Ingestion Pipeline Architecture

Here's how events flow through the system:

100%

Note

This only covers the happy path. I'm not going to go into detail about rate limiting, authentication, or error handling at each step.

When building your own pipeline, make sure you handle all failure scenarios first, then the happy path. We don't want to miss any events. If we lose an event, we lose the signal.

Let me walk through each step:

Step 1: SDK Batches Events Client-Side

The client SDK doesn't fire an HTTP request for every event. It queues them locally and flushes either:

When the batch hits 50 events, or
Every 5 seconds, whichever comes first

This is standard practice (Segment, Amplitude, everyone does this), but worth mentioning because it dramatically reduces request volume and improves reliability.

Step 2: POST /v1/batch API Endpoint

The SDK sends a single HTTP POST with the batch payload. Authorization header carries the project's API key.

Step 3: Ingest Worker Validates and Enriches Events

This is where the magic happens:

✅ Validates API key (KV lookup, cached)
✅ Checks rate limit (KV, with sliding window)
✅ Validates payload (Zod schemas, event-type specific)
✅ Enriches events (geo from CF headers, device parsing, UTM extraction)
✅ Generates fingerprint if no distinct_id (for cookieless tracking)
✅ Compresses batch (gzip)
✅ Writes to R2: incoming/{project_id}/{timestamp}.ndjson.gz
✅ Returns 202 Accepted

Key design decisions here:

Why R2 first, not direct to queue? Durability. R2 is object storage - data written there isn't going anywhere. If the queue has issues, I haven't lost events. They're sitting in R2, waiting to be processed.

Why gzip compression? Reduces storage costs and speeds up downstream processing. Event payloads compress extremely well (lots of repeated keys, similar structures).

Why 202 Accepted? The client doesn't need to wait for the event to hit ClickHouse. They just need to know we received it. Async processing FTW.

Step 4: R2 Event Notifications Trigger Queue

R2 supports event notifications. When a new file lands in the incoming/ prefix, it pushes a message to a Cloudflare Queue.

This decouples ingestion from insertion. The ingest worker can return immediately; the insert worker processes files at its own pace.

Step 5: Insert Worker Bulk Loads to ClickHouse

The insert worker:

✅ Reads file from R2
✅ Decompresses
✅ Deduplicates by event ID (idempotency matters)
✅ Transforms to ClickHouse format
✅ Bulk inserts to ClickHouse tables (events, persons, groups, memberships)
✅ Moves file to processed/ prefix
✅ Acknowledges message (removes from queue)

The bulk insert is important. ClickHouse loves big batches. Inserting 1000 events in one query is way more efficient than 1000 individual inserts.

Step 6: Dead Letter Queue for Failure Handling

If the insert worker fails (ClickHouse down, network issues, whatever):

Message goes back to the queue
Retries up to 3 times with exponential backoff
After max retries, moves to Dead Letter Queue

The DLQ is just another Cloudflare Queue. I have a separate worker that can replay failed events manually (or automatically after the underlying issue is fixed).

Scaling Considerations: Do You Really Need Kafka?

This architecture is optimized for getting to market fast. But honestly? After researching the actual limits, I think this setup will carry us further than I originally thought.

Why We're Not Using Kafka (Yet)

My initial instinct was "if we hit 100K+ events/second, we'll need Kafka." But then I actually looked at Cloudflare Queues' current limits:

5,000 messages/second per queue (up from 400 in early beta)
250 concurrent consumer invocations
10,000 queues per account

Do the math: even with a single queue, that's 13 billion messages per month. And I can shard across multiple queues if needed.

For a product capturing behavioral signals to power emails and surveys, we'd need thousands of active customers at peak load before hitting these limits. That's a good problem to have, and by then, we'd have the resources to revisit architecture.

Kafka's real value add would be:

Infinite replay capability (Queues have 4-day retention)
Its ecosystem (Schema Registry, Kafka Connect, etc.)
If we're ingesting from sources that already speak Kafka

But for now? Queues are more than enough. Don't add infrastructure you don't need.

ClickHouse Kafka Engine: Only If You Already Have Kafka

ClickHouse does have native Kafka integration - a single Kafka table can handle 60K-300K simple messages per second. But this only makes sense if you're already running Kafka for other reasons.

Adding Kafka just to use the Kafka Engine would be trading one piece of infrastructure (Insert Worker) for a much more complex one (Kafka cluster). Not worth it unless Kafka is already in your stack.

Why We're Not Doing Streaming Inserts

I briefly considered direct ClickHouse inserts from the ingest worker for sub-second latency. But ClickHouse best practices are clear: batch 10K-100K rows per insert.

The current architecture (R2 → Queue → batched inserts) actually follows this best practice. The few seconds of latency from file buffering is a feature, not a bug - it lets us batch efficiently and handle ClickHouse downtime gracefully.

For true real-time (sub-second), you'd want ClickHouse's async inserts or Buffer tables. But for behavioral triggers and dashboards? A 5-10 second delay is imperceptible. Ship the simpler thing.

Multi-Region ClickHouse for Disaster Recovery

This one is still on the roadmap, but not for performance - for disaster recovery.

ClickHouse supports multi-region replication, but with caveats:

Latency between regions should stay under ~100ms (US coasts work fine; US-Europe gets tricky)
ZooKeeper is sensitive to network latency
Setup involves VPC peering, headless services, ZK observers...

The ClickHouse docs literally say "it's not trivial to set up." They're not wrong.

For now, solid backups to R2 (with cross-region replication enabled) plus the durability of the ingest pipeline give us acceptable DR posture. True multi-region ClickHouse is a "when we have dedicated infrastructure engineers" problem.

Key Takeaways for Building Event Ingestion Pipelines

If you're building high-volume event ingestion in 2026:

ClickHouse for analytics storage - Nothing else comes close for this use case
Cloudflare Workers for ingestion - 8-9x cheaper than AWS at scale, free tier covers beta entirely, zero cold starts, global by default
Decouple ingestion from insertion - R2 as a buffer gives you durability and operational flexibility
Design for failure - DLQs, idempotency, retries. Events are precious; don't lose them.
Don't over-engineer - Cloudflare Queues handle 5,000 msg/sec per queue. You probably don't need Kafka.

The entire pipeline is clean, focused TypeScript. It handles millions of events with minimal operational overhead.

Sometimes the boring architecture is the right architecture. And sometimes, the architecture you ship today will carry you further than you think.

Next up: Deploying and scaling a self-hosted ClickHouse cluster (or: how I learned to stop worrying and love ZooKeeper).