Building a Multi-Tenant Observability Platform with SigNoz + OneUptime

Building a Multi-Tenant Observability Platform with SigNoz + OneUptime

1) Architecture Overview ## 2) Tenant Segregation Strategy ## 3) Provisioning and Hardening the Monitoring VM ## 4) Multi-Tenant Routing at the Edge ## 5) Collector Configuration: Logs, Traces, Metrics ## 6) Client-Side Integration (Apps) ## 7) DNS and TLS (Public UX) ## 8) Verification and Observability QA ## 9) Security and Compliance (SOC 2 + ISO 27001) ## 10) Repeatability: Infra + Tenant Config as Code ## Final Takeaways Modern SaaS teams need deep observability without sacrificing tenant isolation or compliance. This post explains how we built a multi-tenant monitoring platform that routes logs, metrics, and traces to isolated SigNoz and OneUptime stacks, enforces strong security controls, and aligns with SOC 2 and ISO 27001 practices. The result: each customer gets a dedicated monitoring experience while we keep the operational footprint lean and repeatable. We designed a hub-and-spoke model: This gives a consistent ingestion pipeline while allowing isolation-by-default where needed. We support two isolation modes: 1) Full isolation per tenant 2) Logical isolation on a shared stack We default to full isolation for regulated or high-traffic customers. We treat the monitoring VM as a controlled production system: Example provisioning steps (high-level): We use Nginx maps to route traffic by hostname for both UI and OTLP ingestion: This gives us clean DNS-based tenant routing while keeping a single IP. Each tenant VM runs OTEL Collector with filelog + OTLP. We parse PM2 logs (JSON wrapper), normalize severity, and attach resource fields for fast filtering in SigNoz. Core fields we enforce: Minimal config excerpt: This makes severity_text, service.name, and host.name searchable immediately in SigNoz. We used a consistent OTEL pattern across backend, web, and agent services: Typical environment variables: We terminate TLS at Nginx with real certificates (ACME/Let's Encrypt): We keep per-tenant TLS policies aligned with strong ciphers and HSTS. We validate the pipeline with: Example ClickHouse check (internal): We implemented controls aligned with SOC 2 and ISO 27001: We split configuration by responsibility: That means a new VM can be rebuilt with:

1) Pull monitoring repo and run docker compose up -d
2) Update DNS + TLS
3) Run tenant deployment scripts to install collector and env This architecture gives us the best of both worlds: Templates let you quickly answer FAQs or store snippets for re-use. as well , this person and/or COMMAND_BLOCK:
# SSH key-based access only
az vm user update --resource-group <rg> --name <vm> --username <user> --ssh-key-value "<pubkey>" # Open required ports (restrict SSH to trusted IPs)
az network nsg rule create ... --destination-port-ranges 22 80 443 4317 4318 COMMAND_BLOCK:
# SSH key-based access only
az vm user update --resource-group <rg> --name <vm> --username <user> --ssh-key-value "<pubkey>" # Open required ports (restrict SSH to trusted IPs)
az network nsg rule create ... --destination-port-ranges 22 80 443 4317 4318 COMMAND_BLOCK:
# SSH key-based access only
az vm user update --resource-group <rg> --name <vm> --username <user> --ssh-key-value "<pubkey>" # Open required ports (restrict SSH to trusted IPs)
az network nsg rule create ... --destination-port-ranges 22 80 443 4317 4318 CODE_BLOCK:
map $host $signoz_collector_upstream { signoz.tenant-a.example signoz-otel-collector-tenant-a; signoz.tenant-b.example signoz-otel-collector-tenant-b; default signoz-otel-collector-default;
} server { listen 4318; location / { proxy_pass http://$signoz_collector_upstream; }
} CODE_BLOCK:
map $host $signoz_collector_upstream { signoz.tenant-a.example signoz-otel-collector-tenant-a; signoz.tenant-b.example signoz-otel-collector-tenant-b; default signoz-otel-collector-default;
} server { listen 4318; location / { proxy_pass http://$signoz_collector_upstream; }
} CODE_BLOCK:
map $host $signoz_collector_upstream { signoz.tenant-a.example signoz-otel-collector-tenant-a; signoz.tenant-b.example signoz-otel-collector-tenant-b; default signoz-otel-collector-default;
} server { listen 4318; location / { proxy_pass http://$signoz_collector_upstream; }
} CODE_BLOCK:
processors: resourcedetection: detectors: [system] resource: attributes: - key: business_id value: ${env:BUSINESS_ID} action: upsert transform/logs: log_statements: - context: log statements: - set(severity_text, attributes["severity"]) where attributes["severity"] != nil CODE_BLOCK:
processors: resourcedetection: detectors: [system] resource: attributes: - key: business_id value: ${env:BUSINESS_ID} action: upsert transform/logs: log_statements: - context: log statements: - set(severity_text, attributes["severity"]) where attributes["severity"] != nil CODE_BLOCK:
processors: resourcedetection: detectors: [system] resource: attributes: - key: business_id value: ${env:BUSINESS_ID} action: upsert transform/logs: log_statements: - context: log statements: - set(severity_text, attributes["severity"]) where attributes["severity"] != nil CODE_BLOCK:
BUSINESS_ID=tenant-a
SIGNOZ_ENDPOINT=http://signoz.tenant-a.example:4318
ONEUPTIME_ENDPOINT=http://status.tenant-a.example:4318
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://127.0.0.1:4318/v1/traces
DEPLOY_ENV=production CODE_BLOCK:
BUSINESS_ID=tenant-a
SIGNOZ_ENDPOINT=http://signoz.tenant-a.example:4318
ONEUPTIME_ENDPOINT=http://status.tenant-a.example:4318
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://127.0.0.1:4318/v1/traces
DEPLOY_ENV=production CODE_BLOCK:
BUSINESS_ID=tenant-a
SIGNOZ_ENDPOINT=http://signoz.tenant-a.example:4318
ONEUPTIME_ENDPOINT=http://status.tenant-a.example:4318
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://127.0.0.1:4318/v1/traces
DEPLOY_ENV=production COMMAND_BLOCK:
sudo certbot --nginx -d signoz.tenant-a.example -d status.tenant-a.example COMMAND_BLOCK:
sudo certbot --nginx -d signoz.tenant-a.example -d status.tenant-a.example COMMAND_BLOCK:
sudo certbot --nginx -d signoz.tenant-a.example -d status.tenant-a.example CODE_BLOCK:
SELECT severity_text, count()
FROM signoz_logs.logs_v2
WHERE resources_string['business_id'] = 'tenant-a'
AND timestamp >= now() - INTERVAL 15 MINUTE
GROUP BY severity_text; CODE_BLOCK:
SELECT severity_text, count()
FROM signoz_logs.logs_v2
WHERE resources_string['business_id'] = 'tenant-a'
AND timestamp >= now() - INTERVAL 15 MINUTE
GROUP BY severity_text; CODE_BLOCK:
SELECT severity_text, count()
FROM signoz_logs.logs_v2
WHERE resources_string['business_id'] = 'tenant-a'
AND timestamp >= now() - INTERVAL 15 MINUTE
GROUP BY severity_text; - A central monitoring VM hosts the observability stack.
- Each tenant has either: a fully isolated SigNoz stack (frontend, query, collector, ClickHouse), or
a shared stack with strict routing based on a tenant identifier (for lightweight tenants).
- a fully isolated SigNoz stack (frontend, query, collector, ClickHouse), or
- a shared stack with strict routing based on a tenant identifier (for lightweight tenants).
- Each application VM runs an OpenTelemetry (OTEL) Collector that tails PM2 logs, receives OTLP traces/metrics, and forwards to the monitoring VM. - a fully isolated SigNoz stack (frontend, query, collector, ClickHouse), or
- a shared stack with strict routing based on a tenant identifier (for lightweight tenants). - Dedicated SigNoz stack per tenant
- Separate ClickHouse instance
- Separate OTEL collector upstream
- Strongest data isolation - Single SigNoz + ClickHouse
- Routing by business_id (header + resource attribute)
- Good for smaller tenants - x-business-id for SigNoz
- x-oneuptime-token for OneUptime - SSH keys only, no password auth
- Minimal inbound ports (22, 80, 443, 4317/4318)
- Nginx as a single TLS ingress
- Docker Compose for immutable service layout - severity_text (info/warn/error)
- service.name
- deployment.environment
- business_id - Backend: OTLP exporter for traces
- Web: browser traces forwarded to backend (which re-exports)
- Agents: OTEL SDK configured with OTEL_EXPORTER_OTLP_ENDPOINT - OTEL health endpoint (/health on collector)
- Test traffic from backend
- ClickHouse queries to confirm log attributes
- SigNoz filters for severity_text, service.name, host.name - Access control: SSH keys only, least privilege, MFA on cloud console.
- Network segmentation: minimal open ports; SSH restricted by source IP.
- Secrets management: runtime secrets stored in a vault, never in code.
- Encryption in transit: TLS everywhere, no plaintext endpoints exposed.
- Encryption at rest: disk encryption enabled on VMs and DB volumes.
- Audit trails: system logs retained; infra changes tracked in code.
- Change management: all config in repos; change reviews before deployment.
- Monitoring and alerting: OneUptime for SLOs and uptime checks.
- Incident response: documented procedures, retention and escalation.
- Backup strategy: ClickHouse backup policies per tenant. - Monitoring services repo: all infra and Nginx routing
- Tenant repos: OTEL collector config and deploy hooks - Strong tenant isolation for compliance-focused clients
- Shared ops processes and standard config
- Fast log filtering (severity/service/env/host) for high signal-to-noise debugging
- A repeatable, audited deployment flow suitable for SOC 2 and ISO 27001 requirements