Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions cometchat-on-prem/docker/air-gapped-deployment.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: "Air-Gapped Deployment"
sidebarTitle: "Air-Gapped"
---

Guidelines for deploying the platform in offline or isolated (air-gapped) environments.

## Offline installation steps

- Export required Docker images with `docker save`
- Transfer images via removable media, secure copy (SSH), or an isolated internal network
- Import images on the target system with `docker load`

## Local registry

- Host images in Harbor, Nexus, or a private Docker registry
- Enforce role-based access control (RBAC) and image retention policies

## Limitations in air-gapped mode

- No access to external push notification services
- No S3 or other cloud object storage unless internally emulated
- No cloud-hosted analytics, logging, or monitoring integrations

Air-gapped deployments require careful planning for certificate management, image updates, and backup strategies. For assistance with compliance requirements (HIPAA, FedRAMP, ISO 27001) or custom air-gapped architectures, [contact us](https://www.cometchat.com/contact-sales).
121 changes: 121 additions & 0 deletions cometchat-on-prem/docker/configuration-reference.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
title: "Configuration Reference"
sidebarTitle: "Configuration"
---

Use this reference when updating domains, migrating environments, troubleshooting misconfiguration, or performing production deployments. Values are sourced from `docker-compose.yml`, service-level `.env` files, and the domain update guide.

Use this when:
- Updating domains
- Migrating environments
- Troubleshooting service misconfiguration
- Performing production deployments

## Global notes

- All services read environment variables from their respective directories.
- Domain values must be updated consistently across API, WebSocket, Notifications, Webhooks, and NGINX configurations.
- Changing the primary domain impacts reverse proxy routing, OAuth headers, CORS, webhook endpoints, and TiDB host references.

## Chat API

Update these values when changing domains:

- `MAIN_DOMAIN="<your-domain>"`
- `EXTENSION_DOMAIN="<your-domain>"`
- `WEBHOOKS_BASE_URL="https://webhooks.<your-domain>/v1/webhooks"`
- `TRIGGERS_BASE_URL="https://webhooks.<your-domain>/v1/triggers"`
- `EXTENSION_BASE_URL="https://notifications.<your-domain>"`
- `MODERATION_ENABLED=true`
- `RULES_BASE_URL="https://moderation.<your-domain>/v1/moderation-service"`
- `ADMIN_API_HOST="api.<your-domain>"`
- `CLIENT_API_HOST="apiclient.<your-domain>"`
- `ALLOWED_API_DOMAINS="<your-domain>,<additional-domain>"`
- `DB_HOST="tidb.<your-domain>"`
- `DB_HOST_CREATOR="tidb.<your-domain>"`
- `V3_CHAT_HOST="websocket.<your-domain>"`

## Management API (MGMT API)

- `ADMIN_API_HOST="api.<your-domain>"`
- `CLIENT_API_HOST="apiclient.<your-domain>"`
- `APP_HOST="dashboard.<your-domain>"`
- `API_HOST="https://mgmt-api.<your-domain>"`
- `MGMT_DOMAIN="<your-domain>"`
- `MGMT_DOMAIN_TO_REPLACE="<your-domain>"`
- `RULES_BASE_URL="https://moderation.<your-domain>/v1/moderation"`
- `ACCESS_CONTROL_ALLOW_ORIGIN="<your-domain>,<additional-domain>"`

## WebSocket

Hostnames are derived automatically from NGINX and Chat API configuration; no manual domain updates are required.

## Notifications service

- `CC_DOMAIN="<your-domain>"` (controls routing, token validation, and push delivery)

## Moderation service

- `CHAT_API_URL="<your-domain>"` for rule evaluation, metadata retrieval, and decision submission

## Webhooks service

- `CHAT_API_DOMAIN="<your-domain>"` - must match the Chat API domain exactly to avoid retries or signature verification failures

## Extensions

```json
"DOMAINS": [
"<allowed-domain-1>",
"<allowed-domain-2>",
"<your-domain>"
],
"DOMAIN_NAME": "<your-domain>"
```

Defines CORS and allowed origins for extension traffic.

## Receipt Updater

- `RECEIPTS_MYSQL_HOST="tidb.<your-domain>"` for delivery receipts, read receipts, and thread metadata

## SQL Consumer

```json
"CONNECTION_CONFIG": {
"host": "<tidb-host>"
},
"ALTER_USER_CONFIG": {
"host": "<tidb-host>"
},
"API_CONFIG": {
"API_DOMAIN": "<api-domain>"
}
```

Controls database migrations, multi-tenant provisioning, and internal requests to Chat API.

## NGINX configuration files

Update domain values in:

- chatapi.conf
- extensions.conf
- mgmtapi.conf
- notifications.conf
- dashboard.conf
- globalwebhooks.conf
- moderation.conf
- websocket.conf

These govern TLS termination, routing, reverse proxy rules, and WebSocket upgrades.

## Summary of domain values to update

- Chat API, Client API, and Management API
- Notifications, Moderation, Webhooks, and Extensions services
- NGINX reverse proxy hostnames
- TiDB host references
- WebSocket host configuration in Chat API

Configuration changes should be tested in staging environments before production deployment. For assistance with complex multi-region setups, custom domain architectures, or migration planning, [contact us](https://www.cometchat.com/contact-sales).
174 changes: 174 additions & 0 deletions cometchat-on-prem/docker/monitoring.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
title: "Monitoring"
sidebarTitle: "Monitoring"
---

Monitoring ensures system health, operational visibility, and SLA compliance for CometChat On-Prem deployments.

## Monitoring stack

The following open-source tools form the monitoring and observability stack for CometChat On-Prem deployments:

- **Prometheus**: Collects and stores metrics from all services
- **Grafana**: Visualizes metrics with dashboards and alerts
- **Loki**: Stores and queries logs from all containers
- **Promtail**: Tails logs from Docker containers and pushes them to Loki
- **Node Exporter**: Collects host-level metrics (CPU, memory, disk, network)
- **cAdvisor**: Collects container-level resource usage metrics

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│ Grafana │
│ (Dashboards & Visualization) │
└──────────────┬─────────────────────────┬────────────────────┘
│ │
│ Queries │ Queries
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Prometheus │ │ Loki │
│ (Metrics Store) │ │ (Log Store) │
└────────┬─────────┘ └────────┬─────────┘
│ │
│ Scrapes (/metrics) │ Pushes
▼ ▼
┌─────────────────────────────────────────┐
│ Node Exporter │ cAdvisor │ Promtail │
│ (Host Metrics) │ (Container)│ (Logs) │
└─────────────────────────────────────────┘
│ │ │
└────────────────┴──────────────┘
┌───────▼────────┐
│ Docker Swarm │
│ CometChat │
│ Services │
└────────────────┘
```

## Key metrics to monitor

### Infrastructure
- CPU usage per node
- Memory usage per node
- Disk space and I/O
- Network traffic
- Container resource usage

### Application services
- WebSocket active connections
- Chat API request rate and latency
- API error rates (4xx, 5xx)
- Service uptime

### Data stores
- **Kafka**: Consumer lag, message throughput
- **Redis**: Memory usage, cache hit ratio, connected clients
- **MongoDB**: Operation latency, connections, replication lag
- **TiDB**: Query duration, region health, storage capacity

### Load balancer
- NGINX request rate
- Response status codes
- Active connections

## Alerting

Alerts should focus on user impact, capacity risks, and data integrity rather than raw metric noise.

Set up alerts for these critical conditions:

- CPU usage > 80% for 5 minutes
- Memory usage > 85% for 5 minutes
- Disk space < 15%
- Service down for 2 minutes
- Database query latency > 100ms
- Kafka consumer lag > 10,000 messages
- Redis memory > 90%
- WebSocket connection errors > 10/second
- API error rate > 5%
- Container restarts

These thresholds are recommended starting points and should be adjusted based on workload characteristics and environment scale.

## Grafana dashboards

Create dashboards to visualize:

1. **Overview**: System health, active users, request rates, error rates
2. **Infrastructure**: CPU, memory, disk, network per node
3. **WebSocket**: Active connections, message throughput, errors
4. **API**: Request rate, latency, error rates by endpoint
5. **Databases**: Query performance, connections, replication status
6. **Kafka**: Consumer lag, throughput, partition health
7. **Logs & Error Analysis**: Error aggregation, log volume, search, and correlation with metrics

### Logs & Error Analysis Dashboard

This dashboard provides centralized visibility into application errors, log patterns, and system anomalies for rapid troubleshooting and incident investigation.

**Key Visualizations:**

- **Error Volume by Service**: Time-series graph showing error log count per service, helping identify which components are experiencing issues
- **Top Error Messages**: Table displaying the most frequent error messages with occurrence counts, enabling quick identification of recurring problems
- **Log Volume Trends**: Track total log volume over time to detect unusual spikes that may indicate issues or attacks
- **Error Rate by Severity**: Breakdown of errors by severity level (CRITICAL, ERROR, WARNING) for prioritization
- **Service Health Correlation**: Side-by-side view of error logs and service metrics (CPU, memory, latency) to correlate errors with resource constraints
- **Search & Filter**: Interactive LogQL query panel for ad-hoc log searches and pattern matching
- **Recent Critical Errors**: Live feed of the latest critical errors across all services for immediate awareness

**Use Cases:**
- Rapid incident investigation by correlating errors with metric anomalies
- Identifying error patterns and root causes across distributed services
- Monitoring error trends to detect degradation before user impact
- Post-incident analysis and root cause identification
- Compliance and audit trail review

## Log queries

Use Loki's LogQL to search and filter logs across all services:

```logql
# View all errors
{service="chat-api"} |= "error"

# WebSocket connection issues
{service="websocket"} |~ "connection.*failed"

# API 5xx errors
{service="nginx"} |~ "HTTP/[0-9.]+ 5[0-9]{2}"

# High latency requests
{service="chat-api"} | json | latency > 1000
```

## Troubleshooting

### First check Grafana dashboards

Start with the Overview dashboard to determine blast radius before drilling into component-level dashboards. Confirm whether the issue is node-level, service-level, or data-store related before diving into individual components.

### Check Prometheus targets
```bash
curl http://localhost:9090/api/v1/targets
```

### Check Loki status
```bash
curl http://localhost:3100/ready
```

### View Promtail logs
```bash
docker service logs promtail
```

### Check service metrics
```bash
# Node Exporter
curl http://localhost:9100/metrics

# cAdvisor
curl http://localhost:8080/metrics
```
Loading