Files
godoxy-yusing/internal/metrics/uptime/README.md
yusing 60cdffcf3c refactor(metrics): remove unused fields from RouteAggregate and update related documentation
- Removed `display_name`, `is_docker`, and `is_excluded` fields from the `RouteAggregate` struct and corresponding Swagger documentation.
- Updated references in the README and code to reflect the removal of these fields, ensuring consistency across the codebase.
2026-01-25 17:26:44 +08:00

397 lines
10 KiB
Markdown

# Uptime
Tracks and aggregates route health status over time, providing uptime/downtime statistics and latency metrics.
## Overview
The uptime package monitors route health status and calculates uptime percentages over configurable time periods. It integrates with the `period` package for historical storage and provides aggregated statistics for visualization.
### Primary Consumers
- `internal/api/v1/metrics` - HTTP endpoint for uptime data
- `internal/homepage` - Dashboard uptime widgets
- Monitoring and alerting systems
### Non-goals
- Does not perform health checks (handled by `internal/route/routes`)
- Does not provide alerting on downtime
- Does not persist data beyond the period package retention
- Does not aggregate across multiple GoDoxy instances
### Stability
Internal package. Data format and API are stable.
## Public API
### Exported Types
#### StatusByAlias
```go
type StatusByAlias struct {
Map map[string]routes.HealthInfoWithoutDetail `json:"statuses"`
Timestamp int64 `json:"timestamp"`
}
```
Container for health status of all routes at a specific time.
#### Status
```go
type Status struct {
Status types.HealthStatus `json:"status" swaggertype:"string" enums:"healthy,unhealthy,unknown,napping,starting"`
Latency int32 `json:"latency"`
Timestamp int64 `json:"timestamp"`
}
```
Individual route status at a point in time.
#### RouteAggregate
```go
type RouteAggregate struct {
Alias string `json:"alias"`
Uptime float32 `json:"uptime"`
Downtime float32 `json:"downtime"`
Idle float32 `json:"idle"`
AvgLatency float32 `json:"avg_latency"`
CurrentStatus types.HealthStatus `json:"current_status" swaggertype:"string" enums:"healthy,unhealthy,unknown,napping,starting"`
Statuses []Status `json:"statuses"`
}
```
Aggregated statistics for a single route.
#### Aggregated
```go
type Aggregated []RouteAggregate
```
Slice of route aggregates, sorted alphabetically by alias.
### Exported Variables
#### Poller
```go
var Poller = period.NewPoller("uptime", getStatuses, aggregateStatuses)
```
Pre-configured poller for uptime metrics. Start with `Poller.Start()`.
### Unexported Functions
#### getStatuses
```go
func getStatuses(ctx context.Context, _ StatusByAlias) (StatusByAlias, error)
```
Collects current status of all routes. Called by the period poller every second.
**Returns:**
- `StatusByAlias` - Map of all route statuses with current timestamp
- `error` - Always nil (errors are logged internally)
#### aggregateStatuses
```go
func aggregateStatuses(entries []StatusByAlias, query url.Values) (int, Aggregated)
```
Aggregates status entries into route statistics.
**Query Parameters:**
- `period` - Time filter (5m, 15m, 1h, 1d, 1mo)
- `limit` - Maximum number of routes to return (0 = all)
- `offset` - Offset for pagination
- `keyword` - Fuzzy search keyword for filtering routes
**Returns:**
- `int` - Total number of routes matching the query
- `Aggregated` - Slice of route aggregates
## Architecture
### Core Components
```mermaid
flowchart TD
subgraph Health Monitoring
Routes[Routes] -->|GetHealthInfoWithoutDetail| Status[Status Map]
Status -->|Polls every| Second[1 Second]
end
subgraph Poller
Poll[getStatuses] -->|Collects| StatusByAlias
StatusByAlias -->|Stores in| Period[Period StatusByAlias]
end
subgraph Aggregation
Query[Query Params] -->|Filters| Aggregate[aggregateStatuses]
Aggregate -->|Calculates| RouteAggregate
RouteAggregate -->|Uptime| UP[Uptime %]
RouteAggregate -->|Downtime| DOWN[Downtime %]
RouteAggregate -->|Idle| IDLE[Idle %]
RouteAggregate -->|Latency| LAT[Avg Latency]
end
subgraph Response
RouteAggregate -->|JSON| Client[API Client]
end
```
### Data Flow
```mermaid
sequenceDiagram
participant Routes as Route Registry
participant Poller as Uptime Poller
participant Period as Period Storage
participant API as HTTP API
Routes->>Poller: GetHealthInfoWithoutDetail()
Poller->>Period: Add(StatusByAlias)
loop Every second
Poller->>Routes: Collect status
Poller->>Period: Store status
end
API->>Period: Get(filter)
Period-->>API: Entries
API->>API: aggregateStatuses()
API-->>Client: Aggregated JSON
```
### Status Types
| Status | Description | Counted as Uptime? |
| ----------- | ------------------------------ | ------------------ |
| `healthy` | Route is responding normally | Yes |
| `unhealthy` | Route is not responding | No |
| `unknown` | Status could not be determined | Excluded |
| `napping` | Route is in idle/sleep state | Idle (separate) |
| `starting` | Route is starting up | Idle (separate) |
### Calculation Formula
For a set of status entries:
```
Uptime = healthy_count / total_count
Downtime = unhealthy_count / total_count
Idle = (napping_count + starting_count) / total_count
AvgLatency = sum(latency) / count
```
Note: `unknown` statuses are excluded from all calculations.
## Configuration Surface
No explicit configuration. The poller uses period package defaults:
| Parameter | Value |
| ------------- | ---------------------------- |
| Poll Interval | 1 second |
| Retention | 5m, 15m, 1h, 1d, 1mo periods |
## Dependency and Integration Map
### Internal Dependencies
| Package | Purpose |
| ------------------------- | --------------------- |
| `internal/route/routes` | Health info retrieval |
| `internal/metrics/period` | Time-bucketed storage |
| `internal/types` | HealthStatus enum |
| `internal/metrics/utils` | Query utilities |
### External Dependencies
| Dependency | Purpose |
| ---------------------------------------- | ---------------- |
| `github.com/lithammer/fuzzysearch/fuzzy` | Keyword matching |
| `github.com/bytedance/sonic` | JSON marshaling |
### Integration Points
- Route health monitors provide status via `routes.GetHealthInfoWithoutDetail()`
- Period poller handles data collection and storage
- HTTP API provides query interface via `Poller.ServeHTTP`
## Observability
### Logs
Poller lifecycle and errors are logged via zerolog.
### Metrics
No metrics exposed directly. Status data available via API.
## Failure Modes and Recovery
| Failure | Detection | Recovery |
| -------------------------------- | --------------------------------- | ------------------------------ |
| Route health monitor unavailable | Empty map returned | Log warning, continue |
| Invalid query parameters | `aggregateStatuses` returns empty | Return empty result |
| Poller panic | Goroutine crash | Process terminates |
| Persistence failure | Load/save error | Log, continue with empty state |
### Fuzzy Search
The package uses `fuzzy.MatchFold` for keyword matching:
- Case-insensitive matching
- Substring matching
- Fuzzy ranking
## Usage Examples
### Starting the Poller
```go
import "github.com/yusing/godoxy/internal/metrics/uptime"
func init() {
uptime.Poller.Start()
}
```
### HTTP Endpoint
```go
import (
"github.com/gin-gonic/gin"
"github.com/yusing/godoxy/internal/metrics/uptime"
)
func setupUptimeAPI(r *gin.Engine) {
r.GET("/api/uptime", uptime.Poller.ServeHTTP)
}
```
**API Examples:**
```bash
# Get latest status
curl http://localhost:8080/api/uptime
# Get 1-hour history
curl "http://localhost:8080/api/uptime?period=1h"
# Get with limit and offset (pagination)
curl "http://localhost:8080/api/uptime?limit=10&offset=0"
# Search for routes containing "api"
curl "http://localhost:8080/api/uptime?keyword=api"
# Combined query
curl "http://localhost:8080/api/uptime?period=1d&limit=20&offset=0&keyword=docker"
```
### WebSocket Streaming
```javascript
const ws = new WebSocket(
"ws://localhost:8080/api/uptime?period=1m&interval=5s"
);
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
data.data.forEach((route) => {
console.log(`${route.alias}: ${route.uptime * 100}% uptime`);
});
};
```
### Direct Data Access
```go
// Get entries for the last hour
entries, ok := uptime.Poller.Get(period.MetricsPeriod1h)
for _, entry := range entries {
for alias, status := range entry.Map {
fmt.Printf("Route %s: %s (latency: %dms)\n",
alias, status.Status, status.Latency.Milliseconds())
}
}
// Get aggregated statistics
_, agg := uptime.aggregateStatuses(entries, url.Values{
"period": []string{"1h"},
})
for _, route := range agg {
fmt.Printf("%s: %.1f%% uptime, %.1fms avg latency\n",
route.Alias, route.Uptime*100, route.AvgLatency)
}
```
### Response Format
**Latest Status Response:**
```json
{
"alias1": {
"status": "healthy",
"latency": 45
},
"alias2": {
"status": "unhealthy",
"latency": 0
}
}
```
**Aggregated Response:**
```json
{
"total": 5,
"data": [
{
"alias": "api-server",
"uptime": 0.98,
"downtime": 0.02,
"idle": 0.0,
"avg_latency": 45.5,
"current_status": "healthy",
"statuses": [
{ "status": "healthy", "latency": 45, "timestamp": 1704892800 }
]
}
]
}
```
## Performance Characteristics
- O(n) status collection per poll where n = number of routes
- O(m \* k) aggregation where m = entries, k = routes
- Memory: O(p _ r _ s) where p = periods, r = routes, s = status size
- Fuzzy search is O(routes \* keyword_length)
## Testing Notes
- Mock `routes.GetHealthInfoWithoutDetail()` for testing
- Test aggregation with known status sequences
- Verify pagination and filtering logic
- Test fuzzy search matching
## Related Packages
- `internal/route/routes` - Route health monitoring
- `internal/metrics/period` - Time-bucketed metrics storage
- `internal/types` - Health status types