godoxy-yusing/internal/metrics/uptime/README.md

# Uptime

Tracks and aggregates route health status over time, providing uptime/downtime statistics and latency metrics.

## Overview

The uptime package monitors route health status and calculates uptime percentages over configurable time periods. It integrates with the `period` package for historical storage and provides aggregated statistics for visualization.

### Primary Consumers

- `internal/api/v1/metrics` - HTTP endpoint for uptime data
- `internal/homepage` - Dashboard uptime widgets
- Monitoring and alerting systems

### Non-goals

- Does not perform health checks (handled by `internal/route/routes`)
- Does not provide alerting on downtime
- Does not persist data beyond the period package retention
- Does not aggregate across multiple GoDoxy instances

### Stability

Internal package. Data format and API are stable.

## Public API

### Exported Types

#### StatusByAlias

```go
type StatusByAlias struct {
    Map       map[string]routes.HealthInfoWithoutDetail `json:"statuses"`
    Timestamp int64                                     `json:"timestamp"`
}
```

Container for health status of all routes at a specific time.

#### Status

```go
type Status struct {
    Status    types.HealthStatus `json:"status" swaggertype:"string" enums:"healthy,unhealthy,unknown,napping,starting"`
    Latency   int32              `json:"latency"`
    Timestamp int64              `json:"timestamp"`
}
```

Individual route status at a point in time.

#### RouteAggregate

```go
type RouteAggregate struct {
    Alias         string             `json:"alias"`
    Uptime        float32            `json:"uptime"`
    Downtime      float32            `json:"downtime"`
    Idle          float32            `json:"idle"`
    AvgLatency    float32            `json:"avg_latency"`
    CurrentStatus types.HealthStatus `json:"current_status" swaggertype:"string" enums:"healthy,unhealthy,unknown,napping,starting"`
    Statuses      []Status           `json:"statuses"`
}
```

Aggregated statistics for a single route.

#### Aggregated

```go
type Aggregated []RouteAggregate
```

Slice of route aggregates, sorted alphabetically by alias.

### Exported Variables

#### Poller

```go
var Poller = period.NewPoller("uptime", getStatuses, aggregateStatuses)
```

Pre-configured poller for uptime metrics. Start with `Poller.Start()`.

### Unexported Functions

#### getStatuses

```go
func getStatuses(ctx context.Context, _ StatusByAlias) (StatusByAlias, error)
```

Collects current status of all routes. Called by the period poller every second.

**Returns:**

- `StatusByAlias` - Map of all route statuses with current timestamp
- `error` - Always nil (errors are logged internally)

#### aggregateStatuses

```go
func aggregateStatuses(entries []StatusByAlias, query url.Values) (int, Aggregated)
```

Aggregates status entries into route statistics.

**Query Parameters:**

- `period` - Time filter (5m, 15m, 1h, 1d, 1mo)
- `limit` - Maximum number of routes to return (0 = all)
- `offset` - Offset for pagination
- `keyword` - Fuzzy search keyword for filtering routes

**Returns:**

- `int` - Total number of routes matching the query
- `Aggregated` - Slice of route aggregates

## Architecture

### Core Components

```mermaid
flowchart TD
    subgraph Health Monitoring
        Routes[Routes] -->|GetHealthInfoWithoutDetail| Status[Status Map]
        Status -->|Polls every| Second[1 Second]
    end

    subgraph Poller
        Poll[getStatuses] -->|Collects| StatusByAlias
        StatusByAlias -->|Stores in| Period[Period StatusByAlias]
    end

    subgraph Aggregation
        Query[Query Params] -->|Filters| Aggregate[aggregateStatuses]
        Aggregate -->|Calculates| RouteAggregate
        RouteAggregate -->|Uptime| UP[Uptime %]
        RouteAggregate -->|Downtime| DOWN[Downtime %]
        RouteAggregate -->|Idle| IDLE[Idle %]
        RouteAggregate -->|Latency| LAT[Avg Latency]
    end

    subgraph Response
        RouteAggregate -->|JSON| Client[API Client]
    end
```

### Data Flow

```mermaid
sequenceDiagram
    participant Routes as Route Registry
    participant Poller as Uptime Poller
    participant Period as Period Storage
    participant API as HTTP API

    Routes->>Poller: GetHealthInfoWithoutDetail()
    Poller->>Period: Add(StatusByAlias)

    loop Every second
        Poller->>Routes: Collect status
        Poller->>Period: Store status
    end

    API->>Period: Get(filter)
    Period-->>API: Entries
    API->>API: aggregateStatuses()
    API-->>Client: Aggregated JSON
```

### Status Types

| Status      | Description                    | Counted as Uptime? |
| ----------- | ------------------------------ | ------------------ |
| `healthy`   | Route is responding normally   | Yes                |
| `unhealthy` | Route is not responding        | No                 |
| `unknown`   | Status could not be determined | Excluded           |
| `napping`   | Route is in idle/sleep state   | Idle (separate)    |
| `starting`  | Route is starting up           | Idle (separate)    |

### Calculation Formula

For a set of status entries:

```
Uptime = healthy_count / total_count
Downtime = unhealthy_count / total_count
Idle = (napping_count + starting_count) / total_count
AvgLatency = sum(latency) / count
```

Note: `unknown` statuses are excluded from all calculations.

## Configuration Surface

No explicit configuration. The poller uses period package defaults:

| Parameter     | Value                        |
| ------------- | ---------------------------- |
| Poll Interval | 1 second                     |
| Retention     | 5m, 15m, 1h, 1d, 1mo periods |

## Dependency and Integration Map

### Internal Dependencies

| Package                   | Purpose               |
| ------------------------- | --------------------- |
| `internal/route/routes`   | Health info retrieval |
| `internal/metrics/period` | Time-bucketed storage |
| `internal/types`          | HealthStatus enum     |
| `internal/metrics/utils`  | Query utilities       |

### External Dependencies

| Dependency                               | Purpose          |
| ---------------------------------------- | ---------------- |
| `github.com/lithammer/fuzzysearch/fuzzy` | Keyword matching |
| `github.com/bytedance/sonic`             | JSON marshaling  |

### Integration Points

- Route health monitors provide status via `routes.GetHealthInfoWithoutDetail()`
- Period poller handles data collection and storage
- HTTP API provides query interface via `Poller.ServeHTTP`

## Observability

### Logs

Poller lifecycle and errors are logged via zerolog.

### Metrics

No metrics exposed directly. Status data available via API.

## Failure Modes and Recovery

| Failure                          | Detection                         | Recovery                       |
| -------------------------------- | --------------------------------- | ------------------------------ |
| Route health monitor unavailable | Empty map returned                | Log warning, continue          |
| Invalid query parameters         | `aggregateStatuses` returns empty | Return empty result            |
| Poller panic                     | Goroutine crash                   | Process terminates             |
| Persistence failure              | Load/save error                   | Log, continue with empty state |

### Fuzzy Search

The package uses `fuzzy.MatchFold` for keyword matching:

- Case-insensitive matching
- Substring matching
- Fuzzy ranking

## Usage Examples

### Starting the Poller

```go
import "github.com/yusing/godoxy/internal/metrics/uptime"

func init() {
    uptime.Poller.Start()
}
```

### HTTP Endpoint

```go
import (
    "github.com/gin-gonic/gin"
    "github.com/yusing/godoxy/internal/metrics/uptime"
)

func setupUptimeAPI(r *gin.Engine) {
    r.GET("/api/uptime", uptime.Poller.ServeHTTP)
}
```

**API Examples:**

```bash
# Get latest status
curl http://localhost:8080/api/uptime

# Get 1-hour history
curl "http://localhost:8080/api/uptime?period=1h"

# Get with limit and offset (pagination)
curl "http://localhost:8080/api/uptime?limit=10&offset=0"

# Search for routes containing "api"
curl "http://localhost:8080/api/uptime?keyword=api"

# Combined query
curl "http://localhost:8080/api/uptime?period=1d&limit=20&offset=0&keyword=docker"
```

### WebSocket Streaming

```javascript
const ws = new WebSocket(
  "ws://localhost:8080/api/uptime?period=1m&interval=5s"
);

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  data.data.forEach((route) => {
    console.log(`${route.alias}: ${route.uptime * 100}% uptime`);
  });
};
```

### Direct Data Access

```go
// Get entries for the last hour
entries, ok := uptime.Poller.Get(period.MetricsPeriod1h)
for _, entry := range entries {
    for alias, status := range entry.Map {
        fmt.Printf("Route %s: %s (latency: %dms)\n",
            alias, status.Status, status.Latency.Milliseconds())
    }
}

// Get aggregated statistics
_, agg := uptime.aggregateStatuses(entries, url.Values{
    "period": []string{"1h"},
})

for _, route := range agg {
    fmt.Printf("%s: %.1f%% uptime, %.1fms avg latency\n",
        route.Alias, route.Uptime*100, route.AvgLatency)
}
```

### Response Format

**Latest Status Response:**

```json
{
  "alias1": {
    "status": "healthy",
    "latency": 45
  },
  "alias2": {
    "status": "unhealthy",
    "latency": 0
  }
}
```

**Aggregated Response:**

```json
{
  "total": 5,
  "data": [
    {
      "alias": "api-server",
      "uptime": 0.98,
      "downtime": 0.02,
      "idle": 0.0,
      "avg_latency": 45.5,
      "current_status": "healthy",
      "statuses": [
        { "status": "healthy", "latency": 45, "timestamp": 1704892800 }
      ]
    }
  ]
}
```

## Performance Characteristics

- O(n) status collection per poll where n = number of routes
- O(m \* k) aggregation where m = entries, k = routes
- Memory: O(p _ r _ s) where p = periods, r = routes, s = status size
- Fuzzy search is O(routes \* keyword_length)

## Testing Notes

- Mock `routes.GetHealthInfoWithoutDetail()` for testing
- Test aggregation with known status sequences
- Verify pagination and filtering logic
- Test fuzzy search matching

## Related Packages

- `internal/route/routes` - Route health monitoring
- `internal/metrics/period` - Time-bucketed metrics storage
- `internal/types` - Health status types