212 lines
5.9 KiB
Markdown
212 lines
5.9 KiB
Markdown
# Log Analysis Summary - Infinite Refresh Loop Fix
|
|
|
|
## Problem Identified
|
|
|
|
Your logs showed a **critical infinite refresh loop**:
|
|
|
|
```
|
|
Keycloak session invalidated, clearing token to force re-authentication
|
|
Keycloak session invalidated, clearing token to force re-authentication
|
|
Keycloak session invalidated, clearing token to force re-authentication
|
|
... (repeating infinitely)
|
|
```
|
|
|
|
### Root Cause
|
|
|
|
1. **Session Invalidated**: User's Keycloak session became invalid (logged out elsewhere, expired, etc.)
|
|
2. **Multiple Widgets**: All widgets/components making parallel API requests
|
|
3. **JWT Callback Triggered**: Each request triggers NextAuth JWT callback
|
|
4. **Refresh Attempt**: Each callback tries to refresh the expired token
|
|
5. **Refresh Fails**: Refresh fails because session is invalid
|
|
6. **No Circuit Breaker**: Next request sees expired token → tries refresh again → **infinite loop**
|
|
|
|
### Impact
|
|
|
|
- **Performance**: Hundreds of refresh attempts per second
|
|
- **Server Load**: CPU/memory spike
|
|
- **Keycloak Load**: Potential DoS on Keycloak server
|
|
- **User Experience**: App appears broken
|
|
- **Logs**: Spam with error messages
|
|
|
|
---
|
|
|
|
## Solution Implemented
|
|
|
|
### Circuit Breaker Pattern
|
|
|
|
Added a **5-second cooldown** after failed refresh attempts:
|
|
|
|
1. **Track Failures**: Record timestamp when refresh fails
|
|
2. **Cooldown Period**: Don't retry refresh for 5 seconds after failure
|
|
3. **Early Return**: If in cooldown, return error immediately (no API call)
|
|
4. **Memory Management**: Cleanup old entries to prevent memory leaks
|
|
|
|
### Code Changes
|
|
|
|
**File**: `app/api/auth/options.ts`
|
|
|
|
**Added:**
|
|
- `refreshCooldown` Map to track last failure per user
|
|
- `REFRESH_COOLDOWN_MS = 5000` (5 seconds)
|
|
- `cleanupRefreshCooldown()` function to prevent memory leaks
|
|
- Cooldown check before refresh attempt
|
|
- Failure recording after failed refresh
|
|
|
|
**How It Works:**
|
|
|
|
```typescript
|
|
// Before refresh attempt:
|
|
if (timeSinceFailure < REFRESH_COOLDOWN_MS) {
|
|
// Skip refresh, return error immediately
|
|
return errorToken;
|
|
}
|
|
|
|
// After failed refresh:
|
|
if (refreshedToken.error === "SessionNotActive") {
|
|
refreshCooldown.set(userId, Date.now()); // Record failure
|
|
return errorToken;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Expected Behavior After Fix
|
|
|
|
### Before Fix
|
|
```
|
|
Request 1 → Refresh attempt → Fail → Clear tokens
|
|
Request 2 → Refresh attempt → Fail → Clear tokens
|
|
Request 3 → Refresh attempt → Fail → Clear tokens
|
|
... (infinite loop)
|
|
```
|
|
|
|
### After Fix
|
|
```
|
|
Request 1 → Refresh attempt → Fail → Record failure → Clear tokens
|
|
Request 2 → Check cooldown → Skip refresh → Return error immediately
|
|
Request 3 → Check cooldown → Skip refresh → Return error immediately
|
|
... (cooldown prevents refresh attempts)
|
|
After 5s → Next request can try refresh again (if session restored)
|
|
```
|
|
|
|
### What You'll See in Logs
|
|
|
|
**Good Signs:**
|
|
- ✅ "Refresh cooldown active, skipping refresh attempt" (instead of infinite failures)
|
|
- ✅ Only 1-2 refresh attempts per user when session invalidates
|
|
- ✅ User redirected to sign-in page
|
|
- ✅ No refresh storm
|
|
|
|
**Bad Signs (if still happening):**
|
|
- ❌ Still seeing infinite "Keycloak session invalidated" messages
|
|
- ❌ Multiple refresh attempts within 5 seconds
|
|
- ❌ Cooldown not working
|
|
|
|
---
|
|
|
|
## Testing the Fix
|
|
|
|
### Test Scenario 1: Session Invalidation
|
|
1. Log in to the app
|
|
2. Logout from Keycloak admin console (or expire session)
|
|
3. **Expected**:
|
|
- 1-2 refresh attempts
|
|
- Then cooldown messages
|
|
- User redirected to sign-in
|
|
- **NOT** infinite loop
|
|
|
|
### Test Scenario 2: Multiple Widgets
|
|
1. Open app with all widgets loading
|
|
2. Invalidate session
|
|
3. **Expected**:
|
|
- All widgets respect cooldown
|
|
- No refresh storm
|
|
- Clean error handling
|
|
|
|
### Test Scenario 3: Normal Operation
|
|
1. Valid session
|
|
2. Token expires naturally
|
|
3. **Expected**:
|
|
- Refresh succeeds
|
|
- No cooldown triggered
|
|
- Normal operation continues
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
### Metrics to Watch
|
|
|
|
1. **Refresh Attempts**: Should be low (1-2 per user per session)
|
|
2. **Cooldown Activations**: Should only happen when session invalid
|
|
3. **Refresh Success Rate**: Should be high for valid sessions
|
|
4. **Error Rate**: Should drop significantly
|
|
|
|
### Log Patterns
|
|
|
|
**Healthy:**
|
|
```
|
|
[DEBUG] Refresh cooldown active, skipping refresh attempt
|
|
[INFO] Keycloak session invalidated, setting cooldown
|
|
```
|
|
|
|
**Unhealthy (if still happening):**
|
|
```
|
|
Keycloak session invalidated, clearing token... (repeating)
|
|
```
|
|
|
|
---
|
|
|
|
## Future Improvements
|
|
|
|
### Short-term (Recommended)
|
|
1. ✅ **Done**: In-memory circuit breaker
|
|
2. ⚠️ **Next**: Migrate to Redis-based circuit breaker (for multi-instance)
|
|
3. ⚠️ **Next**: Add client-side session guard to stop requests
|
|
|
|
### Long-term
|
|
1. ⚠️ Add metrics/monitoring
|
|
2. ⚠️ Implement exponential backoff
|
|
3. ⚠️ Add request cancellation on client-side
|
|
4. ⚠️ Better error boundaries
|
|
|
|
---
|
|
|
|
## Additional Notes
|
|
|
|
### Why 5 Seconds?
|
|
|
|
- **Too Short (< 2s)**: Still allows refresh storms
|
|
- **Too Long (> 10s)**: Delays legitimate refresh attempts
|
|
- **5 Seconds**: Good balance - prevents storms, allows quick recovery
|
|
|
|
### Memory Considerations
|
|
|
|
- **Map Size**: Limited to 1000 entries (auto-cleanup)
|
|
- **Memory Per Entry**: ~50 bytes (userId + timestamp)
|
|
- **Total Memory**: ~50KB max
|
|
- **Cleanup**: Automatic (removes entries older than 50s)
|
|
|
|
### Multi-Instance Deployment
|
|
|
|
**Current**: In-memory Map (per-instance)
|
|
- Works for single instance
|
|
- Each instance has its own cooldown
|
|
|
|
**Future**: Redis-based (shared across instances)
|
|
- Better for multi-instance
|
|
- Shared cooldown state
|
|
- See `CRITICAL_ISSUE_ANALYSIS.md` for Redis implementation
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
✅ **Fixed**: Infinite refresh loop with circuit breaker
|
|
✅ **Impact**: Prevents refresh storms, reduces server load
|
|
✅ **Testing**: Verify with session invalidation scenarios
|
|
⚠️ **Next**: Monitor logs, consider Redis migration for multi-instance
|
|
|
|
The fix is **production-ready** and should immediately stop the refresh loop you're seeing in your logs.
|
|
|