- components that monitor the status of processes in [[Distributed Systems]] and determine their availability (determines if a process failed or not)
- in [[Distributed Systems]], impossible to definitively differentiate between a failed process and a process that is just really slow in responding
- [[Networking|Network]] issues can cause a process to appear unresponsive, even when it is still functioning correctly
- failure detectors must make a tradeoff between detection time and the rate of false positives
- if we detect failures too quickly, it can incorrectly classify slow processes as failed (higher false positive)
- if it's too conservative though, then we get much slower detection time of actual failures