Process & Service Management

TL;DR

Every running program is a process with a PID. Control them with signals (SIGTERM=graceful stop, SIGKILL=force). Long-running services are managed by systemd (systemctl start/stop/enable). Schedule recurring tasks with cron. Monitor with ps, top/htop, and journalctl.

Explain Like I'm 12

Think of your computer as a school. Every process is a student doing their work. Each student has an ID number (PID). The principal (systemd) makes sure important students (services like the web server) are always in school — if one goes home sick, the principal calls them back. Signals are like announcements: "Please finish up and leave" (SIGTERM) vs "FIRE DRILL! Everyone out NOW!" (SIGKILL). Cron is the school bell — it triggers activities at specific times.

Process Lifecycle

Linux process lifecycle showing states: created, running, sleeping, stopped, zombie, and the signals that transition between them

Process Basics

# List processes
ps aux                          # All processes, full detail
ps aux --sort=-%mem | head -10  # Top 10 by memory usage
ps aux --sort=-%cpu | head -10  # Top 10 by CPU usage
ps -ef --forest                 # Tree view (parent-child)

# Process details
# USER  PID %CPU %MEM  VSZ   RSS TTY  STAT START TIME COMMAND
# root    1  0.0  0.1 16944  4840 ?    Ss   Mar28 0:05 /sbin/init

# Interactive monitors
top                             # Built-in (press q to quit, k to kill)
htop                            # Better UI (install: apt install htop)

# Find PID by name
pgrep nginx                     # PIDs of all nginx processes
pidof nginx                     # Similar to pgrep
STATStateMeaning
RRunningCurrently using CPU or waiting in the run queue
SSleepingWaiting for an event (I/O, timer, signal)
DDisk sleepUninterruptible sleep — waiting for I/O (can't be killed)
TStoppedSuspended by signal (Ctrl+Z or SIGSTOP)
ZZombieFinished but parent hasn't collected exit status
Info: Zombie processes aren't using CPU or memory — they're just leftover entries in the process table. They happen when a parent process doesn't call wait() on its children. A few zombies are harmless; many indicate a buggy parent process.

Signals

Signals are software interrupts sent to processes. The most important ones:

SignalNumberDefault ActionCan Be Caught?
SIGTERM15Graceful terminationYes — process can clean up
SIGKILL9Immediate terminationNo — kernel kills the process
SIGINT2Interrupt (Ctrl+C)Yes
SIGHUP1Hangup / reload configYes — many daemons reload
SIGSTOP19Pause processNo — can't be ignored
SIGCONT18Resume paused processYes
# Send signals
kill 1234                  # SIGTERM (default — graceful stop)
kill -9 1234               # SIGKILL (force kill)
kill -HUP 1234             # SIGHUP (reload config)
killall nginx              # Send SIGTERM to all nginx processes
pkill -f "python app.py"  # Kill by command pattern

# Keyboard signals
# Ctrl+C  → SIGINT (interrupt)
# Ctrl+Z  → SIGTSTP (suspend to background)
# Ctrl+\  → SIGQUIT (quit with core dump)
Warning: Always try kill PID (SIGTERM) first. Give the process 5-10 seconds to clean up. Only use kill -9 (SIGKILL) if it's completely unresponsive. SIGKILL doesn't allow cleanup — open files won't be flushed, temp files won't be deleted, locks won't be released.

Job Control

# Background a command
./long-task.sh &           # Start in background (& at end)
# [1] 12345                 # Job number and PID

# Manage background jobs
jobs                        # List background jobs
fg %1                       # Bring job 1 to foreground
bg %1                       # Resume suspended job in background

# Suspend and background
# 1. Run a command
# 2. Press Ctrl+Z (suspends it)
# 3. Type 'bg' (resumes in background)

# Keep running after logout
nohup ./script.sh &         # Immune to SIGHUP (won't die on logout)
# Output goes to nohup.out

# Or use screen/tmux for persistent sessions
tmux new -s mywork          # Create named session
tmux detach                 # Detach (Ctrl+B, d)
tmux attach -t mywork       # Reattach later
Tip: Use tmux or screen for long-running tasks on remote servers. If your SSH connection drops, the session keeps running. Reattach when you reconnect. This is essential for deployments and long builds.

systemd & Services

systemd is the init system (PID 1) on most modern Linux distributions. It manages services (daemons), boot order, logging, and more.

# Service management
sudo systemctl start nginx        # Start now
sudo systemctl stop nginx         # Stop now
sudo systemctl restart nginx      # Stop + start
sudo systemctl reload nginx       # Reload config (no downtime)
sudo systemctl status nginx       # Show status and recent logs

# Boot behavior
sudo systemctl enable nginx       # Start on boot
sudo systemctl disable nginx      # Don't start on boot
sudo systemctl is-enabled nginx   # Check if enabled

# List services
systemctl list-units --type=service              # Running services
systemctl list-units --type=service --state=failed  # Failed services

# View logs (journalctl)
journalctl -u nginx              # Logs for a specific unit
journalctl -u nginx --since "1 hour ago"
journalctl -u nginx -f           # Follow logs live (like tail -f)
journalctl -p err                # Only error-level messages
journalctl --disk-usage          # How much space logs use
Info: systemctl reload is better than restart for services like Nginx — it reloads the config file without dropping existing connections. Not all services support reload; check with systemctl show nginx | grep Reload.

Creating a Custom Service

# /etc/systemd/system/myapp.service
[Unit]
Description=My Python Web App
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/venv/bin/python app.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
# Enable and start the custom service
sudo systemctl daemon-reload         # Reload unit files after changes
sudo systemctl enable --now myapp    # Enable + start in one command

Cron: Scheduled Tasks

# Edit your crontab
crontab -e

# Cron syntax:
# ┌─── minute (0-59)
# │ ┌─── hour (0-23)
# │ │ ┌─── day of month (1-31)
# │ │ │ ┌─── month (1-12)
# │ │ │ │ ┌─── day of week (0-7, 0=Sun, 7=Sun)
# │ │ │ │ │
# * * * * * command

# Examples
0 2 * * *   /opt/backup.sh           # Every day at 2:00 AM
*/15 * * * * /opt/health-check.sh    # Every 15 minutes
0 9 * * 1   /opt/weekly-report.sh    # Every Monday at 9 AM
0 0 1 * *   /opt/monthly-cleanup.sh  # First day of each month

# List cron jobs
crontab -l                            # Your crontab
sudo crontab -l -u www-data          # Another user's crontab

# System-wide cron
ls /etc/cron.daily/                   # Scripts run daily
ls /etc/cron.weekly/                  # Scripts run weekly
Warning: Cron jobs run with a minimal environment — your PATH and other env vars may not be set. Always use absolute paths for commands and files in crontab. Redirect output to log: 0 2 * * * /opt/backup.sh >> /var/log/backup.log 2>&1

System Monitoring

# CPU & Memory
top                          # Live process monitor
htop                         # Better interactive monitor
free -h                      # Memory summary
vmstat 1 5                   # VM stats every 1 sec, 5 times

# Disk I/O
iostat -x 1                  # Disk I/O stats per second
iotop                        # I/O usage by process (like top for disk)

# Network
ss -tuln                     # Listening ports
ss -s                        # Socket statistics summary
iftop                        # Network traffic by connection
nethogs                      # Bandwidth per process

# Load average
uptime                       # Shows 1min, 5min, 15min load averages
# Load average: 2.50, 1.80, 1.20
# (On a 4-core system, 4.0 = 100% utilized)
Tip: Load average is the number of processes waiting for CPU. Compare it to your core count: on a 4-core system, load of 4.0 = fully utilized, 8.0 = overloaded (processes queuing). Check cores with nproc.

Test Yourself

What is the difference between SIGTERM and SIGKILL? Why does it matter?

SIGTERM (15) requests graceful shutdown — the process can catch it, clean up resources (flush buffers, release locks), and exit. SIGKILL (9) kills immediately — can't be caught or ignored, no cleanup happens. SIGKILL can leave corrupted files, unreleased locks, and orphaned resources. Always try SIGTERM first.

What is a zombie process and how do you fix it?

A zombie is a process that has finished but whose parent hasn't called wait() to collect its exit status. It's a process table entry using no CPU/memory. Fix: you can't kill a zombie (it's already dead). Fix the parent process — restart it or send SIGCHLD. If the parent dies, init (PID 1) adopts and reaps the zombies.

Write a cron expression for "every weekday at 8:30 AM".

30 8 * * 1-5 — minute 30, hour 8, any day of month, any month, Monday through Friday (1-5). Remember: the day-of-week field is 0-7 where both 0 and 7 are Sunday.

What does systemctl enable nginx vs systemctl start nginx do?

start runs the service right now but it won't survive a reboot. enable creates a symlink so the service starts automatically on boot but doesn't start it now. For both: systemctl enable --now nginx.

A server's load average is 12.0 but it has 4 CPU cores. What does this mean?

Load average 12.0 on 4 cores = 3x overloaded. There are 12 processes wanting CPU but only 4 can run simultaneously — 8 are queuing. Investigate with top (sort by CPU: press P), check for runaway processes, I/O wait (wa% in top), or insufficient resources.

Interview Questions

A production web server becomes unresponsive. Walk through your debugging process.

1) uptime — check load average (CPU overload?). 2) free -h — check if memory is exhausted (OOM killer active?). 3) df -h — check if disk is full. 4) top/htop — find top CPU/memory consumers. 5) dmesg | tail — kernel messages (OOM kills, hardware errors). 6) journalctl -u nginx --since "5 min ago" — service logs. 7) ss -tuln — is the port actually listening? 8) curl -I localhost — can the server reach itself?

Explain the difference between a process and a thread in Linux.

A process has its own memory space, file descriptors, and PID — isolated from other processes. A thread shares the parent process's memory space but has its own stack and thread ID. Threads are lighter (faster to create, less memory) but sharing memory creates concurrency risks (race conditions). In Linux, threads are implemented as lightweight processes using clone() — they appear in ps -eLf.

How does the OOM killer work? How would you protect a critical process from it?

When the system runs out of memory, the kernel's OOM (Out of Memory) killer selects a process to kill based on a score (/proc/PID/oom_score) — it targets the biggest memory consumer. Protect critical processes: set OOMScoreAdjust=-1000 in the systemd unit file, or write -17 to /proc/PID/oom_adj. Better: set proper memory limits with cgroups so the OOM killer isn't triggered.