Why does DBSCAN with Euclidean distance fail on raw lat/lon GPS data?

Euclidean distance treats degrees as Cartesian units, so a 1-degree longitude step varies from ~111 km at the equator to near zero at the poles. Convert coordinates to radians and use the haversine metric (or project to a local UTM CRS) before fitting.

How do I choose eps and min_samples for mixed urban and rural routes?

Derive eps from a k-distance elbow plot computed on a representative sample. Urban stops need a tighter radius (50–80 m) to avoid merging adjacent delivery points; rural depots tolerate 150–300 m. min_samples should equal the expected ping count during your minimum valid dwell window (e.g., 3 pings at 30-second intervals = 90-second minimum stop).

How can I run DBSCAN on a 10-million-point GPS dataset without memory errors?

Partition the dataset by vehicle or H3/S2 spatial cell, run DBSCAN independently per partition, then merge boundary clusters with union-find logic. Alternatively, use Dask or Spark to distribute the BallTree construction and cluster expansion across workers.

What causes phantom stops near traffic signals and how do I prevent them?

A low min_samples value combined with frequent 30–90 second halts at intersections generates clusters with valid density but no operational meaning. Apply a minimum dwell duration filter (e.g., ≥ 2 minutes) post-clustering and confirm ping_count meets a floor before writing a stop record.

DBSCAN for Fleet Stop Clustering

Fleet telematics pipelines routinely ingest millions of GPS pings daily, yet raw coordinate streams rarely map cleanly to operational stops. Sensor drift, multipath interference in urban canyons, variable idling patterns, and inconsistent sampling intervals introduce spatial noise that breaks naive radius-based stop detection. Density-based spatial clustering resolves this by identifying high-density point regions without requiring predefined stop boundaries or fixed grid overlays. This methodology anchors modern Stop Detection & Dwell Time Analytics frameworks, enabling logistics platforms to transform fragmented GPS traces into structured, actionable location intelligence.

Why Density-Based Clustering Outperforms Fixed-Radius Methods

Traditional stop detection relies on static geofences or distance-threshold heuristics. These approaches break when vehicles idle in non-standard locations — loading docks, curbside drop-offs, temporary staging areas — or when GPS accuracy degrades below 5–10 metres. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) evaluates point neighbourhoods dynamically. A point is designated a core point if at least min_samples neighbours exist within radius eps. Border points attach to existing core regions; outliers remain unassigned with label −1. This topology-aware behaviour naturally absorbs GPS jitter, ignores transient traffic halts, and adapts to irregular stop geometries without manual polygon definition.

The mathematical heart of the algorithm is the neighbourhood set:

N(p) =

A core point satisfies |N(p)| ≥ min_samples. Directly-density-reachable points extend those cores; density-connected regions form the final clusters. No predefined cluster count is required, which is the decisive operational advantage over k-means for stop detection.

Before clustering, raw coordinates must pass through GPS data preprocessing and cleaning — particularly outlier removal in raw telematics streams — to prevent anomalous pings from seeding false dense regions.

Prerequisites

Requirement	Minimum version / detail
Python	3.10+
pandas	2.0 — nullable dtypes prevent silent NaN-as-zero bugs
numpy	1.24 — `np.radians` vectorised over float64 arrays
scikit-learn	1.3 — `DBSCAN` with `algorithm="ball_tree"` and `metric="haversine"`
geopandas	0.14 — optional; required for CRS reprojection and spatial join enrichment
pyproj	3.5 — UTM reprojection when Euclidean distance is preferable to haversine

Data schema requirements. Each GPS record must carry vehicle_id, latitude, longitude, speed_kmh, and timestamp_utc. Null coordinates or zero-epoch timestamps must be removed before vectorisation. Records where latitude or longitude falls outside the valid bounding box of your operational region indicate a sensor fault and should be quarantined.

Mathematical background. You need basic familiarity with the haversine formula (great-circle distance from angular coordinates), radian conversion, and the trade-off between BallTree and KDTree spatial indexes. BallTree supports arbitrary metric functions (including haversine) and is required here; KDTree is Euclidean-only and will silently return wrong distances on lat/lon inputs.

Coordinate reference system decision. DBSCAN’s eps parameter must be expressed in units consistent with the chosen distance metric. Two valid approaches:

Keep WGS84 (EPSG:4326) → convert to radians → use metric="haversine" → express eps in radians (metres ÷ 6,371,000).
Reproject to a local UTM zone via the WGS84-to-local CRS conversion workflow → use Euclidean distance → express eps directly in metres.

Approach 1 is simpler for single-region fleets. Approach 2 is preferable when the operational area spans multiple UTM zones or when you need distance outputs in human-readable metres for downstream SLA calculations.

Step-by-Step Workflow

1. Velocity and Temporal Pre-Filtering

High-speed transit points inflate computational complexity and distort density calculations. Apply a rolling speed threshold to isolate stationary segments:

import pandas as pd

IDLE_THRESHOLD_KMH = 7.0
GAP_MINUTES = 15

df = df[df["speed_kmh"] <= IDLE_THRESHOLD_KMH].copy()
df = df.sort_values(["vehicle_id", "timestamp_utc"])

# Split into discrete trip segments on temporal gaps
df["time_delta"] = df.groupby("vehicle_id")["timestamp_utc"].diff()
df["segment_id"] = (
    df.groupby("vehicle_id")["time_delta"]
    .transform(lambda s: (s > pd.Timedelta(minutes=GAP_MINUTES)).cumsum())
)

Speed alone is insufficient. Combine with heading variance and acceleration deltas to filter false positives at traffic signals — see the discussion in the troubleshooting section below.

Expected output shape: a filtered DataFrame retaining 15–40 % of raw rows for urban delivery routes, with a new segment_id column partitioning each vehicle’s data into gap-free windows.

2. Coordinate Transformation and BallTree Construction

Convert decimal degrees to radians. The haversine formula expects angular inputs to compute great-circle distances accurately:

import numpy as np
from sklearn.neighbors import BallTree

df["lat_rad"] = np.radians(df["latitude"])
df["lon_rad"] = np.radians(df["longitude"])
coords = df[["lat_rad", "lon_rad"]].values  # shape (N, 2)

# Build once; reuse for parameter sweeps
tree = BallTree(coords, metric="haversine")

BallTree construction is O(N log N). Building it once and reusing it for k-distance plots (for eps calibration) pays dividends at fleet scale. For datasets above 5 M rows per vehicle, consider building per-segment_id trees to cap memory allocation.

3. DBSCAN Parameter Derivation

Hardcoded eps values fail when GPS accuracy varies across vehicle types. Derive eps from a k-distance elbow plot on a representative sample:

import matplotlib.pyplot as plt

K = 4  # same as your intended min_samples
distances, _ = tree.query(coords, k=K + 1)  # +1 because first nn is self
kth_distances = np.sort(distances[:, K])[::-1]

plt.plot(kth_distances)
plt.xlabel("Points (sorted by distance)")
plt.ylabel(f"{K}-th neighbour distance (radians)")
plt.title("k-distance plot — choose eps at the elbow")
plt.tight_layout()
plt.savefig("k_distance_elbow.png", dpi=150)

Convert the elbow value from radians to metres for human review: metres = radians * 6_371_000. Urban delivery operations typically yield elbows at 8 × 10⁻⁶ to 2.4 × 10⁻⁵ radians (50–150 m). Rural depot visits tolerate up to 5 × 10⁻⁵ radians (≈ 318 m).

4. Density-Based Cluster Execution

from sklearn.cluster import DBSCAN

EPS_RADIANS = 1.5e-5    # ~95 m for urban routes
MIN_SAMPLES = 4         # 4 pings × 30 s interval = 2-minute minimum stop

clusterer = DBSCAN(
    eps=EPS_RADIANS,
    min_samples=MIN_SAMPLES,
    metric="haversine",
    algorithm="ball_tree",
    n_jobs=-1,           # parallelise neighbourhood queries
)
df["cluster_id"] = clusterer.fit_predict(coords)

The n_jobs=-1 flag distributes neighbourhood expansion across all CPU cores. Verify that clusterer.labels_ contains a healthy proportion of valid cluster IDs (>= 0) versus noise (-1). A noise fraction above 60 % usually indicates eps is too small or min_samples is too large for the sampling interval.

Expected output shape: the DataFrame now carries cluster_id — non-negative integers for valid stops, −1 for noise.

5. Noise Isolation and Centroid Aggregation

stops = (
    df[df["cluster_id"] != -1]
    .groupby(["vehicle_id", "segment_id", "cluster_id"])
    .agg(
        centroid_lat=("latitude", "mean"),
        centroid_lon=("longitude", "mean"),
        arrival_time=("timestamp_utc", "min"),
        departure_time=("timestamp_utc", "max"),
        ping_count=("cluster_id", "count"),
    )
    .reset_index()
)

# Compute dwell duration; feed into downstream SLA checks
stops["dwell_seconds"] = (
    stops["departure_time"] - stops["arrival_time"]
).dt.total_seconds()

# Drop clusters that are implausibly short (traffic light artefacts)
MIN_DWELL_SECONDS = 120
stops = stops[stops["dwell_seconds"] >= MIN_DWELL_SECONDS]

The centroid_lat / centroid_lon columns represent the mean GPS position across all pings in each stop group — a reasonable approximation for compact urban stops. For large depot footprints (> 200 m diameter), consider using the spatial median (scipy.spatial.geometric_median) instead, which is more robust to edge-case GPS scatter.

These dwell intervals feed directly into time-window based dwell calculation modules, where shift boundaries, service-level agreements, and overtime rules are applied.

6. Downstream Enrichment and Typing

Raw centroids lack semantic context. Reverse-geocode coordinates, cross-reference against commercial zoning layers, and apply business rules to classify stops as warehouses, retail locations, customer sites, or unauthorised layovers. This enrichment layer powers Location Typing and POI Matching for Stops, enabling automated compliance checks and route optimisation feedback loops.

# Spatial join against a local POI GeoDataFrame using geopandas
import geopandas as gpd
from shapely.geometry import Point

stops_gdf = gpd.GeoDataFrame(
    stops,
    geometry=[Point(lon, lat) for lon, lat in zip(stops["centroid_lon"], stops["centroid_lat"])],
    crs="EPSG:4326",
)

poi_gdf = gpd.read_file("data/poi_polygons.geojson")  # customer sites, depots, etc.
enriched = gpd.sjoin(stops_gdf, poi_gdf[["poi_type", "poi_name", "geometry"]],
                     how="left", predicate="within")

Mathematical Model: Haversine Distance and Radians

The haversine formula gives the great-circle distance between two points on a sphere:

d = 2r · arcsin( sqrt( sin²(Δφ/2) + cos(φ₁)·cos(φ₂)·sin²(Δλ/2) ) )

where φ is latitude, λ is longitude (both in radians), r is Earth’s radius (6,371,000 m), and d is the surface distance in metres.

scikit-learn’s haversine metric implements this formula internally, so the eps value you pass is expressed as d/r — i.e. the angular distance in radians. To convert:

Metres to radians: eps_rad = desired_metres / 6_371_000
Radians to metres: metres = eps_rad * 6_371_000

One practical implication: if your operational region spans more than ≈ 3° of longitude (≈ 333 km at mid-latitudes), consider projecting to UTM before clustering to avoid the cosine foreshortening that accumulates in the haversine formula near the poles.

Numerical stability. The arcsin(sqrt(...)) composition is stable for small distances but accumulates floating-point error when Δφ and Δλ approach machine epsilon. At GPS precision levels (< 0.00001°), this is not a practical concern, but it confirms that float64 (not float32) arrays are mandatory.

Scalability Patterns

Static in-memory clustering becomes prohibitive above ~10 M rows per run. Adopt one of the following patterns based on fleet size and latency requirements:

Spatial partitioning (recommended for batch pipelines). Divide the operational region into overlapping H3 or S2 cells. Run DBSCAN independently per cell, then merge boundary clusters using union-find (disjoint-set) logic. Overlapping cells by at least one eps radius ensures no boundary stop is split across partitions.

import h3

def h3_partition(df: pd.DataFrame, resolution: int = 7) -> dict[str, pd.DataFrame]:
    df["h3_cell"] = df.apply(
        lambda r: h3.latlng_to_cell(r["latitude"], r["longitude"], resolution), axis=1
    )
    return {cell: grp for cell, grp in df.groupby("h3_cell")}

Approximate nearest neighbours. Replace exact BallTree with FAISS or Annoy for sub-linear query times when processing continental-scale telemetry. Note that approximate methods introduce a recall trade-off — calibrate the approximation factor against your acceptable false-negative stop rate.

Incremental/streaming windows. Process streaming events with a sliding temporal buffer (e.g., 30-minute windows with 5-minute overlap). Reassign points only when new pings alter neighbourhood density beyond a stability threshold. This pattern integrates naturally with the timestamp synchronisation stage that aligns multi-device clocks before data enters the clustering buffer.

Routing and Engine Integration Notes

Stop centroids produced here are the primary inputs to route optimisation engines. Key integration considerations:

Coordinate order. OSRM and Valhalla APIs expect longitude,latitude (GeoJSON order), not latitude,longitude. The centroid columns produced above are named centroid_lat / centroid_lon — swap them explicitly when constructing API payloads to avoid silent misrouting.
Snapping distance. When submitting stop centroids as waypoints, set the OSRM radiuses parameter to a value slightly larger than your clustering eps in metres. This ensures the engine can snap the centroid to the nearest road even when the stop occurred off-network (e.g., inside a private warehouse yard).
Rate limiting. Batch geocoding and POI-matching calls against commercial APIs require an exponential backoff circuit breaker. A 30-second rolling window with a 50 req/s ceiling is a safe baseline for most geocoding providers.
Timezone awareness. arrival_time and departure_time are stored in UTC. Before feeding into dwell SLA checks or calculating dwell times across timezone shifts, localise to the stop’s geographic timezone using timezonefinder or equivalent.

Operational Troubleshooting

Cluster merging across adjacent delivery points

Cause. eps is too large relative to the spacing between consecutive delivery addresses (often 20–50 m on dense urban routes).

Symptom. Two or more distinct customer stops are assigned the same cluster_id; the aggregated centroid falls between buildings rather than at either address.

Fix. Reduce eps to 50–60 m for urban environments. Validate by spot-checking merged clusters against address geocodes. Alternatively, apply a heading-change filter: if consecutive pings show a heading reversal > 90°, force a segment break before clustering.

Memory exhaustion on full-fleet batch runs

Cause. scikit-learn materialises an O(N²) distance matrix when algorithm="brute" or when the BallTree leaf size is misconfigured relative to dataset size.

Symptom. MemoryError during fit_predict, typically at 5–20 M points depending on available RAM.

Fix. Confirm algorithm="ball_tree" is set. Partition by vehicle or H3 cell before clustering (see Scalability Patterns above). Increase leaf_size from the default 40 to 100–200 to reduce tree depth at the cost of slightly slower queries. Use float32 coordinates to halve memory consumption (verify precision is acceptable first).

Phantom stops at traffic signals and rail crossings

Cause. Low min_samples combined with frequent 30–90 second halts at fixed infrastructure generates clusters with technically valid density but no operational meaning.

Symptom. Stop records with dwell_seconds of 60–180 s clustered at known intersection coordinates; high false-positive rate in compliance reports.

Fix. Enforce dwell_seconds >= 120 (or your domain minimum) as a post-clustering filter. Pre-filter the input stream using a road-network mask: any point within 10 m of a signalised intersection (derivable from OpenStreetMap highway=traffic_signals) can be flagged for exclusion before clustering.

Coordinate wrapping artefacts near ±180° longitude

Cause. If your operational region straddles the antimeridian, longitude values jump from +179.9 to −179.9. The haversine metric handles this correctly in theory, but pandas mean() on longitude columns does not — the centroid falls on the wrong side of the dateline.

Symptom. Stop centroids appear in the mid-Pacific for fleets operating in eastern Siberia or Alaska.

Fix. Project to a Cartesian CRS before aggregating centroids. Use pyproj.Transformer to convert to EPSG:3857 (Web Mercator) or a regional UTM zone, aggregate, then reproject back to WGS84.

GPS multipath inflating density and creating phantom clusters

Cause. Urban canyon multipath bounces satellite signals off building facades, creating a halo of 10–50 m scatter around the true vehicle position. This artificially elevates local point density.

Symptom. Clusters appear at locations the vehicle never actually stopped (e.g., mid-block on a through street) due to reflected signal accumulation.

Fix. Apply a Kalman filter for GPS noise reduction or a Savitzky-Golay smoother to raw coordinates before clustering. Validate cluster stability by re-running with 5 % of pings randomly dropped: a stable genuine stop should survive the subsampling; a multipath artefact will typically disappear.

eps calibrated on one vehicle class fails for another

Cause. Heavy trucks require wider loading-zone radii (100–200 m) than last-mile vans (40–80 m). A single global eps under-clusters trucks or over-clusters vans.

Symptom. Truck stops split into 3–5 sub-clusters at the same depot; van stops merge across adjacent customer addresses.

Fix. Maintain a vehicle_class → eps lookup table populated from k-distance elbow analysis per class. Apply the appropriate eps when partitioning the dataset by vehicle before clustering. For fleets with many classes, train a lightweight regression model mapping vehicle class and median GPS accuracy to optimal eps.

Deployment Checklist

Confirm input coordinates are WGS84 (EPSG:4326) and not already in a projected CRS before radian conversion.

Verify eps is expressed in radians when using metric="haversine", not in metres.

Confirm algorithm="ball_tree" — never "brute" — on datasets above 10,000 points.

Validate that timestamp_utc column is a tz-aware datetime64[ns, UTC] dtype before temporal gap splitting.

Run a k-distance elbow plot on a representative 50,000-point sample for each vehicle class before deploying fixed eps values.

Apply MIN_DWELL_SECONDS filter (≥ 120 s) after aggregation to suppress traffic-signal artefacts.

Test centroid coordinate order (lat/lon vs. lon/lat) against the routing API with a single known location before bulk submission.

Confirm downstream dwell calculations localise arrival_time / departure_time to the stop’s geographic timezone, not the server’s local timezone.

Set up monitoring on the noise fraction (label == −1 rate). Alert if it exceeds 65 % on any vehicle, which indicates a data quality or parameter regression.

Validate cluster stability under 5 % random ping subsampling before promoting parameters to production.

Tuning DBSCAN eps and min_samples for Delivery Truck Stops — vehicle-class-specific parameter calibration with k-distance methodology
Time-Window Based Dwell Calculation — applying shift boundaries and SLA rules to the stop records produced here
Location Typing and POI Matching for Stops — enriching raw centroids with commercial and operational context
Kalman Filtering for GPS Noise Reduction — pre-processing raw traces to reduce the multipath scatter that corrupts density estimates
Outlier Removal in Raw Telematics Streams — removing anomalous pings before they seed false clusters

Up: Stop Detection & Dwell Time Analytics

Related