How we safely deduplicated a HubSpot database of around 800,000 contacts (without merging the wrong people)

How we cleared more than 20,000 duplicate matches from a roughly 800,000-contact HubSpot database without merging the wrong people, and stopped new duplicates forming.

John Kelleher
John Kelleher

Here is how we cleared a large duplicate backlog from a HubSpot database of around 800,000 contacts, safely, without merging the wrong people, and how we stopped the duplicates coming back. The business is anonymised and the figures are rounded, but the work and the results are real.

The challenge

A business with a large, mixed B2B and B2C contact database, around 800,000 records, had a serious duplication problem. Most of it traced back to a single historical data import, and it was distorting reporting, breaking attribution and undermining trust in the CRM.

HubSpot's native duplicate manager could not solve it. It caps the view at 5,000 records, so the team could not even see the full extent of the problem, every merge is manual and permanent, and there was no way to stop new duplicates forming.

The risk most teams miss

The obvious fix, merging any records that share a phone number or email, would have been a disaster. Across the database, several thousand of those "duplicates" were actually two different people who happened to share one phone number: a shared household mobile, a reception line, a reissued handset. In a database where each person's history has to stay separate, auto-merging them would have fused two people into one record, and a HubSpot merge cannot be cleanly undone.

What we built

As a UK HubSpot Diamond partner and a software engineering firm, we built a deduplication solution in two halves that share one decision library:

  • A supervised one-time clean-up that scales past the native 5,000-record cap and sees the whole database.
  • A go-forward guardrail, a HubSpot custom-code workflow that catches and resolves new duplicates as contacts arrive, so the clean-up holds.

The engineering that made it safe:

  • The name-gate. A record is only auto-merged when the name also agrees. Where a phone or email matched but the names did not, the records were never merged automatically; they were held for human review.
  • Smart matching. Mobile numbers normalised to a single international format, and email matched with typo-domain correction so a misspelled domain no longer hides (or fakes) a duplicate.
  • Clean resolution. Linked records clustered to a single surviving record under clear winner rules, with a short-name guard and automatic safety checks that abort the run on any inconsistency.
  • Reversible by review. A full backup of every affected record before any merge, a dry run reviewed before anything changed, and a complete audit log. Nothing was merged until the dry run was approved.
  • Reported in their terms. A branded run report with the results split by B2B and B2C.

The results

  • More than 20,000 duplicate matches identified across the full database, well beyond what the native tool could show.
  • Around 12,000 records auto-merged safely, each having passed the name-gate.
  • Around 8,000 records routed to human review rather than risked, including the several thousand cases of two different people sharing a phone number, caught and held rather than wrongly merged.
  • Zero wrong merges, and a clean before-and-after figure for the team.
  • A go-forward guardrail in place, so the database stays clean rather than drifting back.

The point

Safe deduplication is a set of decisions, not a button. The reason this worked at scale is the same reason it was safe: every merge had to earn its place, and anything uncertain was held for a person. The result is a CRM the business can trust again, and one that stays clean.

More on the how: deduplicating HubSpot contacts at scale and why merging on phone number alone is dangerous.

Want your HubSpot duplicates cleared safely, at scale? See our HubSpot data engineering work, or tell us what you are running and we will scope it.

John Kelleher

John Kelleher

Author
John is the founder and the Chief Executive at SpotDev.

Stay Updated with Our Latest Insights

Get expert HubSpot tips and integration strategies delivered to your inbox.