Credential compromise: A comparison of approaches to password scraping

Written by Ian Tinney

February 14, 2022

password scraping

Password scraping has real potential as a breach detection/notification security tool, provided it is applied using robust controls.


Password scraping is a technique that helps discover if your username and password have been compromised in a data breach, and it’s now being used as a tool to encourage better password hygiene. Initially the brainchild of Troy Hunt, who launched the HaveIBeenPwned website in 2013 to allow users to search the Adobe data leak, subsequent leaks have seen the number of logged credentials grow, spurring interest in its potential as a security service.

Tech giants add alerts

HaveIBeenPwned has collated records from 538 data breaches and enables users to check whether their email address has been found in these published data breaches. It has also captured over 600 million unencrypted passwords, which are stored as SHA-1 hashes, in a list that can be downloaded or searched by hashing your own password. It has proven to be a phenomenal success – so much so that other technology companies are now using the concept. 

Google introduced its Password Checkup extension in 2019 before incorporating it into Google accounts and then Chrome later the same year. It compares passwords and usernames saved in Chrome against over 4 billion credentials, and if the user enters a combination on a website that matches, they will see a pop-up warning them that a data breach has exposed their password.

Fast on its heels, Apple added a data breach notification feature to iOS 14 last year. This compares the passwords logged in its Keychain password manager with those recovered from known data breaches. There are numerous other examples, and several pureplay Password Managers are adding password notification alerts to their service offerings too.

Issues with password scraping

But how useful are these services? Does an alert mean your credentials have been compromised? And what should you do with that knowledge?

Firstly, there are some real limitations with these forms of password scraping. The breach information has to be manually identified and uploaded, which means round the clock monitoring to upload thousands of breaches, but that’s seldom the case which means you are only capturing a fraction of the breaches that are being continually uploaded to dark websites and pasteboards.

Secondly, the accuracy of the results is not great. These services will show me the websites where my email address has been used and a potential list of personally identifiable information (PII) that may have been compromised, but there’s no way to establish whether both my username and password were obtained or which of the PII, if any, that particular company has on me. (The list of PII disclosed under the website associated with that email address is a generic capture-all list.)

Usually, it’s the email address that has been captured because this has a specific format that is easy to trawl for, but my email address is public information. This is a real problem because the likelihood is that any email address of determinate age will almost certainly be flagged as compromised.

Credentials are separated into username and password databases for security purposes, but this prevents the user from establishing whether both have been compromised together. The service won’t tell me what that password is – I have to know the password and use it to search a 10GB database by downloading it or hashing it and searching the site with a two-part hash which is technically beyond most people.

Finally, there’s no guidance on what I should do next. I don’t know where the data was breached from, so while these service providers can tell me where the credentials were used, i.e. the websites that match the password manager, they can’t help me identify where should be my first port of call. You’ll be advised to change your password on all sites linked to that email address regardless of whether you need to or not 

The way password scraping is currently performed therefore limits its potential application, but it doesn’t have to be like this.

Corporate vs consumer

Password scraping came about in response to the tardiness of organisations who were too slow or were reluctant to disclose or were even completely in the dark about a breach. It allowed Joe Public to take matters into their own hands to check the exposure of their online accounts and has proven a useful consumer tool. But as the value of scraping has become apparent in a user context, interest has grown in its possible use as a security tool commercially 

Organisations are now looking to use scraping and alerts as part of their security arsenal to improve password management internally and/or as a customer-facing service to demonstrate their commitment to security. Yet, the way scraping is currently performed limits its application – it simply has too many shortcomings. It’s likely to trigger far too many alerts, can’t qualify and rate the severity of the breach and offers limited remediation advice, all of which means it’s likely to become a burden rather than an asset to the security team.

In order for password scraping to become a viable security tool, it needs to:

  1. Capture more breach data with greater accuracy

Collecting data breaches 24×7 allows data to be gathered when it is leaked, reducing the window of compromise and allowing the business to react quicker. Typically this breach data will have been tampered with or merged in some way, so needs to be cleansed. By comparing and combining data, it can be deduplicated (preventing the same breach from being reported multiple times) and compared with other data breaches to determine if the threat level has changed.

  1. Prove domain ownership without technical modifications

To check your entire workforce/domain using existing password scraping services today requires you to prove you own the DNS records by modifying them. For some large organisations or those that have to observe specific regulations, such as in the public sector, this represents a huge challenge. But if a trust model is employed, whereby the business simply proves DNS ownership, there’s no need to navigate this technical hurdle.

  1. Establish the seriousness of the breach

What has been lost? Have both usernames and passwords been compromised in the same dataset? Which employees are affected? Being alerted to a breach should be just the beginning. The security team cannot simply reset thousands of passwords; it needs more intelligence to determine its response. Pinpointing precisely what data has been exposed and who it relates to can help validate the data and prevent false positives, for example, those whose accounts are no longer in use. The threat level can then be determined or risk-scored and action taken accordingly.

  1. Provide remediation advice

The intelligence used to risk-score exposure can also help guide remediation, focusing resources on high-risk employees. The security team alerts those individuals affected by informing them of the source of the breach (e.g. Adobe) and the associated password (e.g. Pencil) before asking if they have reused them elsewhere/on any company systems. The team can then quickly establish which internal systems might be at risk.

Next-gen password scraping takes the concept to another level with continuous monitoring that uploads breach data in near real-time to UK data centres with guaranteed uptime, security, availability and reliability. Unstructured data is cleansed, broken down and reconstituted into meaningful parts and compared with other breach information to determine if the new data alters the current risk in the status of credentials. Finally, risk-scoring this information by collectively assessing passwords, users, breach sources and where the information comes from means we can determine which users present the most risk to the business and escalate as necessary.

At 4Data, we recognise the potential of password scraping as a means to quickly alert and inform the business of a credential breach and so are delighted to add the Threatstatus Trillion solution to our security portfolio. Trillion is a commercial breach detection system that has been architected to deliver a robust, secure means of discovering, validating and quantifying credential compromise. It uses rapid data discovery, data pre-processing, big data analytics and risk scoring to determine the threat posed by exposed credentials and offers appropriate advice on how to remediate threats.

For businesses that want to extend this type of credential monitoring to their customers, Threatstatus also offers the Arc solution. This monitors for credential compromise and detects when both usernames and passwords are being used together as a pair by an attacker who will spray the cloud in an active attempt to gain unauthorised access. Offered as an API for the customer-facing front-end of the business, Arc is ideally suited to e-commerce providers who want to show they take the security of their customers seriously.

To find out how Trillion can help you protect your users and your data or how Arc can protect your customer, contact us at info@4datasolutions.com or on 0330 128 9180.

Follow Us