Parsing tools that extract data from websites are increasingly becoming an alternative for hackers to launch sophisticated attacks. This is what you need to know.
Beware of scrapers: once published on the Internet, information can be collected and used for other purposes.
It is now almost common practice that millions of user data regularly fall into the wrong hands due to hacks and data leaks in online services. For example, data from over 553 million Facebook users and 500 million LinkedIn accounts.
However, what is quite unusual is that all the companies deny that they were victims of hacking. According to them, it was more of a case of scraping.
The data was intercepted that was published by the users themselves and that could be viewed by other participants. But what is scraping, how does it work and how to protect yourself from it?
Scraping Concept Overview
Scraping, as a short form of Screen Scraping or Web Scraping, is a function in which an application or script reads and stores information from a website or online service - i.e. “collects” information from the screen.
Well-known uses of this technology include search engine bots such as Google, which constantly travel the Internet to index websites (crawl). But comparison portals also use this method to collect huge amounts of data and then evaluate it.
In many cases, this practice is also in the interests of website operators, since through such indexing they can achieve greater coverage or increase sales of their products and services. However, this technology can be used for another purpose. Companies can use scraping to automatically search competitors' online stores.
They can then, for example, adjust their prices so that they are always slightly cheaper (price capture). Or you can take over their product descriptions and images (content capture) or the entire online store design and save a lot of time and money. Phone numbers and email addresses collected from Facebook are also directly linked to subsequent waves of smishing and phishing.
How does Scraping work?
The scraping process generally consists of two parts: visiting the desired web pages (static and dynamically generated) and then extracting the data. There are many scraping tools available, many of which are only available on GitHub. It provides solutions and toolkits for a wide range of applications.
In the case of the Facebook information, from which data marked as private was also extracted, experts suggest the use of a special method that took advantage of a gap in the platform's contact import function, which was closed at the end of 2019.
This feature is designed to allow users to identify friends and acquaintances on Facebook by downloading their phone book. According to Facebook, attackers have widely used this feature to request a set of user profiles and then obtain information about them contained in their public profiles.
Is scraping legal or illegal?
The answer depends on many factors. If no technical security measures are overcome for parsing, the action itself is not illegal - after all, only information that is already in the public domain is collected. However, what you do with the data after it is collected may be illegal.
If images, articles and the like are copied and published anywhere without permission, this is a clear violation of copyright. It is illegal to use datasets for phishing and similar activities.
The verdict is even clearer when it comes to the collection of personal data. Data protection laws provide clear guidance on the collection and storage of personal data.
To do this, you must have a lawful reason, such as explicit consent or a legitimate interest in collecting and storing personal data. The law also requires that only as much data is processed as is necessary to complete the task (data economy).
Most social media operators also exclude scraping in their terms and conditions. The fact that, as with Facebook and LinkedIn, there are virtually no other regulatory bodies, casts a bad light on their security setup.
Scraping (data parsing): protection measures.
The site operator has various options for protecting against scrapers. Commonly used methods include using captcha requests or a robots.txt file to deny access to web crawlers. Additionally, web application firewalls are usually capable of detecting suspicious activity using a parser.
In addition, you should not oversimplify the work of automated data collectors. In the case of some sites, it appears that sequential numbering was used when creating user profiles in the SQL database. This provides relatively easy access to scrapers: a simple script that adds a number to profile links is enough for bulk data scraping.
What's on the user side? Users should be aware that any information that is publicly available is also at risk of being scrapped, regardless of whether it is Facebook, LinkedIn or other applications.
Mainton Security Experts: Once published, information can be collected, and you have no control over who copies the data or what is done with it on the Internet.
Accordingly, the only way to prevent public information from being collected and used in undesirable ways is to not publish it. Facebook also recommends that all users regularly review their data protection settings to continually adapt them to their current preferences.
Mainton Company - custom software development and testing, DevOps and SRE, SEO and online advertising since 2004.