11

I am attempting to scrape a site which is notoriously difficult to scrape. Access from datacentres is generally blocked. In the past I've used various proxies, but recently these have stopped working.

The site employs various pitfalls when it doesn't like the user; e.g. certain javascript components fail, or the server redirects AJAX requests to localhost; thus causing a null-route.

I had previously assumed that the server was filtering by IP -- Recently I've noticed that the site acts up even from a "good" IP address, but only if proxied. In other words, if I open the site from a browser in computer A, it works perfectly fine. If I try to connect from computer B which uses computer A as a proxy server, the site fails to load. Even if I connect from computer A using the proxy server running on itself, the site still fails to load.

Which leads me to believe that the site is somehow detecting the existence of a proxy.

The proxy software is one I've written myself, so I know for certain that it does not add any headers which would give it away. I have used it successfully for many years without issue, so it's unlikely to have an obvious bug. It cannot be queried by the remote server. It doesn't mess with the headers or certificates -- it only forwards https traffic with the CONNECT method. (There is no HTTP traffic)

The browser I'm using is Firefox, and WebRTC is disabled.

My question is: is there any way for a website/webserver to detect:

  1. That a browser has some proxy settings configured?
  2. That a proxy server is being used at all
12
  • Sounds like bot detection, not proxy detection. When you run a manual browser session from computer B using computer A as proxy does it still fail to load? Commented May 3, 2022 at 22:07
  • What timezone is the IP in? What timezone is the computer (that is running the browser) set to? Are they the same?
    – mti2935
    Commented May 4, 2022 at 0:19
  • @pcalkins Yes it fails from a browser as well. It seems to fail if there are any proxy settings on the browser at all. Commented May 4, 2022 at 6:36
  • 1
    Would you like to share the HTTP headers exchange for both cases, with proxy and without proxy ?
    – elsadek
    Commented May 7, 2022 at 17:04
  • 2
    I think the question might benefit if you share the site in question. That way someone might take the time to analyze it and figure what's going on. Otherwise, at this point, we're just guessing.
    – nobody
    Commented May 13, 2022 at 18:01

4 Answers 4

1

yes! it's possible.

Usually proxy servers add a X-Forwarded-For HTML header. That's how normal proxies work. Now the server which you are connecting to, can simply read this tag and it is followed by an IP address which is nothing but your IP. So all in all detectable.

On the other hand we have something called "anonymizers". This is simply the same proxy software, that is made sure to not give out the above mentioned header in the request. But there are a few ways it can still be concluded if you're using a proxy or not. Simply put there are ten people who are using the same proxy and apparently they are use the same proxy. Now the website will not get any of the IPs but instead will get ten different requests from the same IP. But, it does get to know that the requests are all having multiple OS/Multiple Browsers/Multiple Browser versions etc., Thus it can be detected that you may be using a proxy.

Hopefully this was helpful..

Notes: X-Forwarded-For header info -> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Forwarded-For

1
  • 2
    You missed the part where I said I use my own proxy software which does not add any additional headers. Furthermore, I am the only person using it so nobody else is sharing the IP. Commented May 7, 2022 at 6:55
1

Some HTTPS web servers may use the TLS session resumption technique (https://www.ssl.com/article/tracking-users-with-tls/) in order to reduce the handshake overhead: both TLS endpoints will rapidly build new encryption materials based on a previous session. The downside is that is allows the server to track the client endpoint, detecting any change in its IP address, which may be caused to changing the IP address, trying to connect directly then through proxy, or even changing the proxy, etc.

1

Short answer: yes, but not always.

In the past, many websites simply relied on IP blacklists, often sold by third parties, which contain lists of IPs known to have been used for malicious purposes. They can be a useful tool to protect a website, but do not detect the majority of proxies and can be avoided by almost all attackers simply by using unflagged IPs.

There are now proxy detection services which many websites use, mainly maxmind.com and spur.us. These services can detect the majority of residential proxies in most cases, but there are some they fail to detect. They typically use databases of known proxies, which are populated from analysing historical traffic data.

Residential proxies: These can be detected by prior knowledge of an IP being used by a proxy provider (think large scale IP enumeration), or by monitoring traffic from that IP long term and detecting anomalies (think vast numbers of different users from different timezones and using different browsers/languages using an IP). The caveat of this detection is that it comes with a fair number of false positives (relying on historical traffic data is always going to be inaccurate in many cases, and many mobile network or even residential IPs are assigned from a pool and so have multiple past uses), and cannot detect less well-known or newly registered proxies, which motivated attackers prioritise.

Datacenter proxies: These can be detected by looking up the hosting information of the IP and checking whether it is present in a known datacenter. This of course fails when the datacenter is not known.

Mobile network/4G proxies: These are mostly undetected by existing proxy detection services, since they aren't typically used large scale, often used by only a few attackers, and are often freshly registered.

In recent years, there have been multiple new proxy detection techniques proposed that claim 90-99% accuracy at detecting residential proxies. These mostly rely on detecting discrepancies in latencies in a network connection and commonly use machine learning algorithms. This discrepancy is introduced due to the use of a proxy splitting the connection into two distinct connections, and can be measured by comparing RTTs of packets between the server and proxy server and the server and client respectively. For detecting slow residential proxies, this method works very well, hence the up to 99% accuracy, however for different types of proxies it doesn't perform as well, and also has false positives in practice. Random network delays and fluctuations, which are fairly common, can cause non-proxy IPs to be falsely detected as proxies, since this added latency can appear identical to that from the use of a proxy server, and the method relies on a set tolerance of course. This method isn't as accurate for detecting datacenter proxies or low latency residential proxies, since they are typically faster and add less latency to the connection, and motivated attackers can effectively bypass detection by choosing to use proxies which they have a very low latency to (so the added latency is within the tolerance). An example of this is BadPass

There are some methods that can detect proxies with 100% certainty, but these tend to be novel and not publicly disclosed.

So to sum up, it is possible for a website to detect the use of a proxy, but it is not always feasible for them so most do not do it.

Disclaimer: I run a proxy detection service that uses novel proxy detection techniques to achieve 100% accuracy detecting all types of proxies detectproxy.io

0

HTTP proxies often occupy typical ports. By means of a port scan, the open ports of the client can be detected through the contacted web server, which can provide information about whether the ip address that the server sees is a middleware.

3
  • 1
    Not the case in this instance. In fact, the same happens if I run the proxy from behind a firewall on my own machine. Commented May 8, 2022 at 18:52
  • @CaptainCodeman Any chance the site is running a port scan to detect http proxies through the browser/client? There are some tricks which allow a limited form of port scanning using javascript
    – nobody
    Commented May 13, 2022 at 15:15
  • 1
    @nobody, no, that's not it, I keep the local proxy running all the time, even if I don't run traffic through it. (And when I don't use the proxy the site works fine.) Commented May 13, 2022 at 17:49

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .