Questions tagged [web-crawler]
A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.
web-crawler
9,711
questions
0
votes
0
answers
12
views
AWS crawler creating Null values for partion columns
I am having some country level partitioned data in s3 and crawler is crawling the this root folder and creating a table. No Null value is there for country code. But when looked in the Athena, there ...
0
votes
0
answers
18
views
Why Am I Getting a 490 Response Code on TorProxy on an Ubuntu Server? [closed]
I've set up a TorProxy on my Ubuntu server for web crawling due to specific network requirements. When my crawler begins to operate, it functions correctly for about 3 to 4 minutes. However, after ...
-3
votes
0
answers
24
views
Download ICD-10 codes (International Classification of Diseases)
We can easily browse the ICD-10 codes: https://icd.who.int/browse10/2019/en
Unfortunately, there is no way to download all of the codes as TXT (or XLS) file in order to parse with Python, or import ...
-1
votes
0
answers
19
views
crawler - rotten tomatoes website - problem with pages
im trying to crawl the website rotten tomatoes but i have a problem:
to get the html for page 5 and above of the movies for example:
https://www.rottentomatoes.com/browse/movies_at_home/?page=**8**
...
-2
votes
0
answers
19
views
Mass-attack of Amazon bots [closed]
Gday folks. Recently we discovered a significant spike in outgoing data on our web-server.
It turns out Amazon bots are downloading our web imagery, a lot. We set a disallow in our Robots.txt, over a ...
1
vote
1
answer
57
views
Scrapy Spider does not work with multiple urls
I wrote a Scrapy spider and used Selenium in it to scrape the products in devgrossonline.com.
It does not work with multiple category urls, but it works when I provide only one url.
Here is my spider:
...
-1
votes
0
answers
22
views
The time obtained by the Python crawler is incorrect when getting comments
When I use Python to crawl stock comments from a website, the time parsed from the website is different from the time obtained by my crawler.
For example:
when use the F12 to detect the website,i find ...
-4
votes
0
answers
30
views
Cannot fetch images from specific site [closed]
I'm using PHP (Laravel) code to fetch images from external urls and then saving them into my project folder.
It works for all image urls but some from a specific site, for e.g https://f00.esfr.pl/foto/...
0
votes
1
answer
31
views
TYPO3 indexed search fails to index PDF files
I'm hoping to get help with a problem I can't solve. The working environment is as follows:
SYSTEM
Debian 12 bookworm
PHP 7.4 (tried 8.2 and 8.3 with failure on crawler) + FPM/FastCGI
/usr/bin/...
0
votes
0
answers
12
views
How to download PDFs using Norconex Web Crawler?
I have tried to download PDFs from certain URLs (e.g. https://example.com) using the Norconex Web Crawler (v3.0) and the configuration below but no luck. Can someone please help me with this?
<?xml ...
0
votes
0
answers
37
views
Getting subsequent GET calls for some PUT, POST APIs in web site
I'm observing subsequent GET calls for some PUT, POST API. I already checked the code and there is no GET calls created for those endpoints. But I'm seeing this call in my server logs.
Say for example ...
-2
votes
0
answers
35
views
TikTok finding username with videoID
I am currently working on a project that deals with the data of the DSA transparency data base. Specifically, I am looking at the TikTok data. Now I would like to go one step further and check if the ...
0
votes
0
answers
10
views
Issues with Crawling Yahoo Auction During Peak Hours in a Cross-Border E-commerce System (Errors 404, 500)
I am seeking assistance with a critical issue we are facing in our cross-border e-commerce auction and proxy purchase platform. Our system relies heavily on web crawling technology to access Yahoo ...
0
votes
0
answers
25
views
Facebook Crawler not picking updated OpenGraph meta tags via Sharing Debugger but does via crawler curl call
Setup
It's a React App with React Helmet. It's deployed with Docker on a VPS and is exposed with Nginx. Cloudflare is used for SSL and as a Prerender.io worker.
Problem explaination
I make a change to ...
0
votes
0
answers
14
views
how to focus on instagram post comment textarea using vanilla JS?
I can select the textarea using the devtools console but I cannot focus on it and start typing and because of that the post button is disabled.
BTW, I can do it using python + selenium.