Skip to content

PHP library to get the sitemap. It crawls a whole website checking all internal and external links plus a Search Engine Optimization.

License

Notifications You must be signed in to change notification settings

johnbe4/getSeoSitemap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

getSeoSitemap v5.0.0 | 2023-02-27

PHP library to get sitemap.
It crawls a whole domain checking all URLs.
It makes Search Engine Optimization of URLs into sitemap only.

donate via paypal
donate via bitcoin
Please support this project by making a donation via PayPal or via BTC bitcoin to the address 19928gKpqdyN6CHUh4Tae1GW9NAMT6SfQH

Warning

Before moving from releases lower than 4.1.1 to 4.1.1 or higher, you must drop getSeoSitemap and getSeoSitemapExec tables into your dBase.

Overview
This script creates a full gzip sitemap or multiple gzip sitemaps plus a gzip sitemap index.
It includes change frequency, last modification date and priority setted following your own rules.
Change frequency will be automatically selected between daily, weekly, monthly and yearly.
Max URL lenght must be 767 characters, otherwise the script will fail.
Max page size must be 16777215 bytes, otherwise the script will fail.
URLs with http response code different from 200 or with size = 0 will not be included into sitemap.
It checks all internal and external links inside html pages and js sources (href URLs into 'a' tag plus form action URLs if method is get).
It checks all internal and external sources.
Mailto URLs will not be included into sitemap.
URLs inside pdf files will not be scanned and will not be included into sitemap.

getSeoSitemapBot is a crawler like Googlebot and it does not exec javascript.
That means it does not follow URLs created by javascript.
On https://support.google.com/webmasters/answer/2409684?hl=en Google says:
".....
Some features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash can make it difficult for search engines to crawl your site.
Check the following:
Use a text browser such as Lynx to examine your site, since many search engines see your site much as Lynx would.
If features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.
....."

To improve SEO following robots.txt rules of "User-agent: *", it checks:

  • http response code of all internal and external sources into domain (images, scripts, links, iframes, videos, audios)
  • malformed URLs into domain
  • page title of URLs into domain
  • page description of URLs into domain
  • page h1/h2/h3 of URLs into domain
  • page size of URLs into sitemap
  • image alt of URLs into domain
  • image title of URLs into domain.

You can use absolute or relative URLs inside the site.
This script will set automatically all URLs to skip and to allow into sitemap following the robots.txt rules of "User-agent: *" and robots tag into page head.
There is not any automatic function to submit updated sitemap to search engines.
Sitemap will be saved in the main directory of the domain.
It rewrites robots.txt adding updated sitemap informations.
Maximum limit of URLs to insert into sitemap is 2.5T.

Other main features:

  • backup of all previous sitemaps into bak folder.
  • it repeats URL scan once after 5 sec in case of http response code is different from 200.
  • it prevents saving sitemap if total URLs percentage difference from previous successful exec is more than a preset value.

Using getSeoSitemap, you will be able to give a better surfing experience to your clients.

Requirements

  • PHP 8.0.
  • MariaDB 10.4.

Instructions
1 - copy getSeoSitemap folder in a protected zone of your server.
2 - set all user parameters into config.php.
3 - on your server cronotab schedule the script once each day preferable when your server is not too much busy.
A command line example to schedule the script every day at 7:45:00 AM is:
45 7 * * * php /example/example/example/example/example/getSeoSitemap/getSeoSitemap.php
When you know how long it takes to execute all the script, you could add a cronotab timeout.

Warning
Before moving from releases lower than 4.1.1 to 4.1.1 or higher, you must drop getSeoSitemap and getSeoSitemapExec tables into your dBase.
Do not save any file with name that starts with sitemap in the main directory, otherwise getSeoSitemap script could cancel it.
The robots.txt file must be present into the main directory of the site otherwise getSeoSitemap will fail.
In case of FPM timeout errors, you should fix setting pm.process_idle_timeout to 30s or higher.
To run getSeoSitemap faster, using a script like Geoplugin you should exclude geoSeoSitemapBot user-agent from that.