javascript - Is it possible to crawl directly through a site tree a site tree remotely or locally? -


i'm n00b web development , have n00b question.

suppose there's site is, example, like

index.php     page1.php     page2.php       page2-1.php       page2-2.php     page3.php  

is there way can try go directly every subpage starting index, without knowledge of subpage names? in concrete terms, possible in, say, javascript, construct function works like

console.log(printsitetree("stackoverflow.com"); /* prints:      stackoverflow.com      stackoverflow.com/questions             .             .             .             stackoverflow.com/questions/29633992             .             .             .                 stackoverflow.com/questions/29633992/is-there-any-tool-to-calculate-the-distance-between-a-program-point-and-a-execut             .             .             .      stackoverflow.com/tags      .      .      . */ 

without relying on undue brute force?

theory

you can list of links on site, if site wants let have them. done via site map: http://en.wikipedia.org/wiki/site_map

usually, site provides location of sitemap in robots.txt file, crawlers can access it. xml file url's nested under sitemap/loc.

example

let's want links crawl http://www.msn.com/.
can go usual robots file location, is: http://www.msn.com/robots.txt , there can find line:
sitemap: http://sitemap.msn.com/xml
visit url , our url list:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://sitemap.msn.com/xml/en-nz-autos-0</loc> </sitemap> <sitemap> <loc>http://sitemap.msn.com/xml/en-nz-entertainment-0</loc> </sitemap> <sitemap> <loc>http://sitemap.msn.com/xml/en-nz-entertainment-1</loc> </sitemap> <sitemap> <loc>http://sitemap.msn.com/xml/en-nz-finance-0</loc> </sitemap> 

disclaimer

not sites give this, , there no guarantee links there or complete list. it's figure out if it's useful purpose.


Comments

Popular posts from this blog

css - SVG using textPath a symbol not rendering in Firefox -

Java 8 + Maven Javadoc plugin: Error fetching URL -

node.js - How to abort query on demand using Neo4j drivers -