javascript - Is it possible to crawl directly through a site tree a site tree remotely or locally? -
i'm n00b web development , have n00b question.
suppose there's site is, example, like
index.php page1.php page2.php page2-1.php page2-2.php page3.php
is there way can try go directly every subpage starting index, without knowledge of subpage names? in concrete terms, possible in, say, javascript, construct function
works like
console.log(printsitetree("stackoverflow.com"); /* prints: stackoverflow.com stackoverflow.com/questions . . . stackoverflow.com/questions/29633992 . . . stackoverflow.com/questions/29633992/is-there-any-tool-to-calculate-the-distance-between-a-program-point-and-a-execut . . . stackoverflow.com/tags . . . */
without relying on undue brute force?
theory
you can list of links on site, if site wants let have them. done via site map: http://en.wikipedia.org/wiki/site_map
usually, site provides location of sitemap in robots.txt file, crawlers can access it. xml file url's nested under sitemap/loc
.
example
let's want links crawl http://www.msn.com/.
can go usual robots file location, is: http://www.msn.com/robots.txt , there can find line:
sitemap: http://sitemap.msn.com/xml
visit url , our url list:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://sitemap.msn.com/xml/en-nz-autos-0</loc> </sitemap> <sitemap> <loc>http://sitemap.msn.com/xml/en-nz-entertainment-0</loc> </sitemap> <sitemap> <loc>http://sitemap.msn.com/xml/en-nz-entertainment-1</loc> </sitemap> <sitemap> <loc>http://sitemap.msn.com/xml/en-nz-finance-0</loc> </sitemap>
disclaimer
not sites give this, , there no guarantee links there or complete list. it's figure out if it's useful purpose.
Comments
Post a Comment