A web crawler (web spider) is a program that automatically browses the WWW, collecting information. Such a program is the base for all WWW search engines. A web crawler can also have other purposes. In a nutshell, a web crawler works as follows: (1) it starts from a well known web-page (downloads the web-page); (2) from the current web-page, it collects all urls in the page; (3) it systematically travels the collected urls (e.g. depth first search, breath-first search, etc). In this homework, you will write a web crawler that will search through webpages in the www.cs.fsu.edu domain. The objective is to find at least 20 broken links in the webpages in the www.cs.fsu.edu.domain. Since we only focus on www.cs.fsu.edu, the crawler can be customized to ignore all other web servers. Due Oct. 28. Please send the source code as well as a file that contains the 20 broken links to Yixin Shou (shou@cs.fsu.edu). Hint: (1) Use the HTTP protocol to download files. After connect to www.cs.fsu.edu port 80. The crawler should be the following command to the web server GET filepath HTTP/1.1\n Host: www.cs.fsu.edu\n \n For example, to get the page http://www.cs.fsu.edu/~xyuan/index.html, you will send the following string to the server (www.cs.fsu.edu:80): "GET /~xyuan/index.html HTTP/1.1\nHosts: www.cs.fsu.edu\n\n" After command is sent, the server will respond using the HTTP/1.1 protocol: If the file is not found (the link is broken), the first line of responds is "HTTP/1.1 404 Not Found". If the file is found, the file will be sent back and the first line of the respond is "HTTP/1.1 200 OK" (2) HTTP/1.1 server reponses message has two parts: HTTP header and data. HTTP header and data is separated with an empty line. Your program needs to handle at least two types of responses: chunked reply and regular reply. Dynamically generated replies are usually chunked (try "GET / HTTP/1.1\nHost: www.cs.fsu.edu\n\n"): there is a line : "Transfer-Encoding: chunked" in the header. In this case, the data format is size1 (in hex) ... data (of size1 bytes) \r\n size2 ... data (of size2 bytes) \r\n size3 ... \r\n 0 For regular reply, there is a line "Content-Length: XXX" in the header. In this case, the data will be XXX bytes. (3) Your crawler should start from the page: http://www.cs.fsu.edu/ (4) Your crawler does not need to find all links. It is good enough to find a sufficient number of links (in our case 30 broken links). (5) The links in a webpage is specified in href="xxxx", where xxxx may be in a url format or in the regular file path format. Your crawler should construct the url from the file path format (again, you don't need to consider all cases, just ignore a link if your program cannot understand the link). (6) The web crawler must not search the same page two times, or it will be in an infinite loop. It must store all pages searched.