VPF::GetURLs - Форум программистов

Zhenia87

Дата 22.3.2008, 19:28 (ссылка)

(нет голосов)

Загрузка ...

Новичок

Профиль
Группа: Участник
Сообщений: 12
Регистрация: 12.11.2007
Где: Украина, Винница

Репутация: нет
Всего: нет

Мне надо записать все урл которые выдает гугл при запросе(насколько я понял с описания задания и примера). У меня есть программа которая делает парсинг сайта, но  когда я беру урл, например, такой как в примере задания:
http://www.google.com/search?hl=en&q=%...amp;btnG=Search   ,
то программа выдает такое:
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.google.com/search?hl=en&q=%...amp;btnG=Search
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1170)
        at java.net.URL.openStream(URL.java:1007)
        at geturls.ReadTag.<init>(ReadTag.java:44)
        at geturls.ReadTag.<init>(ReadTag.java:37)
        at geturls.GetURLs.<init>(GetURLs.java:14)
        at geturls.GetURLs.main(GetURLs.java:41)
Если , я беру например такой урл : www.google.com , то программа отлично работает.

ЗАДАНИЕ:
Name of project: GetUrls

description: Goal of the program is to make a lists of links copied from the google pages for each keyword supplied, Using also words collected
in order to present less pages on google.

1.The program takes a list of "Keywords" (notice keywords can be queries to google that includes google commands, for example: "google.com -inurl:google.com" is considered a keyword aswell. This list is called: main_keywords.txt

2.For each of the keywords, The program will search google (using a different datacenter and different proxy each time), And will need to make 2 lists:

  1. Recursive_words.txt - The program will append all the words collected from all the google pages      (1.....maximum), each word will be on a new line (Notice words can be unicode also).
   (Remove duplicate words, remove html elements like <a href=> etc...)

  2. Collected_Links.txt - The program will append all the full urls collected from the google pages to       this list.

example:
----------------------------

Main_Keywords.txt contains:
______________________________
"google.com" -inurl:google.com
"yahoo.com" -inurl:yahoo.com
______________________________

program will query google like this (notice that "" is part of the query):
http://www.google.com/search?hl=en&q=%...amp;btnG=Search
http://www.google.com/search?hl=en&q=%...amp;btnG=Search

then:
1.The program will take all the "site descriptions" (written in color black) and append them to Recursive_words.txt. each word in a new line.
2.The program will take all the "full urls" (written in color green) and will append the links, each in different lines, to Collected_Links.txt.

ПРОГРАММА ДЛЯ ПАРСИНГУ САЙТА:
ReadTag.java

Код


import java.io.*;
import java.net.*;
public class ReadTag {
    /** The URL that this ReadTag object is reading */
    protected URL myURL = null;
    /** The Reader for this object */
    protected BufferedReader inrdr = null;
  
    /* Simple main showing one way of using the ReadTag class. */
    public static void main(String[] args) throws MalformedURLException, IOException {
        if (args.length == 0) {
            System.err.println("Usage: ReadTag URL [...]");
            return;
        }

        for (int i=0; i<args.length; i++) {
            ReadTag rt = new ReadTag(args[0]);
            String tag;
            while ((tag = rt.nextTag()) != null) {
                System.out.println(tag);
            }
            rt.close();
        }
    }
  
    /** Construct a ReadTag given a URL String */
    public ReadTag(String theURLString) throws 
            IOException, MalformedURLException {

        this(new URL(theURLString));
    }

    /** Construct a ReadTag given a URL */
    public ReadTag(URL theURL) throws IOException {
        myURL = theURL;
        // Open the URL for reading
        inrdr = new BufferedReader(new InputStreamReader(myURL.openStream()));
    }

    /** Read the next tag.  */
    public String nextTag() throws IOException {
        int i;
        while ((i = inrdr.read()) != -1) {
            char thisChar = (char)i;
            if (thisChar == '<') {
                String tag = readTag();
                return tag;
            }
        }
        return null;
    }

    public void close() throws IOException {
        inrdr.close();
    }

    /** Read one tag. Adapted from code by Elliotte Rusty Harold */
    protected String readTag() throws IOException {
        StringBuffer theTag = new StringBuffer("<");
        int i = '<';
      
        while (i != '>' && (i = inrdr.read()) != -1) {
                theTag.append((char)i);
        }     
        return theTag.toString();
    }

    /* Return a String representation of this object */
    public String toString() {
        return "ReadTag[" + myURL.toString() + "]";
    }
}

GetURLs.java

Код


package geturls;
import java.io.*;
import java.net.*;
import java.util.*;

public class GetURLs {
    /** The tag reader */
    ReadTag reader;

    public GetURLs(URL theURL) throws IOException {
        reader = new ReadTag(theURL);
    }

    public GetURLs(String theURL) throws MalformedURLException, IOException {
        reader = new ReadTag(theURL);
    }

    /* The tags we want to look at */
    public final static String[] wantTags = {
        "<a ", "<A "
        
    };

    public ArrayList getURLs() throws IOException {
        ArrayList al = new ArrayList();
        String tag;
        while ((tag = reader.nextTag()) != null) {
            for (int i=0; i<wantTags.length; i++) {
                if (tag.startsWith(wantTags[i])) {
                    al.add(tag);
                    continue;        // optimization
                }
            }
        }
        return al;
    }

    public void close() throws IOException {
        if (reader != null) 
            reader.close();
    }
    public static void main(String[] argv) throws 
            MalformedURLException, IOException {
        String theURL = argv.length == 0 ?
            "http://www.google.com/search?hl=en&q=%22google.com%22+-inurl%3Agoogle.com&btnG=Search" : argv[0];
        GetURLs gu = new GetURLs(theURL);
        ArrayList urls = gu.getURLs();
        Iterator urlIterator = urls.iterator();
        while (urlIterator.hasNext()) {
            System.out.println(urlIterator.next());
        }
    }
}

Это сообщение отредактировал(а) Zhenia87 - 2.4.2008, 18:24

1 Пользователей читают эту тему (1 Гостей и 0 Скрытых Пользователей)
0 Пользователей:
« Предыдущая тема \| Java: Работа с сетью \| Следующая тема »