T.R. Istanbul University

Department of Computer Engineering

Search Engine and Design Course Web Page

Instructor : Asst. Prof. Dr. Şadi Evren ŞEKER

**Text Books:**

- Understanding Search Engines: Mathematical Modeling and Text Retrieval (Software, Environments, Tools), Second Edition Michael W. Berry, Murray Browne

**Course Outline:**

Week 1: Introduction to search engines and modules

Week 2: Concepts of Web Crawler, Tokenizer, Stemmer, Stop Words

Week 3: Draft coding of web crawler and tokenizer (Click to download the codes)

Week 4: Statistical Analysis of keywords on the web pages and probabilisting modeling on n-grams. Introduction to Indexer.

Week 5: Statistical components of search engines (statistical approaches to similarities, estimating result set size, duplicate detection), Storing information (big table)

Week 6: Text processing (Text statistics, Zipf's Law, Heap's Law) Tokenizing, Stemming, Porter Stemmer Algorithm, Page Rank Calculation

**Homeworks:**

JAVA is strongly encouraged as an implementation environment. In spite of strong libraries in JAVA, such as string and network, you can submit homeworks in other programming languages. In the case of selecting another language, you should submit all the homeworks in the same language and you are alone to find some libraries which will be provided during the course.

1. Read the below article and run the source code attached to the article.

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

2. List and briefly explain the importance of 10 trees as a data structure. (deadline: Oct 11, 2011, should be submitted before the class)

3. Implement a Tokenizer which can avoid the HTML tags and list all the distinct keywords with the number of occurances on the file. Also implement a stop word solution, where a list of stop words will be an input to your tokenizer and your tokenizer will not list the keywords on the list. (deadline: Oct 11, 2011, should be submitted before the class)

4. Implement the n-gram counter for web pages and query strings. You are asked to implement and store the n-gram values of each web site you have tokenized and implement a search algorithm over the n-gram values of the web pages. Also take the n value of n-gram as parameter and combine the work you have done until this week into a single project. (deadline: Oct, 24, 2011 until midnight)

5. Implement the simhash algorithm and result set size estimation function.

6. Implement the porter stemmer algorithm.

>>>>>> Submit All Missing Homeworks (including 6th HW) until the day before midterm <<<<<<<<<<