Assignment 1

UCR - CS 172 Spring 2024


 Instructions: Submit in Canvas by 4/20, and electronic copy on Canvas. This is individual assignment.

 

Exercise A

Consider the following document D, taken from a collection C.

"The University of California, Riverside is one of 10 universities within the prestigious University of California system, and the only UC located in Inland Southern California. Widely recognized as one of the most ethnically diverse research universities in the nation."

Consider the following two queries:

Characteristics of collection C are as follows:

Assuming that stemming is used (i.e. "university" and "universities" are viewed as one term), compute the scores of Q1 and Q2 for D, using

(a) Vector space model cosine similarity, where the weight of a term in the query vector is the term frequency of the term in the query, and the weight of a term in the document vector is tf*log2(N/df).

(b) BM25,

(c) Unigram Language Model with Jelinek-Mercer smoothing with lambda=0.7.

Make and state any assumptions necessary, e.g., about the constants in BM25.

 

Exercise B

Consider documents:

D1: Basketball is a sport played in many countries. NBA is a basketball league.

D2. NBA is a basketball tournament in the US and other countries.

D3. Basketball is a sport played in many countries. NBA is a basketball league. Click here for more.

1. Compute the simhash of each document and the simhash distances between them. First remove the stopwords.

Use 8-bit signatures. To hash a word, sum the ASCII numbers of its characters and then do "mod 256".

 

2. What are the tradeoffs (advantages/disadvantages) between smaller and longer signatures?

 

3. Based on the above documents, what number of bits threshold would you pick for near-identical detection for 8-bit signatures?