SMART Information Retrieval System

The SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s.{{cite journal |last1=Salton |first1=G, Lesk, M.E. |title=The SMART automatic document retrieval systems—an illustration |journal=Communications of the ACM |date=June 1965 |volume=8 |issue=6 |pages=391–398 |doi=10.1145/364955.364990 |doi-access=free }} Many important concepts in information retrieval were developed as part of research on the SMART system, including the vector space model, relevance feedback, and Rocchio classification.

Gerard Salton led the group that developed SMART. Other contributors included Mike Lesk.

The SMART system also provides a set of corpora, queries and reference rankings, taken from different subjects, notably

ADI: publications from information science reviews
Computer science
Cranfield collection: publications from aeronautic reviews
Forensic science: library science
MEDLARS collection: publications from medical reviews
Time magazine collection: archives of the generalist review Time in 1963

To the legacy of the SMART system belongs the so-called SMART triple notation, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd.qqq, where the first three letters represents the term weighting of the collection document vector and the second three letters represents the term weighting for the query document vector. For example, ltc.lnn represents the ltc weighting applied to a collection document and the lnn weighting applied to a query document.

The following tables establish the SMART notation:{{Cite web|url=http://sauparna.sdf.org/Information_Retrieval/.ontfidf|title=On The Provenance of tf-idf|last=Palchowdhury|first=Sauparna|date=2016|website=sauparna.sdf.org|access-date=2019-07-29}}

class="wikitable"

Symbols and notation

colspan="4" |

D_i = \{w_{i_1}, w_{i_2}, \ldots, w_{i_t}\}

represents a document vector, where

w_{i_k}

is the weight of the term

T_k

D_i

and

t

is the number of unique terms in

D_i

. Positive features characterize terms that are present in a document, and the weight of zero is used for terms that are absent from a document.

f_{i_k}

|Occurrence frequency of term $T_k$ in document $D_i$

| $u_i$

|Number of unique terms in document $D_i$

N

|Number of collection documents

| $\operatorname{avg}(u)$

|Average number of unique terms in a document

n_k

|Number of documents with term $T_k$ present

| $b_t$

|Number of characters in document $D_i$

\max(f_{i_k})

|Occurrence frequency of the most common term in document $D_i$

| $\operatorname{avg}(b)$

|Average number of characters in a document

\operatorname{avg}(f_{i_k})

|Average occurrence frequency of a term in document $D_i$

| $G$

|Global collection statistics

s

|The slope in the context of pivoted document length normalizationSinghal, A., Buckley, C., & Mitra, M. (1996). [http://singhal.info/pivoted-dln.pdf Pivoted Document Length Normalization]. SIGIR Forum, 51, 176-184.

class="wikitable"

|+Smart term-weighting triple notation

! colspan="4" |Term frequency $\text{tf}(f_{i_k})$

! colspan="4" |Document frequency $\text{df}(N, n_k)$

! colspan="4" |Document length normalization $g(G, D_i)$

|b

| $1$

|Binary weight

|x

|n

| $1$

|Disregards the collection frequency

|x

|n

| $1$

|No document length normalization

t

|n

| $f_{i_k}$

|Raw term frequency

|f

| $\log_2\left(\frac{N}{n_k}\right)$

|Inverse collection frequency

|c

| $\sqrt{\sum_{k=1}^t w_{i_k}^2}$

|Cosine normalization

|a

| $0.5 + 0.5\frac{f_{i_k}}{\max(f_{i_k})}$

|Augmented normalized term frequency

|t

| $\log_2\left(\frac{N+1}{n_k}\right)$

|Inverse collection frequency

|u

| $1-s+s\frac{u_i}{\operatorname{avg}(u)}$

|Pivoted unique normalization

|l

| $1+\log_2 f_{i_k}$

|Logarithm

|p

| $\log_2\left(\frac{N-n_k}{n_k}\right)$

|Probabilistic inverse collection frequency

|b

| $1-s+s\frac{b_i}{\operatorname{avg}(b)}$

|Pivoted characted length normalization

|L

| $\frac{1+\log_2(f_{i_k})}{1 + \log_2(\operatorname{avg}(f_{i_k}))}$

|Average-term-frequency-based normalization

|d

| $1+\log_2(1+\log_2(f_{i_k}))$

|Double logarithm

The gray letters in the first, fifth, and ninth columns are the scheme used by Salton and Buckley in their 1988 paper.Salton, G., & Buckley, C. (1988). [https://ecommons.cornell.edu/server/api/core/bitstreams/fc18789c-6a03-48e6-8226-7dba0ce94e32/content Term-Weighting Approaches in Automatic Text Retrieval]. Inf. Process. Manage., 24, 513-523. The bold letters in the second, sixth, and tenth columns are the scheme used in experiments reported thereafter.

References

External links

[ftp://ftp.cs.cornell.edu/pub/smart/ Software and test collections]{{dead link|date=October 2019}} (FTP at Cornell University)
[https://web.archive.org/web/20080324201744/http://tesla.tcnj.edu/SMART/index.php Interactive SMART tutorial]

Category:Discontinued software

Category:Search engine software