Multi-Strategy Search Engine Implementation

Project Overview

This project showcases a sophisticated search engine implemented with three distinct search mechanisms, illustrating expertise in data structures, algorithms, and Python programming. The search engine processes a corpus of text files, enabling users to search for documents containing specific terms efficiently and effectively.

Key Features

Three Search Mechanisms:
- Linear Search: A straightforward search through all files, demonstrating baseline functionality.
- Dictionary-based Indexed Search: Utilizes Python's built-in dictionary for fast lookup and significantly improved search speeds.
- Custom-implemented Hash Table: A bespoke hash table implementation, highlighting deep understanding of hash mechanics and collision handling.
Corpus Processing: Efficiently manages large datasets, tested with a diverse range of documents including Slate magazine articles (approximately 5000) and Berlitz travelogues.
Web Interface: Displays search results in a browser, allowing users to navigate seamlessly to the original documents.
Performance Comparison: Illustrates the significant speed enhancements of indexed searches over linear search, providing a clear demonstration of the benefits of advanced data structures.
HTML Generation: Creates well-formatted HTML output for search results, including highlighted search terms for better user experience.

Technologies Used

Python: Core programming language for implementation.
HTML: For generating user-friendly search results.
Web Browser Integration: Utilizes the webbrowser module to display results.
HTML Templating: Jinja2 templating engine for advanced HTML generation.

Implementation Details

1. Linear Search (`linear_search.py`)

Implements a basic linear search across all files in the corpus.
Time Complexity: O(kn) for k words per file and n files.

2. Dict-based Indexed Search (`index_search.py`)

Creates a fast lookup index using Python's built-in dictionary.
Index Creation Complexity: O(n) for n total words in all files.
Search Complexity: O(1) after indexing.

3. Custom Hash Table Search (`myhtable_search.py`)

Implements a custom hash table from scratch, including custom hash function and collision handling.
Demonstrates advanced understanding of hash table mechanics.
Index Creation Complexity: O(n) for n total words in all files.
Search Complexity: Approaching O(1) after indexing (depending upon the number of buckets used in creating the index).

4. Main Program (`search.py`)

Manages the user interface and search execution.
Generates HTML output and opens results in a web browser.

Custom Hash Table Implementation (Details)

This project includes a custom implementation of a hash table using open hashing (separate chaining) to handle collisions. The hash table is represented as a list of lists, where each bucket is a list of key-value pairs (tuples).

Features

Hash Table Creation:
- htable(nbuckets): Initializes a hash table with a specified number of buckets, returning a list of empty lists.
Hash Function:
- hashcode(o): Computes a hash code for integers and strings. For integers, it returns the integer value directly. For strings, it calculates the hash code using a custom formula.
Bucket Index Lookup:
- bucket_indexof(table, key): Finds the index of a key within a specific bucket.
Insert or Update Key-Value Pair:
- htable_put(table, key, value): Inserts or updates the key-value pair in the appropriate bucket. If the key already exists, its value is updated. If the key does not exist, a new key-value pair is appended to the bucket.
Retrieve Value by Key:
- htable_get(table, key): Retrieves the value associated with the given key from the hash table. Returns None if the key is not found.
Bucket String Representation:
- htable_buckets_str(table): Returns a string representation of the hash table's buckets. Each bucket is displayed with its index and contents.
Hash Table String Representation:
- htable_str(table): Returns a string representation of the entire hash table similar to Python's dictionary str method. The order is based on the bucket order and the insertion order within each bucket.

Challenges and Solutions

Efficient Indexing: Implemented a hash table to reduce search time from O(n) to O(1).
Custom Hash Table: Designed a robust hash function and effective collision handling mechanisms.
Large Dataset Handling: Optimized the code to process and search thousands of files efficiently.

Future Improvements

Implement more advanced text processing techniques (e.g., stemming, removing stop words).
Add support for phrase searches and boolean operators.
Enhance the user interface with a more sophisticated web-based front-end.
Implement multi-threading for faster indexing of large corpora.

Installation and Usage

Clone the repository:

git clone https://github.com/amitch2019/multi-strategy-search-engine.git

Navigate to the project directory:
```
cd multi-strategy-search-engine
```
Run the search engine:
```
python search.py
```

Requirements

Python 3.x
Required libraries: os, re, string, jinja2

Access to Code

The source code for this project is not publicly available at the moment.
However, it can be shared with interested persons upon request.
Please contact me directly to request access to the code.

Command Line Input

Sample Output Loaded to the Default Browser

Contact

For any questions or inquiries, please contact me at [[email protected]].

This project demonstrates a robust understanding of search algorithms, data structures, and Python programming. It efficiently handles large datasets, provides significant performance improvements, and includes a user-friendly web interface for displaying search results.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-Strategy Search Engine Implementation

Project Overview

Key Features

Technologies Used

Implementation Details

1. Linear Search (`linear_search.py`)

2. Dict-based Indexed Search (`index_search.py`)

3. Custom Hash Table Search (`myhtable_search.py`)

4. Main Program (`search.py`)

Custom Hash Table Implementation (Details)

Features

Challenges and Solutions

Future Improvements

Installation and Usage

Requirements

Access to Code

Command Line Input

Sample Output Loaded to the Default Browser

About

Uh oh!

Releases

Packages

amitch2019/Search-Engine-Implementation

Folders and files

Latest commit

History

Repository files navigation

Multi-Strategy Search Engine Implementation

Project Overview

Key Features

Technologies Used

Implementation Details

1. Linear Search (linear_search.py)

2. Dict-based Indexed Search (index_search.py)

3. Custom Hash Table Search (myhtable_search.py)

4. Main Program (search.py)

Custom Hash Table Implementation (Details)

Features

Challenges and Solutions

Future Improvements

Installation and Usage

Requirements

Access to Code

Command Line Input

Sample Output Loaded to the Default Browser

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

1. Linear Search (`linear_search.py`)

2. Dict-based Indexed Search (`index_search.py`)

3. Custom Hash Table Search (`myhtable_search.py`)

4. Main Program (`search.py`)

Packages