Upload and search the contents of your files!
On the below example, 3 files were uploaded, each one containing a specific word:
- Txt:
car - Docx:
automobile - Pdf:
vehicle
The search for any of these words matches all 3 files, as they are close semantically. Searching for an unrelated term, like banana, returns nothing.
The goal of the app is to provide the best possible experience for uploading files and searching for content inside them.
To provide the best experience to the user, the aim is for simplicity and ease of use. On the frontend, established UX patterns like drag and drop, instant feedback, and clear messages are chosen. On the backend, the focus is on the performance of the file indexing process, tuning its parameters to achieve a good balance between speed and search quality.
- File upload using drag and drop.
- File upload using "upload" button.
- Enable selection (or dragging) of multiple files to upload.
- Support upload queuing: multiple files can be selected for upload. The frontend accepts multiple files simultaneously but uploads them sequentially. If an upload has not finished, the user can still add more files, which are enqueued and uploaded normally.
- Provide instant upload feedback: show live progress.
- Upload error handling: provide clear error messages and a "try again" option.
- Accept files with the same name: like Dropbox, files are saved using a unique ID that does not collide, so multiple files with the same name are supported.
- Supported file formats: txt, md, docx and pdf.
- docx will require a library (e.g.,
mammoth.js) - pdfs can be complex to parse. Parsing pdfs to extract text directly is an option but cannot handle images or scanned pdfs containing text. The alternatives are OCR or LLMs. For the PoC, a basic text-parsing library (e.g.,
pdf-parseorpdf.js) will be used, and LLM parsing only if time allows.
- docx will require a library (e.g.,
- Perform file validation before uploading: check format and size.
- Persist uploaded files content
- Persist uploaded files metadata (file name, file id, path)
By storing files and metadata locally or in memory, a working solution can be quickly built with minimal configuration and integration. Adding a layer of abstraction on these storage systems allows for easy replacement later with proper cloud storage.
Search is the main feature of the App. I believe the best possible experience includes features like suggestions, fuzzy search, match highlighting (showing where in the file the term appears), semantic search, pagination (or infinite scroll), and filtering (by file name, type, or date). However, each of those features comes with trade-offs, such as increased system complexity, storage needs (e.g., index size), and processing power, in addition to the time required to actually implement them.
Semantic search using vectors was chosen as it is good opportunity to demonstrate the use of AI. A popular approach to semantic search involves using a pre-trained model to embed text into a vector space, where vectors represent the meaning of the text based on their position. By building an index of the extracted embeddings, vector searches can be performed using algorithms for nearest neighbor search.
The app must also enable basic management of uploaded files.
- List files
- Filter files by name or format
- Download files
- Delete files
For non-functional requirements, the focus is on aspects considered important for providing the best user experience. The project is also built to support future growth in functionality and complexity.
- Search feels fast: search results returned in < 1s.
- The interface is minimal and without clutter to reduce cognitive overhead
- File storage must support future swapping to another provider (e.g., from local to S3)
These items will not be included in this PoC due to time constraints and prioritization, but they also serve as suggestions for how the app can be improved.
- Authentication and user isolation
- Upload from 3rd parties: Google Drive, Dropbox, etc
- Production-grade persistent storage (e.g., cloud databases)
- Responsive layout for mobile users
- OCR/LLM parsing of files.
- There are many file types that could be supported. To name a few: rtf, ebooks (epub, mobi, etc), presentation (ppt, pptx), spreadsheets (xls, xlsx), OpenDocument formats (odt, odp, ods)
- Support for archives (zip, tar, rar, 7z): unpack and parse all files archived.
- Enable support for encodings other than UTF-8
For the PoC, the implementation uses Next.js with React, combined with tRPC, enabling live TypeScript type checking on both frontend and backend. Files and their metadata are stored in memory. The vector search is built using LangChain.
Search is handled by FAISS with Google Generative AI embeddings for semantic search. Parsing of Docx files is handled by mammoth.js and text-based pdf content extraction by pdf-parse.
To get the best of AI coding assistants, it is a good practice to "guide" them to solve one small task at a time. Each task should be small enough to be implemented by creating/modifying only a couple of files, and having enough guards and restrictions to not hallucinate. However, the AI also need to have the context of the whole application that should be built. To give them enough context to start developing I described the initial planned version on a file that was fed as context to the model. Its content was basically a more detailed version of what is described on this Readme. Later, as the project evolved, and the code itself was enough context for the AI, this file was replaced by this Readme.
This project was built with Cursor, and 4 cursor rules files were created to guide the agents. Some rules apply to the whole project while others are specific to some folders. This segregation also helps to keep the model context smaller.
- backend-rules.mdc: includes rules for TypeScript, Next.js, and tRPC
- frontend-rules.mdc: rules for HTML, CSS, Tailwind, and React
- testing-rules.mdc: apply to any file inside playwright folder or files ending in
.test.ts. - rules.mdc: apply to the whole project and includes software development rules
The development tasks were created on GitHub Projects, which automatically creates the issues on the GitHub repo. The Projects page has a view with columns like a KanBan board, where task's progress is tracked. See SearchBox Project here.
The first version of this PoC has been built to enable upload and search of documents. All files are stored in memory. The FAISS index is stored in memory but a copy is kept in a file to allow quicker rebuilding. Storage can be easily replaced later with proper storage solutions.
The search supports semantic search using FAISS vector store with Google Generative AI embeddings. This allows for finding documents based on meaning and context, not just exact text matches.
See DOCKER.md file for instructions.
Make sure the .env file has the GOOGLE_API_KEY set. Get a free one here.
# Install pnpm
npm install -g pnpm
# Install dependencies
pnpm install
# Run dev
pnpm dev# unit
pnpm test-unit
# integration
pnpm test-e2eFor SearchBox to launch as a MVP it needs a few improvements. Currently, each instance of the app serves a single user, there is no authentication or persistent storage. The app needs to control user access and properly isolate user data.
Also, some features were not fully implemented, like deletion of files or filtering by file format.
The CI pipeline is not executing the integration tests correctly. This needs further investigation.
FInally, the semantic search can also be further optimized by tunning some parameters like the threshold used when searching.
