Index Word/PDF Documents From File System To SQL Server_问答_开发者

Index Word/PDF Documents From File System To SQL Server

开发者 https://www.devze.com 2023-03-27 02:05 出处：网络

I\'m trying to come up with a simple solution to a problem I have because all of those I have found so far just seem too complicated!

I'm trying to come up with a simple solution to a problem I have because all of those I have found so far just seem too complicated!

The situation is that we use a proprietary application for managing most aspects of our business. It has an SQL Server 2005 backend database, which is quite large. The application also allows the attaching of Word and PDF documents to records, which we use extensively, and these are stored in the file system on the server, with the filenames referenced in the database. Unfortunately the search facilities in the application are poor, so I'm trying to build my own version.

So far I've got a neat ASP.NET page with a search box which will allow users to enter words to search for, as well as filter their results on other fields, such as department, date, etc. The Stored Procedure I've written in the database looks for the words they're searching for in several different fields in the database. What I'm really aiming for is Google-style 'one search to rule them all' effect, where the user doesn't have to specify where they expect to find the word they're looking for, they will just get hits anywhere that it appears in the database. And this is working.

What I want to add now is the ability for the search to include the text of the documents which are 'attached' to records. They are all either .doc or .pdf files but if I couldn't search the .pdf files it wouldn't be the end of the world.

In my ideal world what I'd do is find some software which would index the folder containing the documents (currently there are around 100,000 of them, averaging about 100k) and populate a table in my existing database with this index so that I could then just include that table in my search. I'd love it to just contain a record for each unique word it indexed and a join table referencing documents in the file system containing that word.

Given that this seems fanciful and there isn't any software that will do this, or anything close to it, as far as I can see, what solution would you recommend? The server already has dtSearch running on it, indexing the very files I'm interested in. However, whilst I could wade through the documentation trying to figure out how to implement a search of this index through my own webpage (which I've started to do, and found heavy going), that would have to be a separat开发者_StackOverflowe search to the one of the SQL database. I couldn't return results from the file index and the database in a unified way.

So, starting from the ultimate wish of having the indexed words stored in the database, with a view to implementing full-text searching on that, what would anyone suggest?

SQL Server has full text search (http://msdn.microsoft.com/en-us/library/ms142571.aspx); this supports both PDF and word files (though with some wrinkles - installation can be a bit tricky). The link is to SQL Server 2008 - but the feature's been presence since SQL Server 2000.

So, super simplistically - your solution would require you to load the documents into SQL Server, and amend your stored proc to query them using the built-in free text search features.

Keeping the file system and database versions of the document synchronized could be a challenge, but other than that, I think the solution should be fairly straightforward.