开发者

How to access tag information on office files via C#

开发者 https://www.devze.com 2023-04-12 16:57 出处:网络
I would like to write a simple bit of code that would extract only the tag information from a set of office (docx, pptx, etc.) files that exist in a directory so that it could be indexed and searched

I would like to write a simple bit of code that would extract only the tag information from a set of office (docx, pptx, etc.) files that exist in a directory so that it could be indexed and searched easily.

When I say "tag", I mean the tag info that you have been able to add to a file since Vista. It's typically done using Explorer. For example, the pptx file in the screenshot below has the tag, "bubble" attached.

How to access tag information on office files via C#

But searching those tags is already built into Windows, you say? Why, yes, but I need this to only index the tags and I need to expose the info through an intranet rather than inside of Windows.

I have found that inside the office file package, the actual information is stored in /docProps/core.xml file in the cp:keywords element. And I do realize that, in code, I could unzip the file, access that file, and extract what I need. I'm hoping that there's a pre-abstracted solution out there somewhere, however. I seriously doubt that's what Windows is doing to index that same information (but admittedly, I can't really find any good info on it).

I have also found some discussions about IFilters. And yet, this accesses the text of the file. I don't see where an IFilter helps solve this particular problem.

Can anyone point me in the right di开发者_高级运维rection on this one?


I don't have word installed but i'll guess that they are accessible from the standard property system as the KEYWORD entries as are the tags on a jpg picture.

If you want to know exactly how it's done, I played with the shell COM API and here is a full sample code in Gist : FileTags.cs. But that was just for fun you should use the Microsoft Windows API Code Pack as their implementation is a lot cleaner.

To get the tags (called keywords internally) reference Microsoft.WindowsAPICodePack.Shell.dll then :

using System;
using Microsoft.WindowsAPICodePack.Shell;

class Program
{
    static void Main()
    {
        var shellFile = ShellFile.FromFilePath(@"C:\path\to\some\file.jpg");
        var tags = (string[])shellFile.Properties.System.Keywords.ValueAsObject;
        tags = tags ?? new string[0];
        Console.WriteLine("Tags: {0}", String.Join("; ", tags));
        Console.ReadLine();
    }
}

if they didn't mess it up it should work starting from Windows XP SP2 (Mine should work from SP1 as I avoided the PropVariantGetStringElem but it's really annoying without them).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号