Search and destroy / capturing illegal data...
The Environment:
I manage a few very "open" databases. The type of access is usually full select/insert/update/delete. The mechanism for accessing the data is usually through linked tables (to SQL-Server) in custom-build MS Access databases.The Rules
No social security numbers, etc. (e.g., think FERPA/HIPPA).The Problem
Users enter / hide the illegal data in creative ways (e.g., ssn in the middle name field, etc.); administrative/disciplinary control is weak/ineffective. The general attitude (even from most of the bosses) is that security is a hassle, if you find a way around it then good for you, etc. I need a (better) way to find the data after it has been entered.What I've Tried
Initially, I made modifications to the various custom-built user interfaces folks had (that I was aware of), all the way down to the table structures that they were linking to our our database server. The SSN's, for example, no longer had a field of their own, etc. And yet...I continue to find them buried in other data fields.
After a secret audit some folks at my institution did, where they found this buried data, I wrote some sql that (literally) checks every character in every field field in every table of the database looking for anything that matched an ssn pattern. It takes a long time to run, and the users are fin开发者_StackOverflow中文版ding ways around my pattern definitions.
My Question
Of course, a real solution would entail policy enforcement. That has to be addressed (way) above my head, however, it is beyond the scope and authority of my position.Are you aware of or do you use any (free or commercial) tools that have been targeted at auditing for FERPA & HIPPA data? (or if not those policies specifically, then just data patterns in general?
I'd like to find something that I can run on a schedule, and that stayed updated with new pattern definitions.
I would monitor the users, in two ways.
- The same users are likely to be entering the same data, so track who is getting around the roadbloacks, and identify them. Ensure that they are documented as fouling the system, so that they are disciplined appropriately. Their efforts create risk (monetary and legal, which becomes monetary) for the entire organization.
- Look at the queries that users issue. If they are successful in searching for the information, then it is somehow stored in the repository.
If you are unable to track users, begin instituting passwords.
In the long-run, though, your organization needs to upgrade its users.
In the end you are fighting an impossible battle unless you have support from management. If it's illegal to store an SSN in your DB, then this rule must have explicit support from the top. @Iterator is right, record who is entering this data and document their actions: implement an audit trail.
Search across the audit trail not the database itself. This should be quicker, you only have one day (or one hour or ...) of data to search. Each violation record and publish.
You could tighten up some validation. No numeric field I guess needs to be as long as an SSN. No name field needs numbers in it. No address field needs more that 5 or 6 numbers in it (how many houses are there on route 66?) Hmmm Could a phone number be used to represent an SSN? Trouble is you can stop someone entering acaaabdf etc. (encoding 131126 etc) there's always a way to defeat your checks.
You'll never achieve perfection, but you can at least catch the accidental offender.
One other suggestion: you can post a new question asking about machine learning plugins (essentially statistical pattern recognition) for your database of choice (MS Access). By flagging some of the database updates as good/bad, you may be able to leverage an automated tool to find the bad stuff and bring it to your attention.
This is akin to spam filters that find the bad stuff and remove it from your attention. However, to get good answers on this, you may need to provide a bit more details in the question, such as the # of samples you have (if it's not many, then a ML plugin would not be useful), your programming skills (for what's known as feature extraction), and so on.
Despite this suggestion, I believe it's better to target the user behavior than to build a smarter mousetrap.
精彩评论