Recommend a strategy ? (basically, for large scale re-factoring) [closed]_问答_开发者

Closed. This question is opinion-based. It is not currently accepting answers.

开发者_JS百科

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 4 years ago.

Improve this question

Might anyone be able to recommend a strategy for the following?

There's a large organization which refers to customers by an id, or reference like ‘1234A’ (four numbers and a letter). These id’s are used pretty much everywhere, database queries and primary keys, about 40 Java applications, many external interfaces, there’s Visual Basic, spread sheets, you name it. They will need to change this from ‘1234A’ to also allow the format ‘1234AB’ (four numbers and two letters). So, a simple change with really big impact.

I’m starting to think about what might be a good approach. Might anyone know any recommended strategies, or patterns etc?

I've noticed the related posting- Strategy for large scale refactoring

Thanks!

The common recommendation is "find/change all the code by hand". When the code base gets big, this gets to be a problem, as you have observed.

I'll note your problem is much like the Y2K problem, which I characterize as a special case of "field expansion" (which has happened with phone numbers, license plates, bar codes,transaction ids in large scale stock trading systems, will happen with social security numbers).

What is ideally needed is a tool that can identify all the instances of the problem data, and for each instance, determine what code changes are needed there. For the Y2K problem, one had to find the date fields with 2 digit years, and for each occurrence of such data in the code, patch that code (e.g., expand data declarations to include 2 more digits, remove "19"+ string concatenations that generate 4 digit dates from 2 digit dates, etc.

Finding the data itself can be hard. How do you know something is a date (or in your case, an extended id)? Fundamentally you need to identify the sources or sinks of such data (e.g., date fields on screens, calls to get_current_year, compare to other things already knows as dates, etc.) and trace where that data flows (to arguments in calls, to assignments of copies, to prints, .... [The Y2K guys also used X mod 100 == 0 as a hint that something was a year because this computation is likely a leapyear check and therefore the involved data must be a year].

Then for each use of the data, you need to decide what to do about that use: leave it alone (date copies aren't wrong if they work when extended), fix it (e.g., remove the addition of century prefixes, etc.). For your extended IDs, what matters here is what kinds of things can be done to extended IDs? can they be torn apart into the digit section and alphanumeric section? Does the first alpha letter signify something by itself? Based on the answers to these questions, it is generally obvious what to do at each point of use in the code.

Now in fact, you can do all of the above by hand, and that's at least more organized than "give it to the programmers and let them do as they will".

But in fact, like the Y2K adventure, you can get tools (much better than Y2K tools) to automate most of this. Such tools have to be capable of processing the programming langauges of interest (you didn't say what you had) with compiler-level semantic analysis (e.g., knows the language data type), must be able to match sources/sinks of the data type, must be able to follow data flows ("flow analysis" in the langauge of the compiler community), and be able to mechanically apply usage-specific transformations.

The tools that can do this are called program transformation systems. Most of these tools can apply source-to-source transformations like the following:

domain Java.

pattern date_source_1():expression
  " calendar.get_year() ";

rule remove_century_prefix(s: sum): expression -> expression
   "  \"19\"+\s "
   rewrites to
   "   \s  " ;

[This example format here is for our DMS Software Reengineering Toolkit]. We are assuming that 2-digit dates are represented as strings and we want to find/fix these. A rule has a name (so human beings can name the specific rule of interest, just like functions in C have names) and a source and replacement pattern separated by rewrites to. The " surrounding the source and target patterns are meta-quotes, and indicate the text inside the meta-quotes are from the programming langauge named in the domain. The reason the \" inside the domain metaquotes are backslashed, is to allow domain/language specific quotes inside the pattern. The \s represents in subexpression which is part of the concatenation expression. The pattern definition allows one to match possible sources of dates.]

So the rules describe how to handle each of the cases encountered, but they have to be qualified by use of entities of the appropriate datatype; you don't want the above rule to run on every string concatenation. Most of the existing program transformation tools don't provide much help here.

DMS does provide the ability, at least for C, Java and COBOL, to do quite serious data flow tracing. So you'd have to revise the rule:

rule remove_century_prefix(s: sum): expression -> expression
   "  \"19\"+\s "
   rewrites to
   "   \s  " if is_date(s);

where is_date detects a data flow (using DMS's built-in flow analysis machinery), and the patterns for recognizing the generation of a date as shown above.

Using such program transformation machinery, you can automate a large part of such field expansion tasks.

You can try a java program and edit the field in the program. Please note to have multiple logic for reading and writing for multiple file-types.

Also, for text files you can do a search in windows for files that contain the field and then add all the files into notepad++ and then "find and replace" for all files in the notepad++.

For excel files and any other format that stores the data in unreadable form (I mean non text form), it is better to edit through some java program using Apache POI etc.

There is not pattern for this since the Ids are validated.

You need to add support in one application at a time. Which means that the application should be able to handle both kind of ids without breaking down. Also add some sort of configuration flag which can be set to start generate ids in the new way (but don't enable it yet).

Do this for each application and test it.

When all applications have been tested, simply change their configurations so that they start generating the new Ids.

Nobody will save you this time. Move ID validation(s) to a set of components managed on one place and shared - this way you'll ba able to change the format quickly next time. If all or majority of the applications are on network, you can add "load new format definition" functionality and distribute regular expressions for al the apps aou there this way.