Not sure this is the right place to ask this question, but I'm having a disagreement with a colleague on this idea.
Let's say we have a dataset comprised of "unclean" strings. The end goal is to have "clean" strings.
Now let's say there is an existing model, called Model A. We pass the all of the "unclean" strings into Model A and call the result the "clean" data.
Next, if I train Model B with the same "unclean" strings as the input and use the output from Model A as the training and validation data, would Model B ever provide significantly different results? Would you be able to determine if it performed better than Model A?
A little more background info on the goals. The current project goal is to create an ML version of Model A which is based on regex. My understanding would be that if we trained Model B using the output of Model A, it would, at best, recreate the functionality of Model A.
There is no human editing of the output from Model A so it really is just Model A (regex) into Model B (some ML algorithm).
Here's some example data:
Model A
Input String | Output String |
---|---|
wats going on? | What's going on? |
How R U today? | How are you today? |
123 main street | 123 Main St |