1
$\begingroup$

Not sure this is the right place to ask this question, but I'm having a disagreement with a colleague on this idea.

Let's say we have a dataset comprised of "unclean" strings. The end goal is to have "clean" strings.

Now let's say there is an existing model, called Model A. We pass the all of the "unclean" strings into Model A and call the result the "clean" data.

Next, if I train Model B with the same "unclean" strings as the input and use the output from Model A as the training and validation data, would Model B ever provide significantly different results? Would you be able to determine if it performed better than Model A?

A little more background info on the goals. The current project goal is to create an ML version of Model A which is based on regex. My understanding would be that if we trained Model B using the output of Model A, it would, at best, recreate the functionality of Model A.

There is no human editing of the output from Model A so it really is just Model A (regex) into Model B (some ML algorithm).

Here's some example data:

Model A

Input String Output String
wats going on? What's going on?
How R U today? How are you today?
123 main street 123 Main St
$\endgroup$
2
  • 3
    $\begingroup$ if model A is much bigger than model B, you may be interested in the idea of model distillation: en.wikipedia.org/wiki/Knowledge_distillation $\endgroup$ Commented Jul 9 at 17:25
  • $\begingroup$ Unfortunately, this is not the goal of our project, but this is a new topic that I did not know about and will keep in mind in the future. $\endgroup$
    – setty
    Commented Jul 9 at 17:48

0

Browse other questions tagged or ask your own question.