For context, I’ve got a heavy background in software development and am skilled with reporting tools, databases, and SQL, so I have a solid understanding of a dataset. I also generally get the idea that you can train a machine learning model by providing it a training set of input data matched with output data in as “supervised learning”, and then ask it to predict the output data of new input data not in the training set.
Let’s say I have a number of problems that I think can solved via machine learning, with two examples being:
-
Lateness Predictability – There’s a task in my system where a user has to create a document. Documents may take a day or may take 10 days depending on the ask of the specs of the document. There are probably some other obvious factors that come into play such as the experience of the person who has created these documents before. This task is on the critical path of a larger set of steps and it would be very helpful to look at historical data to determine whether this task may be completed on time or will be late.
-
Iteration Predictability – Let’s say once that document is done it goes through reviews, and it would be helpful to predict how many iterations that may take. Again, factors like document spec complexity, who is reviewing, and who the creator are come into play.
How do I go about designing and building a product feature which allows me understand what the likely outcome is of these problems (for 1, what’s the likelihood that this will be late, and for 2, what’s the likely number of iterations) as well as identifying what the largest factors are that affect that prediction (i.e. this task is likely going to be late because this person has almost never completed this task as fast as you expect it to be).
I have a loose idea that the process would involve:
- Building a training set of data for each of these problems.
- Running a process to have the machine ‘learn’
- Taking the output of that (which is a model?) and any new problems run through this model.
- Looking at the output model and determining what the highest weighted factors are in determining the outcome?
I have a hard time understanding the tactical steps here to take, for example:
- What kind of skillset is needed to do step 1 and 2 in the process? Do I need to find someone who is highly skilled in stats to identify the key variables to feed in as the input dataset? Or do I gather all the input data I can find and let the machine learning process identify the right input variables?
- How do I take that output and implement it in a product?
- Can I take the model output and is this just a static formula that I can implement in my software, or is that not possible?
- If I want to have the machine continually learn and course correct itself, is that just a formula that dynamically updates the model as it finds new inputs/outputs that let it refine its model?
- How do I extract the ‘factors’ of why something was predicted as such?
I’d love to hear about a practical example of how someone took a machine learning problem and implemented it in a web application, for example.