Developing machine learning models involves a lot of steps. Whether you’re working with labeled or unlabeled data, you might think numbers are just numbers and that it doesn’t matter what each of the features of a data set signifies when it comes to spitting out insights with the potential for true impact. It’s true that there are tons of great machine learning libraries out there such as scikit-learn that make it straightforward to gather up some data and plop them into a cookie-cutter model. Pretty quickly, you might start to think there’s no problem you can’t solve with machine learning.
Frankly, that’s a beginner’s mindset. You are not yet aware of everything you don’t know. Data sets given in machine learning courses or the free ones you find online often have been groomed already and are convenient to use when applying machine learning models, but once you take your skills and knowledge out of the playpen and into the real world, you’ll face some additional challenges.
Lots of people believe that domain knowledge, or additional knowledge regarding the industry or area the data pertains to, is superfluous. And it’s kind of true. Do you need domain knowledge in the area you’re developing the model? No. You can still produce fairly accurate models without it. Theoretically, deep and machine learning are black-box approaches. This means you can put labeled data into a model without deep knowledge of the area and without even looking at the data very closely.
But, if you go down this route, you’ll have to deal with the consequences. This is a very inefficient way to train classifiers, and, in order for them to function properly, you’ll require massive amounts of labeled data sets and a lot of computational power to produce accurate models.
Incorporating domain knowledge into your architecture and your model can make it a lot easier to explain the results, both to yourself and to an outside viewer. Every bit of domain knowledge can serve as a stepping stone through the black box of a machine learning model.
It's very easy to think that domain knowledge isn’t required because, for lots of visible data sets such as COCO, the limited domain knowledge that is required is part of being a seeing human. Even more complex data sets that contain cancer cells are similarly obvious to the human eye, despite a lack of expert-level knowledge. You can do basic evaluation of similarity or differences between cells without any specific medical knowledge.
Natural-language processing (NLP) and computer vision are prime examples of areas where it’s easy to think that domain knowledge is entirely unnecessary, but, more so because they are such normal tasks for us, we may not even notice how we’re applying our domain knowledge.
If you start working in areas such as outlier detection, which isn’t such an everyday human task, the importance of domain knowledge quickly becomes apparent.