Data retrieval

As mentioned earlier in this chapter, ML systems need data for functioning. It is not available all of the time, in fact, most of the time, the data itself is not available in a format with which we can actually start training ML models. But what if there is no standard dataset for a particular problem that we are trying to solve using ML? Welcome to reality! This happens for most real-life ML projects. For example, let's say we are trying to analyze the sentiments of tweets regarding the New Year resolutions of 2018 and trying to estimate the most meaningful ones. This is actually a problem for which there is no standard dataset available. We will have to scrape it from Twitter using its APIs. Another great example is business logs. Business logs are treasures of knowledge. If effectively mined and modeled, they can help in many decision-making processes. But often, logs are not available directly to the ML engineer. So, the ML engineer needs to spend a considerable amount of time figuring out the structure of the logs and they might write a script so that the logs are captured as required. All of these processes are collectively called data retrieval or data collection.