Pothole Detector by AI

Process

We initally started out brainstorming several ideas for what machine learning model we could make and how it could better people's everyday lives. After narrowing down our ideas, we settled on potholes as it is a common natural phenomenon that when detected, can greatly increase the safety of driving on the roads.

Data Collection
As we based our pothole detector around the YOLO v4 supervised learning model, this meant that we needed to find and label a large amount of pothole images to train our model. This would ultimatly be the key component that determines the model's accuracy as the images we fed would influence the way that the model's weights are adjusted to detect potholes. Labelbox would be used to label the potholes in the images that we found. 

Splitting up the Data
After gathering and labeling all the data we could, we ended up with around three thousand labeled images. The next step was to split the data into a training, testing, and validation set. The training sets would be used to train our model, testing sets were used to test our model to see if it was functioning as intended, and validation sets that would be used as images for fine tuning our model's parameters. For our model, we sectioned off 80% of images to training, 10% to testing, and the final 10% to validation. 

Formatting the data
Because we had collected our data from a number of different sources, we had some data that were incompatible with our model.  After coding in various functions to clean up the files to our desired file extensions as well as reorganizing the files into their proper image sets, we were ready to begin training our model.

Model Training
For the training process, we used a NVIDIA Tesla T4 GPU with 16 GB of VRAM and began feeding in our data. After around 12 hours, we would be able to deploy our site for the first time and tested it out. Although our model appeared to be working seamlessly from the few pothole photos we manually tested, our validatation dataset soon showed us the opposite. From the confusion matrices that we generated, we noticed our model had a bit too many false negativess in its detection. 

Inherent Problems
After diving deeper into the issue, the variation in labeling techniques seemed to be the cause. As we had a multitude of people labeling the data, there were certain biases toward what counted as a pothole and what didn't when labeling. To combat this, we had people reviewing previously labeled data with set standards to make a more uniform dataset with minimal bias in it. After further review, the new data was fed into our model and we arrived at a much better model.