Collect a reasonable amount of representative data.
For images, this is usually on the order of 10k-50k images to train an initial model. This can go up to millions or even hundreds of millions of images, depending on how robust you want your computer vision system to be.
As a rule of thumb, the sources of data for the actual use case and training data should generally be similar. In technical terms, you want your live test set distribution to match your training set distribution. For instance, if your camera is a fish eye camera, your computer vision models would generally find fish eye images more useful than non-fish eye images. Another example: if you're distinguishing cats from dogs, you'd want many examples of the kinds of dogs and cats you'll see later on, and not just one kind of cat or dog.