Student: ThienNgo Nguyen Le

Instructor: Dr. Feng Jiang

Class: Machine Learning - CS3210

Final Project - Applying Machine Learning to Agricultural

Introduction

Nowadays, Machine learning technique is widely used in many different fields including Agricultural. Many Machine Learning applications are applied to this field such as yield prediction algorithms, automatic harvest system. This technology has a large contribute portion to the improvement of agricultural. However, there are still many challenges and roadblocks on applying machine learning technique in agricultural. The application that is implemented in agricultural has to be a practical application that align with others existing technologies. Machine learning by the word itself tells us that it will requires lots of data, so the gaps in data collection and preparation is also a big challenge. In this project, I focus on data collection strategies and practical implementation of Machine Learning techniques to agriculture. This project has five phrases:

1. Data collection strategies.
2. Wild weed detection.
3. Growing stages and environment analysis.
4. Pest and diseases detection.
5. Complete automatic growing model.

In this report, I will demonstrate the work I have done for Phrase 1 and Phrase 2.

1. Data collection strategies.

As I mentioned in the previous paraghraph, lacking of data is one of the main blocker on appying Machine Learning to Agricultural. So finding advanced tools and method to collect data is important. The first phrase of this project I focus on finding the methods to collect the data. I use Arducam ESP8266 UNO Board conects with "OV2640 Mini Module Camera" to capture the pictures of the plants, "Soil Moisture Sensor" to get the moisture of the soil over time and "Temperature Humidity Sensor" to collect temperature and humidity of the growing environment. All the data will save into a SD card that attached to the board.

In [1]:
from IPython.display import Image
PATH = ("/Users/thienngole/Desktop/MSU/10-MSU-Spring-2019/" 
        + "CS3120-MachineLearning/Assignment/FinalProject/img/")
Image(filename = PATH + "board.png", width=400, height=400)
Out[1]:
In [2]:
Image(filename = PATH + "sensors.png", width=400, height=400)
Out[2]:
In [3]:
Image(filename = PATH + "greenhouse.jpg", width=300, height=400)
Out[3]:

This method works really well. However, this method takes some labor work to collect the data from the SD card and store it back to the board. It normally takes me 20 to 30 miniutes to collect the data with 10 samples every week. If we scale it up to 1000 or 10000 samples, it is hard to implement due to the labor work effort. Cloud storage should be an ideal solution that could save us a lot of time in this situation. I came up with a model as show on the figure below. I will usse the Raspberry pi to connect the board with the internet and load the data directly to some cloud storage services, then I can virtually download the data from the cloud storage.

In [4]:
Image(filename = PATH + "rpconnection.png", width=400, height=400)
Out[4]:

2. Wild weed detection.

Weed control in the early stage (less than eight weeks old) of crop has a big effect on crop yield because weeds compete vigorously with the crop for nutrients and water during this period, so detect and kill the weeds is a usefull task.

In [5]:
import os
import cv2
import numpy as np
data_path = ("/Users/thienngole/Desktop/MSU/10-MSU-Spring-2019/"
             +"CS3120-MachineLearning/Assignment/FinalProject/data/")
data_set = []
labels = []
image_set = os.listdir(data_path)
image_set.remove('.DS_Store')

print("Reading data")
#Read data
for class_name in image_set:
    image_list = os.listdir(data_path + class_name )
    for an_image in image_list:
        if '.png' in an_image:
            image = cv2.imread(data_path + class_name + '/' + an_image)
            image = cv2.resize(image, (64, 64), interpolation=cv2.INTER_CUBIC)
            data_set.append(image)
            labels.append(class_name)
            
print("Done reading data")
Reading data
Done reading data
In [6]:
my_data_set = np.array(data_set, dtype='float32') / 255.0
my_labels = np.array(labels)
In [7]:
from sklearn.model_selection import train_test_split
#Split data set into 75% training and 25% testing.
(X_train, X_test, y_train, y_test) = train_test_split(my_data_set, my_labels, test_size=0.25, random_state=20)
In [8]:
print("X_train shape", X_train.shape)
print("y_train shape", y_train.shape)
print("X_test shape", X_test.shape)
print("y_test shape", y_test.shape)
X_train shape (4380, 64, 64, 3)
y_train shape (4380,)
X_test shape (1461, 64, 64, 3)
y_test shape (1461,)
In [9]:
from sklearn.preprocessing import LabelBinarizer
print("Shape before one-hot encoding: ", y_train.shape)
lb = LabelBinarizer()
y_train = lb.fit_transform(y_train)
y_test = lb.transform(y_test)
print("Shape after one-hot encoding: ", y_train.shape)
Shape before one-hot encoding:  (4380,)
Shape after one-hot encoding:  (4380, 13)
In [10]:
from keras.models import Sequential
from keras.layers.convolutional import Conv2D
from keras.layers.core import Activation
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Dropout
from keras.layers.core import Flatten
from keras.layers.core import Dense
inputShape = (64, 64, 3)
#Layers use relu activation function.
model = Sequential()
model.add(Conv2D(32, (3,3), padding = 'same' ,input_shape = inputShape))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=1))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding="same"))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=1))
model.add(Conv2D(64, (3, 3), padding="same"))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=1))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(128, (3, 3), padding="same"))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=1))
model.add(Conv2D(128, (3, 3), padding="same"))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=1))
model.add(Conv2D(128, (3, 3), padding="same"))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=1))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(len(lb.classes_)))
model.add(Activation("softmax"))
   
Using TensorFlow backend.
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
In [11]:
#compile the model
model.compile(loss="categorical_crossentropy", optimizer='sgd',metrics=["accuracy"])
In [12]:
import matplotlib.pyplot as plt
def acc_loss_graphic(model_history, title):
    fig = plt.figure()
    plt.subplot(2,1,2)
    plt.plot(model_history.history['acc'], label = 'trainning accuaracy')
    plt.plot(model_history.history['val_acc'], label = 'test accuaracy')
    plt.plot(model_history.history['loss'], label = 'trainning loss') 
    plt.plot(model_history.history['val_loss'], label = 'test loss')
    plt.title(title)
    plt.ylabel('loss/accuracy')
    plt.xlabel('epoch')
    plt.legend(bbox_to_anchor=(1,1), loc='upper left', ncol =1)
In [13]:
#Train the model with the original dataset
org_data_fit_history = model.fit(X_train, y_train, batch_size=32, 
                    validation_data=(X_test, y_test), epochs=10)
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 4380 samples, validate on 1461 samples
Epoch 1/10
4380/4380 [==============================] - 81s 18ms/step - loss: 2.9533 - acc: 0.1454 - val_loss: 3.0339 - val_acc: 0.0719
Epoch 2/10
4380/4380 [==============================] - 91s 21ms/step - loss: 2.5972 - acc: 0.1669 - val_loss: 2.5075 - val_acc: 0.1246
Epoch 3/10
4380/4380 [==============================] - 77s 18ms/step - loss: 2.3758 - acc: 0.2212 - val_loss: 2.1453 - val_acc: 0.2834
Epoch 4/10
4380/4380 [==============================] - 78s 18ms/step - loss: 1.8880 - acc: 0.3737 - val_loss: 1.8998 - val_acc: 0.2649
Epoch 5/10
4380/4380 [==============================] - 85s 20ms/step - loss: 1.7235 - acc: 0.4082 - val_loss: 3.2420 - val_acc: 0.2115
Epoch 6/10
4380/4380 [==============================] - 76s 17ms/step - loss: 1.5374 - acc: 0.4872 - val_loss: 1.7275 - val_acc: 0.4305
Epoch 7/10
4380/4380 [==============================] - 85s 19ms/step - loss: 1.3824 - acc: 0.5386 - val_loss: 1.3574 - val_acc: 0.5510
Epoch 8/10
4380/4380 [==============================] - 82s 19ms/step - loss: 1.2393 - acc: 0.5906 - val_loss: 1.6037 - val_acc: 0.5175
Epoch 9/10
4380/4380 [==============================] - 81s 18ms/step - loss: 1.0728 - acc: 0.6429 - val_loss: 1.2099 - val_acc: 0.5811
Epoch 10/10
4380/4380 [==============================] - 89s 20ms/step - loss: 0.9278 - acc: 0.6916 - val_loss: 1.7823 - val_acc: 0.5072
In [14]:
acc_loss_graphic(org_data_fit_history, 'Original data trained model loss and accuracy')
In [15]:
from sklearn.metrics import classification_report
# evaluation
print("[INFO] evaluating network...")
predictions = model.predict(X_test, batch_size=32)
print(classification_report(y_test.argmax(axis=1),
                            predictions.argmax(axis=1), target_names=lb.classes_))
[INFO] evaluating network...
                           precision    recall  f1-score   support

              Black-grass       0.00      0.00      0.00        81
                 Charlock       0.78      0.86      0.82       117
                 Cleavers       0.62      0.60      0.61        96
         Common Chickweed       0.99      0.41      0.58       202
             Common wheat       0.50      0.04      0.07        54
                  Fat Hen       0.73      0.09      0.15       128
           Genovese basil       1.00      0.01      0.03        71
         Loose Silky-bent       0.61      0.73      0.67       181
                    Maize       0.70      0.73      0.72        67
        Scentless Mayweed       0.26      0.93      0.40       155
         Shepherd’s Purse       0.34      0.39      0.36        67
Small-flowered Cranesbill       0.62      0.93      0.74       135
               Sugar beet       1.00      0.07      0.14       107

                micro avg       0.51      0.51      0.51      1461
                macro avg       0.63      0.45      0.41      1461
             weighted avg       0.65      0.51      0.46      1461

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
In [16]:
from keras.preprocessing.image import ImageDataGenerator
#Generate more data for training based on the original data set by rotating, shifting, and zooming.
data_generator = ImageDataGenerator(rotation_range=30, width_shift_range=0.1,
                        height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
                        horizontal_flip=True, fill_mode="nearest")
In [17]:
#Train the model with the generated data set by rotating, shifting, and zooming.
gen_data_fit_history = model.fit_generator(data_generator.flow(X_train, y_train, batch_size=32), 
                                           validation_data=(X_test, y_test), 
                                           steps_per_epoch=len(X_train)//16, 
                                           epochs=10)
Epoch 1/10
273/273 [==============================] - 176s 646ms/step - loss: 1.1698 - acc: 0.6089 - val_loss: 1.8237 - val_acc: 0.4401
Epoch 2/10
273/273 [==============================] - 167s 610ms/step - loss: 1.0195 - acc: 0.6615 - val_loss: 1.0110 - val_acc: 0.6598
Epoch 3/10
273/273 [==============================] - 160s 586ms/step - loss: 0.9232 - acc: 0.6884 - val_loss: 0.8643 - val_acc: 0.7036
Epoch 4/10
273/273 [==============================] - 169s 620ms/step - loss: 0.8660 - acc: 0.7134 - val_loss: 1.1952 - val_acc: 0.5667
Epoch 5/10
273/273 [==============================] - 164s 601ms/step - loss: 0.7939 - acc: 0.7366 - val_loss: 1.6848 - val_acc: 0.5024
Epoch 6/10
273/273 [==============================] - 157s 575ms/step - loss: 0.7600 - acc: 0.7444 - val_loss: 0.6133 - val_acc: 0.7823
Epoch 7/10
273/273 [==============================] - 162s 593ms/step - loss: 0.6866 - acc: 0.7671 - val_loss: 0.7088 - val_acc: 0.7420
Epoch 8/10
273/273 [==============================] - 148s 542ms/step - loss: 0.6779 - acc: 0.7657 - val_loss: 0.4541 - val_acc: 0.8385
Epoch 9/10
273/273 [==============================] - 153s 559ms/step - loss: 0.6312 - acc: 0.7867 - val_loss: 0.6034 - val_acc: 0.7926
Epoch 10/10
273/273 [==============================] - 149s 544ms/step - loss: 0.5846 - acc: 0.7997 - val_loss: 0.8510 - val_acc: 0.7070
In [18]:
acc_loss_graphic(gen_data_fit_history, 'generated data trained model loss and accuaracy')
In [21]:
# evaluation
print("[INFO] evaluating network...")
predictions = model.predict(X_test, batch_size=32)
print(classification_report(y_test.argmax(axis=1),
                            predictions.argmax(axis=1), target_names=lb.classes_))
[INFO] evaluating network...
                           precision    recall  f1-score   support

              Black-grass       0.55      0.14      0.22        81
                 Charlock       0.72      0.99      0.84       117
                 Cleavers       0.90      0.80      0.85        96
         Common Chickweed       0.99      0.51      0.67       202
             Common wheat       0.67      0.91      0.77        54
                  Fat Hen       0.97      0.29      0.45       128
           Genovese basil       1.00      1.00      1.00        71
         Loose Silky-bent       0.72      0.86      0.78       181
                    Maize       0.77      0.93      0.84        67
        Scentless Mayweed       0.45      0.99      0.62       155
         Shepherd’s Purse       0.64      0.37      0.47        67
Small-flowered Cranesbill       0.99      0.61      0.75       135
               Sugar beet       0.61      0.86      0.71       107

                micro avg       0.71      0.71      0.71      1461
                macro avg       0.77      0.71      0.69      1461
             weighted avg       0.78      0.71      0.69      1461

3. Growing stages and environment analysis

Future completion

4. Pest and diseases detection.

Future completion

5. Complete automatic growing model, Smart Grow

Future completion

Conclusion

In this report, I want to show how the size of the data set effect the result of our model by training the model with the original data set and the generated data set by rotating, shifting, and zooming the original data set. Overall the weed detection model clearly show us that the data has big effect on the model. Generating more data for training based the original data set gives us a higher accuracy. In this case, it increased the accuracy about 10 percents. Also the size of dataset of each classes is related to the accuracy on that class prediction as well. Common Chickweed and Genovese basil are the two classes that have the most images in the dataset, so both of them have a very high accuracy when we train the model with both original data set and the generated data set. Therefore, data preparation is a very important factor for the success of machine learning application. Data collection is very important in machine learning techniques; it would save us a lot of time if the data was collected right.

In [ ]: