證交所買賣交易日報驗證碼 開源專案研究與後續處理自動下載
證交所買賣交易日報驗證碼 開源專案研究與後續處理自動下載
開源專案
- 證交所的買賣日報表查詢系統提供最近一日的上市買賣日報,下載時需要先輸入驗證碼。這種驗證碼有很多種破解方式,今天介紹的是 github 上的一個開源專案 證交所買賣交易日報驗證碼,很容易上手,效果也不錯。
開源專案的後續加工
- 由於專案內的說明已經很詳盡了,有興趣使用者,可自行下載專案研讀文件及程式。這裡再補充一下自己後續的處理,以及串接自動化下載的步驟。
- 首先,針對所有已標記過的檔案,做一次圖片的預處理。由於總共標記了15,000個圖片檔,所以程式的迴圈數可以修改一下。
- 再來實際跑模型,並附上一些實際執行期間的訊息。這邊的程式,每跑一個 epoch 會存一個模型檔。需要花費的時間應該跟硬體規格有關。我自己的環境跑一個 epoch 需要三分鐘左右。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
152021-03-16 19:46:04.754543: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs/train/plugins/profile/2021_03_16_19_46_04
2021-03-16 19:46:04.758101: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to logs/train/plugins/profile/2021_03_16_19_46_04/localhost.localdomain.trace.json.gz
2021-03-16 19:46:04.761178: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs/train/plugins/profile/2021_03_16_19_46_04
2021-03-16 19:46:04.761298: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for memory_profile.json.gz to logs/train/plugins/profile/2021_03_16_19_46_04/localhost.localdomain.memory_profile.json.gz
2021-03-16 19:46:04.761559: I tensorflow/python/profiler/internal/profiler_wrapper.cc:111] Creating directory: logs/train/plugins/profile/2021_03_16_19_46_04Dumped tool data for xplane.pb to logs/train/plugins/profile/2021_03_16_19_46_04/localhost.localdomain.xplane.pb
Dumped tool data for overview_page.pb to logs/train/plugins/profile/2021_03_16_19_46_04/localhost.localdomain.overview_page.pb
Dumped tool data for input_pipeline.pb to logs/train/plugins/profile/2021_03_16_19_46_04/localhost.localdomain.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to logs/train/plugins/profile/2021_03_16_19_46_04/localhost.localdomain.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to logs/train/plugins/profile/2021_03_16_19_46_04/localhost.localdomain.kernel_stats.pb
160/160 [==============================] - ETA: 0s - loss: 0.0747 - digit1_loss: 0.0163 - digit2_loss: 0.0129 - digit3_loss: 0.0158 - digit4_loss: 0.0131 - digit5_loss: 0.0166 - digit1_accuracy: 0.9944 - digit2_accuracy: 0.9954 - digit3_accuracy: 0.9950 - digit4_accuracy: 0.9954 - digit5_accuracy: 0.9944
Epoch 00001: saving model to model/01-0.07-0.28.hdf5
160/160 [==============================] - 159s 995ms/step - loss: 0.0747 - digit1_loss: 0.0163 - digit2_loss: 0.0129 - digit3_loss: 0.0158 - digit4_loss: 0.0131 - digit5_loss: 0.0166 - digit1_accuracy: 0.9944 - digit2_accuracy: 0.9954 - digit3_accuracy: 0.9950 - digit4_accuracy: 0.9954 - digit5_accuracy: 0.9944 - val_loss: 0.2760 - val_digit1_loss: 0.0445 - val_digit2_loss: 0.0558 - val_digit3_loss: 0.0707 - val_digit4_loss: 0.0520 - val_digit5_loss: 0.0530 - val_digit1_accuracy: 0.9985 - val_digit2_accuracy: 0.9975 - val_digit3_accuracy: 0.9965 - val_digit4_accuracy: 0.9985 - val_digit5_accuracy: 0.9985
Epoch 2/30
52/160 [========>.....................] - ETA: 1:44 - loss: 0.0788 - digit1_loss: 0.0095 - digit2_loss: 0.0142 - digit3_loss: 0.0153 - digit4_loss: 0.0164 - digit5_loss: 0.0235 - digit1_accuracy: 0.9954 - digit2_accuracy: 0.9958 - digit3_accuracy: 0.9965 - digit4_accuracy: 0.9962 - digit5_accuracy: 0.9927 - 使用模型預測,這邊由於專案中沒有說明,加上自己還是初學 DL ,摸索了很久,這邊將心得註記下來,如註解的部份。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16def TakeGuess(filename):
print('model loading...')
model = load_model("model/01-0.07-0.28.hdf5") #載入模型
print('loading completed')
from_filename = CAPTCHA_FOLDER + filename
to_filename = PROCESSED_FOLDER + filename
preprocessing(from_filename, to_filename) #將要辨識的圖片做一次預處理
img = cv2.imread(to_filename)
npary = np.array(img)/255.0
npary = npary.reshape(-1, 40, 190, 3) #將預處理過的圖片轉成模型可接受的形式
x = model.predict(npary) #預測結果
result = one_hot_decoding(x, allowedChars) #使用 one_hot_decoding 將預測結果轉成字串
return result
下載證交所買賣交易日報的程式
下載資料的部份,網路上也有分享品質很好的程式: 分點進出取資料研究。程式下載回來後,將手動輸入驗證碼的部份改成呼叫上述辨識的程式即可完成自動化的版本。
原始版本
1
2
3
4print('輸入圖型驗證碼: ', end='', flush=True)
vcode = sys.stdin.readline().strip()
params['CaptchaControl1'] = vcode
params['TextBox_Stkno'] = '2330'
自動辨識版
1
2
3
4
5# print('輸入圖型驗證碼: ', end='', flush=True)
# vcode = sys.stdin.readline().strip()
vcode = TakeGuess(imgpath)
params['CaptchaControl1'] = vcode
params['TextBox_Stkno'] = '2330'
- 實際下載測試成果,截圖如下。個人體感,辨識率大概在八成左右(每下載五次大概會失敗一次),後續還有改良空間。