For the new project to analyze Edgar document, which is where companies in the US announces their annual report, quarterly report, etc by law. Many investors consider this data as a valuable investment data resource.
Challenge
- Volume: To analyze the Edgar document, the volume is the first challenge. The number of announcements in Edgar is about 3,000 per day. The actual files for the day are up to 40k files. Analyzing 40k files per day can be hard work if you need to do it for the last 3 months' data. In particular, if you need to repeat the analysis with a modified parameter, 40k per day can be quite a performance burden.
- Cracking Document: The documents consist of various file types, including HTML, PDF, Excel, jpg, png, gif, js, etc. OCR(Optical Character Recognition) required for PDF, JPG, etc.
- The storage: With the standard S1 tier, the index provides 25GB of storage. This only lasts a few months for my project purpose. The standard S2 tier gives us 100GB. Still, as I need to continuously add new documents into the Search, the 100GB will be filled up eventually.
- Cost: The standard S1 costs US$245.28 per month, per Unit. It's important. It's per Unit price. If you use 4 Units to increase the performance, the cost will be US$981.12. The standard S2 with 4 Units costs US$3,924.48.