Custom search engine for legal research database

 

ActaPublica.se is a leading research database company in Sweden, having more than 90 million legal documents. The database, along with monitoring services, allows for extensive investigations and background checks.

These are used by professionals such as journalists, lawyers and researchers. Documents like primary investigations, annual reports, court decisions and more are imported from around 600 government agencies. 

The application core is built on the Laravel framework. Background processes based on Python/PHP are used to identify documents and notify users via emails and API.

We integrated Elasticsearch to help provide rapid search results to specific search queries and filters.

Elasticsearch result rank algorithm helps to show the best matching documents on top of the result set. Users can also use API services to integrate the document search to other applications.

LiteBreeze can build structured data management systems that can effectively process, store, and provide real-time access to the data to multiple users. As in the case of businesses like news agencies and research organizations like ActaPublica.

Key features of ActaPublica

Document insertion and processing

Document insertion: Documents that are to be available in the search database are inserted through numerous different types of sources such as direct file uploads, automated imports from online providers and from our sister projects through an internal API.

Document processing: The inserted documents are processed and the searchable data are stored in the database for faster searches.

Automated document processing scripts are responsible for generating file previews, creating back-ups and extracting data such as personal identity numbers, regional data, for faster search.

Types of documents: ActaPublica is created to be able to handle a multitude of file types which includes text or pdf documents, audio or video files, zip files, etc.

Document search

Document search: Searching among the available documents using different combinations of search parameters is the main feature of ActaPublica.

Wildcard search: The user can input wildcard expressions to yield suitable and relevant results. S/he can use symbols like *, which could match a particular character sequence or ‘?’ which could match a single character from the keyword.

Search result filters: Filtering the search results using further criteria is possible with intelligent filters, where only the filtering options relevant to the current search result are displayed.

Downloading search results: The documents listed in the search result can be downloaded either individually or as a zip archive.

Multiple documents can be selected from the search result and downloaded as a compressed zip file, which reduces the number of individual downloads the user has to perform.

Security

Request throttling: Search or download requests to ActaPublica are constantly monitored to ensure data security and to protect from online attacks.

If a user exceeds the allotted number of requests within a time frame, it is logged and further requests are prevented until the time frame ends.

Search agents

Search agents: Users can save search criteria for future documents, under ‘agent’ for easier search. These agents are used to notify users of new matching documents or to filter documents on the search page.

Types of available agents: Number search (search based on personal identity numbers), Court search (based on court name), Text search (based on any text from the document contents) and Advanced search (with multiple filter options including the other search types).

Agent highlights: Once an agent matches a new document, the specific part which matched the agent is stored as a highlight so that the users can get a clear context of the matching text.

Agent creation from search criteria: Although agents can be created on their own, we have provided provisions to create agents from the search page, using the current search criteria.

This enables users to try out the search before creating the corresponding agents.

Email notification: Users can enable email notifications for the agents. Therefore, be notified about new matching documents via email.

Sections

API

Application programming interface: Unique API access credentials can be used to access data from an account, to build custom services and solutions.

The API can be used to retrieve search results for document and agent searches, and also to retrieve individual document and agent details in JSON format.

Localization

Multilingual support: The application supports seven major languages – English, Danish, German, Spanish, French, Italian, Portuguese and Swedish. A user can easily switch between languages from the footer of the platform.

Admin Panel

Features: The Admin Panel is a one-stop control center for the entire ActaPublica environment. Only users with administrative privileges can access the Admin Panel.

User management: Users can be created and managed from the Admin Panel by administrators. They can be grouped into organisations and be assigned privileges.

Permissions and roles: Users and organisations can be assigned permissions and roles to restrict usage and access.

Activity logs

Usage logs: All user activities are logged to ensure maximum backtracking capability. Real-time error notifications allow the developers to ensure proper maintenance of services.

Agent processing logs: The various stages of agent processing are logged in the backend. This keeps track of the events to aid auditing and debugging.

Technical information

Elastic Compute Cloud (EC2) hosting: The application is running on the high-performance EC2 instance to provide a hassle-free experience to the end-users.

Relational Database Service (RDS) Aurora Serverless: Automatically starts up, shuts off and scales up database capacity according to the application’s needs. Reduced workload during weekends.

Lambda functions: Allows to run code without provisioning or managing servers, it scales up automatically.

Simple Storage Service (S3): Used to store the documents and deliver results to the application upon requests for document previews and downloads.

Textract and Rekognition: The application utilizes these AWS text identification/extraction services to identify and process the contents of new documents.

Simple Email Service (SES): Allows to send email alerts to users as soon as matching documents are identified for saved agents.

Kinesis: Kinesis streams are used for big data streaming and handling.

Step functions: Are used to build efficient and fast serverless applications. In combination with AWS Lambda, it is used as a backend for the internal document insertion API.

Simple Notification Service (SNS): Used in combination with SES to monitor emails sent from the application. This helps in blacklisting email ids that bounce incoming emails.

Automated deployment: Releases are easier and faster by setting up automated deployments using the AWS Cloudformation and Bitbucket Pipelines. AWS Cloudformation is a service to securely provision AWS services. Bitbucket Pipelines is a CI/CD service.

Elasticsearch, Logstash and Kibana (ELK) stack: This is used to efficiently store, retrieve and log data.

The Elasticsearch store is hosted on Elastic cloud. This allows the use of latest Elasticsearch versions and makes features like Elastic X-Pack, extensive monitoring capabilities and snapshotting available.

Logstash is used as a data pipeline for Elasticsearch. Kibana is used for monitoring and navigating through Elasticsearch.

Database: The SQL database of the application has been configured using AWS serverless for better data security and easy backup.

RDS serverless automatically starts up, shuts off and scales up database capacity according to the application’s needs.

Responsive design: The website is built over the Bootstrap framework to provide a responsive and optimized user experience on all devices and orientations.

Coding standards: The project uses the latest PSR-12 coding standards to keep the PHP code readable and easily maintainable with proper code comments and PHPDoc blocks.

Latest HTML5 and CSS3 markups are used for web pages, ensuring safety, compatibility and stability for the website.

Future challenges

Improved UX – We would need to continually improve user experience based on feedback from users and industry updates to make it easier for users to get the best out of the platform.

Elasticsearch Open Distro – We are considering switching to Open Distro for Elasticsearch as an open-source, enterprise-grade and community-driven alternative for Elastic cloud.

Faster notifications – The aim is to further improve the turn-around time for mail notifications by making use of various modern solutions such as AWS Lambda.

Improve code coverage – We will make better use of design patterns like repository patterns and improve unit-test coverage.

Multiple authentications: OTP/Bank id based authentication.

Custom search engine for legal research database
We trust LiteBreeze with all our web development work in Siren and Acta Publica. They developed the archive service for Acta Publica which all Swedish media firms rely on daily. We are excited about ongoing work for Siren and recommend LiteBreeze for their AWS expertise. - Martin Fredriksson (Stockholm, Sweden)
Team of developers who worked on this project: Preeth, Bheem, Saji, Praveen R, Dileep