Legal information retrieval

Legal information retrieval

Legal information retrieval is the science of information retrieval applied to legal text, including legislation, case law, and scholarly works. Accurate legal information retrieval is important to provide access to the law to laymen and legal professionals. Its importance has increased because of the vast and quickly increasing amount of legal documents available through electronic means. Legal information retrieval is a part of the growing field of legal informatics. In a legal setting, it is frequently important to retrieve all information related to a specific query. However, commonly used boolean search methods (exact matches of specified terms) on full text legal documents have been shown to have an average recall rate as low as 20 percent, meaning that only 1 in 5 relevant documents are actually retrieved. In that case, researchers believed that they had retrieved over 75% of relevant documents. This may result in failing to retrieve important or precedential cases. In some jurisdictions this may be especially problematic, as legal professionals are ethically obligated to be reasonably informed as to relevant legal documents. Legal Information Retrieval attempts to increase the effectiveness of legal searches by increasing the number of relevant documents (providing a high recall rate) and reducing the number of irrelevant documents (a high precision rate). This is a difficult task, as the legal field is prone to jargon, polysemes (words that have different meanings when used in a legal context), and constant change. Techniques used to achieve these goals generally fall into three categories: boolean retrieval, manual classification of legal text, and natural language processing of legal text. == Problems == Application of standard information retrieval techniques to legal text can be more difficult than application in other subjects. One key problem is that the law rarely has an inherent taxonomy. Instead, the law is generally filled with open-ended terms, which may change over time. This can be especially true in common law countries, where each decided case can subtly change the meaning of a certain word or phrase. Legal information systems must also be programmed to deal with law-specific words and phrases. Though this is less problematic in the context of words which exist solely in law, legal texts also frequently use polysemes, words may have different meanings when used in a legal or common-speech manner, potentially both within the same document. The legal meanings may be dependent on the area of law in which it is applied. For example, in the context of European Union legislation, the term "worker" has four different meanings: Any worker as defined in Article 3(a) of Directive 89/391/EEC who habitually uses display screen equipment as a significant part of his normal work. Any person employed by an employer, including trainees and apprentices but excluding domestic servants; Any person carrying out an occupation on board a vessel, including trainees and apprentices, but excluding port pilots and shore personnel carrying out work on board a vessel at the quayside; Any person who, in the Member State concerned, is protected as an employee under national employment law and in accordance with national practice; It also has the common meaning: A person who works at a specific occupation. Though the terms may be similar, correct information retrieval must differentiate between the intended use and irrelevant uses in order to return the correct results. Even if a system overcomes the language problems inherent in law, it must still determine the relevancy of each result. In the context of judicial decisions, this requires determining the precedential value of the case. Case decisions from senior or superior courts may be more relevant than those from lower courts, even where the lower court's decision contains more discussion of the relevant facts. The opposite may be true, however, if the senior court has only a minor discussion of the topic (for example, if it is a secondary consideration in the case). An information retrieval system must also be aware of the authority of the jurisdiction. A case from a binding authority is most likely of more value than one from a non-binding authority. Additionally, the intentions of the user may determine which cases they find valuable. For instance, where a legal professional is attempting to argue a specific interpretation of law, he might find a minor court's decision which supports his position more valuable than a senior courts position which does not. He may also value similar positions from different areas of law, different jurisdictions, or dissenting opinions. Overcoming these problems can be made more difficult because of the large number of cases available. The number of legal cases available via electronic means is constantly increasing (in 2003, US appellate courts handed down approximately 500 new cases per day), meaning that an accurate legal information retrieval system must incorporate methods of both sorting past data and managing new data. == Techniques == === Boolean searches === Boolean searches, where a user may specify terms such as use of specific words or judgments by a specific court, are the most common type of search available via legal information retrieval systems. They are widely implemented but overcome few of the problems discussed above. The recall and precision rates of these searches vary depending on the implementation and searches analyzed. One study found a basic boolean search's recall rate to be roughly 20%, and its precision rate to be roughly 79%. Another study implemented a generic search (that is, not designed for legal uses) and found a recall rate of 56% and a precision rate of 72% among legal professionals. Both numbers increased when searches were run by non-legal professionals, to a 68% recall rate and 77% precision rate. This is likely explained because of the use of complex legal terms by the legal professionals. === Manual classification === In order to overcome the limits of basic boolean searches, information systems have attempted to classify case laws and statutes into more computer friendly structures. Usually, this results in the creation of an ontology to classify the texts, based on the way a legal professional might think about them. These attempt to link texts on the basis of their type, their value, and/or their topic areas. Most major legal search providers now implement some sort of classification search, such as Westlaw's “Natural Language” or LexisNexis' Headnote searches. Additionally, both of these services allow browsing of their classifications, via Westlaw's West Key Numbers or Lexis' Headnotes. Though these two search algorithms are proprietary and secret, it is known that they employ manual classification of text (though this may be computer-assisted). These systems can help overcome the majority of problems inherent in legal information retrieval systems, in that manual classification has the greatest chances of identifying landmark cases and understanding the issues that arise in the text. In one study, ontological searching resulted in a precision rate of 82% and a recall rate of 97% among legal professionals. The legal texts included, however, were carefully controlled to just a few areas of law in a specific jurisdiction. The major drawback to this approach is the requirement of using highly skilled legal professionals and large amounts of time to classify texts. As the amount of text available continues to increase, some have stated their belief that manual classification is unsustainable. === Natural language processing === In order to reduce the reliance on legal professionals and the amount of time needed, efforts have been made to create a system to automatically classify legal text and queries. Adequate translation of both would allow accurate information retrieval without the high cost of human classification. These automatic systems generally employ Natural Language Processing (NLP) techniques that are adapted to the legal domain, and also require the creation of a legal ontology. Though multiple systems have been postulated, few have reported results. One system, “SMILE,” which attempted to automatically extract classifications from case texts, resulted in an f-measure (which is a calculation of both recall rate and precision) of under 0.3 (compared to perfect f-measure of 1.0). This is probably much lower than an acceptable rate for general usage. Despite the limited results, many theorists predict that the evolution of such systems will eventually replace manual classification systems. === Citation-Based ranking === In the mid-90s the Room 5 case law retrieval project used citation mining for summaries and ranked its search results based on citation type and count. This slightly pre-dated the PageRank algorithm at Stanford which was also a citation-based ranking. Ranking of results was based

Landweber iteration

The Landweber iteration or Landweber algorithm is an algorithm to solve ill-posed linear inverse problems, and it has been extended to solve non-linear problems that involve constraints. The method was first proposed in the 1950s by Louis Landweber, and it can be now viewed as a special case of many other more general methods. == Basic algorithm == The original Landweber algorithm attempts to recover a signal x from (noisy) measurements y. The linear version assumes that y = A x {\displaystyle y=Ax} for a linear operator A. When the problem is in finite dimensions, A is just a matrix. When A is nonsingular, then an explicit solution is x = A − 1 y {\displaystyle x=A^{-1}y} . However, if A is ill-conditioned, the explicit solution is a poor choice since it is sensitive to any noise in the data y. If A is singular, this explicit solution doesn't even exist. The Landweber algorithm is an attempt to regularize the problem, and is one of the alternatives to Tikhonov regularization. We may view the Landweber algorithm as solving: min x ‖ A x − y ‖ 2 2 / 2 {\displaystyle \min _{x}\|Ax-y\|_{2}^{2}/2} using an iterative method. The algorithm is given by the update x k + 1 = x k − ω A ∗ ( A x k − y ) . {\displaystyle x_{k+1}=x_{k}-\omega A^{}(Ax_{k}-y).} where the relaxation factor ω {\displaystyle \omega } satisfies 0 < ω < 2 / σ 1 2 {\displaystyle 0<\omega <2/\sigma _{1}^{2}} . Here σ 1 {\displaystyle \sigma _{1}} is the largest singular value of A {\displaystyle A} . If we write f ( x ) = ‖ A x − y ‖ 2 2 / 2 {\displaystyle f(x)=\|Ax-y\|_{2}^{2}/2} , then the update can be written in terms of the gradient x k + 1 = x k − ω ∇ f ( x k ) {\displaystyle x_{k+1}=x_{k}-\omega \nabla f(x_{k})} and hence the algorithm is a special case of gradient descent. For ill-posed problems, the iterative method needs to be stopped at a suitable iteration index, because it semi-converges. This means that the iterates approach a regularized solution during the first iterations, but become unstable in further iterations. The reciprocal of the iteration index 1 / k {\displaystyle 1/k} acts as a regularization parameter. A suitable parameter is found, when the mismatch ‖ A x k − y ‖ 2 2 {\displaystyle \|Ax_{k}-y\|_{2}^{2}} approaches the noise level. Using the Landweber iteration as a regularization algorithm has been discussed in the literature. == Nonlinear extension == In general, the updates generated by x k + 1 = x k − τ ∇ f ( x k ) {\displaystyle x_{k+1}=x_{k}-\tau \nabla f(x_{k})} will generate a sequence f ( x k ) {\displaystyle f(x_{k})} that converges to a minimizer of f whenever f is convex and the stepsize τ {\displaystyle \tau } is chosen such that 0 < τ < 2 / ( ‖ ∇ f ‖ 2 ) {\displaystyle 0<\tau <2/(\|\nabla f\|^{2})} where ‖ ⋅ ‖ {\displaystyle \|\cdot \|} is the spectral norm. Since this is special type of gradient descent, there currently is not much benefit to analyzing it on its own as the nonlinear Landweber, but such analysis was performed historically by many communities not aware of unifying frameworks. The nonlinear Landweber problem has been studied in many papers in many communities; see, for example. == Extension to constrained problems == If f is a convex function and C is a convex set, then the problem min x ∈ C f ( x ) {\displaystyle \min _{x\in C}f(x)} can be solved by the constrained, nonlinear Landweber iteration, given by: x k + 1 = P C ( x k − τ ∇ f ( x k ) ) {\displaystyle x_{k+1}={\mathcal {P}}_{C}(x_{k}-\tau \nabla f(x_{k}))} where P {\displaystyle {\mathcal {P}}} is the projection onto the set C. Convergence is guaranteed when 0 < τ < 2 / ( ‖ A ‖ 2 ) {\displaystyle 0<\tau <2/(\|A\|^{2})} . This is again a special case of projected gradient descent (which is a special case of the forward–backward algorithm) as discussed in. == Applications == Since the method has been around since the 1950s, it has been adopted and rediscovered by many scientific communities, especially those studying ill-posed problems. In X-ray computed tomography it is called simultaneous iterative reconstruction technique (SIRT). It has also been used in the computer vision community and the signal restoration community. It is also used in image processing, since many image problems, such as deconvolution, are ill-posed. Variants of this method have been used also in sparse approximation problems and compressed sensing settings.

GeneXus

GeneXus is a low code, cross-platform, knowledge representation-based development tool, mainly oriented towards enterprise-class applications for web applications, smart devices, and the Microsoft Windows platform. GeneXus uses mostly declarative language to generate native code for multiple environments. It includes a normalization module, which creates and maintains an optimal database structure based on user views. The languages for which code can be generated include COBOL, Java, Objective-C, RPG, Ruby, Visual Basic, and Visual FoxPro. Some of the DBMSs supported are Microsoft SQL Server, Oracle, IBM Db2, Informix, PostgreSQL, and MySQL. GeneXus was developed by Uruguayan company ARTech Consultores SRL which later renamed to Genexus SA. The latest version is GeneXus 18, which was released on November 10, 2022.

KeyBase

KeyBase is a database and web application for managing and deploying interactive taxonomic keys for plants and animals developed by the Royal Botanic Gardens Victoria. KeyBase provides a medium where pathway keys which were traditionally developed for print and other classical types of media, can be used more effectively in the internet environment. The platform uses a concept called "keys" which can be easily linked together, joined with other keys, or merged into larger other seamless keys groups, with each still available to be browsed independently. Keys in the KeyBase database can be filtered and displayed in a variety of ways, filters, and formats.

SWILE

SWILE (formerly: Lunchr) is a French app-based company that focuses on improving the employee experience. Among others, the platform offers meal vouchers, gift vouchers, mobility vouchers, and business travel solutions. In March 2020, it was renamed SWILE and entered the lunch break and meal voucher market. == History == The company was founded as Lunchr by Loïc Soubeyrand in 2016. Originally, Lunchr was an app for pre-ordering lunch on the spot or to go. In January 2017, the company raised €2.5 million in seed funding from Daphni. In 2018, the company raised €11 million (series A) from Idinvest, followed by another €30 million in February 2019 (series B), notably from Index Ventures and Kima Ventures. In January 2020, Lunchr became one of the first startups to join the French Tech 120. A few months later, in March, Lunchr diversified its services, adding team life management tools and changing its brand name to Swile. In June 2020, the company raised €70 million more in a new round of financing (Series C) from the same investors and the BPI. In November 2020, Swile acquired Briq, a startup specializing in employee engagement. In January 2021, Swile won a tender with Carrefour and distributed 62,000 Swile cards to its employees. In early October 2021, a new $200 million (€175 million) fundraising round, in which Japanese Softbank joined other investors, allowed Swile to capitalize on $1 billion. President Emmanuel Macron cited the company as "a further proof that FrenchTech is at the forefront internationally." In May 2022, the company acquired the travel management start-up Okarito for €6 million. == Overview == Swile operates in two countries (France and Brazil) and has a total of 1000 employees, 5.5 million users and 85,000 corporate customers, including Carrefour, Le Monde, JCDECAUX, PSG, Airbnb, Spotify, Red Bull, and TikTok in the private sector, as well as numerous local authorities and ministerial references in the public sector.

ISLRN

The ISLRN or International Standard Language Resource Number is Persistent Unique Identifier for Language Resources. == Context == On November 18, 2013, 12 major organisations (see list below) from the fields Language Resources and Technologies, Computational Linguistics, and Digital Humanities held a cooperation meeting in Paris (France) and agreed to announce the establishment of the International Standard Language Resource Number (ISLRN), to be assigned to each Language Resource. Among the 12 organisations, 4 institutions constitute the ISLRN Steering Committee (ST) ADHO ACL Asian Federation of Natural Language Processing ST COCOSDA, International Committee for the Coordination & Standardisation of Speech Databases and Assessment Techniques ICCL (COLING) European Data Forum ELRA ST IAMT, International Association for Machine Translation Archived 2010-06-24 at the Wayback Machine ISCA LDC ST Oriental COCOSDA ST RMA, Language Resource Management Agency == Size and Content == The Joint Research Centre(JRC), the [European Commission]'s in-house science service, was the first organisation to adopt the ISLRN initiative and requested. 2500 resources and tools have already been allocated an ISLRN. These resources include written data (Annotated corpus, Annotated text, List of misspelled word, Terminological database, Treebank, Wordnet, etc.) and speech corpora (Synthesised Speech, Transcripts and Audiovisual Recordings, Conversational Speech, Folk Sayings, etc.) == Objectives == Providing Language Resources with unique names and identifiers using a standardized nomenclature ensures the identification of each Language Resources and streamlines the citation with proper references in activities within Human Language Technology as well as in documents and scientific publications. Such unique identifier also enhances the reproducibility, an essential feature of scientific work.

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language. == History == Apache Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation. === Naming === Regarding the naming of the Pig programming language, the name was chosen arbitrarily and stuck because it was memorable, easy to spell, and for novelty. The story goes that the researchers working on the project initially referred to it simply as 'the language'. Eventually they needed to call it something. Off the top of his head, one researcher suggested Pig, and the name stuck. It is quirky yet memorable and easy to spell. While some have hinted that the name sounds coy or silly, it has provided us with an entertaining nomenclature, such as Pig Latin for the language, Grunt for the shell, and PiggyBank for the CPAN-like shared repository. == Example == Below is an example of a "Word Count" program in Pig Latin: The above program will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages on the internet. == Pig vs SQL == In comparison to SQL, Pig has a nested relational model, uses lazy evaluation, uses extract, transform, load (ETL), is able to store data at any point during a pipeline, declares execution plans, supports pipeline splits, thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines. On the other hand, it has been argued DBMSs are substantially faster than the MapReduce system once the data is loaded, but that loading the data takes considerably longer in the database systems. It has also been argued RDBMSs offer out of the box support for column-storage, working with compressed data, indexes for efficient random data access, and transaction-level fault tolerance. Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative. In SQL users can specify that data from two tables must be joined, but not what join implementation to use (You can specify the implementation of JOIN in SQL, thus "... for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways. In effect, Pig Latin programming is similar to specifying a query execution plan, making it easier for programmers to explicitly control the flow of their data processing task. SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline. Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin.