Elasticsearch
cheat sheet and summary
https://www.elastic.co/guide/index.html
https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html
https://www.edureka.co/blog/elasticsearch-tutorial/
https://www.elastic.co/training/free#quick-starts
https://www.runtastic.com/blog/en/increasing-search-engine-relevance-elasticsearch/
Inside:
Uses lucene
Index : database
An index is a collection of documents
Type : table
Document : same as json Document
Shard :
Fuzzy query : good query for number of differences and other stuffs
Analyzer :
Explain : shows how it does to get that result
ELK
Logstash
Kibana, Machine Learning
https://logz.io/learn/complete-guide-elk-stack/
Visualization, monitoring, beats : heartbeat(monitoring) and other beats, graphs
Uses Zen Discovery(instead of zookeeper)
Shard = lucene index = group of segments
Highlight(tell why and where match happened)
Searching: Simple, Query DSL, filtered, Phrase(exact combination),
ES_JAVA_OPTS="-Xms10g -Xmx10g"
Flush: The flush API is responsible for flushing one or more indices through an API. Basically, its a process of releasing memory from the index by pushing the data to the index storage and clearing the internal transaction log. The following example shows an index being flushed
Refresh: The refresh API is responsible for refreshing one or more index explicitly. This makes all operations performed since the last refresh available for the search. The following example shows an index being refreshed
Term-vector: get detail info about document, tf, idf, ...
Task status, cancel
Relationship can be done with Parent/child and Nested
Similarity
APM
App performance monitor
Mapping
schema for index. More dynamic than SQL, as can have virtual fields
Data type
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
Common types
binary Binary value encoded as a Base64 string.
boolean true and false values.
Keywords The keyword family, including keyword, constant_keyword, and wildcard.
Numbers Numeric types, such as long and double, used to express amounts.
Dates: Date types, including date and date_nanos.
alias Defines an alias for an existing field.
Objects and relational types
Structured data types
Range Range types, such as long_range, double_range, date_range, and ip_range.
ip IPv4 and IPv6 addresses.
version Software versions. Supports Semantic Versioning precedence rules.
murmur3 Compute and stores hashes of values.
Aggregate data types
aggregate_metric_double Pre-aggregated metric values.
histogram Pre-aggregated numerical values in the form of a histogram.
Text search types
text fields The text family, including text and match_only_text. Analyzed, unstructured text.
annotated-text Text containing special markup. Used for identifying named entities.
completion Used for auto-complete suggestions.
search_as_you_type text-like type for as-you-type completion.
token_count A count of tokens in a text.
Document ranking types
dense_vector Records dense vectors of float values.
sparse_vector Records sparse vectors of float values.
rank_feature Records a numeric feature to boost hits at query time.
rank_features Records numeric features to boost hits at query time.
Spatial data types
Other types
percolator: Indexes queries written in Query DSL.
Query DSL
Search collapse
Highlight: say what matches
Async search
Sort result: https://www.elastic.co/guide/en/elasticsearch/reference/current/sort-search-results.html
Query context, Filter context
Leaf Query Clauses
Match
Term
Phrase
Wildcard
Fuzzy
Full text queries: Match, match phrase(exact match), multi match(search in multi fields)
A full text query that allows fine-grained control of the ordering and proximity of matching terms.
The standard query for performing full text queries, including fuzzy matching and phrase or proximity queries.
Creates a bool query that matches each term as a term query, except for the last term, which is matched as a prefix query
Like the match query but used for matching exact phrases or word proximity matches.
Like the match_phrase query, but does a wildcard search on the final word.
The multi-field version of the match query.
Matches over multiple fields as if they had been indexed into one combined field.
Supports the compact Lucene query string syntax, allowing you to specify AND|OR|NOT conditions and multi-field search within a single query string. For expert users only.
A simpler, more robust version of the query_string syntax suitable for exposing directly to users.
Term-level queries: term, terms, range, exists, prefix, Wildcard, regex, Fuzzy(number of different), Terms Set Query
positive, negative and negative_boost
Constant score Queries
Disjunction max Queries
Function score Queries
Painless, Expression, Mustache, java
https://www.elastic.co/guide/en/elasticsearch/painless/current/index.html
decay functions: gauss, linear, exp
A query that computes scores based on the dynamically computed distances between the origin and documents' date, date_nanos, and geo_point fields. It is able to efficiently skip non-competitive hits.
This query finds documents which are similar to the specified text, document, or collection of documents.
This query finds queries that are stored as documents that match with the specified document.
A query that computes scores based on the values of numeric features and is able to efficiently skip non-competitive hits.
This query allows a script to act as a filter. Also see the function_score query.
A query that allows to modify the score of a sub-query with a script.
A query that accepts other queries as json or yaml string.
A query that promotes selected documents over others matching a given query.
nested query: This query is used for the documents containing nested type fields. Using this query, you can query each object as an independent document.
has_child & has_parent queries: This query is used to retrieve the parent-child relationship between two document types within a single index. The has_child query returns the matching parent documents, while the has_parent query returns the matching child documents.
https://www.elastic.co/guide/en/elasticsearch/guide/master/geopoints.html
geo_point: These are the fields which support lat/ lon pairs
geo_shape: These are the fields which support points, lines, circles, polygons, multi-polygons etc.
Aggregation
Bucket aggregations don’t calculate metrics over fields like the metrics aggregations do, but instead, they create buckets of documents.
Here each bucket is associated with a key and a document. Whenever the aggregation is executed, all the buckets criteria are evaluated on every document. Each time a criterion matches, the document is considered to “fall in” the relevant bucket.
Metrics are the aggregations which are responsible for keeping a track and computing the metrics over a set of documents.
Pipeline are the aggregations which are responsible for aggregating the output of other aggregations and their associated metrics together.
Matrix: Matrix are the aggregations which are responsible for operating on multiple fields. They produce a matrix result out of the values extracted from the requested document fields. Matrix does not support scripting.
Cardinality: count of distinct values of a particular field
extended_stats: all the statistics about a specific numerical field in aggregated documents
Filter aggregation
Terms aggregation
Nested aggregation
Date histogram aggregation—used with date values.
Scripted aggregation—used with scripts.
Top hits aggregation—used with top matching documents.
Range aggregation—used with a set of range values.
Aggs This keyword shows that you are using an aggregation.
name_of_aggregation This is the name of aggregation which the user defines.
type_of_aggregation This is the type of aggregation being used.
Field This is the field keyword.
document_field_name This is the column name of the document being targeted.
Analyzing: the process of conversion of text into tokens or terms.
https://www.elastic.co/blog/found-text-analysis-part-1
Analyzers
Standard, Simple, Whitespace, Stop, Keyword, Pattern, Language, Snowball, Custom
Persian
https://github.com/mlkmhd/persian-analyzer-elasticsearch
https://github.com/hlavki/jlemmagen
https://github.com/NarimanN2/ParsiAnalyzer
https://www.elastic.co/guide/en/elasticsearch/plugins/7.14/analysis-icu-analyzer.html
Tokenizer
responsible for generating tokens from a text. Using whitespace or other punctuations, the text can be broken down into tokens.
Standard, Edge NGram, Keyword, Letter, Lowercase, NGram, Whitespace, Pattern, UAX Email URL, Path Hierarchy, Classic, Thai
Shingler: word edge ngram
Token Filters
These token filters can further modify, delete or add text into that input.
Don't use synonym in index as is make problem: like adding atm to automate teller machine
Stemming: get root of words
Character Filters
Before the tokenizers, the text is processed by the character filters. Character filters search for the special characters or HTML tags or specified patterns. After which it either deletes them or changes them to appropriate words.
HTML strip
Mapping
Pattern replace
Normalizers
are similar to analyzers except that they may only emit a single token. As a consequence, they do not have a tokenizer and only accept a subset of the available char filters and token filters.
Only the filters that work on a per-character basis are allowed.
Ingest
Sometimes we need to transform a document before we index it. For instance, we want to remove a field from the document or rename a field and then index it. This is handled by Ingest node.
ILM: index lifecycle management
Rollover: Creates a new write index when the current one reaches a certain size, number of docs, or age.
Shrink: Reduces the number of primary shards in an index.
Force merge: Triggers a force merge to reduce the number of segments in an index’s shards.
Freeze: Freezes an index and makes it read-only.
Delete: Permanently remove an index, including all of its data and metadata.
Lifecycle
Hot: The index is actively being updated and queried.
Warm: The index is no longer being updated but is still being queried.
Cold: The index is no longer being updated and is queried infrequently. The information still needs to be searchable, but it’s okay if those queries are slower.
Frozen: The index is no longer being updated and is queried rarely. The information still needs to be searchable, but it’s okay if those queries are extremely slow.
Delete: The index is no longer needed and can safely be removed.
Data stream:
Append only time series: good for logs
https://www.elastic.co/guide/en/elasticsearch/reference/current/set-up-a-data-stream.html
Ranking
geo shape(box) + functional decay + ranking features + term with boost
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html
Profiling
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html
Search
API-refrence: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html
Guide:
Using edge n-grams for search-as-you-type is easy to set up, flexible, and fast. However, sometimes it is not fast enough. Latency matters, especially when you are trying to provide instant feedback. Sometimes the fastest way of searching is not to search at all.
The completion suggester in Elasticsearch takes a completely different approach. You feed it a list of all possible completions, and it builds them into a finite state transducer, an optimized data structure that resembles a big graph. To search for suggestions, Elasticsearch starts at the beginning of the graph and moves character by character along the matching path. Once it has run out of user input, it looks at all possible endings of the current path to produce a list of suggestions.
This data structure lives in memory and makes prefix lookups extremely fast, much faster than any term-based query could be. It is an excellent match for autocompletion of names and brands, whose words are usually organized in a common order: “Johnny Rotten” rather than “Rotten Johnny.”
When word order is less predictable, edge n-grams can be a better solution than the completion suggester. This particular cat may be skinned in myriad ways.
Add fuzziness
Add custom weight for top results
https://blog.mimacom.com/autocomplete-elasticsearch-part1/
https://blog.mimacom.com/autocomplete-elasticsearch-part2/
https://blog.mimacom.com/autocomplete-elasticsearch-part3/
https://blog.mimacom.com/autocomplete-elasticsearch-part4/
https://www.elastic.co/blog/you-complete-me
https://www.elastic.co/blog/found-uses-of-elasticsearch
search analyzer
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
Index prefix
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-prefixes.html
Search as you type:
https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html
Suggestion(completion): https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters.html
Then one that provide "similar" term, based on the edit distance. It provides suggestions based on data in the index, there are a lot of knobs and turns to tune it.
It's very similar to what term suggester is doing, but taking into account a whole phrase.
Completion suggester or search-as-you-type functionality.
If first two are doing something like did you mean functionality or spellchecking, based on the actual terms in the index. This one should "show" you some 5 or 10 relevant docs, while user is typing, and for this one you need to manually index field of suggestion type, where later ES will do a fast lookup.
The completion suggester provides auto-complete/search-as-you-type functionality. This is a navigational feature to guide users to relevant results as they are typing, improving search precision. It is not meant for spell correction or did-you-mean functionality like the term or phrase suggesters.
This one is a continuation of the completion suggester, with the idea of the some context where user is coming from (geo) or if engine wants to boost some company over another, just because they are paid for it, or something like this. In this case you also need to manually index additional data.
Query:
match_bool_prefix: term for words + prefix for last