Get invited to our slack community and get access to opportunities and data science insights

HOW TO: WhereOS SQL Function Documentation – Comprehensive Guide to SparkSQL & Hive Functions

Introduction

This is a list of built-in functions of WhereOS, based on Spark & Hive functions and 3rd party libraries. More functions can be added to WhereOS via Python or R bindings or as Java & Scala UDF (user-defined function), UDAF (user-defined aggregation function) and UDTF (user-defined table generating function) extensions. Custom libraries can be added on via Settings-page or installed from WhereOS Store

Function: !

! expr – Logical not.

Class: org.apache.spark.sql.catalyst.expressions.Not

Function: %

expr1 % expr2 – Returns the remainder after `expr1`/`expr2`.

Class: org.apache.spark.sql.catalyst.expressions.Remainder

Function: &

expr1 & expr2 – Returns the result of bitwise AND of `expr1` and `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.BitwiseAnd

Function: *

expr1 * expr2 – Returns `expr1`*`expr2`.

Class: org.apache.spark.sql.catalyst.expressions.Multiply

Function: +

expr1 + expr2 – Returns `expr1`+`expr2`.

Class: org.apache.spark.sql.catalyst.expressions.Add

Function: –

expr1 – expr2 – Returns `expr1`-`expr2`.

Class: org.apache.spark.sql.catalyst.expressions.Subtract

Function: /

expr1 / expr2 – Returns `expr1`/`expr2`. It always performs floating point division.

Class: org.apache.spark.sql.catalyst.expressions.Divide

Function: <

expr1 < expr2 - Returns true if `expr1` is less than `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.LessThan

Function: <=

expr1 <= expr2 - Returns true if `expr1` is less than or equal to `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.LessThanOrEqual

expr1 <=> expr2 – Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.

Class: org.apache.spark.sql.catalyst.expressions.EqualNullSafe

Function: =

expr1 = expr2 – Returns true if `expr1` equals `expr2`, or false otherwise.

Class: org.apache.spark.sql.catalyst.expressions.EqualTo

Function: ==

expr1 == expr2 – Returns true if `expr1` equals `expr2`, or false otherwise.

Class: org.apache.spark.sql.catalyst.expressions.EqualTo

expr1 > expr2 – Returns true if `expr1` is greater than `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.GreaterThan

expr1 >= expr2 – Returns true if `expr1` is greater than or equal to `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual

Function: ^

expr1 ^ expr2 – Returns the result of bitwise exclusive OR of `expr1` and `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.BitwiseXor

Function: abs

abs(expr) – Returns the absolute value of the numeric value.

Class: org.apache.spark.sql.catalyst.expressions.Abs

Function: acos

acos(expr) – Returns the inverse cosine (a.k.a. arc cosine) of `expr`, as if computed by `java.lang.Math.acos`.

Class: org.apache.spark.sql.catalyst.expressions.Acos

Function: add_bias

add_bias(feature_vector in array) – Returns features with a bias in array

Class: hivemall.ftvec.AddBiasUDF

Function: add_days

Class: brickhouse.udf.date.AddDaysUDF

Function: add_feature_index

add_feature_index(ARRAY[DOUBLE]: dense feature vector) – Returns a feature vector with feature indices

Class: hivemall.ftvec.AddFeatureIndexUDF

Function: add_field_indices

add_field_indices(array features) – Returns arrays of string that field indices (:)* are augmented

Class: hivemall.ftvec.trans.AddFieldIndicesUDF

Function: add_field_indicies

add_field_indicies(array features) – Returns arrays of string that field indices (:)* are augmented

Class: hivemall.ftvec.trans.AddFieldIndicesUDF

Function: add_months

add_months(start_date, num_months) – Returns the date that is `num_months` after `start_date`.

Class: org.apache.spark.sql.catalyst.expressions.AddMonths

Function: aggregate

aggregate(expr, start, merge, finish) – Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.

Class: org.apache.spark.sql.catalyst.expressions.ArrayAggregate

Function: amplify

amplify(const int xtimes, *) – amplify the input records x-times

Class: hivemall.ftvec.amplify.AmplifierUDTF

Function: and

expr1 and expr2 – Logical AND.

Class: org.apache.spark.sql.catalyst.expressions.And

Function: angular_distance

angular_distance(ftvec1, ftvec2) – Returns an angular distance of the given two vectors

WITH docs as (
select 1 as docid, array(‘apple:1.0’, ‘orange:2.0’, ‘banana:1.0’, ‘kuwi:0’) as features
union all
select 2 as docid, array(‘apple:1.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
union all
select 3 as docid, array(‘apple:2.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
)
select
l.docid as doc1,
r.docid as doc2,
angular_distance(l.features, r.features) as distance,
distance2similarity(angular_distance(l.features, r.features)) as similarity
from
docs l
CROSS JOIN docs r
where
l.docid != r.docid
order by
doc1 asc,
distance asc;

doc1 doc2 distance similarity
1 3 0.31678355 0.75942624
1 2 0.33333337 0.75
2 3 0.09841931 0.91039914
2 1 0.33333337 0.75
3 2 0.09841931 0.91039914
3 1 0.31678355 0.75942624

Class: hivemall.knn.distance.AngularDistanceUDF

Function: angular_similarity

angular_similarity(ftvec1, ftvec2) – Returns an angular similarity of the given two vectors

WITH docs as (
select 1 as docid, array(‘apple:1.0’, ‘orange:2.0’, ‘banana:1.0’, ‘kuwi:0’) as features
union all
select 2 as docid, array(‘apple:1.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
union all
select 3 as docid, array(‘apple:2.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
)
select
l.docid as doc1,
r.docid as doc2,
angular_similarity(l.features, r.features) as similarity
from
docs l
CROSS JOIN docs r
where
l.docid != r.docid
order by
doc1 asc,
similarity desc;

doc1 doc2 similarity
1 3 0.68321645
1 2 0.6666666
2 3 0.9015807
2 1 0.6666666
3 2 0.9015807
3 1 0.68321645

Class: hivemall.knn.similarity.AngularSimilarityUDF

Function: append_array

Class: brickhouse.udf.collect.AppendArrayUDF

Function: approx_count_distinct

approx_count_distinct(expr[, relativeSD]) – Returns the estimated cardinality by HyperLogLog++. `relativeSD` defines the maximum estimation error allowed.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus

Function: approx_percentile

approx_percentile(col, percentage [, accuracy]) – Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column `col` at the given percentage array.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile

Function: argmin_kld

argmin_kld(float mean, float covar) – Returns mean or covar that minimize a KL-distance among distributions

The returned value is (1.0 / (sum(1.0 / covar))) * (sum(mean / covar)

Class: hivemall.ensemble.ArgminKLDistanceUDAF

Function: array

array(expr, …) – Returns an array with the given elements.

Class: org.apache.spark.sql.catalyst.expressions.CreateArray

Function: array_append

array_append(array arr, T elem) – Append an element to the end of an array

SELECT array_append(array(1,2),3);
1,2,3

SELECT array_append(array(‘a’,’b’),’c’);
“a”,”b”,”c”

Class: hivemall.tools.array.ArrayAppendUDF

Function: array_avg

array_avg(array) – Returns an array in which each element is the mean of a set of numbers

WITH input as (
select array(1.0, 2.0, 3.0) as nums
UNION ALL
select array(2.0, 3.0, 4.0) as nums
)
select
array_avg(nums)
from
input;

[“1.5″,”2.5″,”3.5”]

Class: hivemall.tools.array.ArrayAvgGenericUDAF

Function: array_concat

array_concat(array x1, array x2, ..) – Returns a concatenated array

SELECT array_concat(array(1),array(2,3));
[1,2,3]

Class: hivemall.tools.array.ArrayConcatUDF

Function: array_contains

array_contains(array, value) – Returns true if the array contains the value.

Class: org.apache.spark.sql.catalyst.expressions.ArrayContains

Function: array_distinct

array_distinct(array) – Removes duplicate values from the array.

Class: org.apache.spark.sql.catalyst.expressions.ArrayDistinct

Function: array_except

array_except(array1, array2) – Returns an array of the elements in array1 but not in array2,without duplicates.

Class: org.apache.spark.sql.catalyst.expressions.ArrayExcept

Function: array_flatten

array_flatten(array>) – Returns an array with the elements flattened.

SELECT array_flatten(array(array(1,2,3),array(4,5),array(6,7,8)));
[1,2,3,4,5,6,7,8]

Class: hivemall.tools.array.ArrayFlattenUDF

Function: array_hash_values

array_hash_values(array values, [string prefix [, int numFeatures], boolean useIndexAsPrefix]) returns hash values in array

Class: hivemall.ftvec.hashing.ArrayHashValuesUDF

Function: array_index

Class: brickhouse.udf.collect.ArrayIndexUDF

Function: array_intersect

array_intersect(array1, array2) – Returns an array of the elements in the intersection of array1 andarray2, without duplicates.

Class: org.apache.spark.sql.catalyst.expressions.ArrayIntersect

Function: array_join

array_join(array, delimiter[, nullReplacement]) – Concatenates the elements of the given array using the delimiter and an optional string to replace nulls. If no value is set for nullReplacement, any null value is filtered.

Class: org.apache.spark.sql.catalyst.expressions.ArrayJoin

Function: array_max

array_max(array) – Returns the maximum value in the array. NULL elements are skipped.

Class: org.apache.spark.sql.catalyst.expressions.ArrayMax

Function: array_min

array_min(array) – Returns the minimum value in the array. NULL elements are skipped.

Class: org.apache.spark.sql.catalyst.expressions.ArrayMin

Function: array_position

array_position(array, element) – Returns the (1-based) index of the first element of the array as long.

Class: org.apache.spark.sql.catalyst.expressions.ArrayPosition

Function: array_remove

array_remove(array, element) – Remove all elements that equal to element from array.

Class: org.apache.spark.sql.catalyst.expressions.ArrayRemove

Function: array_repeat

array_repeat(element, count) – Returns the array containing element count times.

Class: org.apache.spark.sql.catalyst.expressions.ArrayRepeat

Function: array_slice

array_slice(array values, int offset [, int length]) – Slices the given array by the given offset and length parameters.

SELECT
array_slice(array(1,2,3,4,5,6),2,4),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
0, — offset
2 — length
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
6, — offset
3 — length
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
6, — offset
10 — length
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
6 — offset
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
-3 — offset
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
-3, — offset
2 — length
);

[3,4]
[“zero”,”one”]
[“six”,”seven”,”eight”]
[“six”,”seven”,”eight”,”nine”,”ten”]
[“six”,”seven”,”eight”,”nine”,”ten”]
[“eight”,”nine”,”ten”]
[“eight”,”nine”]

Class: hivemall.tools.array.ArraySliceUDF

Function: array_sort

array_sort(array) – Sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.

Class: org.apache.spark.sql.catalyst.expressions.ArraySort

Function: array_sum

array_sum(array) – Returns an array in which each element is summed up

WITH input as (
select array(1.0, 2.0, 3.0) as nums
UNION ALL
select array(2.0, 3.0, 4.0) as nums
)
select
array_sum(nums)
from
input;

[“3.0″,”5.0″,”7.0”]

Class: hivemall.tools.array.ArraySumUDAF

Function: array_to_str

array_to_str(array arr [, string sep=’,’]) – Convert array to string using a sperator

SELECT array_to_str(array(1,2,3),’-‘);
1-2-3

Class: hivemall.tools.array.ArrayToStrUDF

Function: array_union

array_union(array1, array2) – Returns an array of the elements in the union of array1 and array2, without duplicates.

Class: org.apache.spark.sql.catalyst.expressions.ArrayUnion

Function: arrays_overlap

arrays_overlap(a1, a2) – Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.

Class: org.apache.spark.sql.catalyst.expressions.ArraysOverlap

Function: arrays_zip

arrays_zip(a1, a2, …) – Returns a merged array of structs in which the N-th struct contains allN-th values of input arrays.

Class: org.apache.spark.sql.catalyst.expressions.ArraysZip

Function: ascii

ascii(str) – Returns the numeric value of the first character of `str`.

Class: org.apache.spark.sql.catalyst.expressions.Ascii

Function: asin

asin(expr) – Returns the inverse sine (a.k.a. arc sine) the arc sin of `expr`, as if computed by `java.lang.Math.asin`.

Class: org.apache.spark.sql.catalyst.expressions.Asin

Function: assert

Asserts in case boolean input is false. Optionally it asserts with message if input string provided. assert(boolean) assert(boolean, string)

Class: brickhouse.udf.sanity.AssertUDF

Function: assert_equals

Class: brickhouse.udf.sanity.AssertEqualsUDF

Function: assert_less_than

Class: brickhouse.udf.sanity.AssertLessThanUDF

Function: assert_true

assert_true(expr) – Throws an exception if `expr` is not true.

Class: org.apache.spark.sql.catalyst.expressions.AssertTrue

Function: atan

atan(expr) – Returns the inverse tangent (a.k.a. arc tangent) of `expr`, as if computed by `java.lang.Math.atan`

Class: org.apache.spark.sql.catalyst.expressions.Atan

Function: atan2

atan2(exprY, exprX) – Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates (`exprX`, `exprY`), as if computed by `java.lang.Math.atan2`.

Class: org.apache.spark.sql.catalyst.expressions.Atan2

Function: auc

auc(array rankItems | double score, array correctItems | int label [, const int recommendSize = rankItems.size ]) – Returns AUC

Class: hivemall.evaluation.AUCUDAF

Function: average_precision

average_precision(array rankItems, array correctItems [, const int recommendSize = rankItems.size]) – Returns MAP

Class: hivemall.evaluation.MAPUDAF

Function: avg

avg(expr) – Returns the mean calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Average

Function: base64

base64(bin) – Converts the argument from a binary `bin` to a base 64 string.

Class: org.apache.spark.sql.catalyst.expressions.Base64

Function: base91

base91(BINARY bin) – Convert the argument from binary to a BASE91 string

SELECT base91(deflate(‘aaaaaaaaaaaaaaaabbbbccc’));
AA+=kaIM|WTt!+wbGAA

Class: hivemall.tools.text.Base91UDF

Function: bbit_minhash

bbit_minhash(array<> features [, int numHashes]) – Returns a b-bits minhash value

Class: hivemall.knn.lsh.bBitMinHashUDF

Function: bigint

bigint(expr) – Casts the value `expr` to the target data type `bigint`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: bin

bin(expr) – Returns the string representation of the long value `expr` represented in binary.

Class: org.apache.spark.sql.catalyst.expressions.Bin

Function: binarize_label

binarize_label(int/long positive, int/long negative, …) – Returns positive/negative records that are represented as (…, int label) where label is 0 or 1

Class: hivemall.ftvec.trans.BinarizeLabelUDTF

Function: binary

binary(expr) – Casts the value `expr` to the target data type `binary`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: bit_length

bit_length(expr) – Returns the bit length of string data or number of bits of binary data.

Class: org.apache.spark.sql.catalyst.expressions.BitLength

Function: bits_collect

bits_collect(int|long x) – Returns a bitset in array

Class: hivemall.tools.bits.BitsCollectUDAF

Function: bits_or

bits_or(array b1, array b2, ..) – Returns a logical OR given bitsets

SELECT unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3))));
[1,2,3,4]

Class: hivemall.tools.bits.BitsORUDF

Function: bloom

Constructs a BloomFilter by aggregating a set of keys bloom(string key)

Class: brickhouse.udf.bloom.BloomUDAF

Function: bloom_and

Returns the logical AND of two bloom filters; representing the intersection of values in both bloom1 AND bloom2 bloom_and(string bloom1, string bloom2)

Class: brickhouse.udf.bloom.BloomAndUDF

Function: bloom_contains

Returns true if the referenced bloom filter contains the key.. bloom_contains(string key, string bloomfilter)

Class: brickhouse.udf.bloom.BloomContainsUDF

Function: bloom_contains_any

bloom_contains_any(string bloom, string key) or bloom_contains_any(string bloom, array keys)- Returns true if the bloom filter contains any of the given key

WITH data1 as (
SELECT explode(array(1,2,3,4,5)) as id
),
data2 as (
SELECT explode(array(1,3,5,6,8)) as id
),
bloom as (
SELECT bloom(id) as bf
FROM data1
)
SELECT
l.*
FROM
data2 l
CROSS JOIN bloom r
WHERE
bloom_contains_any(r.bf, array(l.id))

Class: hivemall.sketch.bloom.BloomContainsAnyUDF

Function: bloom_not

Returns the logical NOT of a bloom filters; representing the set of values NOT in bloom1 bloom_not(string bloom)

Class: brickhouse.udf.bloom.BloomNotUDF

Function: bloom_or

Returns the logical OR of two bloom filters; representing the intersection of values in either bloom1 OR bloom2 bloom_or(string bloom1, string bloom2)

Class: brickhouse.udf.bloom.BloomOrUDF

Function: boolean

boolean(expr) – Casts the value `expr` to the target data type `boolean`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: bpr_sampling

bpr_sampling(int userId, List posItems [, const string options])- Returns a relation consists of

Class: hivemall.ftvec.ranking.BprSamplingUDTF

Function: bround

bround(expr, d) – Returns `expr` rounded to `d` decimal places using HALF_EVEN rounding mode.

Class: org.apache.spark.sql.catalyst.expressions.BRound

Function: build_bins

build_bins(number weight, const int num_of_bins[, const boolean auto_shrink = false]) – Return quantiles representing bins: array

Class: hivemall.ftvec.binning.BuildBinsUDAF

Function: call_kone_elevator

Class: com.whereos.udf.KONEElevatorCallUDF

Function: cardinality

cardinality(expr) – Returns the size of an array or a map.The function returns -1 if its input is null and spark.sql.legacy.sizeOfNull is set to true.If spark.sql.legacy.sizeOfNull is set to false, the function returns null for null input.By default, the spark.sql.legacy.sizeOfNull parameter is set to true.

Class: org.apache.spark.sql.catalyst.expressions.Size

Function: cast

cast(expr AS type) – Casts the value `expr` to the target data type `type`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: cast_array

Class: brickhouse.udf.collect.CastArrayUDF

Function: cast_map

Class: brickhouse.udf.collect.CastMapUDF

Function: categorical_features

categorical_features(array featureNames, feature1, feature2, .. [, const string options]) – Returns a feature vector array

Class: hivemall.ftvec.trans.CategoricalFeaturesUDF

Function: cbrt

cbrt(expr) – Returns the cube root of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Cbrt

Function: ceil

ceil(expr) – Returns the smallest integer not smaller than `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Ceil

Function: ceiling

ceiling(expr) – Returns the smallest integer not smaller than `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Ceil

Function: changefinder

changefinder(double|array x [, const string options]) – Returns outlier/change-point scores and decisions using ChangeFinder. It will return a tuple

Class: hivemall.anomaly.ChangeFinderUDF

Function: char

char(expr) – Returns the ASCII character having the binary equivalent to `expr`. If n is larger than 256 the result is equivalent to chr(n % 256)

Class: org.apache.spark.sql.catalyst.expressions.Chr

Function: char_length

char_length(expr) – Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Class: org.apache.spark.sql.catalyst.expressions.Length

Function: character_length

character_length(expr) – Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Class: org.apache.spark.sql.catalyst.expressions.Length

Function: chi2

chi2(array> observed, array> expected) – Returns chi2_val and p_val of each columns as , array>

Class: hivemall.ftvec.selection.ChiSquareUDF

Function: chr

chr(expr) – Returns the ASCII character having the binary equivalent to `expr`. If n is larger than 256 the result is equivalent to chr(n % 256)

Class: org.apache.spark.sql.catalyst.expressions.Chr

Function: coalesce

coalesce(expr1, expr2, …) – Returns the first non-null argument if exists. Otherwise, null.

Class: org.apache.spark.sql.catalyst.expressions.Coalesce

Function: collect

collect(x) – Returns an array of all the elements in the aggregation group

Class: brickhouse.udf.collect.CollectUDAF

Function: collect_list

collect_list(expr) – Collects and returns a list of non-unique elements.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.CollectList

Function: collect_max

collect_max(x, val, n) – Returns an map of the max N numeric values in the aggregation group

Class: brickhouse.udf.collect.CollectMaxUDAF

Function: collect_set

collect_set(expr) – Collects and returns a set of unique elements.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.CollectSet

Function: combine

combine(a,b) – Returns a combined list of two lists, or a combined map of two maps

Class: brickhouse.udf.collect.CombineUDF

Function: combine_hyperloglog

combine_hyperloglog(x) – Combined two HyperLogLog++ binary blobs.

Class: brickhouse.udf.hll.CombineHyperLogLogUDF

Function: combine_previous_sketch

combine_previous_sketch(grouping, map) – Returns a map of the combined keys of previous calls to this

Class: brickhouse.udf.sketch.CombinePreviousSketchUDF

Function: combine_sketch

combine_sketch(x) – Combine two sketch sets.

Class: brickhouse.udf.sketch.CombineSketchUDF

Function: combine_unique

combine_unique(x) – Returns an array of all distinct elements of all lists in the aggregation group

Class: brickhouse.udf.collect.CombineUniqueUDAF

Function: concat

concat(col1, col2, …, colN) – Returns the concatenation of col1, col2, …, colN.

Class: org.apache.spark.sql.catalyst.expressions.Concat

Function: concat_array

concat_array(array x1, array x2, ..) – Returns a concatenated array

SELECT array_concat(array(1),array(2,3));
[1,2,3]

Class: hivemall.tools.array.ArrayConcatUDF

Function: concat_ws

concat_ws(sep, [str | array(str)]+) – Returns the concatenation of the strings separated by `sep`.

Class: org.apache.spark.sql.catalyst.expressions.ConcatWs

Function: conditional_emit

conditional_emit(a,b) – Emit features of a row according to various conditions

Class: brickhouse.udf.collect.ConditionalEmit

Function: conv

conv(num, from_base, to_base) – Convert `num` from `from_base` to `to_base`.

Class: org.apache.spark.sql.catalyst.expressions.Conv

Function: conv2dense

conv2dense(int feature, float weight, int nDims) – Return a dense model in array

Class: hivemall.ftvec.conv.ConvertToDenseModelUDAF

Function: convert_label

convert_label(const int|const float) – Convert from -1|1 to 0.0f|1.0f, or from 0.0f|1.0f to -1|1

Class: hivemall.tools.ConvertLabelUDF

Function: convert_to_sketch

convert_to_sketch(x) – Truncate a large array of strings, and return a list of strings representing a sketch of those items

Class: brickhouse.udf.sketch.ConvertToSketchUDF

Function: corr

corr(expr1, expr2) – Returns Pearson coefficient of correlation between a set of number pairs.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Corr

Function: cos

cos(expr) – Returns the cosine of `expr`, as if computed by `java.lang.Math.cos`.

Class: org.apache.spark.sql.catalyst.expressions.Cos

Function: cosh

cosh(expr) – Returns the hyperbolic cosine of `expr`, as if computed by `java.lang.Math.cosh`.

Class: org.apache.spark.sql.catalyst.expressions.Cosh

Function: cosine_distance

cosine_distance(ftvec1, ftvec2) – Returns a cosine distance of the given two vectors

WITH docs as (
select 1 as docid, array(‘apple:1.0’, ‘orange:2.0’, ‘banana:1.0’, ‘kuwi:0’) as features
union all
select 2 as docid, array(‘apple:1.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
union all
select 3 as docid, array(‘apple:2.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
)
select
l.docid as doc1,
r.docid as doc2,
cosine_distance(l.features, r.features) as distance,
distance2similarity(cosine_distance(l.features, r.features)) as similarity
from
docs l
CROSS JOIN docs r
where
l.docid != r.docid
order by
doc1 asc,
distance asc;

doc1 doc2 distance similarity
1 3 0.45566893 0.6869694
1 2 0.5 0.6666667
2 3 0.04742068 0.95472616
2 1 0.5 0.6666667
3 2 0.04742068 0.95472616
3 1 0.45566893 0.6869694

Class: hivemall.knn.distance.CosineDistanceUDF

Function: cosine_similarity

cosine_similarity(ftvec1, ftvec2) – Returns a cosine similarity of the given two vectors

WITH docs as (
select 1 as docid, array(‘apple:1.0’, ‘orange:2.0’, ‘banana:1.0’, ‘kuwi:0’) as features
union all
select 2 as docid, array(‘apple:1.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
union all
select 3 as docid, array(‘apple:2.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
)
select
l.docid as doc1,
r.docid as doc2,
cosine_similarity(l.features, r.features) as similarity
from
docs l
CROSS JOIN docs r
where
l.docid != r.docid
order by
doc1 asc,
similarity desc;

doc1 doc2 similarity
1 3 0.5443311
1 2 0.5
2 3 0.9525793
2 1 0.5
3 2 0.9525793
3 1 0.5443311

Class: hivemall.knn.similarity.CosineSimilarityUDF

Function: cot

cot(expr) – Returns the cotangent of `expr`, as if computed by `1/java.lang.Math.cot`.

Class: org.apache.spark.sql.catalyst.expressions.Cot

Function: count

count(*) – Returns the total number of retrieved rows, including rows containing null. count(expr[, expr…]) – Returns the number of rows for which the supplied expression(s) are all non-null. count(DISTINCT expr[, expr…]) – Returns the number of rows for which the supplied expression(s) are unique and non-null.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count

Function: count_min_sketch

count_min_sketch(col, eps, confidence, seed) – Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.CountMinSketchAgg

Function: covar_pop

covar_pop(expr1, expr2) – Returns the population covariance of a set of number pairs.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.CovPopulation

Function: covar_samp

covar_samp(expr1, expr2) – Returns the sample covariance of a set of number pairs.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.CovSample

Function: crc32

crc32(expr) – Returns a cyclic redundancy check value of the `expr` as a bigint.

Class: org.apache.spark.sql.catalyst.expressions.Crc32

Function: cube

cube([col1[, col2 ..]]) – create a multi-dimensional cube using the specified columns so that we can run aggregation on them.

Class: org.apache.spark.sql.catalyst.expressions.Cube

Function: cume_dist

cume_dist() – Computes the position of a value relative to all values in the partition.

Class: org.apache.spark.sql.catalyst.expressions.CumeDist

Function: current_database

current_database() – Returns the current database.

Class: org.apache.spark.sql.catalyst.expressions.CurrentDatabase

Function: current_date

current_date() – Returns the current date at the start of query evaluation.

Class: org.apache.spark.sql.catalyst.expressions.CurrentDate

Function: current_timestamp

current_timestamp() – Returns the current timestamp at the start of query evaluation.

Class: org.apache.spark.sql.catalyst.expressions.CurrentTimestamp

Function: date

date(expr) – Casts the value `expr` to the target data type `date`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: date_add

date_add(start_date, num_days) – Returns the date that is `num_days` after `start_date`.

Class: org.apache.spark.sql.catalyst.expressions.DateAdd

Function: date_format

date_format(timestamp, fmt) – Converts `timestamp` to a value of string in the format specified by the date format `fmt`.

Class: org.apache.spark.sql.catalyst.expressions.DateFormatClass

Function: date_range

date_range(a,b,c) – Generates a range of integers from a to b incremented by c or the elements of a map into multiple rows and columns

Class: brickhouse.udf.date.DateRangeUDTF

Function: date_sub

date_sub(start_date, num_days) – Returns the date that is `num_days` before `start_date`.

Class: org.apache.spark.sql.catalyst.expressions.DateSub

Function: date_trunc

date_trunc(fmt, ts) – Returns timestamp `ts` truncated to the unit specified by the format model `fmt`.`fmt` should be one of [“YEAR”, “YYYY”, “YY”, “MON”, “MONTH”, “MM”, “DAY”, “DD”, “HOUR”, “MINUTE”, “SECOND”, “WEEK”, “QUARTER”]

Class: org.apache.spark.sql.catalyst.expressions.TruncTimestamp

Function: datediff

datediff(endDate, startDate) – Returns the number of days from `startDate` to `endDate`.

Class: org.apache.spark.sql.catalyst.expressions.DateDiff

Function: dateseries

Class: com.whereos.udf.DateSeriesUDF

Function: day

day(date) – Returns the day of month of the date/timestamp.

Class: org.apache.spark.sql.catalyst.expressions.DayOfMonth

Function: dayofmonth

dayofmonth(date) – Returns the day of month of the date/timestamp.

Class: org.apache.spark.sql.catalyst.expressions.DayOfMonth

Function: dayofweek

dayofweek(date) – Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, …, 7 = Saturday).

Class: org.apache.spark.sql.catalyst.expressions.DayOfWeek

Function: dayofyear

dayofyear(date) – Returns the day of year of the date/timestamp.

Class: org.apache.spark.sql.catalyst.expressions.DayOfYear

Function: decimal

decimal(expr) – Casts the value `expr` to the target data type `decimal`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: decode

decode(bin, charset) – Decodes the first argument using the second argument character set.

Class: org.apache.spark.sql.catalyst.expressions.Decode

Function: deflate

deflate(TEXT data [, const int compressionLevel]) – Returns a compressed BINARY object by using Deflater. The compression level must be in range [-1,9]

SELECT base91(deflate(‘aaaaaaaaaaaaaaaabbbbccc’));
AA+=kaIM|WTt!+wbGAA

Class: hivemall.tools.compress.DeflateUDF

Function: degrees

degrees(expr) – Converts radians to degrees.

Class: org.apache.spark.sql.catalyst.expressions.ToDegrees

Function: dense_rank

dense_rank() – Computes the rank of a value in a group of values. The result is one plus the previously assigned rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.

Class: org.apache.spark.sql.catalyst.expressions.DenseRank

Function: dimsum_mapper

dimsum_mapper(array row, map colNorms [, const string options]) – Returns column-wise partial similarities

Class: hivemall.knn.similarity.DIMSUMMapperUDTF

Function: distance2similarity

distance2similarity(float d) – Returns 1.0 / (1.0 + d)

Class: hivemall.knn.similarity.Distance2SimilarityUDF

Function: distcache_gets

distcache_gets(filepath, key, default_value [, parseKey]) – Returns map|value_type

Class: hivemall.tools.mapred.DistributedCacheLookupUDF

Function: distributed_bloom

Loads a bloomfilter from a file in distributed cache, and makes available as a named bloom. distributed_bloom(string filename) distributed_bloom(string filename, boolean returnEncoded)

Class: brickhouse.udf.bloom.DistributedBloomUDF

Function: distributed_map

Class: brickhouse.udf.dcache.DistributedMapUDF

Function: double

double(expr) – Casts the value `expr` to the target data type `double`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: e

e() – Returns Euler’s number, e.

Class: org.apache.spark.sql.catalyst.expressions.EulerNumber

Function: each_top_k

each_top_k(int K, Object group, double cmpKey, *) – Returns top-K values (or tail-K values when k is less than 0)

Class: hivemall.tools.EachTopKUDTF

Function: element_at

element_at(array, index) – Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array. element_at(map, key) - Returns value for given key, or NULL if the key is not contained in the map

Class: org.apache.spark.sql.catalyst.expressions.ElementAt

Function: elt

elt(n, input1, input2, …) – Returns the `n`-th input, e.g., returns `input2` when `n` is 2.

Class: org.apache.spark.sql.catalyst.expressions.Elt

Function: encode

encode(str, charset) – Encodes the first argument using the second argument character set.

Class: org.apache.spark.sql.catalyst.expressions.Encode

Function: estimated_reach

estimated_reach(x) – Estimate reach from a sketch set of Strings.

Class: brickhouse.udf.sketch.EstimatedReachUDF

Function: euclid_distance

euclid_distance(ftvec1, ftvec2) – Returns the square root of the sum of the squared differences: sqrt(sum((x – y)^2))

WITH docs as (
select 1 as docid, array(‘apple:1.0’, ‘orange:2.0’, ‘banana:1.0’, ‘kuwi:0’) as features
union all
select 2 as docid, array(‘apple:1.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
union all
select 3 as docid, array(‘apple:2.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
)
select
l.docid as doc1,
r.docid as doc2,
euclid_distance(l.features, r.features) as distance,
distance2similarity(euclid_distance(l.features, r.features)) as similarity
from
docs l
CROSS JOIN docs r
where
l.docid != r.docid
order by
doc1 asc,
distance asc;

doc1 doc2 distance similarity
1 2 2.4494898 0.28989795
1 3 2.6457512 0.2742919
2 3 1.0 0.5
2 1 2.4494898 0.28989795
3 2 1.0 0.5
3 1 2.6457512 0.2742919

Class: hivemall.knn.distance.EuclidDistanceUDF

Function: euclid_similarity

euclid_similarity(ftvec1, ftvec2) – Returns a euclid distance based similarity, which is `1.0 / (1.0 + distance)`, of the given two vectors

WITH docs as (
select 1 as docid, array(‘apple:1.0’, ‘orange:2.0’, ‘banana:1.0’, ‘kuwi:0’) as features
union all
select 2 as docid, array(‘apple:1.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
union all
select 3 as docid, array(‘apple:2.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
)
select
l.docid as doc1,
r.docid as doc2,
euclid_similarity(l.features, r.features) as similarity
from
docs l
CROSS JOIN docs r
where
l.docid != r.docid
order by
doc1 asc,
similarity desc;

doc1 doc2 similarity
1 2 0.28989795
1 3 0.2742919
2 3 0.5
2 1 0.28989795
3 2 0.5
3 1 0.2742919

Class: hivemall.knn.similarity.EuclidSimilarity

Function: exists

exists(expr, pred) – Tests whether a predicate holds for one or more elements in the array.

Class: org.apache.spark.sql.catalyst.expressions.ArrayExists

Function: exp

exp(expr) – Returns e to the power of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Exp

Function: explode

explode(expr) – Separates the elements of array `expr` into multiple rows, or the elements of map `expr` into multiple rows and columns.

Class: org.apache.spark.sql.catalyst.expressions.Explode

Function: explode_outer

explode_outer(expr) – Separates the elements of array `expr` into multiple rows, or the elements of map `expr` into multiple rows and columns.

Class: org.apache.spark.sql.catalyst.expressions.Explode

Function: explodegeometry

Class: com.whereos.udf.ExplodeGeometryUDTF

Function: explodemultipolygon

Class: com.whereos.udf.ExplodeMultiPolygonUDTF

Function: expm1

expm1(expr) – Returns exp(`expr`) – 1.

Class: org.apache.spark.sql.catalyst.expressions.Expm1

Function: extract_feature

extract_feature(feature_vector in array) – Returns features in array

Class: hivemall.ftvec.ExtractFeatureUDF

Function: extract_weight

extract_weight(feature_vector in array) – Returns the weights of features in array

Class: hivemall.ftvec.ExtractWeightUDF

Function: extractframes

Class: com.whereos.udf.ExtractFramesUDTF

Function: extracttilepixels

Class: com.whereos.udf.ExtractPixelsUDTF

Function: f1score

f1score(array[int], array[int]) – Return a F1 score

Class: hivemall.evaluation.F1ScoreUDAF

Function: factorial

factorial(expr) – Returns the factorial of `expr`. `expr` is [0..20]. Otherwise, null.

Class: org.apache.spark.sql.catalyst.expressions.Factorial

Function: feature

feature( feature, value) – Returns a feature string

Class: hivemall.ftvec.FeatureUDF

Function: feature_binning

feature_binning(array features, map> quantiles_map) – returns a binned feature vector as an array feature_binning(number weight, array quantiles) – returns bin ID as int

WITH extracted as (
select
extract_feature(feature) as index,
extract_weight(feature) as value
from
input l
LATERAL VIEW explode(features) r as feature
),
mapping as (
select
index,
build_bins(value, 5, true) as quantiles — 5 bins with auto bin shrinking
from
extracted
group by
index
),
bins as (
select
to_map(index, quantiles) as quantiles
from
mapping
)
select
l.features as original,
feature_binning(l.features, r.quantiles) as features
from
input l
cross join bins r

> [“name#Jacob”,”gender#Male”,”age:20.0″] [“name#Jacob”,”gender#Male”,”age:2″]
> [“name#Isabella”,”gender#Female”,”age:20.0″] [“name#Isabella”,”gender#Female”,”age:2″]

Class: hivemall.ftvec.binning.FeatureBinningUDF

Function: feature_hashing

feature_hashing(array features [, const string options]) – returns a hashed feature vector in array

select feature_hashing(array(‘aaa:1.0′,’aaa’,’bbb:2.0′), ‘-libsvm’);
[“4063537:1.0″,”4063537:1″,”8459207:2.0”]

select feature_hashing(array(‘aaa:1.0′,’aaa’,’bbb:2.0′), ‘-features 10’);
[“7:1.0″,”7″,”1:2.0”]

select feature_hashing(array(‘aaa:1.0′,’aaa’,’bbb:2.0′), ‘-features 10 -libsvm’);
[“1:2.0″,”7:1.0″,”7:1”]

Class: hivemall.ftvec.hashing.FeatureHashingUDF

Function: feature_index

feature_index(feature_vector in array) – Returns feature indices in array

Class: hivemall.ftvec.FeatureIndexUDF

Function: feature_pairs

feature_pairs(feature_vector in array, [, const string options]) – Returns a relation

Class: hivemall.ftvec.pairing.FeaturePairsUDTF

Function: ffm_features

ffm_features(const array featureNames, feature1, feature2, .. [, const string options]) – Takes categorical variables and returns a feature vector array in a libffm format ::

Class: hivemall.ftvec.trans.FFMFeaturesUDF

Function: filter

filter(expr, func) – Filters the input array using the given predicate.

Class: org.apache.spark.sql.catalyst.expressions.ArrayFilter

Function: find_in_set

find_in_set(str, str_array) – Returns the index (1-based) of the given string (`str`) in the comma-delimited list (`str_array`). Returns 0, if the string was not found or if the given string (`str`) contains a comma.

Class: org.apache.spark.sql.catalyst.expressions.FindInSet

Function: first

first(expr[, isIgnoreNull]) – Returns the first value of `expr` for a group of rows. If `isIgnoreNull` is true, returns only non-null values.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.First

Function: first_element

first_element(x) – Returns the first element in an array

SELECT first_element(array(‘a’,’b’,’c’));
a

SELECT first_element(array());
NULL

Class: hivemall.tools.array.FirstElementUDF

Function: first_index

first_index(x) – Last value in an array

Class: brickhouse.udf.collect.FirstIndexUDF

Function: first_value

first_value(expr[, isIgnoreNull]) – Returns the first value of `expr` for a group of rows. If `isIgnoreNull` is true, returns only non-null values.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.First

Function: flatten

flatten(arrayOfArrays) – Transforms an array of arrays into a single array.

Class: org.apache.spark.sql.catalyst.expressions.Flatten

Function: float

float(expr) – Casts the value `expr` to the target data type `float`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: float_array

float_array(nDims) – Returns an array of nDims elements

Class: hivemall.tools.array.AllocFloatArrayUDF

Function: floor

floor(expr) – Returns the largest integer not greater than `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Floor

Function: fmeasure

fmeasure(array|int|boolean actual, array|int| boolean predicted [, const string options]) – Return a F-measure (f1score is the special with beta=1.0)

Class: hivemall.evaluation.FMeasureUDAF

Function: format_number

format_number(expr1, expr2) – Formats the number `expr1` like ‘#,###,###.##’, rounded to `expr2` decimal places. If `expr2` is 0, the result has no decimal point or fractional part. `expr2` also accept a user specified format. This is supposed to function like MySQL’s FORMAT.

Class: org.apache.spark.sql.catalyst.expressions.FormatNumber

Function: format_string

format_string(strfmt, obj, …) – Returns a formatted string from printf-style format strings.

Class: org.apache.spark.sql.catalyst.expressions.FormatString

Function: from_camel_case

from_camel_case(a) – Converts a string in CamelCase to one containing underscores.

Class: brickhouse.udf.json.ConvertFromCamelCaseUDF

Function: from_json

from_json(jsonStr, schema[, options]) – Returns a struct value with the given `jsonStr` and `schema`.

Class: org.apache.spark.sql.catalyst.expressions.JsonToStructs

Function: from_unixtime

from_unixtime(unix_time, format) – Returns `unix_time` in the specified `format`.

Class: org.apache.spark.sql.catalyst.expressions.FromUnixTime

Function: from_utc_timestamp

from_utc_timestamp(timestamp, timezone) – Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield ‘2017-07-14 03:40:00.0’.

Class: org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp

Function: generate_series

generate_series(const int|bigint start, const int|bigint end) – Generate a series of values, from start to end. A similar function to PostgreSQL’s [generate_serics](https://www.postgresql.org/docs/current/static/functions-srf.html)

SELECT generate_series(2,4);

2
3
4

SELECT generate_series(5,1,-2);

5
3
1

SELECT generate_series(4,3);

(no return)

SELECT date_add(current_date(),value),value from (SELECT generate_series(1,3)) t;

2018-04-21 1
2018-04-22 2
2018-04-23 3

WITH input as (
SELECT 1 as c1, 10 as c2, 3 as step
UNION ALL
SELECT 10, 2, -3
)
SELECT generate_series(c1, c2, step) as series
FROM input;

1
4
7
10
10
7
4

Class: hivemall.tools.GenerateSeriesUDTF

Function: generateheatmap

Class: com.whereos.udf.HeatmapGenerateUDTF

Function: geocode

Class: com.whereos.udf.GeocodingUDTF

Function: geokeyradius

Class: com.whereos.udf.GeoKeyRadiusUDTF

Function: geokeys

Class: com.whereos.udf.GeoKeysUDTF

Function: get_json_object

get_json_object(json_txt, path) – Extracts a json object from `path`.

Class: org.apache.spark.sql.catalyst.expressions.GetJsonObject

Function: greatest

greatest(expr, …) – Returns the greatest value of all parameters, skipping null values.

Class: org.apache.spark.sql.catalyst.expressions.Greatest

Function: group_count

A sequence id for all rows with the same value for a specific grouping

Class: brickhouse.udf.collect.GroupCountUDF

Function: grouping

grouping(col) – indicates whether a specified column in a GROUP BY is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.”,

Class: org.apache.spark.sql.catalyst.expressions.Grouping

Function: grouping_id

grouping_id([col1[, col2 ..]]) – returns the level of grouping, equals to `(grouping(c1) << (n-1)) + (grouping(c2) << (n-2)) + ... + grouping(cn)`

Class: org.apache.spark.sql.catalyst.expressions.GroupingID

Function: guess_attribute_types

guess_attribute_types(ANY, …) – Returns attribute types

select guess_attribute_types(*) from train limit 1;
Q,Q,C,C,C,C,Q,C,C,C,Q,C,Q,Q,Q,Q,C,Q

Class: hivemall.smile.tools.GuessAttributesUDF

Function: hamming_distance

hamming_distance(integer A, integer B) – Returns Hamming distance between A and B

select
hamming_distance(0,3) as c1,
hamming_distance(“0″,”3”) as c2 — 0=0x00, 3=0x11
;

c1 c2
2 2

Class: hivemall.knn.distance.HammingDistanceUDF

Function: hash

hash(expr1, expr2, …) – Returns a hash value of the arguments.

Class: org.apache.spark.sql.catalyst.expressions.Murmur3Hash

Function: hash_md5

hash_md5(x) – Hash MD5.

Class: brickhouse.udf.sketch.HashMD5UDF

Function: haversine_distance

haversine_distance(double lat1, double lon1, double lat2, double lon2, [const boolean mile=false])::double – return distance between two locations in km [or miles] using `haversine` formula

Usage: select latlon_distance(lat1, lon1, lat2, lon2) from …

Class: hivemall.geospatial.HaversineDistanceUDF

Function: hbase_balanced_key

hbase_balanced_key(keyStr,numRegions) – Returns an HBase key balanced evenly across regions

Class: brickhouse.hbase.GenerateBalancedKeyUDF

Function: hbase_batch_get

hbase_batch_get(table,key,family) – Do a single HBase Get on a table

Class: brickhouse.hbase.BatchGetUDF

Function: hbase_batch_put

hbase_batch_put(config_map, key, value) – Perform batch HBase updates of a table

Class: brickhouse.hbase.BatchPutUDAF

Function: hbase_cached_get

hbase_cached_get(configMap,key,template) – Returns a cached object, given an HBase config, a key, and a template object used to interpret JSON

Class: brickhouse.hbase.CachedGetUDF

Function: hbase_get

hbase_get(table,key,family) – Do a single HBase Get on a table

Class: brickhouse.hbase.GetUDF

Function: hbase_put

string hbase_put(config, map key_value) – string hbase_put(config, key, value) – Do a HBase Put on a table. Config must contain zookeeper quorum, table name, column, and qualifier. Example of usage: hbase_put(map(‘hbase.zookeeper.quorum’, ‘hb-zoo1,hb-zoo2’, ‘table_name’, ‘metrics’, ‘family’, ‘c’, ‘qualifier’, ‘q’), ‘test.prod.visits.total’, ‘123456’)

Class: brickhouse.hbase.PutUDF

Function: hex

hex(expr) – Converts `expr` to hexadecimal.

Class: org.apache.spark.sql.catalyst.expressions.Hex

Function: hitrate

hitrate(array rankItems, array correctItems [, const int recommendSize = rankItems.size]) – Returns HitRate

Class: hivemall.evaluation.HitRateUDAF

Function: hivemall_version

hivemall_version() – Returns the version of Hivemall

SELECT hivemall_version();

Class: hivemall.HivemallVersionUDF

Function: hll_est_cardinality

hll_est_cardinality(x) – Estimate reach from a HyperLogLog++.

Class: brickhouse.udf.hll.EstimateCardinalityUDF

Function: hour

hour(timestamp) – Returns the hour component of the string/timestamp.

Class: org.apache.spark.sql.catalyst.expressions.Hour

Function: hyperloglog

hyperloglog(x, [b]) – Constructs a HyperLogLog++ estimator to estimate reach for large values, with optional bit parameter for specifying precision (b must be in [4,16]). Default is b = 6. Returns a binary value that represents the HyperLogLog++ data structure.

Class: brickhouse.udf.hll.HyperLogLogUDAF

Function: hypot

hypot(expr1, expr2) – Returns sqrt(`expr1`**2 + `expr2`**2).

Class: org.apache.spark.sql.catalyst.expressions.Hypot

Function: if

if(expr1, expr2, expr3) – If `expr1` evaluates to true, then returns `expr2`; otherwise returns `expr3`.

Class: org.apache.spark.sql.catalyst.expressions.If

Function: ifnull

ifnull(expr1, expr2) – Returns `expr2` if `expr1` is null, or `expr1` otherwise.

Class: org.apache.spark.sql.catalyst.expressions.IfNull

Function: in

expr1 in(expr2, expr3, …) – Returns true if `expr` equals to any valN.

Class: org.apache.spark.sql.catalyst.expressions.In

Function: indexed_features

indexed_features(double v1, double v2, …) – Returns a list of features as array: [1:v1, 2:v2, ..]

Class: hivemall.ftvec.trans.IndexedFeatures

Function: infinity

infinity() – Returns the constant representing positive infinity.

Class: hivemall.tools.math.InfinityUDF

Function: inflate

inflate(BINARY compressedData) – Returns a decompressed STRING by using Inflater

SELECT inflate(unbase91(base91(deflate(‘aaaaaaaaaaaaaaaabbbbccc’))));
aaaaaaaaaaaaaaaabbbbccc

Class: hivemall.tools.compress.InflateUDF

Function: initcap

initcap(str) – Returns `str` with the first letter of each word in uppercase. All other letters are in lowercase. Words are delimited by white space.

Class: org.apache.spark.sql.catalyst.expressions.InitCap

Function: inline

inline(expr) – Explodes an array of structs into a table.

Class: org.apache.spark.sql.catalyst.expressions.Inline

Function: inline_outer

inline_outer(expr) – Explodes an array of structs into a table.

Class: org.apache.spark.sql.catalyst.expressions.Inline

Function: input_file_block_length

input_file_block_length() – Returns the length of the block being read, or -1 if not available.

Class: org.apache.spark.sql.catalyst.expressions.InputFileBlockLength

Function: input_file_block_start

input_file_block_start() – Returns the start offset of the block being read, or -1 if not available.

Class: org.apache.spark.sql.catalyst.expressions.InputFileBlockStart

Function: input_file_name

input_file_name() – Returns the name of the file being read, or empty string if not available.

Class: org.apache.spark.sql.catalyst.expressions.InputFileName

Function: instr

instr(str, substr) – Returns the (1-based) index of the first occurrence of `substr` in `str`.

Class: org.apache.spark.sql.catalyst.expressions.StringInstr

Function: int

int(expr) – Casts the value `expr` to the target data type `int`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: intersect_array

intersect_array(array1, array2, …) – Returns the intersection of a set of arrays

Class: brickhouse.udf.collect.ArrayIntersectUDF

Function: is_finite

is_finite(x) – Determine if x is finite.

SELECT is_finite(333), is_finite(infinity());
true false

Class: hivemall.tools.math.IsFiniteUDF

Function: is_infinite

is_infinite(x) – Determine if x is infinite.

Class: hivemall.tools.math.IsInfiniteUDF

Function: is_nan

is_nan(x) – Determine if x is not-a-number.

Class: hivemall.tools.math.IsNanUDF

Function: is_stopword

is_stopword(string word) – Returns whether English stopword or not

Class: hivemall.tools.text.StopwordUDF

Function: isnotnull

isnotnull(expr) – Returns true if `expr` is not null, or false otherwise.

Class: org.apache.spark.sql.catalyst.expressions.IsNotNull

Function: isnull

isnull(expr) – Returns true if `expr` is null, or false otherwise.

Class: org.apache.spark.sql.catalyst.expressions.IsNull

Function: isochronedistanceedges

Class: com.whereos.udf.IsochroneDistanceEdgesUDTF

Function: isochronedistancepolygons

Class: com.whereos.udf.IsochroneDistancePolygonsUDTF

Function: isochronedurationedges

Class: com.whereos.udf.IsochroneDurationEdgesUDTF

Function: isochronedurationpolygons

Class: com.whereos.udf.IsochroneDistancePolygonsUDTF

Function: item_pairs_sampling

item_pairs_sampling(array pos_items, const int max_item_id [, const string options])- Returns a relation consists of

Class: hivemall.ftvec.ranking.ItemPairsSamplingUDTF

Function: jaccard_distance

jaccard_distance(integer A, integer B [,int k=128]) – Returns Jaccard distance between A and B

select
jaccard_distance(0,3) as c1,
jaccard_distance(“0″,”3”) as c2, — 0=0x00, 0=0x11
jaccard_distance(0,4) as c3
;

c1 c2 c3
0.03125 0.03125 0.015625

Class: hivemall.knn.distance.JaccardDistanceUDF

Function: jaccard_similarity

jaccard_similarity(A, B [,int k]) – Returns Jaccard similarity coefficient of A and B

WITH docs as (
select 1 as docid, array(‘apple:1.0’, ‘orange:2.0’, ‘banana:1.0’, ‘kuwi:0’) as features
union all
select 2 as docid, array(‘apple:1.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
union all
select 3 as docid, array(‘apple:2.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
)
select
l.docid as doc1,
r.docid as doc2,
jaccard_similarity(l.features, r.features) as similarity
from
docs l
CROSS JOIN docs r
where
l.docid != r.docid
order by
doc1 asc,
similarity desc;

doc1 doc2 similarity
1 2 0.14285715
1 3 0.0
2 3 0.6
2 1 0.14285715
3 2 0.6
3 1 0.0

Class: hivemall.knn.similarity.JaccardIndexUDF

Function: java_method

java_method(class, method[, arg1[, arg2 ..]]) – Calls a method with reflection.

Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection

Function: jobconf_gets

jobconf_gets() – Returns the value from JobConf

Class: hivemall.tools.mapred.JobConfGetsUDF

Function: jobid

jobid() – Returns the value of mapred.job.id

Class: hivemall.tools.mapred.JobIdUDF

Function: join_array

Class: brickhouse.udf.collect.JoinArrayUDF

Function: json_map

json_map(json) – Returns a map of key-value pairs from a JSON string

Class: brickhouse.udf.json.JsonMapUDF

Function: json_split

json_split(json) – Returns a array of JSON strings from a JSON Array

Class: brickhouse.udf.json.JsonSplitUDF

Function: json_tuple

json_tuple(jsonStr, p1, p2, …, pn) – Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.

Class: org.apache.spark.sql.catalyst.expressions.JsonTuple

Function: kld

kld(double mu1, double sigma1, double mu2, double sigma2) – Returns KL divergence between two distributions

Class: hivemall.knn.distance.KLDivergenceUDF

Function: kpa_predict

kpa_predict(@Nonnull double xh, @Nonnull double xk, @Nullable float w0, @Nonnull float w1, @Nonnull float w2, @Nullable float w3) – Returns a prediction value in Double

Class: hivemall.classifier.KPAPredictUDAF

Function: kurtosis

kurtosis(expr) – Returns the kurtosis value calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Kurtosis

Function: l1_normalize

l1_normalize(ftvec string) – Returned a L1 normalized value

Class: hivemall.ftvec.scaling.L1NormalizationUDF

Function: l2_norm

l2_norm(double x) – Return a L2 norm of the given input x.

WITH input as (
select generate_series(1,3) as v
)
select l2_norm(v) as l2norm
from input;
3.7416573867739413 = sqrt(1^2+2^2+3^2))

Class: hivemall.tools.math.L2NormUDAF

Function: l2_normalize

l2_normalize(ftvec string) – Returned a L2 normalized value

Class: hivemall.ftvec.scaling.L2NormalizationUDF

Function: lag

lag(input[, offset[, default]]) – Returns the value of `input` at the `offset`th row before the current row in the window. The default value of `offset` is 1 and the default value of `default` is null. If the value of `input` at the `offset`th row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), `default` is returned.

Class: org.apache.spark.sql.catalyst.expressions.Lag

Function: last

last(expr[, isIgnoreNull]) – Returns the last value of `expr` for a group of rows. If `isIgnoreNull` is true, returns only non-null values.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last

Function: last_day

last_day(date) – Returns the last day of the month which the date belongs to.

Class: org.apache.spark.sql.catalyst.expressions.LastDay

Function: last_element

last_element(x) – Return the last element in an array

SELECT last_element(array(‘a’,’b’,’c’));
c

Class: hivemall.tools.array.LastElementUDF

Function: last_index

last_index(x) – Last value in an array

Class: brickhouse.udf.collect.LastIndexUDF

Function: last_value

last_value(expr[, isIgnoreNull]) – Returns the last value of `expr` for a group of rows. If `isIgnoreNull` is true, returns only non-null values.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last

Function: lat2tiley

lat2tiley(double lat, int zoom)::int – Returns the tile number of the given latitude and zoom level

Class: hivemall.geospatial.Lat2TileYUDF

Function: lcase

lcase(str) – Returns `str` with all characters changed to lowercase.

Class: org.apache.spark.sql.catalyst.expressions.Lower

Function: lda_predict

lda_predict(string word, float value, int label, float lambda[, const string options]) – Returns a list which consists of

Class: hivemall.topicmodel.LDAPredictUDAF

Function: lead

lead(input[, offset[, default]]) – Returns the value of `input` at the `offset`th row after the current row in the window. The default value of `offset` is 1 and the default value of `default` is null. If the value of `input` at the `offset`th row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of the window does not have any subsequent row), `default` is returned.

Class: org.apache.spark.sql.catalyst.expressions.Lead

Function: least

least(expr, …) – Returns the least value of all parameters, skipping null values.

Class: org.apache.spark.sql.catalyst.expressions.Least

Function: left

left(str, len) – Returns the leftmost `len`(`len` can be string type) characters from the string `str`,if `len` is less or equal than 0 the result is an empty string.

Class: org.apache.spark.sql.catalyst.expressions.Left

Function: length

length(expr) – Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Class: org.apache.spark.sql.catalyst.expressions.Length

Function: levenshtein

levenshtein(str1, str2) – Returns the Levenshtein distance between the two given strings.

Class: org.apache.spark.sql.catalyst.expressions.Levenshtein

Function: like

str like pattern – Returns true if str matches pattern, null if any arguments are null, false otherwise.

Class: org.apache.spark.sql.catalyst.expressions.Like

Function: ln

ln(expr) – Returns the natural logarithm (base e) of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Log

Function: locate

locate(substr, str[, pos]) – Returns the position of the first occurrence of `substr` in `str` after position `pos`. The given `pos` and return value are 1-based.

Class: org.apache.spark.sql.catalyst.expressions.StringLocate

Function: log

log(base, expr) – Returns the logarithm of `expr` with `base`.

Class: org.apache.spark.sql.catalyst.expressions.Logarithm

Function: log10

log10(expr) – Returns the logarithm of `expr` with base 10.

Class: org.apache.spark.sql.catalyst.expressions.Log10

Function: log1p

log1p(expr) – Returns log(1 + `expr`).

Class: org.apache.spark.sql.catalyst.expressions.Log1p

Function: log2

log2(expr) – Returns the logarithm of `expr` with base 2.

Class: org.apache.spark.sql.catalyst.expressions.Log2

Function: logloss

logloss(double predicted, double actual) – Return a Logrithmic Loss

Class: hivemall.evaluation.LogarithmicLossUDAF

Function: logress

logress(array features, float target [, constant string options]) – Returns a relation consists of <{int|bigint|string} feature, float weight>

Class: hivemall.regression.LogressUDTF

Function: lon2tilex

lon2tilex(double lon, int zoom)::int – Returns the tile number of the given longitude and zoom level

Class: hivemall.geospatial.Lon2TileXUDF

Function: lower

lower(str) – Returns `str` with all characters changed to lowercase.

Class: org.apache.spark.sql.catalyst.expressions.Lower

Function: lpad

lpad(str, len, pad) – Returns `str`, left-padded with `pad` to a length of `len`. If `str` is longer than `len`, the return value is shortened to `len` characters.

Class: org.apache.spark.sql.catalyst.expressions.StringLPad

Function: lr_datagen

lr_datagen(options string) – Generates a logistic regression dataset

WITH dual AS (SELECT 1) SELECT lr_datagen(‘-n_examples 1k -n_features 10’) FROM dual;

Class: hivemall.dataset.LogisticRegressionDataGeneratorUDTF

Function: ltrim

ltrim(str) – Removes the leading space characters from `str`. ltrim(trimStr, str) – Removes the leading string contains the characters from the trim string

Class: org.apache.spark.sql.catalyst.expressions.StringTrimLeft

Function: mae

mae(double predicted, double actual) – Return a Mean Absolute Error

Class: hivemall.evaluation.MeanAbsoluteErrorUDAF

Function: manhattan_distance

manhattan_distance(list x, list y) – Returns sum(|x – y|)

WITH docs as (
select 1 as docid, array(‘apple:1.0’, ‘orange:2.0’, ‘banana:1.0’, ‘kuwi:0’) as features
union all
select 2 as docid, array(‘apple:1.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
union all
select 3 as docid, array(‘apple:2.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
)
select
l.docid as doc1,
r.docid as doc2,
manhattan_distance(l.features, r.features) as distance,
distance2similarity(angular_distance(l.features, r.features)) as similarity
from
docs l
CROSS JOIN docs r
where
l.docid != r.docid
order by
doc1 asc,
distance asc;

doc1 doc2 distance similarity
1 2 4.0 0.75
1 3 5.0 0.75942624
2 3 1.0 0.91039914
2 1 4.0 0.75
3 2 1.0 0.91039914
3 1 5.0 0.75942624

Class: hivemall.knn.distance.ManhattanDistanceUDF

Function: map

map(key0, value0, key1, value1, …) – Creates a map with the given key/value pairs.

Class: org.apache.spark.sql.catalyst.expressions.CreateMap

Function: map_concat

map_concat(map, …) – Returns the union of all the given maps

Class: org.apache.spark.sql.catalyst.expressions.MapConcat

Function: map_exclude_keys

map_exclude_keys(Map map, array filteringKeys) – Returns the filtered entries of a map not having specified keys

SELECT map_exclude_keys(map(1,’one’,2,’two’,3,’three’),array(2,3));
{1:”one”}

Class: hivemall.tools.map.MapExcludeKeysUDF

Function: map_filter_keys

map_filter_keys(map, key_array) – Returns the filtered entries of a map corresponding to a given set of keys

Class: brickhouse.udf.collect.MapFilterKeysUDF

Function: map_from_arrays

map_from_arrays(keys, values) – Creates a map with a pair of the given key/value arrays. All elements in keys should not be null

Class: org.apache.spark.sql.catalyst.expressions.MapFromArrays

Function: map_from_entries

map_from_entries(arrayOfEntries) – Returns a map created from the given array of entries.

Class: org.apache.spark.sql.catalyst.expressions.MapFromEntries

Function: map_get_sum

map_get_sum(map src, array keys) – Returns sum of values that are retrieved by keys

Class: hivemall.tools.map.MapGetSumUDF

Function: map_include_keys

map_include_keys(Map map, array filteringKeys) – Returns the filtered entries of a map having specified keys

SELECT map_include_keys(map(1,’one’,2,’two’,3,’three’),array(2,3));
{2:”two”,3:”three”}

Class: hivemall.tools.map.MapIncludeKeysUDF

Function: map_index

Class: brickhouse.udf.collect.MapIndexUDF

Function: map_key_values

map_key_values(map) – Returns a Array of key-value pairs contained in a Map

Class: brickhouse.udf.collect.MapKeyValuesUDF

Function: map_keys

map_keys(map) – Returns an unordered array containing the keys of the map.

Class: org.apache.spark.sql.catalyst.expressions.MapKeys

Function: map_tail_n

map_tail_n(map SRC, int N) – Returns the last N elements from a sorted array of SRC

Class: hivemall.tools.map.MapTailNUDF

Function: map_url

map_url(double lat, double lon, int zoom [, const string option]) – Returns a URL string

OpenStreetMap: http://tile.openstreetmap.org/${zoom}/${xtile}/${ytile}.png
Google Maps: https://www.google.com/maps/@${lat},${lon},${zoom}z

Class: hivemall.geospatial.MapURLUDF

Function: map_values

map_values(map) – Returns an unordered array containing the values of the map.

Class: org.apache.spark.sql.catalyst.expressions.MapValues

Function: max

max(expr) – Returns the maximum value of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Max

Function: max_label

max_label(double value, string label) – Returns a label that has the maximum value

Class: hivemall.ensemble.MaxValueLabelUDAF

Function: maxrow

maxrow(ANY compare, …) – Returns a row that has maximum value in the 1st argument

Class: hivemall.ensemble.MaxRowUDAF

Function: md5

md5(expr) – Returns an MD5 128-bit checksum as a hex string of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Md5

Function: mean

mean(expr) – Returns the mean calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Average

Function: mhash

mhash(string word) returns a murmurhash3 INT value starting from 1

Class: hivemall.ftvec.hashing.MurmurHash3UDF

Function: min

min(expr) – Returns the minimum value of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min

Function: minhash

minhash(ANY item, array features [, constant string options]) – Returns n different k-depth signatures (i.e., clusterid) for each item

Class: hivemall.knn.lsh.MinHashUDTF

Function: minhashes

minhashes(array<> features [, int numHashes, int keyGroup [, boolean noWeight]]) – Returns minhash values

Class: hivemall.knn.lsh.MinHashesUDF

Function: minkowski_distance

minkowski_distance(list x, list y, double p) – Returns sum(|x – y|^p)^(1/p)

WITH docs as (
select 1 as docid, array(‘apple:1.0’, ‘orange:2.0’, ‘banana:1.0’, ‘kuwi:0’) as features
union all
select 2 as docid, array(‘apple:1.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
union all
select 3 as docid, array(‘apple:2.0’, ‘orange:0’, ‘banana:2.0’, ‘kuwi:1.0’) as features
)
select
l.docid as doc1,
r.docid as doc2,
minkowski_distance(l.features, r.features, 1) as distance1, — p=1 (manhattan_distance)
minkowski_distance(l.features, r.features, 2) as distance2, — p=2 (euclid_distance)
minkowski_distance(l.features, r.features, 3) as distance3, — p=3
manhattan_distance(l.features, r.features) as manhattan_distance,
euclid_distance(l.features, r.features) as euclid_distance
from
docs l
CROSS JOIN docs r
where
l.docid != r.docid
order by
doc1 asc,
distance1 asc;

doc1 doc2 distance1 distance2 distance3 manhattan_distance euclid_distance
1 2 4.0 2.4494898 2.1544347 4.0 2.4494898
1 3 5.0 2.6457512 2.2239802 5.0 2.6457512
2 3 1.0 1.0 1.0 1.0 1.0
2 1 4.0 2.4494898 2.1544347 4.0 2.4494898
3 2 1.0 1.0 1.0 1.0 1.0
3 1 5.0 2.6457512 2.2239802 5.0 2.6457512

Class: hivemall.knn.distance.MinkowskiDistanceUDF

Function: minute

minute(timestamp) – Returns the minute component of the string/timestamp.

Class: org.apache.spark.sql.catalyst.expressions.Minute

Function: mod

expr1 mod expr2 – Returns the remainder after `expr1`/`expr2`.

Class: org.apache.spark.sql.catalyst.expressions.Remainder

Function: monotonically_increasing_id

monotonically_increasing_id() – Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records. The function is non-deterministic because its result depends on partition IDs.

Class: org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID

Function: month

month(date) – Returns the month component of the date/timestamp.

Class: org.apache.spark.sql.catalyst.expressions.Month

Function: months_between

months_between(timestamp1, timestamp2[, roundOff]) – If `timestamp1` is later than `timestamp2`, then the result is positive. If `timestamp1` and `timestamp2` are on the same day of month, or both are the last day of month, time of day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits unless roundOff=false.

Class: org.apache.spark.sql.catalyst.expressions.MonthsBetween

Function: moving_avg

return the moving average of a time series for a given timewindow

Class: brickhouse.udf.timeseries.MovingAvgUDF

Function: mrr

mrr(array rankItems, array correctItems [, const int recommendSize = rankItems.size]) – Returns MRR

Class: hivemall.evaluation.MRRUDAF

Function: mse

mse(double predicted, double actual) – Return a Mean Squared Error

Class: hivemall.evaluation.MeanSquaredErrorUDAF

Function: multiday_count

multiday_count(x) – Returns a count of events over several different periods,

Class: brickhouse.udf.sketch.MultiDaySketcherUDAF

Function: named_struct

named_struct(name1, val1, name2, val2, …) – Creates a struct with the given field names and values.

Class: org.apache.spark.sql.catalyst.expressions.CreateNamedStruct

Function: nan

nan() – Returns the constant representing not-a-number.

SELECT nan(), is_nan(nan());
NaN true

Class: hivemall.tools.math.NanUDF

Function: nanvl

nanvl(expr1, expr2) – Returns `expr1` if it’s not NaN, or `expr2` otherwise.

Class: org.apache.spark.sql.catalyst.expressions.NaNvl

Function: ndcg

ndcg(array rankItems, array correctItems [, const int recommendSize = rankItems.size]) – Returns nDCG

Class: hivemall.evaluation.NDCGUDAF

Function: negative

negative(expr) – Returns the negated value of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.UnaryMinus

Function: next_day

next_day(start_date, day_of_week) – Returns the first date which is later than `start_date` and named as indicated.

Class: org.apache.spark.sql.catalyst.expressions.NextDay

Function: normalize_unicode

normalize_unicode(string str [, string form]) – Transforms `str` with the specified normalization form. The `form` takes one of NFC (default), NFD, NFKC, or NFKD

SELECT normalize_unicode(‘ﾊﾝｶｸｶﾅ’,’NFKC’);
ハンカクカナ

SELECT normalize_unicode(‘㈱㌧㌦Ⅲ’,’NFKC’);
(株)トンドルIII

Class: hivemall.tools.text.NormalizeUnicodeUDF

Function: not

not expr – Logical not.

Class: org.apache.spark.sql.catalyst.expressions.Not

Function: now

now() – Returns the current timestamp at the start of query evaluation.

Class: org.apache.spark.sql.catalyst.expressions.CurrentTimestamp

Function: ntile

ntile(n) – Divides the rows for each window partition into `n` buckets ranging from 1 to at most `n`.

Class: org.apache.spark.sql.catalyst.expressions.NTile

Function: nullif

nullif(expr1, expr2) – Returns null if `expr1` equals to `expr2`, or `expr1` otherwise.

Class: org.apache.spark.sql.catalyst.expressions.NullIf

Function: numeric_range

numeric_range(a,b,c) – Generates a range of integers from a to b incremented by c or the elements of a map into multiple rows and columns

Class: brickhouse.udf.collect.NumericRange

Function: nvl

nvl(expr1, expr2) – Returns `expr2` if `expr1` is null, or `expr1` otherwise.

Class: org.apache.spark.sql.catalyst.expressions.Nvl

Function: nvl2

nvl2(expr1, expr2, expr3) – Returns `expr2` if `expr1` is not null, or `expr3` otherwise.

Class: org.apache.spark.sql.catalyst.expressions.Nvl2

Function: octet_length

octet_length(expr) – Returns the byte length of string data or number of bytes of binary data.

Class: org.apache.spark.sql.catalyst.expressions.OctetLength

Function: onehot_encoding

onehot_encoding(PRIMITIVE feature, …) – Compute onehot encoded label for each feature

WITH mapping as (
select
m.f1, m.f2
from (
select onehot_encoding(species, category) m
from test
) tmp
)
select
array(m.f1[t.species],m.f2[t.category],feature(‘count’,count)) as sparse_features
from
test t
CROSS JOIN mapping m;

[“2″,”8″,”count:9”]
[“5″,”8″,”count:10”]
[“1″,”6″,”count:101”]

Class: hivemall.ftvec.trans.OnehotEncodingUDAF

Function: or

expr1 or expr2 – Logical OR.

Class: org.apache.spark.sql.catalyst.expressions.Or

Function: parse_url

parse_url(url, partToExtract[, key]) – Extracts a part from a URL.

Class: org.apache.spark.sql.catalyst.expressions.ParseUrl

Function: percent_rank

percent_rank() – Computes the percentage ranking of a value in a group of values.

Class: org.apache.spark.sql.catalyst.expressions.PercentRank

Function: percentile

percentile(col, percentage [, frequency]) – Returns the exact percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive integral percentile(col, array(percentage1 [, percentage2]…) [, frequency]) – Returns the exact percentile value array of numeric column `col` at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. The value of frequency should be positive integral

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Percentile

Function: percentile_approx

percentile_approx(col, percentage [, accuracy]) – Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column `col` at the given percentage array.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile

Function: permutations

Class: com.whereos.udf.PermutationUDTF

Function: pi

pi() – Returns pi.

Class: org.apache.spark.sql.catalyst.expressions.Pi

Function: plsa_predict

plsa_predict(string word, float value, int label, float prob[, const string options]) – Returns a list which consists of

Class: hivemall.topicmodel.PLSAPredictUDAF

Function: pmod

pmod(expr1, expr2) – Returns the positive value of `expr1` mod `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.Pmod

Function: polynomial_features

polynomial_features(feature_vector in array) – Returns a feature vectorhaving polynomial feature space

Class: hivemall.ftvec.pairing.PolynomialFeaturesUDF

Function: popcnt

popcnt(a [, b]) – Returns a popcount value

select
popcnt(3),
popcnt(“3”), — 3=0x11
popcnt(array(1,3));

2 2 3

Class: hivemall.knn.distance.PopcountUDF

Function: populate_not_in

populate_not_in(list items, const int max_item_id [, const string options])- Returns a relation consists of that item does not exist in the given items

Class: hivemall.ftvec.ranking.PopulateNotInUDTF

Function: posexplode

posexplode(expr) – Separates the elements of array `expr` into multiple rows with positions, or the elements of map `expr` into multiple rows and columns with positions.

Class: org.apache.spark.sql.catalyst.expressions.PosExplode

Function: posexplode_outer

posexplode_outer(expr) – Separates the elements of array `expr` into multiple rows with positions, or the elements of map `expr` into multiple rows and columns with positions.

Class: org.apache.spark.sql.catalyst.expressions.PosExplode

Function: posexplodepairs

Class: com.whereos.udf.PosExplodePairsUDTF

Function: position

position(substr, str[, pos]) – Returns the position of the first occurrence of `substr` in `str` after position `pos`. The given `pos` and return value are 1-based.

Class: org.apache.spark.sql.catalyst.expressions.StringLocate

Function: positive

positive(expr) – Returns the value of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.UnaryPositive

Function: pow

pow(expr1, expr2) – Raises `expr1` to the power of `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.Pow

Function: power

power(expr1, expr2) – Raises `expr1` to the power of `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.Pow

Function: powered_features

powered_features(feature_vector in array, int degree [, boolean truncate]) – Returns a feature vector having a powered feature space

Class: hivemall.ftvec.pairing.PoweredFeaturesUDF

Function: precision_at

precision_at(array rankItems, array correctItems [, const int recommendSize = rankItems.size]) – Returns Precision

Class: hivemall.evaluation.PrecisionUDAF

Function: prefixed_hash_values

prefixed_hash_values(array values, string prefix [, boolean useIndexAsPrefix]) returns array that each element has the specified prefix

Class: hivemall.ftvec.hashing.ArrayPrefixedHashValuesUDF

Function: printf

printf(strfmt, obj, …) – Returns a formatted string from printf-style format strings.

Class: org.apache.spark.sql.catalyst.expressions.FormatString

Function: quantified_features

quantified_features(boolean output, col1, col2, …) – Returns an identified features in a dense array

Class: hivemall.ftvec.trans.QuantifiedFeaturesUDTF

Function: quantify

quantify(boolean output, col1, col2, …) – Returns an identified features

Class: hivemall.ftvec.conv.QuantifyColumnsUDTF

Function: quantitative_features

quantitative_features(array featureNames, feature1, feature2, .. [, const string options]) – Returns a feature vector array

Class: hivemall.ftvec.trans.QuantitativeFeaturesUDF

Function: quarter

quarter(date) – Returns the quarter of the year for date, in the range 1 to 4.

Class: org.apache.spark.sql.catalyst.expressions.Quarter

Function: r2

r2(double predicted, double actual) – Return R Squared (coefficient of determination)

Class: hivemall.evaluation.R2UDAF

Function: radians

radians(expr) – Converts degrees to radians.

Class: org.apache.spark.sql.catalyst.expressions.ToRadians

Function: raise_error

raise_error() or raise_error(string msg) – Throws an error

SELECT product_id, price, raise_error(‘Found an invalid record’) FROM xxx WHERE price < 0.0

Class: hivemall.tools.sanity.RaiseErrorUDF

Function: rand

rand([seed]) – Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).

Class: org.apache.spark.sql.catalyst.expressions.Rand

Function: rand_amplify

rand_amplify(const int xtimes [, const string options], *) – amplify the input records x-times in map-side

Class: hivemall.ftvec.amplify.RandomAmplifierUDTF

Function: randn

randn([seed]) – Returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution.

Class: org.apache.spark.sql.catalyst.expressions.Randn

Function: rank

rank() – Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.

Class: org.apache.spark.sql.catalyst.expressions.Rank

Function: readjsongeometry

Class: com.whereos.udf.ReadJSONGeometryUDF

Function: recall_at

recall_at(array rankItems, array correctItems [, const int recommendSize = rankItems.size]) – Returns Recall

Class: hivemall.evaluation.RecallUDAF

Function: reflect

reflect(class, method[, arg1[, arg2 ..]]) – Calls a method with reflection.

Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection

Function: regexp_extract

regexp_extract(str, regexp[, idx]) – Extracts a group that matches `regexp`.

Class: org.apache.spark.sql.catalyst.expressions.RegExpExtract

Function: regexp_replace

regexp_replace(str, regexp, rep) – Replaces all substrings of `str` that match `regexp` with `rep`.

Class: org.apache.spark.sql.catalyst.expressions.RegExpReplace

Function: rendergeometries

Class: com.whereos.udf.CollectAndRenderGeometryUDF

Function: renderheatmap

Class: com.whereos.udf.HeatmapRenderUDF

Function: rendertile

Class: com.whereos.udf.TileRenderUDF

Function: repeat

repeat(str, n) – Returns the string which repeats the given string value n times.

Class: org.apache.spark.sql.catalyst.expressions.StringRepeat

Function: replace

replace(str, search[, replace]) – Replaces all occurrences of `search` with `replace`.

Class: org.apache.spark.sql.catalyst.expressions.StringReplace

Function: rescale

rescale(value, min, max) – Returns rescaled value by min-max normalization

Class: hivemall.ftvec.scaling.RescaleUDF

Function: reverse

reverse(array) – Returns a reversed string or an array with reverse order of elements.

Class: org.apache.spark.sql.catalyst.expressions.Reverse

Function: rf_ensemble

rf_ensemble(int yhat [, array proba [, double model_weight=1.0]]) – Returns ensembled prediction results in probabilities>

Class: hivemall.smile.tools.RandomForestEnsembleUDAF

Function: right

right(str, len) – Returns the rightmost `len`(`len` can be string type) characters from the string `str`,if `len` is less or equal than 0 the result is an empty string.

Class: org.apache.spark.sql.catalyst.expressions.Right

Function: rint

rint(expr) – Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

Class: org.apache.spark.sql.catalyst.expressions.Rint

Function: rlike

str rlike regexp – Returns true if `str` matches `regexp`, or false otherwise.

Class: org.apache.spark.sql.catalyst.expressions.RLike

Function: rmse

rmse(double predicted, double actual) – Return a Root Mean Squared Error

Class: hivemall.evaluation.RootMeanSquaredErrorUDAF

Function: rollup

rollup([col1[, col2 ..]]) – create a multi-dimensional rollup using the specified columns so that we can run aggregation on them.

Class: org.apache.spark.sql.catalyst.expressions.Rollup

Function: round

round(expr, d) – Returns `expr` rounded to `d` decimal places using HALF_UP rounding mode.

Class: org.apache.spark.sql.catalyst.expressions.Round

Function: row_number

row_number() – Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.

Class: org.apache.spark.sql.catalyst.expressions.RowNumber

Function: rowid

rowid() – Returns a generated row id of a form {TASK_ID}-{SEQUENCE_NUMBER}

Class: hivemall.tools.mapred.RowIdUDF

Function: rownum

rownum() – Returns a generated row number `sprintf(`%d%04d`,sequence,taskId)` in long

SELECT rownum() as rownum, xxx from …

Class: hivemall.tools.mapred.RowNumberUDF

Function: rpad

rpad(str, len, pad) – Returns `str`, right-padded with `pad` to a length of `len`. If `str` is longer than `len`, the return value is shortened to `len` characters.

Class: org.apache.spark.sql.catalyst.expressions.StringRPad

Function: rtrim

rtrim(str) – Removes the trailing space characters from `str`. rtrim(trimStr, str) – Removes the trailing string which contains the characters from the trim string from the `str`

Class: org.apache.spark.sql.catalyst.expressions.StringTrimRight

Function: salted_bigint

Class: brickhouse.hbase.SaltedBigIntUDF

Function: salted_bigint_key

Class: brickhouse.hbase.SaltedBigIntUDF

Function: schema_of_json

schema_of_json(json[, options]) – Returns schema in the DDL format of JSON string.

Class: org.apache.spark.sql.catalyst.expressions.SchemaOfJson

Function: second

second(timestamp) – Returns the second component of the string/timestamp.

Class: org.apache.spark.sql.catalyst.expressions.Second

Function: select_k_best

select_k_best(array array, const array importance, const int k) – Returns selected top-k elements as array

Class: hivemall.tools.array.SelectKBestUDF

Function: sentences

sentences(str[, lang, country]) – Splits `str` into an array of array of words.

Class: org.apache.spark.sql.catalyst.expressions.Sentences

Function: sequence

sequence(start, stop, step) – Generates an array of elements from start to stop (inclusive), incrementing by step. The type of the returned elements is the same as the type of argument expressions. Supported types are: byte, short, integer, long, date, timestamp. The start and stop expressions must resolve to the same type. If start and stop expressions resolve to the ‘date’ or ‘timestamp’ type then the step expression must resolve to the ‘interval’ type, otherwise to the same type as the start and stop expressions.

Class: org.apache.spark.sql.catalyst.expressions.Sequence

Function: sessionize

sessionize(long timeInSec, long thresholdInSec [, String subject])- Returns a UUID string of a session.

SELECT
sessionize(time, 3600, ip_addr) as session_id,
time, ip_addr
FROM (
SELECT time, ipaddr
FROM weblog
DISTRIBUTE BY ip_addr, time SORT BY ip_addr, time DESC
) t1

Class: hivemall.tools.datetime.SessionizeUDF

Function: set_difference

set_difference(a,b) – Returns a list of those items in a, but not in b

Class: brickhouse.udf.collect.SetDifferenceUDF

Function: set_similarity

set_similarity(a,b) – Compute the Jaccard set similarity of two sketch sets.

Class: brickhouse.udf.sketch.SetSimilarityUDF

Function: sha

sha(expr) – Returns a sha1 hash value as a hex string of the `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Sha1

Function: sha1

sha1(expr) – Returns a sha1 hash value as a hex string of the `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Sha1

Function: sha2

sha2(expr, bitLength) – Returns a checksum of SHA-2 family as a hex string of `expr`. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.

Class: org.apache.spark.sql.catalyst.expressions.Sha2

Function: shiftleft

shiftleft(base, expr) – Bitwise left shift.

Class: org.apache.spark.sql.catalyst.expressions.ShiftLeft

Function: shiftright

shiftright(base, expr) – Bitwise (signed) right shift.

Class: org.apache.spark.sql.catalyst.expressions.ShiftRight

Function: shiftrightunsigned

shiftrightunsigned(base, expr) – Bitwise unsigned right shift.

Class: org.apache.spark.sql.catalyst.expressions.ShiftRightUnsigned

Function: shuffle

shuffle(array) – Returns a random permutation of the given array.

Class: org.apache.spark.sql.catalyst.expressions.Shuffle

Function: sigmoid

sigmoid(x) – Returns 1.0 / (1.0 + exp(-x))

WITH input as (
SELECT 3.0 as x
UNION ALL
SELECT -3.0 as x
)
select
1.0 / (1.0 + exp(-x)),
sigmoid(x)
from
input;
0.04742587317756678 0.04742587357759476
0.9525741268224334 0.9525741338729858

Class: hivemall.tools.math.SigmoidGenericUDF

Function: sign

sign(expr) – Returns -1.0, 0.0 or 1.0 as `expr` is negative, 0 or positive.

Class: org.apache.spark.sql.catalyst.expressions.Signum

Function: signum

signum(expr) – Returns -1.0, 0.0 or 1.0 as `expr` is negative, 0 or positive.

Class: org.apache.spark.sql.catalyst.expressions.Signum

Function: simple_r

Class: com.whereos.udf.RenjinUDF

Function: sin

sin(expr) – Returns the sine of `expr`, as if computed by `java.lang.Math.sin`.

Class: org.apache.spark.sql.catalyst.expressions.Sin

Function: singularize

singularize(string word) – Returns singular form of a given English word

SELECT singularize(lower(“Apples”));

“apple”

Class: hivemall.tools.text.SingularizeUDF

Function: sinh

sinh(expr) – Returns hyperbolic sine of `expr`, as if computed by `java.lang.Math.sinh`.

Class: org.apache.spark.sql.catalyst.expressions.Sinh

Function: size

size(expr) – Returns the size of an array or a map.The function returns -1 if its input is null and spark.sql.legacy.sizeOfNull is set to true.If spark.sql.legacy.sizeOfNull is set to false, the function returns null for null input.By default, the spark.sql.legacy.sizeOfNull parameter is set to true.

Class: org.apache.spark.sql.catalyst.expressions.Size

Function: sketch_hashes

sketch_hashes(x) – Return the MD5 hashes associated with a KMV sketch set of strings

Class: brickhouse.udf.sketch.SketchHashesUDF

Function: sketch_set

sketch_set(x) – Constructs a sketch set to estimate reach for large values

Class: brickhouse.udf.sketch.SketchSetUDAF

Function: skewness

skewness(expr) – Returns the skewness value calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Skewness

Function: slice

slice(x, start, length) – Subsets array x starting from index start (array indices start at 1, or starting from the end if start is negative) with the specified length.

Class: org.apache.spark.sql.catalyst.expressions.Slice

Function: smallint

smallint(expr) – Casts the value `expr` to the target data type `smallint`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: snr

snr(array features, array one-hot class label) – Returns Signal Noise Ratio for each feature as array

Class: hivemall.ftvec.selection.SignalNoiseRatioUDAF

Function: sort_and_uniq_array

sort_and_uniq_array(array) – Takes array and returns a sorted array with duplicate elements eliminated

SELECT sort_and_uniq_array(array(3,1,1,-2,10));
[-2,1,3,10]

Class: hivemall.tools.array.SortAndUniqArrayUDF

Function: sort_array

sort_array(array[, ascendingOrder]) – Sorts the input array in ascending or descending order according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order.

Class: org.apache.spark.sql.catalyst.expressions.SortArray

Function: sort_by_feature

sort_by_feature(map in map) – Returns a sorted map

Class: hivemall.ftvec.SortByFeatureUDF

Function: soundex

soundex(str) – Returns Soundex code of the string.

Class: org.apache.spark.sql.catalyst.expressions.SoundEx

Function: space

space(n) – Returns a string consisting of `n` spaces.

Class: org.apache.spark.sql.catalyst.expressions.StringSpace

Function: spark_partition_id

spark_partition_id() – Returns the current partition id.

Class: org.apache.spark.sql.catalyst.expressions.SparkPartitionID

Function: split

split(str, regex) – Splits `str` around occurrences that match `regex`.

Class: org.apache.spark.sql.catalyst.expressions.StringSplit

Function: split_words

split_words(string query [, string regex]) – Returns an array containing splitted strings

Class: hivemall.tools.text.SplitWordsUDF

Function: splitlinestring

Class: com.whereos.udf.LineSplitterUDTF

Function: sqrt

sqrt(expr) – Returns the square root of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.Sqrt

Function: sst

sst(double|array x [, const string options]) – Returns change-point scores and decisions using Singular Spectrum Transformation (SST). It will return a tuple

Class: hivemall.anomaly.SingularSpectrumTransformUDF

Function: stack

stack(n, expr1, …, exprk) – Separates `expr1`, …, `exprk` into `n` rows.

Class: org.apache.spark.sql.catalyst.expressions.Stack

Function: std

std(expr) – Returns the sample standard deviation calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp

Function: stddev

stddev(expr) – Returns the sample standard deviation calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp

Function: stddev_pop

stddev_pop(expr) – Returns the population standard deviation calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.StddevPop

Function: stddev_samp

stddev_samp(expr) – Returns the sample standard deviation calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp

Function: str_to_map

str_to_map(text[, pairDelim[, keyValueDelim]]) – Creates a map after splitting the text into key/value pairs using delimiters. Default delimiters are ‘,’ for `pairDelim` and ‘:’ for `keyValueDelim`.

Class: org.apache.spark.sql.catalyst.expressions.StringToMap

Function: string

string(expr) – Casts the value `expr` to the target data type `string`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: struct

struct(col1, col2, col3, …) – Creates a struct with the given field values.

Class: org.apache.spark.sql.catalyst.expressions.NamedStruct

Function: subarray

subarray(array values, int offset [, int length]) – Slices the given array by the given offset and length parameters.

SELECT
array_slice(array(1,2,3,4,5,6),2,4),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
0, — offset
2 — length
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
6, — offset
3 — length
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
6, — offset
10 — length
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
6 — offset
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
-3 — offset
),
array_slice(
array(“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “ten”),
-3, — offset
2 — length
);

[3,4]
[“zero”,”one”]
[“six”,”seven”,”eight”]
[“six”,”seven”,”eight”,”nine”,”ten”]
[“six”,”seven”,”eight”,”nine”,”ten”]
[“eight”,”nine”,”ten”]
[“eight”,”nine”]

Class: hivemall.tools.array.ArraySliceUDF

Function: subarray_endwith

subarray_endwith(array original, int|text key) – Returns an array that ends with the specified key

SELECT subarray_endwith(array(1,2,3,4), 3);
[1,2,3]

Class: hivemall.tools.array.SubarrayEndWithUDF

Function: subarray_startwith

subarray_startwith(array original, int|text key) – Returns an array that starts with the specified key

SELECT subarray_startwith(array(1,2,3,4), 2);
[2,3,4]

Class: hivemall.tools.array.SubarrayStartWithUDF

Function: substr

substr(str, pos[, len]) – Returns the substring of `str` that starts at `pos` and is of length `len`, or the slice of byte array that starts at `pos` and is of length `len`.

Class: org.apache.spark.sql.catalyst.expressions.Substring

Function: substring

substring(str, pos[, len]) – Returns the substring of `str` that starts at `pos` and is of length `len`, or the slice of byte array that starts at `pos` and is of length `len`.

Class: org.apache.spark.sql.catalyst.expressions.Substring

Function: substring_index

substring_index(str, delim, count) – Returns the substring from `str` before `count` occurrences of the delimiter `delim`. If `count` is positive, everything to the left of the final delimiter (counting from the left) is returned. If `count` is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for `delim`.

Class: org.apache.spark.sql.catalyst.expressions.SubstringIndex

Function: sum

sum(expr) – Returns the sum calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.Sum

Function: sum_array

sum an array of doubles

Class: brickhouse.udf.timeseries.SumArrayUDF

Function: tan

tan(expr) – Returns the tangent of `expr`, as if computed by `java.lang.Math.tan`.

Class: org.apache.spark.sql.catalyst.expressions.Tan

Function: tanh

tanh(expr) – Returns the hyperbolic tangent of `expr`, as if computed by `java.lang.Math.tanh`.

Class: org.apache.spark.sql.catalyst.expressions.Tanh

Function: taskid

taskid() – Returns the value of mapred.task.partition

Class: hivemall.tools.mapred.TaskIdUDF

Function: tf

tf(string text) – Return a term frequency in

Class: hivemall.ftvec.text.TermFrequencyUDAF

Function: throw_error

Class: brickhouse.udf.sanity.ThrowErrorUDF

Function: tile

tile(double lat, double lon, int zoom)::bigint – Returns a tile number 2^2n where n is zoom level. tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^zoom

refer https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames for detail

Class: hivemall.geospatial.TileUDF

Function: tilex2lon

tilex2lon(int x, int zoom)::double – Returns longitude of the given tile x and zoom level

Class: hivemall.geospatial.TileX2LonUDF

Function: tiley2lat

tiley2lat(int y, int zoom)::double – Returns latitude of the given tile y and zoom level

Class: hivemall.geospatial.TileY2LatUDF

Function: timestamp

timestamp(expr) – Casts the value `expr` to the target data type `timestamp`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: tinyint

tinyint(expr) – Casts the value `expr` to the target data type `tinyint`.

Class: org.apache.spark.sql.catalyst.expressions.Cast

Function: to_bits

to_bits(int[] indexes) – Returns an bitset representation if the given indexes in long[]

SELECT to_bits(array(1,2,3,128));
[14,-9223372036854775808]

Class: hivemall.tools.bits.ToBitsUDF

Function: to_camel_case

to_camel_case(a) – Converts a string containing underscores to CamelCase

Class: brickhouse.udf.json.ConvertToCamelCaseUDF

Function: to_date

to_date(date_str[, fmt]) – Parses the `date_str` expression with the `fmt` expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the `fmt` is omitted.

Class: org.apache.spark.sql.catalyst.expressions.ParseToDate

Function: to_dense

to_dense(array feature_vector, int dimensions) – Returns a dense feature in array

Class: hivemall.ftvec.conv.ToDenseFeaturesUDF

Function: to_dense_features

to_dense_features(array feature_vector, int dimensions) – Returns a dense feature in array

Class: hivemall.ftvec.conv.ToDenseFeaturesUDF

Function: to_json

to_json(expr[, options]) – Returns a JSON string with a given struct value

Class: org.apache.spark.sql.catalyst.expressions.StructsToJson

Function: to_map

to_map(key, value) – Convert two aggregated columns into a key-value map

WITH input as (
select ‘aaa’ as key, 111 as value
UNION all
select ‘bbb’ as key, 222 as value
)
select to_map(key, value)
from input;

> {“bbb”:222,”aaa”:111}

Class: hivemall.tools.map.UDAFToMap

Function: to_ordered_list

to_ordered_list(PRIMITIVE value [, PRIMITIVE key, const string options]) – Return list of values sorted by value itself or specific key

WITH t as (
SELECT 5 as key, ‘apple’ as value
UNION ALL
SELECT 3 as key, ‘banana’ as value
UNION ALL
SELECT 4 as key, ‘candy’ as value
UNION ALL
SELECT 2 as key, ‘donut’ as value
UNION ALL
SELECT 3 as key, ‘egg’ as value
)
SELECT — expected output
to_ordered_list(value, key, ‘-reverse’), — [apple, candy, (banana, egg | egg, banana), donut] (reverse order)
to_ordered_list(value, key, ‘-k 2’), — [apple, candy] (top-k)
to_ordered_list(value, key, ‘-k 100’), — [apple, candy, (banana, egg | egg, banana), dunut]
to_ordered_list(value, key, ‘-k 2 -reverse’), — [donut, (banana | egg)] (reverse top-k = tail-k)
to_ordered_list(value, key), — [donut, (banana, egg | egg, banana), candy, apple] (natural order)
to_ordered_list(value, key, ‘-k -2’), — [donut, (banana | egg)] (tail-k)
to_ordered_list(value, key, ‘-k -100’), — [donut, (banana, egg | egg, banana), candy, apple]
to_ordered_list(value, key, ‘-k -2 -reverse’), — [apple, candy] (reverse tail-k = top-k)
to_ordered_list(value, ‘-k 2’), — [egg, donut] (alphabetically)
to_ordered_list(key, ‘-k -2 -reverse’), — [5, 4] (top-2 keys)
to_ordered_list(key), — [2, 3, 3, 4, 5] (natural ordered keys)
to_ordered_list(value, key, ‘-k 2 -kv_map’), — {4:”candy”,5:”apple”}
to_ordered_list(value, key, ‘-k 2 -vk_map’) — {“candy”:4,”apple”:5}
FROM
t

Class: hivemall.tools.list.UDAFToOrderedList

Function: to_ordered_map

to_ordered_map(key, value [, const int k|const boolean reverseOrder=false]) – Convert two aggregated columns into an ordered key-value map

with t as (
select 10 as key, ‘apple’ as value
union all
select 3 as key, ‘banana’ as value
union all
select 4 as key, ‘candy’ as value
)
select
to_ordered_map(key, value, true), — {10:”apple”,4:”candy”,3:”banana”} (reverse)
to_ordered_map(key, value, 1), — {10:”apple”} (top-1)
to_ordered_map(key, value, 2), — {10:”apple”,4:”candy”} (top-2)
to_ordered_map(key, value, 3), — {10:”apple”,4:”candy”,3:”banana”} (top-3)
to_ordered_map(key, value, 100), — {10:”apple”,4:”candy”,3:”banana”} (top-100)
to_ordered_map(key, value), — {3:”banana”,4:”candy”,10:”apple”} (natural)
to_ordered_map(key, value, -1), — {3:”banana”} (tail-1)
to_ordered_map(key, value, -2), — {3:”banana”,4:”candy”} (tail-2)
to_ordered_map(key, value, -3), — {3:”banana”,4:”candy”,10:”apple”} (tail-3)
to_ordered_map(key, value, -100) — {3:”banana”,4:”candy”,10:”apple”} (tail-100)
from t

Class: hivemall.tools.map.UDAFToOrderedMap

Function: to_sparse

to_sparse(array feature_vector) – Returns a sparse feature in array

Class: hivemall.ftvec.conv.ToSparseFeaturesUDF

Function: to_sparse_features

to_sparse_features(array feature_vector) – Returns a sparse feature in array

Class: hivemall.ftvec.conv.ToSparseFeaturesUDF

Function: to_string_array

to_string_array(array) – Returns an array of strings

select to_string_array(array(1.0,2.0,3.0));

[“1.0″,”2.0″,”3.0”]

Class: hivemall.tools.array.ToStringArrayUDF

Function: to_timestamp

to_timestamp(timestamp_str[, fmt]) – Parses the `timestamp_str` expression with the `fmt` expression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if the `fmt` is omitted.

Class: org.apache.spark.sql.catalyst.expressions.ParseToTimestamp

Function: to_unix_timestamp

to_unix_timestamp(timeExp[, format]) – Returns the UNIX timestamp of the given time.

Class: org.apache.spark.sql.catalyst.expressions.ToUnixTimestamp

Function: to_utc_timestamp

to_utc_timestamp(timestamp, timezone) – Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, ‘GMT+1’ would yield ‘2017-07-14 01:40:00.0’.

Class: org.apache.spark.sql.catalyst.expressions.ToUTCTimestamp

Function: tokenize

tokenize(string englishText [, boolean toLowerCase]) – Returns tokenized words in array

Class: hivemall.tools.text.TokenizeUDF

Function: train_adadelta_regr

train_adadelta_regr(array features, float target [, constant string options]) – Returns a relation consists of <{int|bigint|string} feature, float weight>

Class: hivemall.regression.AdaDeltaUDTF

Function: train_adagrad_rda

train_adagrad_rda(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by Adagrad+RDA regularization binary classifier

Class: hivemall.classifier.AdaGradRDAUDTF

Function: train_adagrad_regr

train_adagrad_regr(array features, float target [, constant string options]) – Returns a relation consists of <{int|bigint|string} feature, float weight>

Class: hivemall.regression.AdaGradUDTF

Function: train_arow

train_arow(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by Adaptive Regularization of Weight Vectors (AROW) binary classifier

Class: hivemall.classifier.AROWClassifierUDTF

Function: train_arow_regr

train_arow_regr(array features, float target [, constant string options]) – a standard AROW (Adaptive Reguralization of Weight Vectors) regressor that uses `y – w^Tx` for the loss function.

SELECT
feature,
argmin_kld(weight, covar) as weight
FROM (
SELECT
train_arow_regr(features,label) as (feature,weight,covar)
FROM
training_data
) t
GROUP BY feature

Class: hivemall.regression.AROWRegressionUDTF

Function: train_arowe2_regr

train_arowe2_regr(array features, float target [, constant string options]) – a refined version of AROW (Adaptive Reguralization of Weight Vectors) regressor that usages adaptive epsilon-insensitive hinge loss `|w^t – y| – epsilon * stddev` for the loss function

SELECT
feature,
argmin_kld(weight, covar) as weight
FROM (
SELECT
train_arowe2_regr(features,label) as (feature,weight,covar)
FROM
training_data
) t
GROUP BY feature

Class: hivemall.regression.AROWRegressionUDTF$AROWe2

Function: train_arowe_regr

train_arowe_regr(array features, float target [, constant string options]) – a refined version of AROW (Adaptive Reguralization of Weight Vectors) regressor that usages epsilon-insensitive hinge loss `|w^t – y| – epsilon` for the loss function

SELECT
feature,
argmin_kld(weight, covar) as weight
FROM (
SELECT
train_arowe_regr(features,label) as (feature,weight,covar)
FROM
training_data
) t
GROUP BY feature

Class: hivemall.regression.AROWRegressionUDTF$AROWe

Function: train_arowh

train_arowh(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by AROW binary classifier using hinge loss

Class: hivemall.classifier.AROWClassifierUDTF$AROWh

Function: train_classifier

train_classifier(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by a generic classifier

Class: hivemall.classifier.GeneralClassifierUDTF

Function: train_cw

train_cw(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by Confidence-Weighted (CW) binary classifier

Class: hivemall.classifier.ConfidenceWeightedUDTF

Function: train_kpa

train_kpa(array features, int label [, const string options]) – returns a relation

Class: hivemall.classifier.KernelExpansionPassiveAggressiveUDTF

Function: train_lda

train_lda(array words[, const string options]) – Returns a relation consists of

Class: hivemall.topicmodel.LDAUDTF

Function: train_logistic_regr

train_logistic_regr(array features, float target [, constant string options]) – Returns a relation consists of <{int|bigint|string} feature, float weight>

Class: hivemall.regression.LogressUDTF

Function: train_logregr

train_logregr(array features, float target [, constant string options]) – Returns a relation consists of <{int|bigint|string} feature, float weight>

Class: hivemall.regression.LogressUDTF

Function: train_multiclass_arow

train_multiclass_arow(list features, {int|string} label [, const string options]) – Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight, float covar>

Build a prediction model by Adaptive Regularization of Weight Vectors (AROW) multiclass classifier

Class: hivemall.classifier.multiclass.MulticlassAROWClassifierUDTF

Function: train_multiclass_arowh

train_multiclass_arowh(list features, int|string label [, const string options]) – Returns a relation consists of

Build a prediction model by Adaptive Regularization of Weight Vectors (AROW) multiclass classifier using hinge loss

Class: hivemall.classifier.multiclass.MulticlassAROWClassifierUDTF$AROWh

Function: train_multiclass_cw

train_multiclass_cw(list features, {int|string} label [, const string options]) – Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight, float covar>

Build a prediction model by Confidence-Weighted (CW) multiclass classifier

Class: hivemall.classifier.multiclass.MulticlassConfidenceWeightedUDTF

Function: train_multiclass_pa

train_multiclass_pa(list features, {int|string} label [, const string options]) – Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight>

Build a prediction model by Passive-Aggressive (PA) multiclass classifier

Class: hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF

Function: train_multiclass_pa1

train_multiclass_pa1(list features, {int|string} label [, const string options]) – Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight>

Build a prediction model by Passive-Aggressive 1 (PA-1) multiclass classifier

Class: hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF$PA1

Function: train_multiclass_pa2

train_multiclass_pa2(list features, {int|string} label [, const string options]) – Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight>

Build a prediction model by Passive-Aggressive 2 (PA-2) multiclass classifier

Class: hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF$PA2

Function: train_multiclass_perceptron

train_multiclass_perceptron(list features, {int|string} label [, const string options]) – Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight>

Build a prediction model by Perceptron multiclass classifier

Class: hivemall.classifier.multiclass.MulticlassPerceptronUDTF

Function: train_multiclass_scw

train_multiclass_scw(list features, {int|string} label [, const string options]) – Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight, float covar>

Build a prediction model by Soft Confidence-Weighted (SCW-1) multiclass classifier

Class: hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF$SCW1

Function: train_multiclass_scw2

train_multiclass_scw2(list features, {int|string} label [, const string options]) – Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight, float covar>

Build a prediction model by Soft Confidence-Weighted 2 (SCW-2) multiclass classifier

Class: hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF$SCW2

Function: train_pa

train_pa(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by Passive-Aggressive (PA) binary classifier

Class: hivemall.classifier.PassiveAggressiveUDTF

Function: train_pa1

train_pa1(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by Passive-Aggressive 1 (PA-1) binary classifier

Class: hivemall.classifier.PassiveAggressiveUDTF$PA1

Function: train_pa1_regr

train_pa1_regr(array features, float target [, constant string options]) – PA-1 regressor that returns a relation consists of `(int|bigint|string) feature, float weight`.

SELECT
feature,
avg(weight) as weight
FROM
(SELECT
train_pa1_regr(features,label) as (feature,weight)
FROM
training_data
) t
GROUP BY feature

Class: hivemall.regression.PassiveAggressiveRegressionUDTF

Function: train_pa1a_regr

train_pa1a_regr(array features, float target [, constant string options]) – Returns a relation consists of `(int|bigint|string) feature, float weight`.

Class: hivemall.regression.PassiveAggressiveRegressionUDTF$PA1a

Function: train_pa2

train_pa2(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by Passive-Aggressive 2 (PA-2) binary classifier

Class: hivemall.classifier.PassiveAggressiveUDTF$PA2

Function: train_pa2_regr

train_pa2_regr(array features, float target [, constant string options]) – Returns a relation consists of `(int|bigint|string) feature, float weight`.

Class: hivemall.regression.PassiveAggressiveRegressionUDTF$PA2

Function: train_pa2a_regr

train_pa2a_regr(array features, float target [, constant string options]) – Returns a relation consists of `(int|bigint|string) feature, float weight`.

Class: hivemall.regression.PassiveAggressiveRegressionUDTF$PA2a

Function: train_perceptron

train_perceptron(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by Perceptron binary classifier

Class: hivemall.classifier.PerceptronUDTF

Function: train_plsa

train_plsa(array words[, const string options]) – Returns a relation consists of

Class: hivemall.topicmodel.PLSAUDTF

Function: train_randomforest_classifier

train_randomforest_classifier(array features, int label [, const string options, const array classWeights])- Returns a relation consists of var_importance, int oob_errors, int oob_tests>

Class: hivemall.smile.classification.RandomForestClassifierUDTF

Function: train_randomforest_regr

train_randomforest_regr(array features, double target [, string options]) – Returns a relation consists of var_importance, double oob_errors, int oob_tests>

Class: hivemall.smile.regression.RandomForestRegressionUDTF

Function: train_randomforest_regressor

train_randomforest_regressor(array features, double target [, string options]) – Returns a relation consists of var_importance, double oob_errors, int oob_tests>

Class: hivemall.smile.regression.RandomForestRegressionUDTF

Function: train_regressor

train_regressor(list features, double label [, const string options]) – Returns a relation consists of

Build a prediction model by a generic regressor

Class: hivemall.regression.GeneralRegressorUDTF

Function: train_scw

train_scw(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by Soft Confidence-Weighted (SCW-1) binary classifier

Class: hivemall.classifier.SoftConfideceWeightedUDTF$SCW1

Function: train_scw2

train_scw2(list features, int label [, const string options]) – Returns a relation consists of

Build a prediction model by Soft Confidence-Weighted 2 (SCW-2) binary classifier

Class: hivemall.classifier.SoftConfideceWeightedUDTF$SCW2

Function: train_slim

train_slim( int i, map r_i, map> topKRatesOfI, int j, map r_j [, constant string options]) – Returns row index, column index and non-zero weight value of prediction model

Class: hivemall.recommend.SlimUDTF

Function: transform

transform(expr, func) – Transforms elements in an array using the function.

Class: org.apache.spark.sql.catalyst.expressions.ArrayTransform

Function: translate

translate(input, from, to) – Translates the `input` string by replacing the characters present in the `from` string with the corresponding characters in the `to` string.

Class: org.apache.spark.sql.catalyst.expressions.StringTranslate

Function: transpose_and_dot

transpose_and_dot(array X, array Y) – Returns dot(X.T, Y) as array>, shape = (X.#cols, Y.#cols)

WITH input as (
select array(1.0, 2.0, 3.0, 4.0) as x, array(1, 2) as y
UNION ALL
select array(2.0, 3.0, 4.0, 5.0) as x, array(1, 2) as y
)
select
transpose_and_dot(x, y) as xy,
transpose_and_dot(y, x) as yx
from
input;

[[“3.0″,”6.0”],[“5.0″,”10.0”],[“7.0″,”14.0”],[“9.0″,”18.0”]] [[“3.0″,”5.0″,”7.0″,”9.0”],[“6.0″,”10.0″,”14.0″,”18.0”]]

Class: hivemall.tools.matrix.TransposeAndDotUDAF

Function: tree_export

tree_export(string model, const string options, optional array featureNames=null, optional array classNames=null) – exports a Decision Tree model as javascript/dot]

Class: hivemall.smile.tools.TreeExportUDF

Function: tree_predict

tree_predict(string modelId, string model, array features [, const string options | const boolean classification=false]) – Returns a prediction result of a random forest in a posteriori> for classification and for regression

Class: hivemall.smile.tools.TreePredictUDF

Function: tree_predict_v1

tree_predict_v1(string modelId, int modelType, string script, array features [, const boolean classification]) – Returns a prediction result of a random forest

Class: hivemall.smile.tools.TreePredictUDFv1

Function: trim

trim(str) – Removes the leading and trailing space characters from `str`. trim(BOTH trimStr FROM str) – Remove the leading and trailing `trimStr` characters from `str` trim(LEADING trimStr FROM str) – Remove the leading `trimStr` characters from `str` trim(TRAILING trimStr FROM str) – Remove the trailing `trimStr` characters from `str`

Class: org.apache.spark.sql.catalyst.expressions.StringTrim

Function: trunc

trunc(date, fmt) – Returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.`fmt` should be one of [“year”, “yyyy”, “yy”, “mon”, “month”, “mm”]

Class: org.apache.spark.sql.catalyst.expressions.TruncDate

Function: truncate_array

Class: brickhouse.udf.collect.TruncateArrayUDF

Function: try_cast

try_cast(ANY src, const string typeName) – Explicitly cast a value as a type. Returns null if cast fails.

SELECT try_cast(array(1.0,2.0,3.0), ‘array‘)
SELECT try_cast(map(‘A’,10,’B’,20,’C’,30), ‘map‘)

Class: hivemall.tools.TryCastUDF

Function: ucase

ucase(str) – Returns `str` with all characters changed to uppercase.

Class: org.apache.spark.sql.catalyst.expressions.Upper

Function: udfarrayconcat

udfarrayconcat(values) – Concatenates the array arguments

Class: com.whereos.udf.UDFArrayConcat

Function: unbase64

unbase64(str) – Converts the argument from a base 64 string `str` to a binary.

Class: org.apache.spark.sql.catalyst.expressions.UnBase64

Function: unbase91

unbase91(string) – Convert a BASE91 string to a binary

SELECT inflate(unbase91(base91(deflate(‘aaaaaaaaaaaaaaaabbbbccc’))));
aaaaaaaaaaaaaaaabbbbccc

Class: hivemall.tools.text.Unbase91UDF

Function: unbits

unbits(long[] bitset) – Returns an long array of the give bitset representation

SELECT unbits(to_bits(array(1,4,2,3)));
[1,2,3,4]

Class: hivemall.tools.bits.UnBitsUDF

Function: unhex

unhex(expr) – Converts hexadecimal `expr` to binary.

Class: org.apache.spark.sql.catalyst.expressions.Unhex

Function: union_hyperloglog

union_hyperloglog(x) – Merges multiple hyperloglogs together.

Class: brickhouse.udf.hll.UnionHyperLogLogUDAF

Function: union_map

union_map(x) – Returns a map which contains the union of an aggregation of maps

Class: brickhouse.udf.collect.UnionUDAF

Function: union_max

union_max(x, n) – Returns an map of the union of maps of max N elements in the aggregation group

Class: brickhouse.udf.collect.UnionMaxUDAF

Function: union_sketch

union_sketch(x) – Constructs a sketch set to estimate reach for large values by collecting multiple sketches

Class: brickhouse.udf.sketch.UnionSketchSetUDAF

Function: union_vector_sum

union_vector_sum(x) – Aggregate adding vectors together

Class: brickhouse.udf.timeseries.VectorUnionSumUDAF

Function: unix_timestamp

unix_timestamp([timeExp[, format]]) – Returns the UNIX timestamp of current or specified time.

Class: org.apache.spark.sql.catalyst.expressions.UnixTimestamp

Function: upper

upper(str) – Returns `str` with all characters changed to uppercase.

Class: org.apache.spark.sql.catalyst.expressions.Upper

Function: uuid

uuid() – Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-character string.

Class: org.apache.spark.sql.catalyst.expressions.Uuid

Function: var_pop

var_pop(expr) – Returns the population variance calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.VariancePop

Function: var_samp

var_samp(expr) – Returns the sample variance calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp

Function: variance

variance(expr) – Returns the sample variance calculated from values of a group.

Class: org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp

Function: vector_add

Add two vectors together

Class: brickhouse.udf.timeseries.VectorAddUDF

Function: vector_cross_product

Multiply a vector times another vector

Class: brickhouse.udf.timeseries.VectorCrossProductUDF

Function: vector_dot

vector_dot(array x, array y) – Performs vector dot product.

SELECT vector_dot(array(1.0,2.0,3.0),array(2.0,3.0,4.0));
20

SELECT vector_dot(array(1.0,2.0,3.0),2);
[2.0,4.0,6.0]

Class: hivemall.tools.vector.VectorDotUDF

Function: vector_dot_product

Return the Dot product of two vectors

Class: brickhouse.udf.timeseries.VectorDotProductUDF

Function: vector_magnitude

Magnitude of a vector.

Class: brickhouse.udf.timeseries.VectorMagnitudeUDF

Function: vector_scalar_mult

Multiply a vector times a scalar

Class: brickhouse.udf.timeseries.VectorMultUDF

Function: vectorize_features

vectorize_features(array featureNames, feature1, feature2, .. [, const string options]) – Returns a feature vector array

Class: hivemall.ftvec.trans.VectorizeFeaturesUDF

Function: voted_avg

voted_avg(double value) – Returns an averaged value by bagging for classification

Class: hivemall.ensemble.bagging.VotedAvgUDAF

Function: weekday

weekday(date) – Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, …, 6 = Sunday).

Class: org.apache.spark.sql.catalyst.expressions.WeekDay

Function: weekofyear

weekofyear(date) – Returns the week of the year of the given date. A week is considered to start on a Monday and week 1 is the first week with >3 days.

Class: org.apache.spark.sql.catalyst.expressions.WeekOfYear

Function: weight_voted_avg

weight_voted_avg(expr) – Returns an averaged value by considering sum of positive/negative weights

Class: hivemall.ensemble.bagging.WeightVotedAvgUDAF

Function: when

CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] END – When `expr1` = true, returns `expr2`; else when `expr3` = true, returns `expr4`; else returns `expr5`.

Class: org.apache.spark.sql.catalyst.expressions.CaseWhen

Function: window

Class: org.apache.spark.sql.catalyst.expressions.TimeWindow

Function: word_ngrams

word_ngrams(array words, int minSize, int maxSize]) – Returns list of n-grams for given words, where `minSize <= n <= maxSize` SELECT word_ngrams(tokenize('Machine learning is fun!', true), 1, 2); ["machine","machine learning","learning","learning is","is","is fun","fun"]

Class: hivemall.tools.text.WordNgramsUDF

Function: write_to_graphite

Writes metric or collection of metrics to graphite.write_to_graphite(String hostname, int port, Map nameToValue, Long timestampInSeconds) write_to_graphite(String hostname, int port, Map nameToValue) write_to_graphite(String hostname, int port, String metricName, Double metricVaule, Long timestampInSeconds) write_to_graphite(String hostname, int port, String metricName, Double metricVaule)

Class: brickhouse.udf.sanity.WriteToGraphiteUDF

Function: write_to_tsdb

This function writes metrics to the TSDB (metics names should look like proc.loadavg.1min, http.hits while tags string is space separated collection of tags). On failiure returns ‘WRITE_FAILED’ otherwise ‘WRITE_OK’ write_to_tsdb(String hostname, int port, Map nameToValue, String tags, Long timestampInSeconds) write_to_tsdb(String hostname, int port, Map nameToValue, String tags) write_to_tsdb(String hostname, int port, Map nameToValue) write_to_tsdb(String hostname, int port, String metricName, Double metricVaule, String tags, Long timestampInSeconds) write_to_tsdb(String hostname, int port, String metricName, Double metricVaule, String tags) write_to_tsdb(String hostname, int port, String metricName, Double metricVaule)

Class: brickhouse.udf.sanity.WriteToTSDBUDF

Function: x_rank

x_rank(KEY) – Generates a pseudo sequence number starting from 1 for each key

Class: hivemall.tools.RankSequenceUDF

Function: xpath

xpath(xml, xpath) – Returns a string array of values within the nodes of xml that match the XPath expression.

Class: org.apache.spark.sql.catalyst.expressions.xml.XPathList

Function: xpath_boolean

xpath_boolean(xml, xpath) – Returns true if the XPath expression evaluates to true, or if a matching node is found.

Class: org.apache.spark.sql.catalyst.expressions.xml.XPathBoolean

Function: xpath_double

xpath_double(xml, xpath) – Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.

Class: org.apache.spark.sql.catalyst.expressions.xml.XPathDouble

Function: xpath_float

xpath_float(xml, xpath) – Returns a float value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.

Class: org.apache.spark.sql.catalyst.expressions.xml.XPathFloat

Function: xpath_int

xpath_int(xml, xpath) – Returns an integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.

Class: org.apache.spark.sql.catalyst.expressions.xml.XPathInt

Function: xpath_long

xpath_long(xml, xpath) – Returns a long integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.

Class: org.apache.spark.sql.catalyst.expressions.xml.XPathLong

Function: xpath_number

xpath_number(xml, xpath) – Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.

Class: org.apache.spark.sql.catalyst.expressions.xml.XPathDouble

Function: xpath_short

xpath_short(xml, xpath) – Returns a short integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.

Class: org.apache.spark.sql.catalyst.expressions.xml.XPathShort

Function: xpath_string

xpath_string(xml, xpath) – Returns the text contents of the first xml node that matches the XPath expression.

Class: org.apache.spark.sql.catalyst.expressions.xml.XPathString

Function: year

year(date) – Returns the year component of the date/timestamp.

Class: org.apache.spark.sql.catalyst.expressions.Year

Function: zip_with

zip_with(left, right, func) – Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.

Class: org.apache.spark.sql.catalyst.expressions.ZipWith

Function: zscore

zscore(value, mean, stddev) – Returns a standard score (zscore)

Class: hivemall.ftvec.scaling.ZScoreUDF

Function: |

expr1 | expr2 – Returns the result of bitwise OR of `expr1` and `expr2`.

Class: org.apache.spark.sql.catalyst.expressions.BitwiseOr

Function: ~

~ expr – Returns the result of bitwise NOT of `expr`.

Class: org.apache.spark.sql.catalyst.expressions.BitwiseNot

share:

No Comments

TAGS : hive spark sql

Related Post

element_at

categorical_features

boolean

Leave a Comment Cancel reply

Save my name, email, and website in this browser for the next time I comment.

Articles by Category

About WhereOS (2)

Applications (1)

Business Insights (15)

Documents (1)

Getting Started (3)

How-To (2)

SQL Functions (573)

Visualizations (1)

Recent Posts

How WhereOS and JCDecaux
April 19, 2021

Data fusion: how integra
February 19, 2021

percentile_approx
February 18, 2021

Experimentation In Produ
February 8, 2021

Open Data Utilization
January 26, 2021

WhereOS

WhereOS is a data innovation platform and service.
For teams and businesses who want to get more value out of data – WhereOS minimizes your effort and maximizes your results.

Navigation

Features
Pricing
Blog
About us
Ecosystem as a service

Contact Us

If you wish discuss with us, please contact us through support@eaglepeaks.com or call us +358 50 486 9257.

© WhereOS 2021 All rights reserved