Benchmarks

Image Captioning on MSCOCO (Cross-Entropy Loss)

Model

BLEU@1

BLEU@2

BLEU@3

BLEU@4

METEOR

ROUGE-L

CIDEr-D

SPICE

LSTM-A3

75.3

59.0

45.4

35.0

26.7

55.6

107.7

19.7

Attention

76.4

60.6

46.9

36.1

27.6

56.6

113.0

20.4

Up-Down

76.3

60.3

46.6

36.0

27.6

56.6

113.1

20.7

GCN-LSTM

76.8

61.1

47.6

36.9

28.2

57.2

116.3

21.2

Transformer

76.4

60.3

46.5

35.8

28.2

56.7

116.6

21.3

Meshed-Memory Transformer

76.3

60.2

46.4

35.6

28.1

56.5

116.0

21.2

X-LAN

77.5

61.9

48.3

37.5

28.6

57.6

120.7

21.9

TDEN

75.5

59.4

45.7

34.9

28.7

56.7

116.3

22.0

Image Captioning on MSCOCO (CIDEr Score Optimization)

Model

BLEU@1

BLEU@2

BLEU@3

BLEU@4

METEOR

ROUGE-L

CIDEr-D

SPICE

LSTM-A3

77.9

61.5

46.7

35.0

27.1

56.3

117.0

20.5

Attention

79.4

63.5

48.9

37.1

27.9

57.6

123.1

21.3

Up-Down

80.1

64.3

49.7

37.7

28.0

58.0

124.7

21.5

GCN-LSTM

80.2

64.7

50.3

38.5

28.5

58.4

127.2

22.1

Transformer

80.5

65.4

51.1

39.2

29.1

58.7

130.0

23.0

Meshed-Memory Transformer

80.7

65.5

51.4

39.6

29.2

58.9

131.1

22.9

X-LAN

80.4

65.2

51.0

39.2

29.4

59.0

131.0

23.2

TDEN

81.3

66.3

52.0

40.1

29.6

59.8

132.6

23.4

Video Captioning on MSVD

Model

BLEU@1

BLEU@2

BLEU@3

BLEU@4

METEOR

ROUGE-L

CIDEr-D

SPICE

MP-LSTM

77.0

65.6

56.9

48.1

32.4

68.1

73.1

4.8

TA

80.4

68.9

60.1

51.0

33.5

70.0

77.2

4.9

Transformer

79.0

67.6

58.5

49.4

33.3

68.7

80.3

4.9

TDConvED

81.6

70.4

61.3

51.7

34.1

70.4

77.8

5.0

Video Captioning on MSR-VTT

Model

BLEU@1

BLEU@2

BLEU@3

BLEU@4

METEOR

ROUGE-L

CIDEr-D

SPICE

MP-LSTM

73.6

60.8

49.0

38.6

26.0

58.3

41.1

5.6

TA

74.3

61.8

50.3

39.9

26.4

59.4

42.9

5.8

Transformer

75.4

62.3

50.0

39.2

26.5

58.7

44.0

5.9

TDConvED

76.4

62.3

49.9

38.9

26.3

59.0

40.7

5.7

Visual Question Answering

Model

Overall

Yes/No

Number

Other

Uniter

70.1

86.8

53.7

59.6

TDEN

71.9

88.3

54.3

62.0

Caption-based image retrieval on Flickr30k

Model

R1

R5

R10

Uniter

61.6

87.7

92.8

TDEN

62.0

86.6

92.4

Visual commonsense reasoning

Model

Q -> A

QA -> R

Q -> AR

Uniter

73.0

75.3

55.4

TDEN

75.0

76.5

57.7